Positive-and-Unlabeled Learning

Introduction

Positive and unlabeled learning, or positive-unlabeled (PU) learning, refers to the binary classification problem where only positive labels are observed and the rest are unlabeled. Since unlabeled part of data consists of both positive and negative instances, naively treating them as negative and performing a standard classification learning algorithm will underestimate the probability of being positive (Ward et al. 2009; Peng Yang et al. 2012). Without providing negative instances in the training set, however, will prevent the direct use of well-developed supervised classification methods. To break through this dilemma, dozens of PU learning algorithms have been proposed in the past two decades.

One way to bypass the lack of negative instances is to disregard the unlabeled part and only learn from positive instances. Given this underlying idea is similar to one-class classification problem, which is designed do classification when the negative instances are absent, poorly sampled, or not well defined (Khan and Madden 2014), many existing one-class learning algorithms could be easily formulated to PU learning (Pengyi Yang, Liu, and Yang 2017), such as Positive Naive Bayesian (PNB) (Wang et al. 2006; Calvo, Larrañaga, and Lozano 2007), one-class SVM (Joachims 1999; De Bie et al. 2007; W. Li, Guo, and Elkan 2010), and one-class KNN (Munroe and Madden 2005). However, since unlabeled instances are not used during the training step, it may not be competitive to those algorithms which could effectively utilize information from both positive and unlabeled instances.

To better utilize the unlabeled instances, one class of algorithms adopt a heuristic two-step strategy (Manevitz and Yousef 2001; Yu, Han, and Chang 2002; Liu et al. 2002, 2003; X. Li and Liu 2003; Peng Yang et al. 2012). In the first step, instances that are likely to have negative labels are identified by certain similarity, or distance metrics. In the second step, a classifier is developed based on the positive instances, quasi-negative instances detected in the first step, and remaining unlabeled instances. Alternatively, another branch of methods let positive and unlabeled instances share different weights in the loss function and/or classification model to account for the asymmetric nature of PU learning problem. Many existing classification algorithms that are able to incorporate weights have been studied, such as naive Bayes (Nigam et al. 2000), biased SVM (Liu et al. 2003), and biased logistic model (Lee and Liu 2003). One potential drawback of the aforementioned methods is they either explicitly or implicitly rely on the assumption that data are generated from a mixture model (Nigam et al. 2000). Hence, they are more appropriate for deterministic scheme but not probabilistic scheme, following the nomenclature proposed by Song et al. (Song and Raskutti 2019).

Meanwhile, a more theoretical viewpoint of PU learning has been developed by putting it into a case-control framework (Ward et al. 2009), given the fact that the way of sampling is not the same for positive set and unlabeled set. In positive set, only instances with positive true label are sampled, which are called cases. Unlabeled set, on the other hand, is a complete random sampling from the population, which serves as controls. Under this framework, Ward et al. are able to estimate the underlying positive-negative logistic model from observed positive-unlabeled data by expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977). Recently, Song et al. (Song and Raskutti 2019) extend this method with penalty terms to accomplish variable selection, and convergence is guaranteed. Denote $x$ as covariates, $y$ as unobserved true label, and $z$ as observed label where unlabeled instances are treated negative. The main flaw of this method is we need to know the population prevalence $P r (y = 1)$ , which is almost impossible in practical application, otherwise it is not identifiable with observed positive and unlabeled instances (Ward et al. 2009). Slightly different from case-control modeling, Elkan et al. and Scott et al. (Elkan and Noto 2008; Scott and Blanchard 2009) consider a single-training-set scenario that all instances are sampled from a joint distribution of $(x, y, z)$ but only $(x, z)$ is recorded. The estimation of population prevalence is obvious under this setting, but an unrealistic noiseless assumption $P (y = 1 ∣ x) = 1$ is required. Therefore, they provide a weighting approach that could get rid of this assumption. However, this approach can be viewed as a one-step iteration of Ward’s EM algorithm (Song and Raskutti 2019).

Recently, researchers incorporate bagging (Breiman 1996) idea into the PU learning to generate final classifier by assembling multiple PU classifiers estimated from bootstrap sampling (Mordelet and Vert 2011; Mordelet and Vert 2014; Claesen et al. 2015; Pengyi Yang et al. 2016). This approach takes advantage of bagging feature to reduce noise from modeling directly with unlabeled instances and obtain more stable predictions. Rather than using unanimous probability to sample, AdaSample (Pengyi Yang et al. 2018), a more boosting-like algorithm, applies different sampling probability at each step, and the probability is calculated from PU model estimated in the previous iteration. The performance of the adopted PU classifier for the bootstrap sample is of great importance to the success of this set of methods.

PUwrapper (basic idea)

The PUwrapper provides a general framework that could apply to any traditional supervised learning algorithm but still make them work for the Positive-Unlabeled dataset.

In positive-only data scene, there are two fundamental setup:

Condition 1. Positive instances are completely random selection from positive population;
Condition 2. The unlabeled instances are a random sampling from the population.

As the observed dataset is not a random sample from the population but a random sample for either the positive or the unlabeled, it satisfies the case-control framework. But among the unlabeled, there is a mixture of true positive and negative instances which makes it hard to directly adopt traditional methods upon this case-control problem.

Since only part of $y$ is observed, PU problem can be viewed as a missing data problem and one commonly used method for the missing data problem is EM algorithm (Dempster, Laird, and Rubin 1977). In the following sections, we will first introduce existing EM algorithm designed for PU problem as well as the basic idea of PUwrapper.

Observed likelihood: $L^{o b s} (θ ∣ x, z, s = 1) = \prod_{i} P_{θ} (z_{i} = 1 ∣ s_{i} = 1, x_{i})^{z_{i}} (1 - P_{θ} (z_{i} = 1 ∣ s_{i} = 1, x_{i}))^{1 - z_{i}}$ which relevant to $P_{θ} (z_{i} = 1 ∣ s_{i} = 1, x_{i})$ but not the desired one $P_{θ} (y_{i} = 1 ∣ x_{i})$ . An adjustment could be make upon the observed likelihood function to target on $P_{θ} (y_{i} = 1 ∣ x_{i})$ but that is hard to solve (Ward et al. 2009). Hence, an EM algorithm is proposed because it is a simple missing label problem and could be easier to work with full likelihood.
Full likelihood: $L^{f u l l} (θ ∣ x, y, z, s = 1) \propto \prod_{i} P_{θ} (y_{i} = 1 ∣ s_{i} = 1, x_{i})^{y_{i}} P_{θ} (y_{i} = 0 ∣ s_{i} = 1, x_{i})^{1 - y_{i}}$
EM algorithm (adjusted): maximize the observed likelihood by iteratively maximize the full likelihood conditioning on the observed data and estimated parameters. Here let $\begin{aligned} f_{θ}^{*} (x) & := P_{θ} (y = 1 ∣ x, s = 1); \\ f_{θ} (x) & := P_{θ} (y = 1 ∣ x) . \end{aligned}$
E-step: $\begin{aligned} Q (θ ∣ θ^{(k)}) & = E [ℓ^{f u l l} (θ ∣ x, y, z, s = 1) ∣ x, z, s = 1, θ^{(k)}] \\ = \sum_{i} {E [y_{i} ∣ z_{i}, x_{i}, s_{i} = 1, θ^{(k)}] \log f_{θ}^{*} (x_{i}) + (1 - E [y_{i} ∣ z_{i}, x_{i}, s_{i} = 1, θ^{(k)}]) \log (1 - f_{θ}^{*} (x_{i}))} \end{aligned}$ where $E [y_{i} ∣ z_{i}, x_{i}, s_{i} = 1, θ^{(k)}] = f_{θ^{(k)}} (x_{i})^{(1 - z_{i})}$ .
M-step: In M-step, we maximize the expectation of full log-likelihood described in E-step $θ^{(k + 1)} = \arg max_{θ} Q (θ ∣ θ^{(k)}) .$
Mapping: We add an additional step here to make it work like a wrapper that are free of the specification of functional structure of $f_{θ}^{*} (x)$ and $f_{θ} (x)$ . After optimizing in M-step, we will obtain ${\hat{θ}}^{(k + 1)}$ as well as $f_{{\hat{θ}}^{(k + 1)}}^{*} (x)$ , but if using ${\hat{θ}}^{(k + 1)}$ means we need to know the functional structure of $f_{θ} (x)$ . Instead, we could directly find the relationship between $f_{θ} (x)$ and $f_{θ}^{*} (x)$ : $f_{θ} (x) = \frac{(c - 1) f_{θ}^{*} (x)}{c - f_{θ}^{*} (x)}$ where $c = \frac{P r (y = 1 ∣ s = 1)}{P r (z = 1 ∣ s = 1)} .$
How to achieve c?

In formula $c = \frac{P r (y = 1 ∣ s = 1)}{P r (z = 1 ∣ s = 1)}$ , $P r (z = 1 ∣ s = 1)$ is observed but $P r (y = 1 ∣ s = 1)$ is relevant to population prevalence $π = P (y = 1)$ which is unknown. In most literature, $π$ is assumed to be known. But in real application, $π$ is always infeasible, and thus hinter the application of their methods. However, noticing that $\begin{aligned} E (y ∣ s = 1) & = E_{x} [E (y ∣ s = 1, x)] \\ = \frac{1}{n} \sum_{i} P r (y_{i} = 1 ∣ s_{i} = 1, x_{i}) \\ = \frac{1}{n} \sum_{i} f_{θ}^{*} (x_{i}), \end{aligned}$ we could replace the unknown $E (y ∣ s = 1)$ by the empirical value. That is $f_{θ^{(k + 1)}} (x) = \frac{(\frac{1}{n} \sum_{i} f_{{\hat{θ}}^{(k + 1)}}^{*} (x_{i}) - 1) f_{{\hat{θ}}^{(k + 1)}}^{*} (x_{i})}{\frac{1}{n} \sum_{i} f_{{\hat{θ}}^{(k + 1)}}^{*} (x_{i}) - f_{{\hat{θ}}^{(k + 1)}}^{*} (x_{i})} .$

How to do variable selection during EM

If we want to accomplish variable selection during the EM algorithm, the objective function at M-step is adjusted to the following $θ^{(k + 1)} = \arg max_{θ} Q (θ ∣ θ^{(k)}) + λ J (θ) .$ Imputation-regularized optimization (IRO) developed by Liang et al. (2018) propose a general idea of handling missing data (in variables not outcomes) in high dimensional data. PULasso (Song and Raskutti 2019) adopts quadratic majorization for M-step (QM-EM) which mainly targets on computational efficacy.

Basically, as a wrapper, we do not need to know specific value of $θ^{(k + 1)}$ . Instead, what a wrapper needs is simply $f_{θ}^{*} (x_{i})$ . So variable selection, if necessary, should be embedded in the wrapped algorithm. For example, PLR or Reinforced Learning Tree (RLT) if wrapped, are able to select important variables.

Unbalanced scenarios (observation unbalancedness and population unbalancedness) This is not something we are interested for a wrapper. If necessary, we could start with rewriting c as $c = 1 + P r (y = 1 ∣ z = 0, s = 1) \frac{P r (z = 0 ∣ s = 1)}{P r (z = 1 ∣ s = 1)}$ and the unbalancedness is explicitly expressed in second term. Specifically, population unbalancedness is expressed by $P r (y = 1 ∣ z = 0, s = 1)$ , since $P r (y = 1 ∣ z = 0, s = 1) = P r (y = 1)$ under Condition 2, and $\frac{P (z = 0 ∣ s = 1)}{P (z = 1 ∣ s = 1)}$ is a measure of observation unbalancedness.
How to find out the optimal cutoff without knowing the true $π$

We are able to generate a sequence of estimated probability for each unlabeled instance, even though the probability itself is biased, the order is maintained (shown by ROC curve). For application purpose, sometimes we need to provide a clear category for instances which could be a problem when true $π$ is not provided. One way is to estimate $π$ from estimated c, which is obtained at convergence of the algorithm. However, we should also be aware of the potential biasness of such estimator, since $π$ is not identifiable.

Illustration via Simple Simulation Studies

In the first simulation study, we compare the outcomes from knowing true $π$ to those using estimated $π$ under a simplified setting, i.e., $l o g i t (Pr (y = 1)) = 1.2 X + ε$ . We display the estimated probabilities against $X$ in Figure 1 (a), (b).

Based on the results, there are several comments:

Biased $π$ (equivalently, $c$ ) leads to biased estimations of probabilities
Using estimated $c$ seems to be a valid solution
Initial value to start EM algorithm does not matter a lot
True $c$ has better convergence rate than using estimated $c$ at each step

Simple examples of Wald's method and the proposed improved method with estimated c. Panel (a) and (b): univariate model $logit(Pr(y=1)) = 1.2X$; Panel (c) and (d): multivariate model $logit(Pr(y=1)) = -1+2X_1+4X_2-X_3-2X_4+X_5$, and we know the true model during fitting.

Figure 1: Simple examples of Wald’s method and the proposed improved method with estimated c. Panel (a) and (b): univariate model $l o g i t (P r (y = 1)) = 1.2 X$ ; Panel (c) and (d): multivariate model $l o g i t (P r (y = 1)) = - 1 + 2 X_{1} + 4 X_{2} - X_{3} - 2 X_{4} + X_{5}$ , and we know the true model during fitting.

In second simulation study, we wrap two basic supervised learning algorithms: random forests (RF) and penalized logistic regression (PLR). At the meantime, we would like to conduct variable selection during the process. For PLR, we adopt LASSO at each EM iteration, while for RF, we use variable importance score as the weights. We generate 100 variables where only 5 of them are related to the outcome. In other words, the rest of them are purely noise. In Figure 2, we display the results based on PUwrapper, and the selected variables, or variable importances are given in panel (c) and (f).

Simple examples of PUwrapper. Panel (a) and (b): Random Forests; Panel (c) and (d): Penalized Logistic Regression. The baseline model is $logit(Pr(y=1)) = -1+2X_1+4X_2-X_3-2X_4+X_5$ with a total of 100 variables simulated. The size of simulated sample is 1600 and 600 of them are labeled as 1.

Figure 2: Simple examples of PUwrapper. Panel (a) and (b): Random Forests; Panel (c) and (d): Penalized Logistic Regression. The baseline model is $l o g i t (P r (y = 1)) = - 1 + 2 X_{1} + 4 X_{2} - X_{3} - 2 X_{4} + X_{5}$ with a total of 100 variables simulated. The size of simulated sample is 1600 and 600 of them are labeled as 1.

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40.

Calvo, Borja, Pedro Larrañaga, and José A Lozano. 2007. “Learning Bayesian Classifiers from Positive and Unlabeled Examples.” Pattern Recognition Letters 28 (16): 2375–84.

Claesen, Marc, Frank De Smet, Johan AK Suykens, and Bart De Moor. 2015. “A Robust Ensemble Approach to Learn from Positive and Unlabeled Data Using SVM Base Models.” Neurocomputing 160: 73–84.

De Bie, Tijl, Léon-Charles Tranchevent, Liesbeth MM Van Oeffelen, and Yves Moreau. 2007. “Kernel-Based Data Fusion for Gene Prioritization.” Bioinformatics 23 (13): i125–32.

Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22.

Elkan, Charles, and Keith Noto. 2008. “Learning Classifiers from Only Positive and Unlabeled Data.” In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 213–20.

Joachims, Thorsten. 1999. “Transductive Inference for Text Classification Using Support Vector Machines.” In Icml, 99:200–209.

Khan, Shehroz S, and Michael G Madden. 2014. “One-Class Classification: Taxonomy of Study and Review of Techniques.” The Knowledge Engineering Review 29 (3): 345–74.

Lee, Wee Sun, and Bing Liu. 2003. “Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression.” In ICML, 3:448–55.

Li, Wenkai, Qinghua Guo, and Charles Elkan. 2010. “A Positive and Unlabeled Learning Algorithm for One-Class Classification of Remote-Sensing Data.” IEEE Transactions on Geoscience and Remote Sensing 49 (2): 717–25.

Li, Xiaoli, and Bing Liu. 2003. “Learning to Classify Texts Using Positive and Unlabeled Data.” In IJCAI, 3:587–92. 2003.

Liang, Faming, Bochao Jia, Jingnan Xue, Qizhai Li, and Ye Luo. 2018. “An Imputation–Regularized Optimization Algorithm for High Dimensional Missing Data Problems and Beyond.” Journal of the Royal Statistical Society. Series B, Statistical Methodology 80 (5): 899.

Liu, Bing, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. 2003. “Building Text Classifiers Using Positive and Unlabeled Examples.” In Third IEEE International Conference on Data Mining, 179–86. IEEE.

Liu, Bing, Wee Sun Lee, Philip S Yu, and Xiaoli Li. 2002. “Partially Supervised Classification of Text Documents.” In ICML, 2:387–94. Citeseer.

Manevitz, Larry M, and Malik Yousef. 2001. “One-Class SVMs for Document Classification.” Journal of Machine Learning Research 2 (Dec): 139–54.

Mordelet, Fantine, and Jean-Philippe Vert. 2011. “ProDiGe: Prioritization of Disease Genes with Multitask Machine Learning from Positive and Unlabeled Examples.” BMC Bioinformatics 12 (1): 389.

Mordelet, Fantine, and J-P Vert. 2014. “A Bagging SVM to Learn from Positive and Unlabeled Examples.” Pattern Recognition Letters 37: 201–9.

Munroe, Daniel T, and Michael G Madden. 2005. “Multi-Class and Single-Class Classification Approaches to Vehicle Model Recognition from Images.” Proc. AICS, 1–11.

Nigam, Kamal, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. “Text Classification from Labeled and Unlabeled Documents Using EM.” Machine Learning 39 (2-3): 103–34.

Scott, Clayton, and Gilles Blanchard. 2009. “Novelty Detection: Unlabeled Data Definitely Help.” In Artificial Intelligence and Statistics, 464–71.

Song, Hyebin, and Garvesh Raskutti. 2019. “PULasso: High-Dimensional Variable Selection with Presence-Only Data.” Journal of the American Statistical Association, 1–30.

Wang, Chunlin, Chris Ding, Richard F Meraz, and Stephen R Holbrook. 2006. “PSoL: A Positive Sample Only Learning Algorithm for Finding Non-Coding RNA Genes.” Bioinformatics 22 (21): 2590–96.

Ward, Gill, Trevor Hastie, Simon Barry, Jane Elith, and John R Leathwick. 2009. “Presence-Only Data and the EM Algorithm.” Biometrics 65 (2): 554–63.

Yang, Peng, Xiao-Li Li, Jian-Ping Mei, Chee-Keong Kwoh, and See-Kiong Ng. 2012. “Positive-Unlabeled Learning for Disease Gene Identification.” Bioinformatics 28 (20): 2640–47.

Yang, Pengyi, Sean J Humphrey, David E James, Yee Hwa Yang, and Raja Jothi. 2016. “Positive-Unlabeled Ensemble Learning for Kinase Substrate Prediction from Dynamic Phosphoproteomics Data.” Bioinformatics 32 (2): 252–59.

Yang, Pengyi, Wei Liu, and Jean Yang. 2017. “Positive Unlabeled Learning via Wrapper-Based Adaptive Sampling.” In IJCAI, 3273–79.

Yang, Pengyi, John T Ormerod, Wei Liu, Chendong Ma, Albert Y Zomaya, and Jean YH Yang. 2018. “AdaSampling for Positive-Unlabeled and Label Noise Learning with Bioinformatics Applications.” IEEE Transactions on Cybernetics 49 (5): 1932–43.

Yu, Hwanjo, Jiawei Han, and Kevin Chen-Chuan Chang. 2002. “PEBL: Positive Example Based Learning for Web Page Classification Using SVM.” In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 239–48.