WP 2016-8.0

We study a regression problem where for some part of the data we observe both the *label* variable (Y) and the *predictors* (X), while for other part of the dataonly the predictors are given. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent. If the conditional expectation ErY | Xs is exactly linear in X then typically the additional observations of the X's do not contain useful information, but otherwise the unlabeled data can be informative. In this case, our aim is at constructing the best linear predictor. We suggest improved alternative estimates to the naive standard procedures that depend only on the labeled data. Our estimation method can be easily implemented and has simply described asymptotic properties. The new estimates asymptotically dominate the usual standard procedures under certain non-linearity condition of ErY | Xs; otherwise, they are asymptotically equivalent. The performance of the new estimator for small sample size is investigated in an extensive simulation study. A real data example of inferring homeless population is used to illustrate the new methodology.

WP2016-08_Berk_SemiSupervisedLinearRegression_12.07.2016(1).pdf