Random Forest Imputation:
One of the nice "extras" of the random forest algorithm (Breiman, 2001) is it's use for mixed data type (numeric and categorical) imputation. The R package randomForest, following the lead of the original code will fail if there are any missing values in the predictor space. Unlike other modeling approaches such as generalized linear models which employ a case-wise deletion, random forests will throw an error when it encounters missing values.
Out of the box, the random forest algorithm will perform imputation (in R the rfImpute function) by leveraging a "proximity matrix" which also allowing for random forests to be used for outlier detection and clusters. As each tree in the RF is built, all cases are “run down” the tree and if two cases (i and j) end up in the same terminal node – they are considered to be relatively alike according to the model. A NxN matrix is ultimately produced with element (i,j) incremented each time case i and j end up in the same node. This is normalized for the number of trees in the forest.
The RF imputation process:
- First imputing missing categorical variables with their model class and numeric variables with the column median.
- A random forest is built on this data set using the roughly imputed values.
- The numeric missing values are re-imputed as the weighted average of the non-missing values of that column, where the weights are the corresponding row vector from the proximity matrix. Thus, there is more weight given to otherwise similar cases.
- Categorical predictors are likewise re-imputed as the category with the largest average proximity.
From a predictive modeling standpoint, there is a problem here - namely that the target variable needs to be fully populated in order for the imputation routine to not throw an error. This is because a RF model is built each time in the process above and requires a target variable. Besides being generally unwise to use supervised imputation, it rules out using the same imputation process when scoring new data, where the target is unknown, as was done on the training data. This could give rise to serious instability.
Miss Forest Imputation:
There is an alternative approach which uses random forest that can be (but doesn't need to be) completely unsupervised - allowing it's use on test data sets.
The Miss Forest approach of Stekhoven and Buhlmann (Stekhoven, 2012) does not require the target variable to be populated. The Miss Forest procedure iteratively loops through the predictor variable set - with loop order based on degree of non-missingness - fitting a random forest model using the available non-missing data to predict the missing data in an iterative fashion until convergence of sorts is reached. Specifically, the algorithm appears to follow:
For each variable and until the change in the imputed matrix begins to diverge:
o Fit a random forest using the non-missing cases of the ith predictor variable $X_{i}$ as the pseudo target variable and all the non-missing variable of the other variables $X_{not i}$ corresponding to these cases as the independent variables.
o Predict the missing values from $X_{i}$ by running the corresponding non-missing cases of $X_{not i}$ through the random forest.
R Package: http://cran.r-project.org/web/packages/missForest/index.html
Paper Describing: http://arxiv.org/abs/1105.0828
R Example