Wednesday, February 6, 2013

Imputation with Random Forest : Miss Forest



Random Forest Imputation:

One of the nice "extras" of the random forest algorithm (Breiman, 2001) is it's use for mixed data type (numeric and categorical) imputation. The R package randomForest, following the lead of the original code will fail if there are any missing values in the predictor space. Unlike other modeling approaches such as generalized linear models which employ a case-wise deletion, random forests will throw an error when it encounters missing values.

Out of the box, the random forest algorithm will perform imputation (in R the rfImpute function) by leveraging a "proximity matrix" which also allowing for random forests to be used for outlier detection and clusters. As each tree in the RF is built, all cases are “run down” the tree and if two cases (i and j) end up in the same terminal node – they are considered to be relatively alike according to the model. A NxN matrix is ultimately produced with element (i,j) incremented each time case i and j end up in the same node. This is normalized for the number of trees in the forest. 

The RF imputation process:


  1. First imputing missing categorical variables with their model class and numeric variables with the column median. 
  2. A random forest is built on this data set using the roughly imputed values. 
  • The numeric missing values are re-imputed as the weighted average of the non-missing values of that column, where the weights are the corresponding row vector from the proximity matrix. Thus, there is more weight given to otherwise similar cases.
  • Categorical predictors are likewise re-imputed as the category with the largest average proximity.
Step 2 is often repeated several times, averaging the final imputation.


From a predictive modeling standpoint, there is a problem here - namely that the target variable needs to be fully populated in order for the imputation routine to not throw an error. This is because a RF model is built each time in the process above and requires a target variable. Besides being generally unwise to use supervised imputation, it rules out using the same imputation process when scoring new data, where the target is unknown, as was done on the training data. This could give rise to serious instability.

Miss Forest Imputation:

 There is an alternative approach which uses random forest that can be (but doesn't need to be) completely unsupervised - allowing it's use on test data sets.



The Miss Forest approach of Stekhoven and Buhlmann (Stekhoven, 2012) does not require the target variable to be populated. The Miss Forest procedure iteratively loops through the predictor variable set - with loop order based on degree of non-missingness - fitting a random forest model using the available non-missing data to predict the missing data in an iterative fashion until convergence of sorts is reached. Specifically, the algorithm appears to follow:

For each variable and until the change in the imputed matrix begins to diverge:
o   Fit a random forest using the non-missing cases of the ith predictor variable $X_{i}$ as the pseudo target variable and all the non-missing variable of the other variables $X_{not  i}$ corresponding to these cases as the independent variables.
o   Predict the missing values from $X_{i}$  by running the corresponding non-missing cases of  $X_{not  i}$ through the random forest.


R Package: http://cran.r-project.org/web/packages/missForest/index.html
Paper Describing: http://arxiv.org/abs/1105.0828


R Example





Monday, January 7, 2013

Partial Dependency Plots and GBM


My favorite "go-to-first" modeling algorithm for a classification or regression task is Gradient Boosted Regression Trees (Friedman, 2001 and 2002), especially as coded in the GBM package of R. Gradient boosted regression is one of the focuses of my masters thesis (along with random forests) and has gotten more and more attention as more and more data mining contests have been won, at least in part, by employing this method. Some of the reasons I pick Gradient Boosted Regression Trees as the best off the shelf predictive modeling algorithm available are: 
  • High predictive accuracy across many domains and target data types
    • Ability to specify various loss functions (Gaussian, Bernoulli,  Poisson etc.) as well as run survival analysis via Cox proportional hazards, quantile regression etc.   
  • Handles mixed data types (continuous and nominal)
  • Seamlessly deals with missing values
  • Contains out-of-bag (OOB) estimates of error
  • Contains variable importance measures
  • Contains variable interaction measures / detection
  • Allows estimates of marginal effects of a predictor (s) via Partial Dependency Plots.

This latter point is a nice feature coded into the GBM package that gives the analyst the ability to produce univariate and bivariate partial dependency plots. These plots enable the researcher to understand the effect of a predictor variable (or interaction between two predictors) on the target outcome, given the other predictors (partialling them out - after accounting for their average effects). 

The technique is not unique to Gradient Boosted Regression Trees and in fact is a general method for understanding any black box modeling algorithm. When we use ensembles for instance, this tends to be the only way to understand how changing the value of a predictor, say a binary predictor  from 0 to 1 or a continuous predictor from within it's observed range, effects the outcome, given the model (i.e. accounting for the impact of other predictors in the model).

The idea of these “marginal” plots is to displays the modeled outcome, over the range of a given predictor. Hastie (2009) describes this general technique as considering the full model function $f$, depending on a small subset of predictors we are interested in, $X_{s}$, where this subset is typically one or two predictors in practice, and the other predictors in the model , $X_{c}$. This full model function is then written as $f(X)=f(X_{s},X_{c})$ where $X=X_{s} \cup X_{c}$ . A partial dependence plot displays the marginal expected value of $f(X_{s})$, by averaging over the values of $X_{c}$. Practically, this would be given by $f(X_{s}=x_{s})=Avg(f(X_{s}=x_{s}, X_{ci}))$ where the expected value of the function for a given value (vector) of the subset of predictors is the average value of the model output setting the subset predictors to the value of interest and averaging the result with the values of $X_{c}$ as they exist in the data set. 

A brute force method would be to create a new data set from the training data, but repeating it for every pattern in $X_{s}$ of interest plugged in and then using the model to output the average value for each distinct value of  $X_{s}$.











Wednesday, January 2, 2013

Need for Repeated Hold Outs in Predictive Models

Many models are built and deployed using a training/validation partitioning approach. Under this construct a data set is randomly split in two parts - one part that is used to train the model and the other part which is held out to validate how well the model works. The split is sometimes done 50/50 and other times more data is left for training than validating- especially when one is dealing with a rare event as the target variable (which almost always is the case in database marketing). Personally, I use 65/35 train/validate as the standard. Sometimes, a third partition might be sampled out to further tune the model before the final validation on a "test" set. The only time I have used this approach is with calibrating scores into probabilities, but it is used frequently by others and is the default in SAS Enterprise Miner when partitioning.

There are many types of validation, from this split sample approach to bootstrapping, k-fold cross validation, Leave One Out Cross Validation (LOOCV) and many others. In R, the Caret package is a fantastic work bench for tuning and validating predictive models.

In database marketing, it seems that many (majority?) of models are validated with a single hold-out partition.

Here is quick example on the need to understand the variability that is inherent with the random sampling split between train and validation. If it is done one time, the results may be quite misleading.

Below displays the error curves for a model built on a data set with 152,000 records. 65% was sampled to train the model and 35% was held back to validate. In the full data set there were about 3% positive events (this is a binary target). The model is a form of Generalized Naive Bayes.

The process followed to create this plot was simple:

  • Split the data set into train and validate with a random seed.
  • Do all processing to the data (e.g. feature selection) that involves the target variable or could be considered a tuning parameter. [It is vital to do this independently each time in any cross validation process].
  • Train the model.
  • Predict the outcome on the validation set.
  • Rank the validation data into 10 equally sized groups (deciles) by the predicted score.
  • For each decile, calculate the average actual "response" (i.e. proportion of 1's), the average predicted response and the difference between these.

Repeat the above multiple times - here it was done 10 times. To make this feasible, the pre-processing, feature selection and modeling needs to be something that is rather automated (programmatically). This is not always possible.

There is a large degree of variability between deciles for each run, that a decision maker likely needs to understand and that the modeler needs to be aware of - that could well be lost with a single hold out partition.




I would be interested to know what others do in these cases. I lean towards presenting something like below- from the same data as above - where a smooth is fit (loess) to the average actual prediction by decile, along with this same error data.







Delta Method in Logistic Regression

In the last post we looked at how to construct and evaluate a simple linear hypothesis test within the frame work of logistic regression (or any generalized linear model fit via maximum likelihood). While the example was simple, the idea can be quite powerful for quickly testing hypothesis concerning regression coefficients.

For this method to work however you need to be able to write your (null) hypothesis in terms of a linear combination of the model parameters or coefficients. In our simple example, this was achieved as we wished to test the null hypothesis $\beta_{2}-\beta_{3}=0$. What happens when we can not construct the weights ($L$) to place on the coefficients with a vector (or a matrix when interested in multiple simultaneous contrasts such as  Ho:  $\beta_{2}=\beta_{3}=0$ )? We need to apply other methods to the estimation and inference task. An often used classical procedure is called the Delta Method.

Delta Method

The main idea of the delta method is in those cases where you are not facing a simple linear sum of observations (random variables), create a linear approximation to that more complicated function using a Taylor expansion and derive a variance of that approximation instead. This method will allow you to derive an approximate variance in large samples and thus construct a confidence interval for theses non-linear estimators.

There are many sources for information on the theory of the delta method, including:
but the main result of interest here is that in large samples under "standard regularity conditions" a                                                      
function $G(b)$ of the coefficients of our generalized linear model ($b_{0},b_{1}....b_{p}$) will be distributed as $N(G(\beta),\frac{\partial G(\beta)}{\partial \beta'}VAR(\hat{\beta})\frac{\partial G(\beta)}{\partial \beta}$ ) which we evaluate at the ML estimates:

Point Estimate: $G(\hat{\beta})$
Variance: $\frac{\partial G(\hat{\beta})}{\partial \hat{\beta}'}VAR(\hat{\beta})\frac{\partial G(\hat{\beta})}{\partial \hat{\beta}}$

For example, returning to the last post, lets say we are interested in the ratio quantity $\frac{\beta_{2}}{\beta_{3}}$.

We have the vector of maximum likelihood estimates :

$[-2.8211865  -0.5794947  -0.4589188  -0.3883266 ]$

and the estimated variance-covariance matrix:

                       (Intercept)                 Gifts1                Gifts2                 Gifts3

(Intercept)  0.0002674919  -0.0002674919 -0.0002674919  -0.0002674919
Gifts1          -0.0002674919  0.0034667138   0.0002674919   0.0002674919
Gifts2          -0.0002674919  0.0002674919   0.0039732911   0.0002674919
Gifts3          -0.0002674919  0.0002674919   0.0002674919   0.0039964383


From here, we know the point estimate of this quantity is just the expression evaluated at the maximum likelihood estimates:

$\frac{ -0.4589188}{ -0.3883266}=1.181778$

Next, we need to get the first derivative of the function:

$\frac{\partial G(\beta)}{\partial \beta}$ which is $[0,0,\frac{1}{\beta_{3}}, \frac{-\beta_{2}}{(\beta_{3})^{2}}]^{t}$ and evaluated at the ML estimates $[0,0, \frac{1}{-0.3883266}, \frac{-0.4589188}{-0.3883266^{2}}]^{t}$

Thus the variance of $\frac{\beta_{2}}{\beta_{3}}$ is thus 0.05916905 and a 95% confidence interval could be constructed as $1.181778\pm1.96\sqrt{0.0591690}=(0.7050221, 1.65855)$

The car package in R provides a very nice short-cut to all of this, supplying symbolic differentiation to save you from calculating the derivative. The deltaMethod function accepts a variety of different model objects or you can supply a mean vector and vcov matrix. Here is a replication of the simple analysis above:



Another common usage for this method is to compute confidence intervals for predicted values out of generalized linear models on the response scale - for example in logistic regression we may be interested in the effect of a predictor variable on the probability p and not on the link scale of the logit. Xu and Long show the derivation for several predicted probabilities in the link above.

  • SAS NLMIXED uses the delta method for estimating the variance with its "estimate" statement - for example when you are interested in the variance of the difference in estimated probabilities between two different predictor values (or vector of values).  


 When you have time and computing power, a bootstrap analysis will typically provide better coverage. In a future post we will compare bootstrapping, simulation and the delta method.


Monday, December 31, 2012

Linear Hypothesis Tests

Returning to our example from the last post, within the framework of a generalized linear model, one way to statistically test if there are differences in the effect on the outcome variable between levels of a nominal variable (when neither level are the reference level using dummy coding) is to use a linear contrast (simulation and bootstrapping being two others). The estimates made are Ceteris paribus (holding all the other model predictors constant) and while simple in this case, can easily be generalized to much more complex situations (e.g. multi-way interactions and much larger models).

We are testing the null hypothesis: $Ho: L'b=0$ where L is a vector or matrix of hypothesis coefficients (the contrast matrix), carefully chosen to represent a linear combination of coefficients from the model of interest and b is the vector of maximum likelihoods estimates (the coefficients of our model). 

In this very simple example, we are interested in the difference between $\beta_{2}$ and  $\beta_{3}$ in our logistic regression. Thus we are interested in the linear combination of parameters:  $\beta_{2}-\beta_{3}=0$ (i.e. $\beta_{2}=\beta_{3}$)

The first type of test we describe is a Wald test.
  • The intuition about this test comes from the test of a single model parameter: $\frac{b_{1}-\beta_{1}}{s\{\beta_{1}\}}$ where $s\{\beta_{1}\}$ is the standard error of $b_{1}$ which is distributed t with degrees of freedom equal to the number of observations in the data set minus the number of parameters estimated. In large samples, a t distribution is the same as the normal  distribution and a squared random variable that follows a standard normal distribution is distributed as chi-square.
  • Thus, from this heuristic description, the formulation of the Wald test makes sense and is given as  $S=(L'b)'(L'\sum_{}L)^{-1}(L'b)$ where $\sum_{}$ is the variance-covariance matrix of the estimated coefficients. $S$ is asymptotically distributed as a chi-square with degrees of freedom equal to the number of restrictions in the null hypothesis. In this case, we are conducting a single degree of freedom test.
  • With our model $log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2)+\beta_{3}(Gifts3)$ the formulation of $L$ to construct  $\beta_{2}-\beta_{3}=0$ is clearly $L=[0,0,1,-1]^{t}$ which comes from the contrast of interest $[\beta_{0}+\beta_{1}(0)+\beta_{2}(1)+\beta_{3}(0)] - [\beta_{0}+\beta_{1}(0)+\beta_{2}(0)+\beta_{3}(1)] $
    • For a maximum likelihood estimator, the MLE of $\beta_{2}-\beta_{3}$ is this expression evaluated at the MLE of the coefficients: $b_{2}-b_{3}$.
    • Further, this contrast has a variance of  $(L'\sum_{}L)$ which comes from first principles: $VAR(b_{2}-b_{3})=VAR(b_{2})+VAR(b_{3}) -2COV(b_{2},b_{3})$.
  • In R we can use the LinearHypothesis function in John Fox's excellent Car package. We fail to reject the null hypothesis with a test statistic of 0.6702656 and p-value 0.412959. We also compute the test statistically manually: 

The contrast $\beta_{2}-\beta_{3}$ has a point estimate of -0.07059 with variance of 0.007434746. We can construct a wald type 95% confidence interval as  $-0.07059\pm1.96*\sqrt(0.007434746)= (-0.2395909, 0.09841095 )$




Another type of test we describe is a Likelihood Ratio (LR) test. Asymptotically, the LR and Wald test will result in the same answer, and will certainly with this large of sample size.

  • The basic idea of the LR test is to compare the log likelihood (LL) at the maximized value without constraint, to the  maximized value with the  constraint(s): $G^{2}=2*(LL  full model)-2*(LL       constrained  model)$. This is illustrated below via brute force, with the constrained model (making $\beta_{2}=\beta_{3}$) fit as: $$log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2)+\beta_{2}(Gifts3)$$
$$=>log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2)+\beta_{2}(Gifts2)$$
$$=log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2+Gifts3)$$


  • This constrained model is very simple to refit in R and does not require use of an offset variable:


Sunday, December 30, 2012

Simple Logistic Regression (GLM and Optim)

We will be using the data from the 1998 KDD Cup in the next couple of posts - at least a couple of the columns- which we will re-purpose.  Using R to download and unzip the data set, we will keep three fields, renaming all and binning the latter of them. We are left with the following fields:

  • Respond: 1 if the prospect donated, 0 otherwise.
  • Amount: The amount the prospect donated.
  • Gifts: We will create three dummy variables that will take on the value 1 if the value is 1, 2, or 3 respectively with 4+ set as our reference level. 

We will imagine here that the Gifts variable is a nominal variable representing 4 levels of an experiment. For concreteness, we can imagine that we sent a direct mail letter to prospects asking them for a donation, with a free gift for anyone donating. We are interested in which of these gifts led to the greatest response. Say, there was a free gift card to department store X versus a free gift card to store Y etc. These free gift offers were randomized among the prospects in an unbalanced way (sample sizes are quite different in this data set).



    To analyze the result of our simple experiment, we decide to fit a logistic regression to the data examining the effect of the 4 gifts on the probability that the prospect will respond (ignoring the amount donated). We are making Gift 4 the reference level. In this simply case, we could use other methods, but a logistic regression will generalize to larger more complex experiments.

    We can do this easily in R using the glm function. Here we quickly examine the coefficients that were fit via maximum likelihood, their standard errors, the log-likelihood of the data at the MLEs and the variance-covariance matrix of these estimates.



    For reasons that will be useful latter, we can also use the optim function, along with the log-likelihood of the logistic model to achieve the same results. Here we minimize the negative log likelihood and compare the estimates to those from glm - noting they are very nearly identical.



    Next post will examine a linear hypothesis test - looking at the difference between Gifts=2 and Gifts=3.


    Thursday, December 27, 2012

    Introduction

    This blog will be an outlet to write about data mining and predictive modeling. I plan to explore and comment on the use of these disciplines aimed (mainly) at marketing analytics and database marketing - in a very applied fashion.

    My goal is to create a space to write about topics that interest me, allow me to learn (nothing is better for learning a new topic or technique than to write it down and *try* to explain it!) and share experience.


    Several posts are already planned:
    • Illustration of linear contrasts in a logistic regression
    • Illustrations of non-linear contrasts in a logistic regression using the delta method
    • Partial dependency plots in a gradient boosted regression tree model
    • An application of zero inflated gamma regression - a handy technique to understand the impact that a factor (e.g. a coupon) has on a continuous variable (like revenue) that contains many zero values - representing those targeted customers that did not purchase.
    • Simulation based power analysis for direct marketing factorial designs.

    Feel free to comment, correct or suggest on any post.