Wednesday, January 2, 2013

Need for Repeated Hold Outs in Predictive Models

Many models are built and deployed using a training/validation partitioning approach. Under this construct a data set is randomly split in two parts - one part that is used to train the model and the other part which is held out to validate how well the model works. The split is sometimes done 50/50 and other times more data is left for training than validating- especially when one is dealing with a rare event as the target variable (which almost always is the case in database marketing). Personally, I use 65/35 train/validate as the standard. Sometimes, a third partition might be sampled out to further tune the model before the final validation on a "test" set. The only time I have used this approach is with calibrating scores into probabilities, but it is used frequently by others and is the default in SAS Enterprise Miner when partitioning.

There are many types of validation, from this split sample approach to bootstrapping, k-fold cross validation, Leave One Out Cross Validation (LOOCV) and many others. In R, the Caret package is a fantastic work bench for tuning and validating predictive models.

In database marketing, it seems that many (majority?) of models are validated with a single hold-out partition.

Here is quick example on the need to understand the variability that is inherent with the random sampling split between train and validation. If it is done one time, the results may be quite misleading.

Below displays the error curves for a model built on a data set with 152,000 records. 65% was sampled to train the model and 35% was held back to validate. In the full data set there were about 3% positive events (this is a binary target). The model is a form of Generalized Naive Bayes.

The process followed to create this plot was simple:

  • Split the data set into train and validate with a random seed.
  • Do all processing to the data (e.g. feature selection) that involves the target variable or could be considered a tuning parameter. [It is vital to do this independently each time in any cross validation process].
  • Train the model.
  • Predict the outcome on the validation set.
  • Rank the validation data into 10 equally sized groups (deciles) by the predicted score.
  • For each decile, calculate the average actual "response" (i.e. proportion of 1's), the average predicted response and the difference between these.

Repeat the above multiple times - here it was done 10 times. To make this feasible, the pre-processing, feature selection and modeling needs to be something that is rather automated (programmatically). This is not always possible.

There is a large degree of variability between deciles for each run, that a decision maker likely needs to understand and that the modeler needs to be aware of - that could well be lost with a single hold out partition.




I would be interested to know what others do in these cases. I lean towards presenting something like below- from the same data as above - where a smooth is fit (loess) to the average actual prediction by decile, along with this same error data.







No comments:

Post a Comment