There are many types of validation, from this split sample approach to bootstrapping, k-fold cross validation, Leave One Out Cross Validation (LOOCV) and many others. In R, the Caret package is a fantastic work bench for tuning and validating predictive models.
In database marketing, it seems that many (majority?) of models are validated with a single hold-out partition.
Here is quick example on the need to understand the variability that is inherent with the random sampling split between train and validation. If it is done one time, the results may be quite misleading.
Below displays the error curves for a model built on a data set with 152,000 records. 65% was sampled to train the model and 35% was held back to validate. In the full data set there were about 3% positive events (this is a binary target). The model is a form of Generalized Naive Bayes.
The process followed to create this plot was simple:
- Split the data set into train and validate with a random seed.
- Do all processing to the data (e.g. feature selection) that involves the target variable or could be considered a tuning parameter. [It is vital to do this independently each time in any cross validation process].
- Train the model.
- Predict the outcome on the validation set.
- Rank the validation data into 10 equally sized groups (deciles) by the predicted score.
- For each decile, calculate the average actual "response" (i.e. proportion of 1's), the average predicted response and the difference between these.
Repeat the above multiple times - here it was done 10 times. To make this feasible, the pre-processing, feature selection and modeling needs to be something that is rather automated (programmatically). This is not always possible.
There is a large degree of variability between deciles for each run, that a decision maker likely needs to understand and that the modeler needs to be aware of - that could well be lost with a single hold out partition.
I would be interested to know what others do in these cases. I lean towards presenting something like below- from the same data as above - where a smooth is fit (loess) to the average actual prediction by decile, along with this same error data.
No comments:
Post a Comment