My favorite "gotofirst" modeling algorithm for a classification or regression task is Gradient Boosted Regression Trees (Friedman, 2001 and 2002), especially as coded in the GBM package of R. Gradient boosted regression is one of the focuses of my masters thesis (along with random forests) and has gotten more and more attention as more and more data mining contests have been won, at least in part, by employing this method. Some of the reasons I pick Gradient Boosted Regression Trees as the best off the shelf predictive modeling algorithm available are:
 High predictive accuracy across many domains and target data types
 Ability to specify various loss functions (Gaussian, Bernoulli, Poisson etc.) as well as run survival analysis via Cox proportional hazards, quantile regression etc.
 Handles mixed data types (continuous and nominal)
 Seamlessly deals with missing values
 Contains outofbag (OOB) estimates of error
 Contains variable importance measures
 Contains variable interaction measures / detection
 Allows estimates of marginal effects of a predictor (s) via Partial Dependency Plots.
This latter point is a nice feature coded into the GBM package that gives the analyst the ability to
produce univariate and bivariate partial dependency plots. These plots enable
the researcher to understand the effect of a predictor variable (or interaction
between two predictors) on the target outcome, given the other predictors (partialling them out  after accounting for their average effects).
The technique is not
unique to Gradient Boosted Regression Trees and in fact is a general method for
understanding any black box modeling algorithm. When we use ensembles for instance, this tends to be the only way to understand how changing the value of a predictor, say a binary predictor from 0 to 1 or a continuous predictor from within it's observed range, effects the outcome, given the model (i.e. accounting for the impact of other predictors in the model).
The idea of these “marginal” plots is to displays the
modeled outcome, over the range of a given predictor. Hastie (2009) describes
this general technique as considering the full model function $f$, depending on a small subset of predictors we are interested
in, $X_{s}$, where this subset is typically one or two predictors
in practice, and the other predictors in the model , $X_{c}$. This full model function is then written as $f(X)=f(X_{s},X_{c})$ where $X=X_{s} \cup X_{c}$ . A partial
dependence plot displays the marginal expected value of $f(X_{s})$, by averaging over the values of $X_{c}$. Practically, this would be given by $f(X_{s}=x_{s})=Avg(f(X_{s}=x_{s}, X_{ci}))$ where the
expected value of the function for a given value (vector) of the subset of
predictors is the average value of the model output setting the subset
predictors to the value of interest and averaging the result with the values of $X_{c}$ as they exist in the data set.
A brute force method
would be to create a new data set from the training data, but repeating it for
every pattern in $X_{s}$ of interest plugged in and then using the model to
output the average value for each distinct value of $X_{s}$.



No comments:
Post a Comment