Processing math: 100%

Monday, December 31, 2012

Linear Hypothesis Tests

Returning to our example from the last post, within the framework of a generalized linear model, one way to statistically test if there are differences in the effect on the outcome variable between levels of a nominal variable (when neither level are the reference level using dummy coding) is to use a linear contrast (simulation and bootstrapping being two others). The estimates made are Ceteris paribus (holding all the other model predictors constant) and while simple in this case, can easily be generalized to much more complex situations (e.g. multi-way interactions and much larger models).

We are testing the null hypothesis: Ho: L'b=0 where L is a vector or matrix of hypothesis coefficients (the contrast matrix), carefully chosen to represent a linear combination of coefficients from the model of interest and b is the vector of maximum likelihoods estimates (the coefficients of our model). 

In this very simple example, we are interested in the difference between \beta_{2} and  \beta_{3} in our logistic regression. Thus we are interested in the linear combination of parameters:  \beta_{2}-\beta_{3}=0 (i.e. \beta_{2}=\beta_{3})

The first type of test we describe is a Wald test.
  • The intuition about this test comes from the test of a single model parameter: \frac{b_{1}-\beta_{1}}{s\{\beta_{1}\}} where s\{\beta_{1}\} is the standard error of b_{1} which is distributed t with degrees of freedom equal to the number of observations in the data set minus the number of parameters estimated. In large samples, a t distribution is the same as the normal  distribution and a squared random variable that follows a standard normal distribution is distributed as chi-square.
  • Thus, from this heuristic description, the formulation of the Wald test makes sense and is given as  S=(L'b)'(L'\sum_{}L)^{-1}(L'b) where \sum_{} is the variance-covariance matrix of the estimated coefficients. S is asymptotically distributed as a chi-square with degrees of freedom equal to the number of restrictions in the null hypothesis. In this case, we are conducting a single degree of freedom test.
  • With our model log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2)+\beta_{3}(Gifts3) the formulation of L to construct  \beta_{2}-\beta_{3}=0 is clearly L=[0,0,1,-1]^{t} which comes from the contrast of interest [\beta_{0}+\beta_{1}(0)+\beta_{2}(1)+\beta_{3}(0)] - [\beta_{0}+\beta_{1}(0)+\beta_{2}(0)+\beta_{3}(1)]
    • For a maximum likelihood estimator, the MLE of \beta_{2}-\beta_{3} is this expression evaluated at the MLE of the coefficients: b_{2}-b_{3}.
    • Further, this contrast has a variance of  (L'\sum_{}L) which comes from first principles: VAR(b_{2}-b_{3})=VAR(b_{2})+VAR(b_{3}) -2COV(b_{2},b_{3}).
  • In R we can use the LinearHypothesis function in John Fox's excellent Car package. We fail to reject the null hypothesis with a test statistic of 0.6702656 and p-value 0.412959. We also compute the test statistically manually: 
library(car)
#B2-B3=0
cVec<-c(0,0,1,-1) #this is L
car::linearHypothesis(mod_glm,cVec, verbose=TRUE, test = "Chisq") #Estimated linear function (hypothesis.matrix %*% coef - rhs)
#Hypothesis matrix:
# [,1] [,2] [,3] [,4]
#[1,] 0 0 1 -1
#Right-hand-side vector:
#[1] 0
#Estimated linear function (hypothesis.matrix %*% coef - rhs)
#[1] -0.07059217
#Linear hypothesis test
#Hypothesis:
#Gifts2 - Gifts3 = 0
#Model 1: restricted model
#Model 2: Respond ~ Gifts1 + Gifts2 + Gifts3
# Res.Df Df Chisq Pr(>Chisq)
#1 95409
#2 95408 1 0.6703 0.413
#Manually
(t(t(cVec)%*%mod_glmcoefficients)) %*% solve((t(cVec)%*%vcov(mod_glm)%*%cVec)) %*% (t(cVec)%*%mod_glmcoefficients)
#[1,] 0.6702656
#p-value
1-pchisq(0.6702656, 1)
#[1] 0.412959
view raw Post3_A.r hosted with ❤ by GitHub

The contrast \beta_{2}-\beta_{3} has a point estimate of -0.07059 with variance of 0.007434746. We can construct a wald type 95% confidence interval as  -0.07059\pm1.96*\sqrt(0.007434746)= (-0.2395909, 0.09841095 )


#B2=1
-2.82119+(-0.57949*0)+(-0.45892*1)+(-0.38833*0) #-3.28011
#B3=1
-2.82119+(-0.57949*0)+(-0.45892*0)+(-0.38833*1) #-3.20952
#Difference is -3.28011-(-3.20952)= -0.07059
#This difference has the following variance:
(t(cVec)%*%vcov(mod_glm)%*%cVec)
#0.007434746
#95% CI
-0.07059-1.96*sqrt(0.007434746)#-0.2395909
-0.07059+1.96*sqrt(0.007434746)#0.09841095
view raw Post3_B.r hosted with ❤ by GitHub


Another type of test we describe is a Likelihood Ratio (LR) test. Asymptotically, the LR and Wald test will result in the same answer, and will certainly with this large of sample size.

  • The basic idea of the LR test is to compare the log likelihood (LL) at the maximized value without constraint, to the  maximized value with the  constraint(s): G^{2}=2*(LL  full model)-2*(LL       constrained  model). This is illustrated below via brute force, with the constrained model (making \beta_{2}=\beta_{3}) fit as: log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2)+\beta_{2}(Gifts3)
=>log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2)+\beta_{2}(Gifts2)
=log(\frac{p}{(1-p)})=\beta_{0}+\beta_{1}(Gifts1)+\beta_{2}(Gifts2+Gifts3)


  • This constrained model is very simple to refit in R and does not require use of an offset variable:
#LR from unconstrained
logLik(mod_glm)
#'log Lik.' -19061.75 (df=4)
#new variable constructed
dataGifts2_3<-dataGifts2+data$Gifts3
#fit new model
mod_glm_new<-glm(Respond~Gifts1+Gifts2_3,data=data, family=binomial(link=logit))
logLik(mod_glm_new)
#'log Lik.' -19062.08 (df=3)
#LR test
2*logLik(mod_glm)-2*logLik(mod_glm_new)
#[1] 0.6701161
view raw Post_3_C.r hosted with ❤ by GitHub


Sunday, December 30, 2012

Simple Logistic Regression (GLM and Optim)

We will be using the data from the 1998 KDD Cup in the next couple of posts - at least a couple of the columns- which we will re-purpose.  Using R to download and unzip the data set, we will keep three fields, renaming all and binning the latter of them. We are left with the following fields:

  • Respond: 1 if the prospect donated, 0 otherwise.
  • Amount: The amount the prospect donated.
  • Gifts: We will create three dummy variables that will take on the value 1 if the value is 1, 2, or 3 respectively with 4+ set as our reference level. 

We will imagine here that the Gifts variable is a nominal variable representing 4 levels of an experiment. For concreteness, we can imagine that we sent a direct mail letter to prospects asking them for a donation, with a free gift for anyone donating. We are interested in which of these gifts led to the greatest response. Say, there was a free gift card to department store X versus a free gift card to store Y etc. These free gift offers were randomized among the prospects in an unbalanced way (sample sizes are quite different in this data set).

    #download data set from URL (95,412 records)
    temp <- tempfile()
    download.file("http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98lrn.zip",temp)
    data <- read.table(unz(temp, "cup98LRN.txt"),sep=",",header=TRUE)
    unlink(temp)
    #keep only three variables
    data<-data[,c("TARGET_D","TARGET_B","NGIFTALL")]
    #bin the gifts into 1,2,3,4+
    dataNGIFTALL<-ifelse(dataNGIFTALL>3,4,data$NGIFTALL)
    colnames(data)<-c("Amount","Respond","Gifts")
    #create dummy variables for gifts
    dataGifts1<-ifelse(dataGifts==1,1,0)
    dataGifts2<-ifelse(dataGifts==2,1,0)
    dataGifts3<-ifelse(dataGifts==3,1,0)
    view raw Post_2A.r hosted with ❤ by GitHub


    To analyze the result of our simple experiment, we decide to fit a logistic regression to the data examining the effect of the 4 gifts on the probability that the prospect will respond (ignoring the amount donated). We are making Gift 4 the reference level. In this simply case, we could use other methods, but a logistic regression will generalize to larger more complex experiments.

    We can do this easily in R using the glm function. Here we quickly examine the coefficients that were fit via maximum likelihood, their standard errors, the log-likelihood of the data at the MLEs and the variance-covariance matrix of these estimates.

    #logistic regression via glm##############
    ##########################################
    mod_glm<-glm(Respond~Gifts1+Gifts2+Gifts3,data=data, family=binomial(link=logit))
    summary(mod_glm)
    #Coefficients:
    # Estimate Std. Error z value Pr(>|z|)
    #(Intercept) -2.82119 0.01636 -172.495 < 2e-16 ***
    #Gifts1 -0.57949 0.05888 -9.842 < 2e-16 ***
    #Gifts2 -0.45892 0.06303 -7.280 3.33e-13 ***
    #Gifts3 -0.38833 0.06322 -6.143 8.11e-10 ***
    logLik(mod_glm) #LL
    #'log Lik.' -19061.75 (df=4)
    vcov(mod_glm)
    # (Intercept) Gifts1 Gifts2 Gifts3
    #(Intercept) 0.0002674919 -0.0002674919 -0.0002674919 -0.0002674919
    #Gifts1 -0.0002674919 0.0034667138 0.0002674919 0.0002674919
    #Gifts2 -0.0002674919 0.0002674919 0.0039732911 0.0002674919
    #Gifts3 -0.0002674919 0.0002674919 0.0002674919 0.0039964383
    view raw Psot2_B.r hosted with ❤ by GitHub


    For reasons that will be useful latter, we can also use the optim function, along with the log-likelihood of the logistic model to achieve the same results. Here we minimize the negative log likelihood and compare the estimates to those from glm - noting they are very nearly identical.

    #likelihood function
    logistic.lik<-function(par,dat)
    {
    xb<-par[1]*1+par[2]*datGifts1+par[3]*datGifts2+par[4]*dat$Gifts3 #linear predictor
    g_x<-1/(1+exp(-xb)) #p(Y=1 | betas)
    LL<-sum(datRespond*log(g_x)+(1-datRespond)*log(1-g_x))
    return(-LL) #returns the negative LL since by default optim minimizes
    }
    result<-optim(c(0,0,0,0),logistic.lik, dat=data, hessian=TRUE, method="BFGS")
    result$par #MLE of Bo, B1, B2 and B3
    #[1] -2.8211834 -0.5795474 -0.4588734 -0.3883401
    (-1)*result$value #,aximized log-likiehood
    #[1] -19061.75
    varCov<-solve(result$hessian) #dont place a minus one in front since we returned the negative log likelihood
    varCov
    # [,1] [,2] [,3] [,4]
    #[1,] 0.0002674911 -0.0002674911 -0.0002674911 -0.0002674911
    #[2,] -0.0002674911 0.0034668663 0.0002674911 0.0002674911
    #[3,] -0.0002674911 0.0002674911 0.0039731245 0.0002674911
    #[4,] -0.0002674911 0.0002674911 0.0002674911 0.0039964732
    #As an example, the standard error of B2
    sqrt(varCov[3,3])
    #[1] 0.06303273
    view raw Post2_C.r hosted with ❤ by GitHub


    Next post will examine a linear hypothesis test - looking at the difference between Gifts=2 and Gifts=3.


    Thursday, December 27, 2012

    Introduction

    This blog will be an outlet to write about data mining and predictive modeling. I plan to explore and comment on the use of these disciplines aimed (mainly) at marketing analytics and database marketing - in a very applied fashion.

    My goal is to create a space to write about topics that interest me, allow me to learn (nothing is better for learning a new topic or technique than to write it down and *try* to explain it!) and share experience.


    Several posts are already planned:
    • Illustration of linear contrasts in a logistic regression
    • Illustrations of non-linear contrasts in a logistic regression using the delta method
    • Partial dependency plots in a gradient boosted regression tree model
    • An application of zero inflated gamma regression - a handy technique to understand the impact that a factor (e.g. a coupon) has on a continuous variable (like revenue) that contains many zero values - representing those targeted customers that did not purchase.
    • Simulation based power analysis for direct marketing factorial designs.

    Feel free to comment, correct or suggest on any post.