Processing math: 16%

Sunday, December 30, 2012

Simple Logistic Regression (GLM and Optim)

We will be using the data from the 1998 KDD Cup in the next couple of posts - at least a couple of the columns- which we will re-purpose.  Using R to download and unzip the data set, we will keep three fields, renaming all and binning the latter of them. We are left with the following fields:

  • Respond: 1 if the prospect donated, 0 otherwise.
  • Amount: The amount the prospect donated.
  • Gifts: We will create three dummy variables that will take on the value 1 if the value is 1, 2, or 3 respectively with 4+ set as our reference level. 

We will imagine here that the Gifts variable is a nominal variable representing 4 levels of an experiment. For concreteness, we can imagine that we sent a direct mail letter to prospects asking them for a donation, with a free gift for anyone donating. We are interested in which of these gifts led to the greatest response. Say, there was a free gift card to department store X versus a free gift card to store Y etc. These free gift offers were randomized among the prospects in an unbalanced way (sample sizes are quite different in this data set).

    #download data set from URL (95,412 records)
    temp <- tempfile()
    download.file("http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98lrn.zip",temp)
    data <- read.table(unz(temp, "cup98LRN.txt"),sep=",",header=TRUE)
    unlink(temp)
    #keep only three variables
    data<-data[,c("TARGET_D","TARGET_B","NGIFTALL")]
    #bin the gifts into 1,2,3,4+
    dataNGIFTALL<-ifelse(dataNGIFTALL>3,4,data$NGIFTALL)
    colnames(data)<-c("Amount","Respond","Gifts")
    #create dummy variables for gifts
    dataGifts1<-ifelse(dataGifts==1,1,0)
    dataGifts2<-ifelse(dataGifts==2,1,0)
    dataGifts3<-ifelse(dataGifts==3,1,0)
    view raw Post_2A.r hosted with ❤ by GitHub


    To analyze the result of our simple experiment, we decide to fit a logistic regression to the data examining the effect of the 4 gifts on the probability that the prospect will respond (ignoring the amount donated). We are making Gift 4 the reference level. In this simply case, we could use other methods, but a logistic regression will generalize to larger more complex experiments.

    We can do this easily in R using the glm function. Here we quickly examine the coefficients that were fit via maximum likelihood, their standard errors, the log-likelihood of the data at the MLEs and the variance-covariance matrix of these estimates.

    #logistic regression via glm##############
    ##########################################
    mod_glm<-glm(Respond~Gifts1+Gifts2+Gifts3,data=data, family=binomial(link=logit))
    summary(mod_glm)
    #Coefficients:
    # Estimate Std. Error z value Pr(>|z|)
    #(Intercept) -2.82119 0.01636 -172.495 < 2e-16 ***
    #Gifts1 -0.57949 0.05888 -9.842 < 2e-16 ***
    #Gifts2 -0.45892 0.06303 -7.280 3.33e-13 ***
    #Gifts3 -0.38833 0.06322 -6.143 8.11e-10 ***
    logLik(mod_glm) #LL
    #'log Lik.' -19061.75 (df=4)
    vcov(mod_glm)
    # (Intercept) Gifts1 Gifts2 Gifts3
    #(Intercept) 0.0002674919 -0.0002674919 -0.0002674919 -0.0002674919
    #Gifts1 -0.0002674919 0.0034667138 0.0002674919 0.0002674919
    #Gifts2 -0.0002674919 0.0002674919 0.0039732911 0.0002674919
    #Gifts3 -0.0002674919 0.0002674919 0.0002674919 0.0039964383
    view raw Psot2_B.r hosted with ❤ by GitHub


    For reasons that will be useful latter, we can also use the optim function, along with the log-likelihood of the logistic model to achieve the same results. Here we minimize the negative log likelihood and compare the estimates to those from glm - noting they are very nearly identical.

    #likelihood function
    logistic.lik<-function(par,dat)
    {
    xb<-par[1]*1+par[2]*datGifts1+par[3]*datGifts2+par[4]*dat$Gifts3 #linear predictor
    g_x<-1/(1+exp(-xb)) #p(Y=1 | betas)
    LL<-sum(datRespond*log(g_x)+(1-datRespond)*log(1-g_x))
    return(-LL) #returns the negative LL since by default optim minimizes
    }
    result<-optim(c(0,0,0,0),logistic.lik, dat=data, hessian=TRUE, method="BFGS")
    result$par #MLE of Bo, B1, B2 and B3
    #[1] -2.8211834 -0.5795474 -0.4588734 -0.3883401
    (-1)*result$value #,aximized log-likiehood
    #[1] -19061.75
    varCov<-solve(result$hessian) #dont place a minus one in front since we returned the negative log likelihood
    varCov
    # [,1] [,2] [,3] [,4]
    #[1,] 0.0002674911 -0.0002674911 -0.0002674911 -0.0002674911
    #[2,] -0.0002674911 0.0034668663 0.0002674911 0.0002674911
    #[3,] -0.0002674911 0.0002674911 0.0039731245 0.0002674911
    #[4,] -0.0002674911 0.0002674911 0.0002674911 0.0039964732
    #As an example, the standard error of B2
    sqrt(varCov[3,3])
    #[1] 0.06303273
    view raw Post2_C.r hosted with ❤ by GitHub


    Next post will examine a linear hypothesis test - looking at the difference between Gifts=2 and Gifts=3.


    No comments:

    Post a Comment