- Respond: 1 if the prospect donated, 0 otherwise.
- Amount: The amount the prospect donated.
- Gifts: We will create three dummy variables that will take on the value 1 if the value is 1, 2, or 3 respectively with 4+ set as our reference level.
We will imagine here that the Gifts variable is a nominal variable representing 4 levels of an experiment. For concreteness, we can imagine that we sent a direct mail letter to prospects asking them for a donation, with a free gift for anyone donating. We are interested in which of these gifts led to the greatest response. Say, there was a free gift card to department store X versus a free gift card to store Y etc. These free gift offers were randomized among the prospects in an unbalanced way (sample sizes are quite different in this data set).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#download data set from URL (95,412 records) | |
temp <- tempfile() | |
download.file("http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98lrn.zip",temp) | |
data <- read.table(unz(temp, "cup98LRN.txt"),sep=",",header=TRUE) | |
unlink(temp) | |
#keep only three variables | |
data<-data[,c("TARGET_D","TARGET_B","NGIFTALL")] | |
#bin the gifts into 1,2,3,4+ | |
dataNGIFTALL<-ifelse(dataNGIFTALL>3,4,data$NGIFTALL) | |
colnames(data)<-c("Amount","Respond","Gifts") | |
#create dummy variables for gifts | |
dataGifts1<-ifelse(dataGifts==1,1,0) | |
dataGifts2<-ifelse(dataGifts==2,1,0) | |
dataGifts3<-ifelse(dataGifts==3,1,0) |
To analyze the result of our simple experiment, we decide to fit a logistic regression to the data examining the effect of the 4 gifts on the probability that the prospect will respond (ignoring the amount donated). We are making Gift 4 the reference level. In this simply case, we could use other methods, but a logistic regression will generalize to larger more complex experiments.
We can do this easily in R using the glm function. Here we quickly examine the coefficients that were fit via maximum likelihood, their standard errors, the log-likelihood of the data at the MLEs and the variance-covariance matrix of these estimates.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#logistic regression via glm############## | |
########################################## | |
mod_glm<-glm(Respond~Gifts1+Gifts2+Gifts3,data=data, family=binomial(link=logit)) | |
summary(mod_glm) | |
#Coefficients: | |
# Estimate Std. Error z value Pr(>|z|) | |
#(Intercept) -2.82119 0.01636 -172.495 < 2e-16 *** | |
#Gifts1 -0.57949 0.05888 -9.842 < 2e-16 *** | |
#Gifts2 -0.45892 0.06303 -7.280 3.33e-13 *** | |
#Gifts3 -0.38833 0.06322 -6.143 8.11e-10 *** | |
logLik(mod_glm) #LL | |
#'log Lik.' -19061.75 (df=4) | |
vcov(mod_glm) | |
# (Intercept) Gifts1 Gifts2 Gifts3 | |
#(Intercept) 0.0002674919 -0.0002674919 -0.0002674919 -0.0002674919 | |
#Gifts1 -0.0002674919 0.0034667138 0.0002674919 0.0002674919 | |
#Gifts2 -0.0002674919 0.0002674919 0.0039732911 0.0002674919 | |
#Gifts3 -0.0002674919 0.0002674919 0.0002674919 0.0039964383 |
For reasons that will be useful latter, we can also use the optim function, along with the log-likelihood of the logistic model to achieve the same results. Here we minimize the negative log likelihood and compare the estimates to those from glm - noting they are very nearly identical.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#likelihood function | |
logistic.lik<-function(par,dat) | |
{ | |
xb<-par[1]*1+par[2]*datGifts1+par[3]*datGifts2+par[4]*dat$Gifts3 #linear predictor | |
g_x<-1/(1+exp(-xb)) #p(Y=1 | betas) | |
LL<-sum(datRespond*log(g_x)+(1-datRespond)*log(1-g_x)) | |
return(-LL) #returns the negative LL since by default optim minimizes | |
} | |
result<-optim(c(0,0,0,0),logistic.lik, dat=data, hessian=TRUE, method="BFGS") | |
result$par #MLE of Bo, B1, B2 and B3 | |
#[1] -2.8211834 -0.5795474 -0.4588734 -0.3883401 | |
(-1)*result$value #,aximized log-likiehood | |
#[1] -19061.75 | |
varCov<-solve(result$hessian) #dont place a minus one in front since we returned the negative log likelihood | |
varCov | |
# [,1] [,2] [,3] [,4] | |
#[1,] 0.0002674911 -0.0002674911 -0.0002674911 -0.0002674911 | |
#[2,] -0.0002674911 0.0034668663 0.0002674911 0.0002674911 | |
#[3,] -0.0002674911 0.0002674911 0.0039731245 0.0002674911 | |
#[4,] -0.0002674911 0.0002674911 0.0002674911 0.0039964732 | |
#As an example, the standard error of B2 | |
sqrt(varCov[3,3]) | |
#[1] 0.06303273 |
Next post will examine a linear hypothesis test - looking at the difference between Gifts=2 and Gifts=3.
No comments:
Post a Comment