-
Notifications
You must be signed in to change notification settings - Fork 135
Description
Pg 143/144 & https://github.com/TaddyLab/BDS/blob/master/examples/paidsearch.R
Text:
Figure 5.3 shows the log difference between average revenues in each group.
Caption:
The log-scale average revenue difference ..
Although, in the code, both plots are using totalrev and are created before semavg is defined.
The total vs average log differences will produce the same pattern on different scales, but initially confused me as I walked through the code/example.
Related, let's assume the graphs plot the mean instead of total, so it is the same as the model.
The graphs first take the average (or total in the current code) and then take the log of the average. (i.e. log(mean(revenue)))
The model uses y from semavg which takes the log and then the mean. In the code, y is defined as y=mean(log(revenue)))
Whether we use sum or mean in the model, it seems like would want to take the log after the mean. This seems especially true if we were going to use sum rather than mean.
Original Code (mean(log(revenue)))
library(data.table)
sem <- as.data.table(sem)
sem_avg_log <- sem[,
list(d=mean(1-search.stays.on), y=mean(log(revenue))),
by=c("dma","treatment_period")]
setnames(sem_avg_log, "treatment_period", "t") # names to match slides
sem_avg_log <- as.data.frame(sem_avg_log)
coef(glm(y ~ d*t, data=sem_avg_log))['d:t']
gives -0.006586852
log(mean(revenue)):
sem_log_avg <- sem[,
list(d=mean(1-search.stays.on), y=log(mean(revenue))),
by=c("dma","treatment_period")]
setnames(sem_log_avg, "treatment_period", "t") # names to match slides
sem_log_avg <- as.data.frame(sem_log_avg)
coef(glm(y ~ d*t, data=sem_log_avg))['d:t']
gives -0.005775498
If we were to use sum rather than mean and then log i.e. log(sum(revenue))
sem_log_sum <- sem[,
list(d=mean(1-search.stays.on), y=log(sum(revenue))),
by=c("dma","treatment_period")]
setnames(sem_log_sum, "treatment_period", "t") # names to match slides
sem_log_sum <- as.data.frame(sem_log_sum)
coef(glm(y ~ d*t, data=sem_log_sum))['d:t']
gives -0.005775498, which is the same as log(mean(revenue))
If we were to do sum(log(revenue)) which would clearly be wrong because the control is a larger group, then we'd get -0.2534986...
Is there a reason we should specifically use mean(log(revenue)) rather than log(mean(revenue))?