Skip to content

Figure 5.3 appears to be total log revenue rather than average, conflicts with text; model uses mean(log(revenue)) rather than log(mean(revenue)) #17

@shane-kercheval

Description

@shane-kercheval

Pg 143/144 & https://github.com/TaddyLab/BDS/blob/master/examples/paidsearch.R

Text:

Figure 5.3 shows the log difference between average revenues in each group.

Caption:

The log-scale average revenue difference ..

Although, in the code, both plots are using totalrev and are created before semavg is defined.

The total vs average log differences will produce the same pattern on different scales, but initially confused me as I walked through the code/example.


Related, let's assume the graphs plot the mean instead of total, so it is the same as the model.

The graphs first take the average (or total in the current code) and then take the log of the average. (i.e. log(mean(revenue)))

The model uses y from semavg which takes the log and then the mean. In the code, y is defined as y=mean(log(revenue)))

Whether we use sum or mean in the model, it seems like would want to take the log after the mean. This seems especially true if we were going to use sum rather than mean.


Original Code (mean(log(revenue)))

library(data.table)
sem <- as.data.table(sem)
sem_avg_log <- sem[, 
			list(d=mean(1-search.stays.on), y=mean(log(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_avg_log, "treatment_period", "t") # names to match slides
sem_avg_log <- as.data.frame(sem_avg_log)
coef(glm(y ~ d*t, data=sem_avg_log))['d:t']

gives -0.006586852


log(mean(revenue)):

sem_log_avg <- sem[, 
			list(d=mean(1-search.stays.on), y=log(mean(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_avg, "treatment_period", "t") # names to match slides
sem_log_avg <- as.data.frame(sem_log_avg)
coef(glm(y ~ d*t, data=sem_log_avg))['d:t']

gives -0.005775498


If we were to use sum rather than mean and then log i.e. log(sum(revenue))

sem_log_sum <- sem[, 
			list(d=mean(1-search.stays.on), y=log(sum(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_sum, "treatment_period", "t") # names to match slides
sem_log_sum <- as.data.frame(sem_log_sum)
coef(glm(y ~ d*t, data=sem_log_sum))['d:t']

gives -0.005775498, which is the same as log(mean(revenue))


If we were to do sum(log(revenue)) which would clearly be wrong because the control is a larger group, then we'd get -0.2534986...


Is there a reason we should specifically use mean(log(revenue)) rather than log(mean(revenue))?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions