Figure 5.3 appears to be total log revenue rather than average, conflicts with text; model uses mean(log(revenue)) rather than log(mean(revenue))

Pg 143/144 & https://github.com/TaddyLab/BDS/blob/master/examples/paidsearch.R


Text: 

> Figure 5.3 shows the log difference between **_average_** revenues in each group.

Caption:

> The log-scale **_average_** revenue difference ..

Although, in the code, both plots are using `totalrev` and are created before `semavg` is defined.

The total vs average log differences will produce the same pattern on different scales, but initially confused me as I walked through the code/example.

---

Related, let's assume the graphs plot the `mean` instead of `total`, so it is the same as the model.

The graphs first take the average (or total in the current code) and then take the log of the average. (i.e. `log(mean(revenue))`)

The model uses `y` from `semavg` which takes the log and then the mean. In the code, `y` is defined as `y=mean(log(revenue)))`

Whether we use `sum` or `mean` in the model, it seems like would want to take the log after the mean. This seems especially true if we were going to use `sum` rather than `mean`.

---

Original Code (`mean(log(revenue))`)

```
library(data.table)
sem <- as.data.table(sem)
sem_avg_log <- sem[, 
			list(d=mean(1-search.stays.on), y=mean(log(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_avg_log, "treatment_period", "t") # names to match slides
sem_avg_log <- as.data.frame(sem_avg_log)
coef(glm(y ~ d*t, data=sem_avg_log))['d:t']
```

gives `-0.006586852`

---

`log(mean(revenue))`:

```
sem_log_avg <- sem[, 
			list(d=mean(1-search.stays.on), y=log(mean(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_avg, "treatment_period", "t") # names to match slides
sem_log_avg <- as.data.frame(sem_log_avg)
coef(glm(y ~ d*t, data=sem_log_avg))['d:t']
```

gives `-0.005775498`

---

If we were to use `sum` rather than `mean` and then log i.e. `log(sum(revenue))`

```
sem_log_sum <- sem[, 
			list(d=mean(1-search.stays.on), y=log(sum(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_sum, "treatment_period", "t") # names to match slides
sem_log_sum <- as.data.frame(sem_log_sum)
coef(glm(y ~ d*t, data=sem_log_sum))['d:t']
```

gives `-0.005775498`, which is the same as `log(mean(revenue))`

---

If we were to do `sum(log(revenue))` which would clearly be wrong because the control is a larger group, then we'd get `-0.2534986`...

---

Is there a reason we should specifically use `mean(log(revenue))` rather than `log(mean(revenue))`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Figure 5.3 appears to be total log revenue rather than average, conflicts with text; model uses mean(log(revenue)) rather than log(mean(revenue)) #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Figure 5.3 appears to be total log revenue rather than average, conflicts with text; model uses mean(log(revenue)) rather than log(mean(revenue)) #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions