Skip to content

Allow expressions in data_summary() that return > 1 column summaries#673

Merged
strengejacke merged 33 commits intomainfrom
data_summary_multiple_rows
Mar 12, 2026
Merged

Allow expressions in data_summary() that return > 1 column summaries#673
strengejacke merged 33 commits intomainfrom
data_summary_multiple_rows

Conversation

@strengejacke
Copy link
Member

@strengejacke strengejacke commented Mar 10, 2026

Revision

@strengejacke

This comment was marked as outdated.

This comment was marked as spam.

etiennebacher

This comment was marked as outdated.

@mattansb
Copy link
Member

I think summary dfs should have one row (to be shape consistent), but allow multi-value expressions - just expanding them to columns.

strengejacke and others added 4 commits March 10, 2026 16:01
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
@strengejacke

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

@strengejacke

This comment was marked as outdated.

@strengejacke strengejacke added the Won't fix 🚫 This will not be worked on label Mar 10, 2026
@strengejacke

This comment was marked as outdated.

@strengejacke
Copy link
Member Author

strengejacke commented Mar 10, 2026

Errors:

library(datawizard)

set.seed(123)
d <- data.frame(
  x = rnorm(100, 1, 1),
  y = rnorm(100, 2, 2),
  groups = rep(1:4, each = 25)
)

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  quant_y = quantile(y, c(0.1, 0.9)),
  suffix = c("a", "b", "c")
)
#> Error:
#> ! Argument `suffix` must have the same length as the result of the
#>   regarding summary expression. `suffix` has 3 elements (`a`, `b` and `c`)
#>   for the expression `quantile(x, c(0.25, 0.75))`, which returned 2
#>   values.

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  quant_y = quantile(y, c(0.1, 0.9)),
  suffix = list(c("a", "b"), c("c", "d"), c("e", "f"))
)
#> Error:
#> ! If `suffix` is a list of character vectors, it should have the same
#>   length as the number of expressions. `suffix` has 3 elements, but
#>   there are 2 expressions.

data_summary(
  mtcars,
  n = unique(mpg),
  j = c(min(am), max(am)),
  by = c("am", "gear")
)
#> Error:
#> ! Each expression must return the same number of values for each group.
#>   Some of the expressions seem to return varying numbers of values.

Created on 2026-03-11 with reprex v2.1.1

This comment was marked as outdated.

strengejacke and others added 4 commits March 11, 2026 12:08
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@strengejacke
Copy link
Member Author

@etiennebacher WDYT about the current implementation? It's no longer an additional "off-label-use" functionality like reframe(), but instead enhances the summary functionality to allow expressions that return arbitrary number of summary results, making the function more flexible. Furthermore, the suffix argument is also optional - if not provided, columns with identical names are simply renumbered (see above example).

@strengejacke

This comment was marked as outdated.

@mattansb
Copy link
Member

If the summary is a named vector can those be used as suffixes instead of the stuffix argument?

@strengejacke
Copy link
Member Author

Which behaviour would you suggest? To make it less complex, we could do:

When the summary expression returns more than one value (and only then)

  • suffix has to be a named list, to name multiple columns
  • For all non-matching named elements, names of the returned summary are used
  • When names are not present, automatic numbering is done.

I would then not allow suffix to be just a character vector that applies to all expressions, because all these options would be quite complex and difficult to understand and document.

Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I'd like to wait for @mattansb's opinion before merging.

@mattansb
Copy link
Member

Sounds good to me, thanks!

@strengejacke

This comment was marked as outdated.

@strengejacke
Copy link
Member Author

Here's the (hopefully) final implementation:

library(datawizard)

set.seed(123)
d <- data.frame(
  x = rnorm(100, 1, 1),
  y = rnorm(100, 2, 2),
  w = rnorm(100, 3, 0.5),
  z = rnorm(100, 4, 3),
  groups = rep(1:4, each = 25)
)

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75))
)
#> quant_x25% | quant_x75% | mean_x | quant_y25% | quant_y50% | quant_y75%
#> -----------------------------------------------------------------------
#>       0.51 |       1.69 |   1.09 |       0.40 |       1.55 |       2.94

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  fivenum = fivenum(y)
)
#> quant_x25% | quant_x75% | mean_x | fivenum_1 | fivenum_2 | fivenum_3
#> --------------------------------------------------------------------
#>       0.51 |       1.69 |   1.09 |     -2.11 |      0.37 |      1.55
#> 
#> quant_x25% | fivenum_4 | fivenum_5
#> ----------------------------------
#>       0.51 |      2.97 |      8.48

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75)),
  suffix = list(quant_y = c("_Q1", "_Q2", "_Q3"))
)
#> quant_x25% | quant_x75% | mean_x | quant_y_Q1 | quant_y_Q2 | quant_y_Q3
#> -----------------------------------------------------------------------
#>       0.51 |       1.69 |   1.09 |       0.40 |       1.55 |       2.94

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75)),
  suffix = list(quant_x = c("Q1", "Q3"), quant_y = c("_Q1", "_Q2", "_Q3"))
)
#> quant_xQ1 | quant_xQ3 | mean_x | quant_y_Q1 | quant_y_Q2 | quant_y_Q3
#> ---------------------------------------------------------------------
#>      0.51 |      1.69 |   1.09 |       0.40 |       1.55 |       2.94

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.5)),
  quant_w = quantile(w, c(0.25, 0.5)),
  quant_y = quantile(y, c(0.25, 0.5)),
  quant_z = quantile(z, c(0.25, 0.5)),
  suffix = c("_Q1", "_Q2")
)
#> quant_x_Q1 | quant_x_Q2 | quant_w_Q1 | quant_w_Q2 | quant_y_Q1 | quant_y_Q2
#> ---------------------------------------------------------------------------
#>       0.51 |       1.06 |       2.73 |       3.02 |       0.40 |       1.55
#> 
#> quant_x_Q1 | quant_z_Q1 | quant_z_Q2
#> ------------------------------------
#>       0.51 |       1.81 |       3.99

# errors
data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75)),
  suffix = list(quant_xy = c("_Q1", "_Q2", "_Q3"))
)
#> Error:
#> ! Names of `suffix` must match the names of the expressions. Suffix
#>   `quant_xy` has no corresponding expression.

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75)),
  suffix = list(c("Q1", "Q3"), "mean", c("_Q1", "_Q2", "_Q3"))
)
#> Error:
#> ! All elements of `suffix` must have names.

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75)),
  suffix = c("_Q1", "_Q2", "_Q3")
)
#> Error:
#> ! Argument `suffix` must have the same length as the result of the
#>   corresponding summary expression. `suffix` has 3 elements (`_Q1`, `_Q2`
#>   and `_Q3`) for the expression `quantile(x, c(0.25, 0.75))`, which
#>   returned 2 values.

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75)),
  suffix = list(quant_x = c("_Q1", "_Q2", "_Q3"))
)
#> Error:
#> ! Argument `suffix` must have the same length as the result of the
#>   corresponding summary expression. `suffix` has 3 elements (`_Q1`, `_Q2`
#>   and `_Q3`) for the expression `quantile(x, c(0.25, 0.75))`, which
#>   returned 2 values.

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.75)),
  mean_x = mean(x),
  quant_y = quantile(y, c(0.25, 0.5, 0.75)),
  suffix = list(quant_x = c("Q1", "Q3"), quant_y = c("_Q1", "_Q2", "_Q2"))
)
#> Error:
#> ! All suffixes for a single expression must be unique. Suffix for element
#>   `quant_y` has duplicate values.

data_summary(
  d,
  quant_x = quantile(x, c(0.25, 0.5)),
  quant_w = quantile(w, c(0.25, 0.5)),
  quant_y = quantile(y, c(0.25, 0.5)),
  quant_z = quantile(z, c(0.25, 0.5)),
  suffix = c("_Q1", "_Q2", "_Q3")
)
#> Error:
#> ! Argument `suffix` must have the same length as the result of the
#>   corresponding summary expression. `suffix` has 3 elements (`_Q1`, `_Q2`
#>   and `_Q3`) for the expression `quantile(x, c(0.25, 0.5))`, which
#>   returned 2 values.

Created on 2026-03-12 with reprex v2.1.1

@mattansb
Copy link
Member

looks great!

Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a suggestion to reorganize a bit the param description.

Thanks!

strengejacke and others added 3 commits March 12, 2026 16:59
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
@strengejacke strengejacke merged commit b2f4416 into main Mar 12, 2026
25 of 27 checks passed
@strengejacke strengejacke deleted the data_summary_multiple_rows branch March 12, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants