Add CountDistinct and CountIf methods to GroupBy#7575
Add CountDistinct and CountIf methods to GroupBy#7575sevenzees wants to merge 7 commits intodotnet:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7575 +/- ##
=======================================
Coverage 69.02% 69.03%
=======================================
Files 1482 1482
Lines 274099 274169 +70
Branches 28266 28272 +6
=======================================
+ Hits 189199 189259 +60
- Misses 77518 77522 +4
- Partials 7382 7388 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
|
||
| public override DataFrame CountDistinct(params string[] columnNames) | ||
| { | ||
| HashSet<GroupByPredicateInput> seenValues = []; |
There was a problem hiding this comment.
[Critical — Correctness] HashSet<GroupByPredicateInput> with a mutable, reused object produces wrong results.
CountDistinct captures a single HashSet<GroupByPredicateInput> in its lambda. Inside CountIf, a single GroupByPredicateInput instance is created per (group, column) and then reused across all rows in the inner loop:
// line ~202: created once per (group, column)
var groupByPredicateInput = new GroupByPredicateInput { ColumnName = column.Name, GroupKey = firstColumn[rowIndex] };
foreach (long row in rowEnumerable)
{
groupByPredicateInput.RowValue = column[row]; // ← line ~209: mutated each iteration
if (predicate(groupByPredicateInput)) // CountDistinct does .Contains() then .Add()
count++;
}CountDistinct's predicate calls seenValues.Add(input), storing the same reference that line 209 mutates on the next iteration. This violates the HashSet contract: after insertion, the object's hash code changes, so it lands in the wrong bucket.
How it breaks concretely: When the HashSet resizes (e.g. at the 3rd distinct value, capacity jumps from 3→7), it rehashes all entries. Since they're all the same reference with the same current RowValue, they all collapse into one bucket. After that, any previously-seen value that hashes to a different bucket won't be found — so it gets double-counted as "distinct."
The test data [1, 1, 2, 3, 4] for Group 2 avoids this because the duplicate 1 appears at index 1 (before the resize at index 2). If the data were [1, 2, 3, 4, 1] instead, the final 1 would be miscounted as distinct (returning 5 instead of 4).
Additionally, the HashSet is never cleared between groups/columns, so it accumulates stale entries — a memory leak proportional to total rows × columns.
Suggested fix: Use a HashSet<object> keyed on just the row value, scoped per (group, column). The simplest approach is to implement CountDistinct directly (like the original Count) rather than routing through CountIf:
public override DataFrame CountDistinct(params string[] columnNames)
{
// Same structure as Count, but with a per-(group, column) HashSet<object>
// to track distinct non-null values.
}| /// <param name="predicate">A function that takes in the column name, group key, and row value and returns true to include that row in the group count or false to exclude it.</param> | ||
| /// <param name="columnNames">The columns within which to compute the number of values in each group that match the predicate. A default value includes all columns.</param> | ||
| /// <returns></returns> | ||
| public abstract DataFrame CountIf(Func<GroupByPredicateInput, bool> predicate, params string[] columnNames); |
There was a problem hiding this comment.
[Major — Breaking change] Adding abstract methods to a public non-sealed class is a binary-breaking change. Any downstream code that subclasses GroupBy will fail to compile (or fail at load time for pre-compiled assemblies) until it implements CountIf and CountDistinct.
Options to avoid the break:
- Make them
virtualwith a default implementation (e.g.throw new NotSupportedException()). - Add them only on
GroupBy<TKey>(which is the concrete class) rather than the abstract base. - Seal
GroupByif external subclassing is not intended (coordinate with maintainers).
|
|
||
| public override DataFrame Count(params string[] columnNames) | ||
| { | ||
| return CountIf(input => input.RowValue != null, columnNames); |
There was a problem hiding this comment.
[Major — Performance regression] Count() was previously allocation-free — a tight if (column[row] != null) count++ loop. By delegating through CountIf, every row now:
- Boxes the cell value into
object RowValue(Int32, Double, DateTime, etc.). - Allocates a
GroupByPredicateInputper (group, column) pair. - Invokes a delegate per row.
Count() is the most commonly-used GroupBy aggregation; on a 1M-row x 10-column DataFrame this adds millions of boxing allocations.
Suggested fix: Keep the original inlined null-check implementation for Count(). CountIf can coexist as a separate method without Count needing to delegate to it.
| /// <summary> | ||
| /// The name of the column that is being aggregated | ||
| /// </summary> | ||
| public string ColumnName { get; set; } |
There was a problem hiding this comment.
[Minor — Design] Two small things on this record:
-
Mutable
setaccessors on a record. Records are designed for value semantics and immutability. Using{ get; init; }instead of{ get; set; }would prevent accidental mutation after construction and make storing instances in collections (like theHashSetinCountDistinct) safe by design. -
object-typed properties cause boxing. Every value-type cell (Int32, Double, DateTime, etc.) gets boxed intoobject GroupKey/object RowValue. If perf matters, consider a generic variant likeGroupByPredicateInput<TKey, TValue>— though that adds API complexity, soobjectmay be an acceptable trade-off if documented.
| /// </summary> | ||
| /// <param name="predicate">A function that takes in the column name, group key, and row value and returns true to include that row in the group count or false to exclude it.</param> | ||
| /// <param name="columnNames">The columns within which to compute the number of values in each group that match the predicate. A default value includes all columns.</param> | ||
| /// <returns></returns> |
There was a problem hiding this comment.
[Minor — Docs] <returns></returns> is empty here and on CountDistinct. Consider filling these in, e.g.:
/// <returns>A <see cref="DataFrame"/> with one row per group and one column per aggregated column, containing the matching counts.</returns>| } | ||
| } | ||
|
|
||
| DataFrame countIf = dfWithDuplicates.GroupBy("Group").CountIf((GroupByPredicateInput input) => input.RowValue is int and < 3, "Int"); |
There was a problem hiding this comment.
[Minor — Test coverage] CountIf is only tested with a single column ("Int"). Consider adding:
- A call with no column names (default all-columns path) to verify it works across all column types.
- An always-true predicate — result should match
Count()output. - An always-false predicate — all counts should be 0.
- Edge cases: empty groups, all-null columns, single-row groups.
Fixes #7574
Adds a CountDistinct method to the GroupBy class which counts distinct values per group. Adds a CountIf method to the GroupBy class which counts values per group that satisfy a custom predicate.