Skip to content

Add CountDistinct and CountIf methods to GroupBy#7575

Open
sevenzees wants to merge 7 commits intodotnet:mainfrom
sevenzees:nunique
Open

Add CountDistinct and CountIf methods to GroupBy#7575
sevenzees wants to merge 7 commits intodotnet:mainfrom
sevenzees:nunique

Conversation

@sevenzees
Copy link
Contributor

Fixes #7574

Adds a CountDistinct method to the GroupBy class which counts distinct values per group. Adds a CountIf method to the GroupBy class which counts values per group that satisfy a custom predicate.

@sevenzees sevenzees marked this pull request as draft January 30, 2026 22:49
@sevenzees sevenzees marked this pull request as ready for review January 30, 2026 22:52
@codecov
Copy link

codecov bot commented Jan 31, 2026

Codecov Report

❌ Patch coverage is 98.59155% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 69.03%. Comparing base (25b977e) to head (2e6f45e).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.Data.Analysis/GroupBy.cs 97.05% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7575   +/-   ##
=======================================
  Coverage   69.02%   69.03%           
=======================================
  Files        1482     1482           
  Lines      274099   274169   +70     
  Branches    28266    28272    +6     
=======================================
+ Hits       189199   189259   +60     
- Misses      77518    77522    +4     
- Partials     7382     7388    +6     
Flag Coverage Δ
Debug 69.03% <98.59%> (+<0.01%) ⬆️
production 63.31% <97.05%> (+<0.01%) ⬆️
test 89.47% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 99.90% <100.00%> (+<0.01%) ⬆️
src/Microsoft.Data.Analysis/GroupBy.cs 97.88% <97.05%> (-0.09%) ⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


public override DataFrame CountDistinct(params string[] columnNames)
{
HashSet<GroupByPredicateInput> seenValues = [];
Copy link
Member

@rokonec rokonec Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical — Correctness] HashSet<GroupByPredicateInput> with a mutable, reused object produces wrong results.

CountDistinct captures a single HashSet<GroupByPredicateInput> in its lambda. Inside CountIf, a single GroupByPredicateInput instance is created per (group, column) and then reused across all rows in the inner loop:

// line ~202: created once per (group, column)
var groupByPredicateInput = new GroupByPredicateInput { ColumnName = column.Name, GroupKey = firstColumn[rowIndex] };

foreach (long row in rowEnumerable)
{
    groupByPredicateInput.RowValue = column[row];  // ← line ~209: mutated each iteration
    if (predicate(groupByPredicateInput))           // CountDistinct does .Contains() then .Add()
        count++;
}

CountDistinct's predicate calls seenValues.Add(input), storing the same reference that line 209 mutates on the next iteration. This violates the HashSet contract: after insertion, the object's hash code changes, so it lands in the wrong bucket.

How it breaks concretely: When the HashSet resizes (e.g. at the 3rd distinct value, capacity jumps from 3→7), it rehashes all entries. Since they're all the same reference with the same current RowValue, they all collapse into one bucket. After that, any previously-seen value that hashes to a different bucket won't be found — so it gets double-counted as "distinct."

The test data [1, 1, 2, 3, 4] for Group 2 avoids this because the duplicate 1 appears at index 1 (before the resize at index 2). If the data were [1, 2, 3, 4, 1] instead, the final 1 would be miscounted as distinct (returning 5 instead of 4).

Additionally, the HashSet is never cleared between groups/columns, so it accumulates stale entries — a memory leak proportional to total rows × columns.

Suggested fix: Use a HashSet<object> keyed on just the row value, scoped per (group, column). The simplest approach is to implement CountDistinct directly (like the original Count) rather than routing through CountIf:

public override DataFrame CountDistinct(params string[] columnNames)
{
    // Same structure as Count, but with a per-(group, column) HashSet<object>
    // to track distinct non-null values.
}

/// <param name="predicate">A function that takes in the column name, group key, and row value and returns true to include that row in the group count or false to exclude it.</param>
/// <param name="columnNames">The columns within which to compute the number of values in each group that match the predicate. A default value includes all columns.</param>
/// <returns></returns>
public abstract DataFrame CountIf(Func<GroupByPredicateInput, bool> predicate, params string[] columnNames);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major — Breaking change] Adding abstract methods to a public non-sealed class is a binary-breaking change. Any downstream code that subclasses GroupBy will fail to compile (or fail at load time for pre-compiled assemblies) until it implements CountIf and CountDistinct.

Options to avoid the break:

  1. Make them virtual with a default implementation (e.g. throw new NotSupportedException()).
  2. Add them only on GroupBy<TKey> (which is the concrete class) rather than the abstract base.
  3. Seal GroupBy if external subclassing is not intended (coordinate with maintainers).


public override DataFrame Count(params string[] columnNames)
{
return CountIf(input => input.RowValue != null, columnNames);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major — Performance regression] Count() was previously allocation-free — a tight if (column[row] != null) count++ loop. By delegating through CountIf, every row now:

  1. Boxes the cell value into object RowValue (Int32, Double, DateTime, etc.).
  2. Allocates a GroupByPredicateInput per (group, column) pair.
  3. Invokes a delegate per row.

Count() is the most commonly-used GroupBy aggregation; on a 1M-row x 10-column DataFrame this adds millions of boxing allocations.

Suggested fix: Keep the original inlined null-check implementation for Count(). CountIf can coexist as a separate method without Count needing to delegate to it.

/// <summary>
/// The name of the column that is being aggregated
/// </summary>
public string ColumnName { get; set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor — Design] Two small things on this record:

  1. Mutable set accessors on a record. Records are designed for value semantics and immutability. Using { get; init; } instead of { get; set; } would prevent accidental mutation after construction and make storing instances in collections (like the HashSet in CountDistinct) safe by design.

  2. object-typed properties cause boxing. Every value-type cell (Int32, Double, DateTime, etc.) gets boxed into object GroupKey / object RowValue. If perf matters, consider a generic variant like GroupByPredicateInput<TKey, TValue> — though that adds API complexity, so object may be an acceptable trade-off if documented.

/// </summary>
/// <param name="predicate">A function that takes in the column name, group key, and row value and returns true to include that row in the group count or false to exclude it.</param>
/// <param name="columnNames">The columns within which to compute the number of values in each group that match the predicate. A default value includes all columns.</param>
/// <returns></returns>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor — Docs] <returns></returns> is empty here and on CountDistinct. Consider filling these in, e.g.:

/// <returns>A <see cref="DataFrame"/> with one row per group and one column per aggregated column, containing the matching counts.</returns>

}
}

DataFrame countIf = dfWithDuplicates.GroupBy("Group").CountIf((GroupByPredicateInput input) => input.RowValue is int and < 3, "Int");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor — Test coverage] CountIf is only tested with a single column ("Int"). Consider adding:

  • A call with no column names (default all-columns path) to verify it works across all column types.
  • An always-true predicate — result should match Count() output.
  • An always-false predicate — all counts should be 0.
  • Edge cases: empty groups, all-null columns, single-row groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CountDistinct method to GroupBy

2 participants