Add CountDistinct and CountIf methods to GroupBy#7575

Open

sevenzees wants to merge 7 commits intodotnet:mainfrom

sevenzees:nunique

Contributor

sevenzees commented Jan 30, 2026

Fixes #7574

Adds a CountDistinct method to the GroupBy class which counts distinct values per group. Adds a CountIf method to the GroupBy class which counts values per group that satisfy a custom predicate.

sevenzees added 3 commits

January 30, 2026 15:11


          Add CountDistinct method in GroupBy

a67a974


          Add CountIf implementation to GroupBy

6e27c1e


          Add test for CountIf

d936477

dotnet-policy-service bot added the community-contribution label

sevenzees marked this pull request as draft

January 30, 2026 22:49

sevenzees marked this pull request as ready for review

January 30, 2026 22:52

codecov bot commented Jan 31, 2026 •

edited

Loading

Codecov Report

❌ Patch coverage is 98.59155% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 69.03%. Comparing base (25b977e) to head (2e6f45e).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
src/Microsoft.Data.Analysis/GroupBy.cs	97.05%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7575   +/-   ##
=======================================
  Coverage   69.02%   69.03%           
=======================================
  Files        1482     1482           
  Lines      274099   274169   +70     
  Branches    28266    28272    +6     
=======================================
+ Hits       189199   189259   +60     
- Misses      77518    77522    +4     
- Partials     7382     7388    +6

Flag	Coverage Δ
Debug	`69.03% <98.59%> (+<0.01%)`	⬆️
production	`63.31% <97.05%> (+<0.01%)`	⬆️
test	`89.47% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs	`99.90% <100.00%> (+<0.01%)`	⬆️
src/Microsoft.Data.Analysis/GroupBy.cs	`97.88% <97.05%> (-0.09%)`	⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This was referenced Jan 31, 2026

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

[android-arm64] The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#6408

Open


          Add test for GroupByPredicateInput getters and setters

aaa8646

build-analysis bot mentioned this pull request

"We stopped hearing from agent Azure Pipelines 32. Verify the agent machine is running and has a healthy network connection" dotnet/dnceng#1886

Open

3 tasks

sevenzees added 3 commits

February 3, 2026 12:23


          Revert "Add test for GroupByPredicateInput getters and setters"

e89130b

This reverts commit aaa8646.


          Add documentation for GroupByPredicateInput record.

c3232dd


          Specify input type explicitly

2e6f45e

This was referenced Mar 4, 2026

🏥 ML.NET Repository Health Dashboard dotutils/GH-AW-Tests#3

Closed

🏥 ML.NET Repository Health Dashboard dotutils/GH-AW-Tests#2

Open

github-actions bot mentioned this pull request

🏥 Repo Health Dashboard #7584

Open

rokonec requested changes

View reviewed changes

src/Microsoft.Data.Analysis/GroupBy.cs

+                      public override DataFrame CountDistinct(params string[] columnNames)
+                      {
+                          HashSet<GroupByPredicateInput> seenValues = [];

Member

rokonec Mar 20, 2026 •

edited

Loading

[Critical — Correctness] HashSet<GroupByPredicateInput> with a mutable, reused object produces wrong results.

CountDistinct captures a single HashSet<GroupByPredicateInput> in its lambda. Inside CountIf, a single GroupByPredicateInput instance is created per (group, column) and then reused across all rows in the inner loop:

// line ~202: created once per (group, column)
var groupByPredicateInput = new GroupByPredicateInput { ColumnName = column.Name, GroupKey = firstColumn[rowIndex] };

foreach (long row in rowEnumerable)
{
    groupByPredicateInput.RowValue = column[row];  // ← line ~209: mutated each iteration
    if (predicate(groupByPredicateInput))           // CountDistinct does .Contains() then .Add()
        count++;
}

CountDistinct's predicate calls seenValues.Add(input), storing the same reference that line 209 mutates on the next iteration. This violates the HashSet contract: after insertion, the object's hash code changes, so it lands in the wrong bucket.

How it breaks concretely: When the HashSet resizes (e.g. at the 3rd distinct value, capacity jumps from 3→7), it rehashes all entries. Since they're all the same reference with the same current RowValue, they all collapse into one bucket. After that, any previously-seen value that hashes to a different bucket won't be found — so it gets double-counted as "distinct."

The test data [1, 1, 2, 3, 4] for Group 2 avoids this because the duplicate 1 appears at index 1 (before the resize at index 2). If the data were [1, 2, 3, 4, 1] instead, the final 1 would be miscounted as distinct (returning 5 instead of 4).

Additionally, the HashSet is never cleared between groups/columns, so it accumulates stale entries — a memory leak proportional to total rows × columns.

Suggested fix: Use a HashSet<object> keyed on just the row value, scoped per (group, column). The simplest approach is to implement CountDistinct directly (like the original Count) rather than routing through CountIf:

public override DataFrame CountDistinct(params string[] columnNames)
{
    // Same structure as Count, but with a per-(group, column) HashSet<object>
    // to track distinct non-null values.
}

src/Microsoft.Data.Analysis/GroupBy.cs

+                      /// <param name="predicate">A function that takes in the column name, group key, and row value and returns true to include that row in the group count or false to exclude it.</param>
+                      /// <param name="columnNames">The columns within which to compute the number of values in each group that match the predicate. A default value includes all columns.</param>
+                      /// <returns></returns>
+                      public abstract DataFrame CountIf(Func<GroupByPredicateInput, bool> predicate, params string[] columnNames);

Member

rokonec Mar 20, 2026

[Major — Breaking change] Adding abstract methods to a public non-sealed class is a binary-breaking change. Any downstream code that subclasses GroupBy will fail to compile (or fail at load time for pre-compiled assemblies) until it implements CountIf and CountDistinct.

Options to avoid the break:

Make them virtual with a default implementation (e.g. throw new NotSupportedException()).
Add them only on GroupBy<TKey> (which is the concrete class) rather than the abstract base.
Seal GroupBy if external subclassing is not intended (coordinate with maintainers).

src/Microsoft.Data.Analysis/GroupBy.cs

                       public override DataFrame Count(params string[] columnNames)
+                      {
+                          return CountIf(input => input.RowValue != null, columnNames);

Member

rokonec Mar 20, 2026

[Major — Performance regression] Count() was previously allocation-free — a tight if (column[row] != null) count++ loop. By delegating through CountIf, every row now:

Boxes the cell value into object RowValue (Int32, Double, DateTime, etc.).
Allocates a GroupByPredicateInput per (group, column) pair.
Invokes a delegate per row.

Count() is the most commonly-used GroupBy aggregation; on a 1M-row x 10-column DataFrame this adds millions of boxing allocations.

Suggested fix: Keep the original inlined null-check implementation for Count(). CountIf can coexist as a separate method without Count needing to delegate to it.

src/Microsoft.Data.Analysis/GroupBy.cs

+                      /// <summary>
+                      /// The name of the column that is being aggregated
+                      /// </summary>
+                      public string ColumnName { get; set; }

Member

rokonec Mar 20, 2026

[Minor — Design] Two small things on this record:

Mutable set accessors on a record. Records are designed for value semantics and immutability. Using { get; init; } instead of { get; set; } would prevent accidental mutation after construction and make storing instances in collections (like the HashSet in CountDistinct) safe by design.
object-typed properties cause boxing. Every value-type cell (Int32, Double, DateTime, etc.) gets boxed into object GroupKey / object RowValue. If perf matters, consider a generic variant like GroupByPredicateInput<TKey, TValue> — though that adds API complexity, so object may be an acceptable trade-off if documented.

src/Microsoft.Data.Analysis/GroupBy.cs

+                      /// </summary>
+                      /// <param name="predicate">A function that takes in the column name, group key, and row value and returns true to include that row in the group count or false to exclude it.</param>
+                      /// <param name="columnNames">The columns within which to compute the number of values in each group that match the predicate. A default value includes all columns.</param>
+                      /// <returns></returns>

Member

rokonec Mar 20, 2026

[Minor — Docs] <returns></returns> is empty here and on CountDistinct. Consider filling these in, e.g.:

/// <returns>A <see cref="DataFrame"/> with one row per group and one column per aggregated column, containing the matching counts.</returns>

test/Microsoft.Data.Analysis.Tests/DataFrameTests.cs

+                              }
+                          }
+                          DataFrame countIf = dfWithDuplicates.GroupBy("Group").CountIf((GroupByPredicateInput input) => input.RowValue is int and < 3, "Int");

Member

rokonec Mar 20, 2026

[Minor — Test coverage] CountIf is only tested with a single column ("Int"). Consider adding:

A call with no column names (default all-columns path) to verify it works across all column types.
An always-true predicate — result should match Count() output.
An always-false predicate — all counts should be 0.
Edge cases: empty groups, all-null columns, single-row groups.

dotnet-policy-service bot added the needs-author-action label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution needs-author-action