`countDistinct` overloads on `GroupBy` by Allex-Nik · Pull Request #1875 · Kotlin/dataframe

Allex-Nik · 2026-05-27T15:06:14Z

Fixes #533.

The countDistinct function is currently defined only on DataFrame. In this PR, I add two countDistinct aggregations on GroupBy:

without a columns selector (all columns are considered)
with a columns selector.

KDocs and website docs are adjusted accordingly.

Jolanrensen · 2026-05-28T18:04:37Z

+    // TODO: check columns as well when #1531 is fixed
+    @Test
+    fun `countDistinct on empty GroupBy`() {
+        df.drop(df.nrow).groupBy("group").countDistinct().count() shouldBe 0


purely cosmetic, but I'd recommend formatting DF operations across multiple lines, per operation, to understand what follows what, and it reads best like a sentence:

df .drop(df.nrow) .groupBy("group").countDistinct() .count() shouldBe 0

I've disabled the respective ktlint rule to allow this notation, otherwise it always puts everthing on the next line.

I think we follow this principle in examples, not as important in tests, but I thought it would be informative nonetheless :)

I'd recommend formatting DF operations across multiple lines, per operation

It is only for the cases when we do many (for example, 3+) DF operations, right?
If we do 2 operations: df.groupBy { isHappy }.countDistinct { name.firstName }, would you keep them in one line?

yes. However, I count groupBy+countDistinct still as 1 operation, so I'd keep them in one line/attached regardless.

(okay, for groupBy the line is a bit more of a gray area, but for, say, df.update { cols }.where {}.with {}, it's clearer to see they belong together on one line, especially if there are other operations after it.)

I see, thank you! I have formatted operations in the tests

Jolanrensen · 2026-05-28T18:14:15Z

+    fun countDistinctColumnsDf() {
+        df.format().perRowCol { row, _ ->
+            val lastName = df[row.index()].name.lastName
+            background(lastNameToColor(lastName)) and textColor(black)


hmm. It took me a bit longer than expected to figure out what the colors meant exactly. I think the problem mainly is that your colors mean two things: "green" means "these rows are considered the same and counted as 1", while "white" means "these rows are all distinct". Could I recommend using a distinct color for each distinct row and using the same color for rows considered similar? (maybe use fewer rows if you run out of color XD)

What if we color only the two rows that will be considered the same? I think this way we would focus the reader's attention on the most important thing: these two rows will be counted as one.

@Allex-Nik hmm, slightly better, but could you see what it looks like with different colors for a dataframe with 4 rows max? I still believe that may show off the idea more clearly. This doesn't work well if the dataframe is larger than that, I believe

I have updated the dataframe in the countDistinct samples so that it only has 4 rows. I have also adjusted samples so that the new shorted dataframe still illustrates the ideas in the documentation. In the groupBy docs, I have made a separate sample for using countDistinct on GroupBy because the dataframe in the countDistinct docs is now different, so the sample cannot be reused in the groupBy docs.

It looks better indeed!

indeed! :D thanks!

Jolanrensen · 2026-05-28T18:17:02Z

+                    background(firstNameToColor(firstName)) and textColor(black)
+                }
+            }
+            .saveDfHtmlSample()


candidate for all nested frames to be expanded by default :D I recommended Andrei a way to achieve this not that long ago... maybe you can find and apply it here :) otherwise we can do it later

I found the script, but I was thinking: since this expanding should be applied to many different samples, would it make sense to create a separate issue for this and implement it in a separate PR? Maybe within this issue we could also investigate if we need to just add a helper method with this script to DataFrameSampleHelper (and maybe even to update the existing samples to be expanded where needed) or we can change the main rendering script (init.js, html.kt) so that tables are expanded by default.

Or do you still think it's better to try expanding on this sample in this PR for now? If yes, I think I can add this script as a method in DataFrameSampleHelper and apply it only for countDistinct

We can do both maybe :) creating an issue is a good idea; then we can add the script here temporarily with a TODO pointing to the issue. That way the website looks good and we will fix it properly in the future. wdyt?

Jolanrensen · 2026-05-28T18:17:31Z

good job! :D

… of new overloads on `GroupBy`

…tDistinct` overloads

Allex-Nik requested review from Jolanrensen and zaleslaw May 27, 2026 15:06

Allex-Nik commented May 28, 2026

View reviewed changes

Comment thread core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/countDistinct.kt

Allex-Nik commented May 28, 2026

View reviewed changes

Comment thread core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/countDistinct.kt

Allex-Nik commented May 28, 2026

View reviewed changes

Comment thread docs/StardustDocs/topics/groupBy.md

Jolanrensen reviewed May 28, 2026

View reviewed changes

Allex-Nik added 6 commits June 1, 2026 17:34

Add countDistinct overloads on GroupBy and KDocs for them

2d747c2

Add tests for countDistinct overloads on GroupBy

7718511

Update website documentation for the countDistinct function because…

d243c71

… of new overloads on `GroupBy`

Update groupBy website documentation and KDocs because of new `coun…

048334a

…tDistinct` overloads

Remove unused import in Access.kt

e4f4392

Fixes to countDistinct and groupBy documentation after review

43c51cf

Allex-Nik force-pushed the count-distinct-on-group-by branch from 666dd6d to 43c51cf Compare June 1, 2026 15:34

Conversation

Allex-Nik commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Allex-Nik Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jolanrensen commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Allex-Nik Jun 1, 2026 •

edited

Loading