Skip to content

countDistinct overloads on GroupBy#1875

Open
Allex-Nik wants to merge 6 commits into
masterfrom
count-distinct-on-group-by
Open

countDistinct overloads on GroupBy#1875
Allex-Nik wants to merge 6 commits into
masterfrom
count-distinct-on-group-by

Conversation

@Allex-Nik
Copy link
Copy Markdown
Collaborator

Fixes #533.

The countDistinct function is currently defined only on DataFrame. In this PR, I add two countDistinct aggregations on GroupBy:

  • without a columns selector (all columns are considered)
  • with a columns selector.

KDocs and website docs are adjusted accordingly.

@Allex-Nik Allex-Nik requested review from Jolanrensen and zaleslaw May 27, 2026 15:06
Comment thread docs/StardustDocs/topics/groupBy.md
// TODO: check columns as well when #1531 is fixed
@Test
fun `countDistinct on empty GroupBy`() {
df.drop(df.nrow).groupBy("group").countDistinct().count() shouldBe 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purely cosmetic, but I'd recommend formatting DF operations across multiple lines, per operation, to understand what follows what, and it reads best like a sentence:

df
    .drop(df.nrow)
    .groupBy("group").countDistinct()
    .count() shouldBe 0

I've disabled the respective ktlint rule to allow this notation, otherwise it always puts everthing on the next line.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we follow this principle in examples, not as important in tests, but I thought it would be informative nonetheless :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend formatting DF operations across multiple lines, per operation

It is only for the cases when we do many (for example, 3+) DF operations, right?
If we do 2 operations: df.groupBy { isHappy }.countDistinct { name.firstName }, would you keep them in one line?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. However, I count groupBy+countDistinct still as 1 operation, so I'd keep them in one line/attached regardless.

(okay, for groupBy the line is a bit more of a gray area, but for, say, df.update { cols }.where {}.with {}, it's clearer to see they belong together on one line, especially if there are other operations after it.)

Copy link
Copy Markdown
Collaborator Author

@Allex-Nik Allex-Nik Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you! I have formatted operations in the tests

fun countDistinctColumnsDf() {
df.format().perRowCol { row, _ ->
val lastName = df[row.index()].name.lastName
background(lastNameToColor(lastName)) and textColor(black)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. It took me a bit longer than expected to figure out what the colors meant exactly. I think the problem mainly is that your colors mean two things: "green" means "these rows are considered the same and counted as 1", while "white" means "these rows are all distinct". Could I recommend using a distinct color for each distinct row and using the same color for rows considered similar? (maybe use fewer rows if you run out of color XD)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we color only the two rows that will be considered the same? I think this way we would focus the reader's attention on the most important thing: these two rows will be counted as one.
rows

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Allex-Nik hmm, slightly better, but could you see what it looks like with different colors for a dataframe with 4 rows max? I still believe that may show off the idea more clearly. This doesn't work well if the dataframe is larger than that, I believe

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the dataframe in the countDistinct samples so that it only has 4 rows. I have also adjusted samples so that the new shorted dataframe still illustrates the ideas in the documentation. In the groupBy docs, I have made a separate sample for using countDistinct on GroupBy because the dataframe in the countDistinct docs is now different, so the sample cannot be reused in the groupBy docs.

It looks better indeed!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image indeed! :D thanks!

background(firstNameToColor(firstName)) and textColor(black)
}
}
.saveDfHtmlSample()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

candidate for all nested frames to be expanded by default :D I recommended Andrei a way to achieve this not that long ago... maybe you can find and apply it here :) otherwise we can do it later

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the script, but I was thinking: since this expanding should be applied to many different samples, would it make sense to create a separate issue for this and implement it in a separate PR? Maybe within this issue we could also investigate if we need to just add a helper method with this script to DataFrameSampleHelper (and maybe even to update the existing samples to be expanded where needed) or we can change the main rendering script (init.js, html.kt) so that tables are expanded by default.

Or do you still think it's better to try expanding on this sample in this PR for now? If yes, I think I can add this script as a method in DataFrameSampleHelper and apply it only for countDistinct

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do both maybe :) creating an issue is a good idea; then we can add the script here temporarily with a TODO pointing to the issue. That way the website looks good and we will fix it properly in the future. wdyt?

@Jolanrensen
Copy link
Copy Markdown
Collaborator

good job! :D

@Allex-Nik Allex-Nik force-pushed the count-distinct-on-group-by branch from 666dd6d to 43c51cf Compare June 1, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define and add countDistinct aggregation for GroupBy

2 participants