countDistinct overloads on GroupBy#1875
Conversation
| // TODO: check columns as well when #1531 is fixed | ||
| @Test | ||
| fun `countDistinct on empty GroupBy`() { | ||
| df.drop(df.nrow).groupBy("group").countDistinct().count() shouldBe 0 |
There was a problem hiding this comment.
purely cosmetic, but I'd recommend formatting DF operations across multiple lines, per operation, to understand what follows what, and it reads best like a sentence:
df
.drop(df.nrow)
.groupBy("group").countDistinct()
.count() shouldBe 0I've disabled the respective ktlint rule to allow this notation, otherwise it always puts everthing on the next line.
There was a problem hiding this comment.
I think we follow this principle in examples, not as important in tests, but I thought it would be informative nonetheless :)
There was a problem hiding this comment.
I'd recommend formatting DF operations across multiple lines, per operation
It is only for the cases when we do many (for example, 3+) DF operations, right?
If we do 2 operations: df.groupBy { isHappy }.countDistinct { name.firstName }, would you keep them in one line?
There was a problem hiding this comment.
yes. However, I count groupBy+countDistinct still as 1 operation, so I'd keep them in one line/attached regardless.
(okay, for groupBy the line is a bit more of a gray area, but for, say, df.update { cols }.where {}.with {}, it's clearer to see they belong together on one line, especially if there are other operations after it.)
There was a problem hiding this comment.
I see, thank you! I have formatted operations in the tests
| fun countDistinctColumnsDf() { | ||
| df.format().perRowCol { row, _ -> | ||
| val lastName = df[row.index()].name.lastName | ||
| background(lastNameToColor(lastName)) and textColor(black) |
There was a problem hiding this comment.
hmm. It took me a bit longer than expected to figure out what the colors meant exactly. I think the problem mainly is that your colors mean two things: "green" means "these rows are considered the same and counted as 1", while "white" means "these rows are all distinct". Could I recommend using a distinct color for each distinct row and using the same color for rows considered similar? (maybe use fewer rows if you run out of color XD)
There was a problem hiding this comment.
@Allex-Nik hmm, slightly better, but could you see what it looks like with different colors for a dataframe with 4 rows max? I still believe that may show off the idea more clearly. This doesn't work well if the dataframe is larger than that, I believe
There was a problem hiding this comment.
I have updated the dataframe in the countDistinct samples so that it only has 4 rows. I have also adjusted samples so that the new shorted dataframe still illustrates the ideas in the documentation. In the groupBy docs, I have made a separate sample for using countDistinct on GroupBy because the dataframe in the countDistinct docs is now different, so the sample cannot be reused in the groupBy docs.
It looks better indeed!
| background(firstNameToColor(firstName)) and textColor(black) | ||
| } | ||
| } | ||
| .saveDfHtmlSample() |
There was a problem hiding this comment.
candidate for all nested frames to be expanded by default :D I recommended Andrei a way to achieve this not that long ago... maybe you can find and apply it here :) otherwise we can do it later
There was a problem hiding this comment.
I found the script, but I was thinking: since this expanding should be applied to many different samples, would it make sense to create a separate issue for this and implement it in a separate PR? Maybe within this issue we could also investigate if we need to just add a helper method with this script to DataFrameSampleHelper (and maybe even to update the existing samples to be expanded where needed) or we can change the main rendering script (init.js, html.kt) so that tables are expanded by default.
Or do you still think it's better to try expanding on this sample in this PR for now? If yes, I think I can add this script as a method in DataFrameSampleHelper and apply it only for countDistinct
There was a problem hiding this comment.
We can do both maybe :) creating an issue is a good idea; then we can add the script here temporarily with a TODO pointing to the issue. That way the website looks good and we will fix it properly in the future. wdyt?
|
good job! :D |
… of new overloads on `GroupBy`
…tDistinct` overloads
666dd6d to
43c51cf
Compare


Fixes #533.
The
countDistinctfunction is currently defined only onDataFrame. In this PR, I add twocountDistinctaggregations onGroupBy:KDocs and website docs are adjusted accordingly.