Skip to content

Conversation

@CarloMariaProietti
Copy link
Contributor

@CarloMariaProietti CarloMariaProietti commented Nov 10, 2025

FIXES #658
It makes possible to compare DataFrame by exploiting Myers difference algotithm whose cost is O((M+N)*D) .
M is length of dfA, N is length of dfB, D is length of shortest edit script to get B from A.

Returns a DataFrame< ComparisonDescription >,
ComparisonDescription is a schema created specifically for this use case.

It comes with a proper test case.

About Myers difference algotithm:
https://neil.fraser.name/writing/diff/myers.pdf

@CarloMariaProietti CarloMariaProietti marked this pull request as ready for review November 16, 2025 19:05
// row at index 'x-1' of dfA was removed
xPrev + 1 == x && yPrev + 1 != y -> {
comparisonDf = comparisonDf.concat(
dataFrameOf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange formatting, can you lint this file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also probably best to add argument names to ComparisonDescription, "magic" argumentless constants are often a source of bugs :)

internal class ComparisonDescription(
val rowAtIndex: Int,
val of: String,
val wasRemoved: Boolean?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this results in true or null... what's wrong with false? Same below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... rows can either be removed or inserted, right? nothing else. Let's turn these two booleans in an enum as well then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose for null instead of false to remark that if wasInserted is true, wasRemoved column does not even make sense (and vive-versa).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; well, in that case an enum makes more sense I think

val of: String,
val wasRemoved: Boolean?,
val wasInserted: Boolean?,
val afterRow: Int?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this insertedAfterRow or something more expressive. That explains why it's null if wasRemoved == true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but honestly, this seems a thing AI would do wrong. If you're using AI, that's okay; however, you remain the one responsible for the code.

At first you had:

val wasRemoved: Boolean?
val wasInserted: Boolean?
val afterRow: Int?

I remarked here: #1556 (comment) both values had two options: either null or true. This is an odd thing to do with booleans which, as you know, have 2 values already: true or false, however, you explained why, so that makes slightly more sense now :)

What I don't understand is why you now have two nullable RowOfComparisons:

val wasRemoved: RowOfComparison?
val wasInserted: RowOfComparison?

What would these even contain? A row is either 'removed' (WAS_REMOVED) or 'inserted' (WAS_INSERTED_AFTER_ROW), right? So why not just make one column called modification: RowOfComparison?

Next, I meant you to rename afterRow to insertedAfterRow, not wasInserted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, no AI at all was used in this code. I misunderstood #1556 (comment),
I thought that I had to create an enum whose constants had an exact correspondance with boolean values true and false (so that the schema could remain the same) .
However, now I understood what You meant :), I proceed to correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I take back what I said :) Thanks for the explanation!

val wasRemoved: Boolean?,
val wasInserted: Boolean?,
val afterRow: Int?,
) : DataRowSchema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could include the modified DataRow<*> in the resulting DataFrame as well. That could make it a bit easier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myers difference algorithm exploits the idea of comparison in a boolean sense,
a member (row) of the data structure (df) is equal or not-equal to another member.
In this logic a modified row is represented by two DataRow<ComparisonDescription> ,
One row is like: row n of dfA was removed and the other: row m of dfB was inserted after row n-1 (of dfA).

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo representing explicitly modified rows means customing Myers Alg logic by introducing a 3-possible-output non boolean logic of comparison: Equal, Non Equal, Similar. Similar means that compared row differ for a limited number of elements (that may be proportional to row's length).
In a lower abstraction level, 'similar' looks like a 'flagged' diagonal move in the edit script graph
-> list of Pair would be no more enough to represent the result, we would need a list of 3-ple.
Anyway, imo, this slution has the following weakness: the result is less neutral because the previous definition of 'Similarity' can't determine with certainty wheter a row was actually modified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not (yet) asking for in-row differences. Simply for adding the original row to a new column when that row was "Not Equal", so either "removed" or "inserted". A ComparisonDescription like row n of dfA was removed doesn't really help me if I can't see what "row n of dfA" actually contains. Similar to row m of dfB.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would make the comparison output more independent. I can try to implement it.

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would make the comparison output more independent. I can try to implement it.

Done, i added a DataRow<T> column to ComparisonDescription<T> representing the content of the row.

@Jolanrensen Jolanrensen added this to the 1.0.0-Beta5 milestone Dec 4, 2025
@CarloMariaProietti
Copy link
Contributor Author

If there is anything I can do, I am at full disposal :)

@Jolanrensen Jolanrensen modified the milestones: 1.0.0-Beta5, Backlog Jan 2, 2026
@Jolanrensen
Copy link
Collaborator

If there is anything I can do, I am at full disposal :)

Sure! However, in the team we currently give more priority to fixing bugs for the 1.0 release rather than adding new features like this. This will likely be a 1.1 thing, so it has lower priority for us; we'll likely revisit it later.

However, if you want, you could explore how DataFrame users would actually use this new functionality. Your PR only provides internal implementation code, after all. Similar to describe(), how would users actually compare two dataframes? Both in "normal" Kotlin code and in notebooks. What would be a good API for this, and does your implementation provide all information the users would expect to be there? Would you use it yourself?

@CarloMariaProietti
Copy link
Contributor Author

If there is anything I can do, I am at full disposal :)

Sure! However, in the team we currently give more priority to fixing bugs for the 1.0 release rather than adding new features like this. This will likely be a 1.1 thing, so it has lower priority for us; we'll likely revisit it later.

However, if you want, you could explore how DataFrame users would actually use this new functionality. Your PR only provides internal implementation code, after all. Similar to describe(), how would users actually compare two dataframes? Both in "normal" Kotlin code and in notebooks. What would be a good API for this, and does your implementation provide all information the users would expect to be there? Would you use it yourself?

I think that comparing dataframes could be very usefull when the goal is to monitor something in the time.
For example: I have 2 DFs, representing my customer base, one refers to today and the other refers to last year. In this context I would like to know the characteristics of the the clients I gained/lost and I would like the comparison to return a DF whose rows represent these clients.
Current implementation of compare allows to do this by properly quering returned DF; however the only usefull columns would be modifiedRowContent and modification.
Regarding the API, if I were an user I would like to do something like dfRepresentingToday.compareTo(dfRepresentingLastYear).

@Jolanrensen
Copy link
Collaborator

I think that comparing dataframes could be very usefull when the goal is to monitor something in the time. For example: I have 2 DFs, representing my customer base, one refers to today and the other refers to last year. In this context I would like to know the characteristics of the the clients I gained/lost and I would like the comparison to return a DF whose rows represent these clients. Current implementation of compare allows to do this by properly quering returned DF; however the only usefull columns would be modifiedRowContent and modification. Regarding the API, if I were an user I would like to do something like dfRepresentingToday.compareTo(dfRepresentingLastYear).

Yeah I think that makes sense :) maybe give it a try with a draft API and a notebook using it (you can publish DF to maven Local and run it in a notebook using:

USE { repositories(mavenLocal()) }
%use dataframe(v=1.0.0-dev)

)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comparing two data frame

2 participants