Compare DataFrames #1556

CarloMariaProietti · 2025-11-10T17:24:22Z

FIXES #658
It makes possible to compare DataFrame by exploiting Myers difference algotithm whose cost is O((M+N)*D) .
M is length of dfA, N is length of dfB, D is length of shortest edit script to get B from A.

Returns a DataFrame< ComparisonDescription >,
ComparisonDescription is a schema created specifically for this use case.

It comes with a proper test case.

About Myers difference algotithm:
https://neil.fraser.name/writing/diff/myers.pdf

…es, there is no difference in the logic

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/compareDataFrames.kt

Jolanrensen · 2025-11-28T12:50:54Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

+            // row at index 'x-1' of dfA was removed
+            xPrev + 1 == x && yPrev + 1 != y -> {
+                comparisonDf = comparisonDf.concat(
+                    dataFrameOf


strange formatting, can you lint this file?

also probably best to add argument names to ComparisonDescription, "magic" argumentless constants are often a source of bugs :)

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

Jolanrensen · 2025-11-28T13:06:36Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

+internal class ComparisonDescription(
+    val rowAtIndex: Int,
+    val of: String,
+    val wasRemoved: Boolean?,


this results in true or null... what's wrong with false? Same below

so... rows can either be removed or inserted, right? nothing else. Let's turn these two booleans in an enum as well then

I choose for null instead of false to remark that if wasInserted is true, wasRemoved column does not even make sense (and vive-versa).

I see; well, in that case an enum makes more sense I think

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

Jolanrensen · 2025-11-28T13:09:06Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

+    val of: String,
+    val wasRemoved: Boolean?,
+    val wasInserted: Boolean?,
+    val afterRow: Int?,


maybe call this insertedAfterRow or something more expressive. That explains why it's null if wasRemoved == true

Correct me if I'm wrong, but honestly, this seems a thing AI would do wrong. If you're using AI, that's okay; however, you remain the one responsible for the code.

At first you had:

val wasRemoved: Boolean? val wasInserted: Boolean? val afterRow: Int?

I remarked here: #1556 (comment) both values had two options: either null or true. This is an odd thing to do with booleans which, as you know, have 2 values already: true or false, however, you explained why, so that makes slightly more sense now :)

What I don't understand is why you now have two nullable RowOfComparisons:

val wasRemoved: RowOfComparison? val wasInserted: RowOfComparison?

What would these even contain? A row is either 'removed' (WAS_REMOVED) or 'inserted' (WAS_INSERTED_AFTER_ROW), right? So why not just make one column called modification: RowOfComparison?

Next, I meant you to rename afterRow to insertedAfterRow, not wasInserted.

Honestly, no AI at all was used in this code. I misunderstood #1556 (comment),
I thought that I had to create an enum whose constants had an exact correspondance with boolean values true and false (so that the schema could remain the same) .
However, now I understood what You meant :), I proceed to correct.

Then I take back what I said :) Thanks for the explanation!

Jolanrensen · 2025-11-28T13:12:13Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

+    val wasRemoved: Boolean?,
+    val wasInserted: Boolean?,
+    val afterRow: Int?,
+) : DataRowSchema


I wonder if we could include the modified DataRow<*> in the resulting DataFrame as well. That could make it a bit easier

Myers difference algorithm exploits the idea of comparison in a boolean sense,
a member (row) of the data structure (df) is equal or not-equal to another member.
In this logic a modified row is represented by two DataRow<ComparisonDescription> ,
One row is like: row n of dfA was removed and the other: row m of dfB was inserted after row n-1 (of dfA).

Imo representing explicitly modified rows means customing Myers Alg logic by introducing a 3-possible-output non boolean logic of comparison: Equal, Non Equal, Similar. Similar means that compared row differ for a limited number of elements (that may be proportional to row's length).
In a lower abstraction level, 'similar' looks like a 'flagged' diagonal move in the edit script graph
-> list of Pair would be no more enough to represent the result, we would need a list of 3-ple.
Anyway, imo, this slution has the following weakness: the result is less neutral because the previous definition of 'Similarity' can't determine with certainty wheter a row was actually modified.

I'm not (yet) asking for in-row differences. Simply for adding the original row to a new column when that row was "Not Equal", so either "removed" or "inserted". A ComparisonDescription like row n of dfA was removed doesn't really help me if I can't see what "row n of dfA" actually contains. Similar to row m of dfB.

Does that make sense?

Yes, it would make the comparison output more independent. I can try to implement it.

Yes, it would make the comparison output more independent. I can try to implement it.

Done, i added a DataRow<T> column to ComparisonDescription<T> representing the content of the row.

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt

CarloMariaProietti · 2025-12-28T17:16:17Z

If there is anything I can do, I am at full disposal :)

Jolanrensen · 2026-01-02T12:47:20Z

If there is anything I can do, I am at full disposal :)

Sure! However, in the team we currently give more priority to fixing bugs for the 1.0 release rather than adding new features like this. This will likely be a 1.1 thing, so it has lower priority for us; we'll likely revisit it later.

However, if you want, you could explore how DataFrame users would actually use this new functionality. Your PR only provides internal implementation code, after all. Similar to describe(), how would users actually compare two dataframes? Both in "normal" Kotlin code and in notebooks. What would be a good API for this, and does your implementation provide all information the users would expect to be there? Would you use it yourself?

CarloMariaProietti · 2026-01-02T19:10:08Z

If there is anything I can do, I am at full disposal :)

Sure! However, in the team we currently give more priority to fixing bugs for the 1.0 release rather than adding new features like this. This will likely be a 1.1 thing, so it has lower priority for us; we'll likely revisit it later.

However, if you want, you could explore how DataFrame users would actually use this new functionality. Your PR only provides internal implementation code, after all. Similar to describe(), how would users actually compare two dataframes? Both in "normal" Kotlin code and in notebooks. What would be a good API for this, and does your implementation provide all information the users would expect to be there? Would you use it yourself?

I think that comparing dataframes could be very usefull when the goal is to monitor something in the time.
For example: I have 2 DFs, representing my customer base, one refers to today and the other refers to last year. In this context I would like to know the characteristics of the the clients I gained/lost and I would like the comparison to return a DF whose rows represent these clients.
Current implementation of compare allows to do this by properly quering returned DF; however the only usefull columns would be modifiedRowContent and modification.
Regarding the API, if I were an user I would like to do something like dfRepresentingToday.compareTo(dfRepresentingLastYear).

Jolanrensen · 2026-01-05T13:57:42Z

I think that comparing dataframes could be very usefull when the goal is to monitor something in the time. For example: I have 2 DFs, representing my customer base, one refers to today and the other refers to last year. In this context I would like to know the characteristics of the the clients I gained/lost and I would like the comparison to return a DF whose rows represent these clients. Current implementation of compare allows to do this by properly quering returned DF; however the only usefull columns would be modifiedRowContent and modification. Regarding the API, if I were an user I would like to do something like dfRepresentingToday.compareTo(dfRepresentingLastYear).

Yeah I think that makes sense :) maybe give it a try with a draft API and a notebook using it (you can publish DF to maven Local and run it in a notebook using:

USE { repositories(mavenLocal()) }
%use dataframe(v=1.0.0-dev)

)

CarloMariaProietti added 10 commits November 7, 2025 16:52

length of edit script is correct, working on path

9580248

trying

eeee3f3

this is working but snake before f.r.e.

5fd2bfa

cleaning

fccaaf6

cleaning

bcc41e0

refining logic

51e2b23

algorythm works good with strings, next step is swithcing to dataFram…

f4b21e3

…es, there is no difference in the logic

pull master

ce46f72

cleaning

b25c712

tests

026ffe6

CarloMariaProietti marked this pull request as draft November 10, 2025 17:24

CarloMariaProietti mentioned this pull request Nov 10, 2025

Comparing two data frame #658

Open

CarloMariaProietti added 8 commits November 14, 2025 19:18

improve logic

2de3ad1

improving comments

352bf45

Update ValueColumn.kt

0db34d8

works fine with df

3387489

compareImpl

e052646

compare is ready to use

4db19bf

ready for review

b4be510

pull

e7b648f

CarloMariaProietti marked this pull request as ready for review November 16, 2025 19:05

Jolanrensen reviewed Nov 21, 2025

View reviewed changes