uniqdiff

Data diff engine for CSV, Parquet, JSONL, streams, and large dataset comparison.

uniqdiff tells you what changed between two datasets:

rows only in the old dataset;
rows only in the new dataset;
rows present in both datasets;
changed fields for rows with the same key;
schema drift: added/removed columns, type changes, nullable changes.

It is built for data engineers, backend engineers, ETL pipelines, and data QA jobs that need repeatable diffs without writing pandas merge code or DuckDB SQL.

pip install uniqdiff
uniqdiff diff old.csv new.csv --format csv --key id --summary

Why uniqdiff

Less memory than pandas for large diffs: use disk-backed and file-result modes.
Simpler than DuckDB for comparison jobs: no SQL required.
More scalable than csv-diff for large outputs: stream JSONL/CSV result rows.
Engine-first API: use it from Python, CLI, CI, ETL, or other tools.
No heavy required dependencies: core uses the Python standard library.

Use uniqdiff when you need a focused comparison engine, not a dataframe framework, SQL database, or report generator.

Quick start

Create two files:

python -c "from pathlib import Path; Path('old.csv').write_text('id,name,status\n1,Alice,active\n2,Bob,pending\n3,Cara,active\n', encoding='utf-8'); Path('new.csv').write_text('id,name,status\n2,Bob,active\n3,Cara,active\n4,Dana,pending\n', encoding='utf-8')"

Install and compare:

pip install uniqdiff
uniqdiff diff old.csv new.csv --format csv --key id --summary

Find changed fields for matching keys:

uniqdiff diff old.csv new.csv --format csv --key id --field-diff

Use the sorted streaming path when upstream exports are already sorted by key:

uniqdiff diff old.csv new.csv --format csv --key id --field-diff --sorted-input

Check schema drift:

uniqdiff diff old.csv new.csv --format csv --schema-diff --summary

Use from Python:

from uniqdiff import compare

result = compare(
    [{"id": 1}, {"id": 2}],
    [{"id": 2}, {"id": 3}],
    key="id",
    include_common=True,
)

print(result.only_in_first)   # [{"id": 1}]
print(result.only_in_second)  # [{"id": 3}]
print(result.common)          # [{"id": 2}]

Example output

CLI summary:

{
  "equal": false,
  "only_in_first_count": 1,
  "only_in_second_count": 1,
  "common_count": 2,
  "duplicate_first_count": 0,
  "duplicate_second_count": 0,
  "backend": "memory",
  "result_mode": "memory"
}

Field-level diff:

{
  "rows": [
    {
      "key": "2",
      "changes": [
        {
          "field": "status",
          "left": "pending",
          "right": "active"
        }
      ]
    }
  ],
  "summary_by_column": {
    "status": 1
  }
}

Machine-readable JSONL event stream:

uniqdiff diff old.csv new.csv \
  --key id \
  --field-diff \
  --columns status \
  --format jsonl \
  --output diff.jsonl

Example uniqdiff.jsonl events:

{"type":"metadata","format":"uniqdiff.jsonl","format_version":"1.0","tool":"uniqdiff","tool_version":"1.0.0","mode":"field_diff","key_columns":["id"],"compared_columns":["status"],"created_at":"2026-05-06T12:00:00Z"}
{"type":"only_left","key":{"id":"1"}}
{"type":"field_change","key":{"id":"2"},"column":"status","left":"pending","right":"active"}
{"type":"summary","left_rows":3,"right_rows":3,"common_rows":2,"only_left":1,"only_right":1,"changed_rows":1,"changed_fields":1,"duplicate_keys_left":0,"duplicate_keys_right":0,"schema_changes":0}

Legacy section/value file-result output:

uniqdiff diff old.csv new.csv \
  --format csv \
  --key id \
  --mode disk \
  --result-mode file \
  --output diff.jsonl

Example JSONL rows:

{"section":"only_in_first","value":{"id":"1","name":"Alice","status":"active"}}
{"section":"only_in_second","value":{"id":"4","name":"Dana","status":"pending"}}

Use cases

Compare daily CSV exports.
Detect added and removed records by primary key.
Compare Parquet snapshots in ETL pipelines.
Catch row-level changes before loading data downstream.
Detect schema drift in CI.
Stream large diff outputs to JSONL for later processing.
Load diff events into DuckDB, Spark, BigQuery, ClickHouse, or CI jobs.
Validate backend service exports against expected data.
Build data QA tools on top of a stable comparison engine.

Benchmarks

These are real local benchmark runs from this repository using benchmarks/run.py. Results depend on CPU, disk, Python version, input shape, overlap, and output size.

Verification commands run before updating this README:

python -m ruff check .
python -m mypy src
python -m pytest tests -q

Result:

ruff: All checks passed
mypy: Success, no issues found in 35 source files
pytest: full test suite passed, 2 optional tests skipped

Scalar keyed diff, 100_000 rows per input, 50% overlap:

python benchmarks/run.py \
  --size 100000 \
  --chunk-size 50000 \
  --scenario memory \
  --scenario sqlite \
  --scenario file_result \
  --json

Scenario	Backend	Time	Peak memory	Rows only in left	Rows only in right	Common	Output
memory	memory	0.085s	13.308 MB	50,000	50,000	50,000	0 MB
sqlite	sqlite	3.177s	9.206 MB	50,000	50,000	50,000	0 MB
file_result	sqlite	4.360s	9.202 MB	50,000	50,000	50,000	5.759 MB

Dictionary row diff by id, 50_000 rows per input, 50% overlap:

python benchmarks/run.py \
  --size 50000 \
  --data-shape dict \
  --chunk-size 25000 \
  --scenario memory \
  --scenario sqlite \
  --scenario file_result \
  --json

Scenario	Backend	Time	Peak memory	Rows only in left	Rows only in right	Common	Output
memory	memory	0.063s	6.638 MB	25,000	25,000	25,000	0 MB
sqlite	sqlite	3.123s	26.818 MB	25,000	25,000	25,000	0 MB
file_result	sqlite	5.584s	26.589 MB	25,000	25,000	25,000	5.057 MB

What the numbers show:

memory mode is fastest when data fits comfortably in RAM;
disk mode trades speed for bounded materialization and file-backed workflows;
file-result mode handles large outputs without keeping result rows in memory;
benchmark runs are reproducible with stdlib-only tooling.

Cross-tool benchmarks are available separately:

pip install -e ".[benchmark]"
python benchmarks/comparison/run.py --rows 10000

The comparison suite reports fit-by-use-case labels for pandas, DuckDB, csv-diff/csvdiff-style workflows, DataComPy, and uniqdiff.

Features

Exact row presence diff:
- only_in_first
- only_in_second
- common
- unique
Duplicate detection.
Comparison by key or composite key.
Field-level diff by key.
Schema-aware diff:
- added columns;
- removed columns;
- type changes;
- nullable changes.
CSV, TSV, JSONL, TXT, gzip, and optional Parquet.
Streaming sorted diff for already sorted inputs.
Memory, SQLite, hash partition, external sort, and auto modes.
File-result mode for JSONL/CSV output.
uniqdiff.jsonl event stream with schema version 1.0.
Lazy readers for large result files.
CLI with --summary and --fail-on-diff.
Type hints and stable result objects.

Usage

Compare files:

uniqdiff diff old.csv new.csv --format csv --key id

Compact CI output:

uniqdiff diff expected.csv actual.csv \
  --format csv \
  --key id \
  --summary \
  --fail-on-diff

Field-level diff:

uniqdiff diff old.csv new.csv \
  --format csv \
  --key id \
  --field-diff \
  --columns name,status,email \
  --summary

Schema diff:

uniqdiff diff old.csv new.csv \
  --format csv \
  --schema-diff \
  --summary \
  --fail-on-diff

Large output:

uniqdiff diff old.csv new.csv \
  --key id \
  --format jsonl \
  --output diff.jsonl

Use --input-format csv when file extensions are not enough for input detection:

uniqdiff diff snapshot_a snapshot_b \
  --input-format csv \
  --key id \
  --format jsonl

Python API:

from uniqdiff import (
    compare_fields,
    compare_file_schema,
    compare_files,
    iter_compare_events,
    iter_event_rows,
    iter_field_diff_sorted,
    summarize_event_file,
)

rows = compare_files("old.csv", "new.csv", format="csv", key="id")
fields = compare_fields(old_rows, new_rows, key="id", columns=("status",))
streamed_fields = iter_field_diff_sorted(old_rows_sorted, new_rows_sorted, key="id")
schema = compare_file_schema("old.csv", "new.csv", format="csv")
events = iter_compare_events(old_rows, new_rows, key="id")
saved_events = iter_event_rows("diff.jsonl")
summary = summarize_event_file("diff.jsonl")

JSONL event types:

Type	Meaning
`metadata`	First event; format, version, key columns, compared columns, timestamp
`only_left`	Key exists only in the left input
`only_right`	Key exists only in the right input
`row_changed`	Matching-key row has one or more changed fields
`field_change`	One changed field with left/right values
`duplicate_key`	Duplicate key detected on one side
`schema_change`	Added/removed/type/nullable schema change
`error`	Machine-readable error event for integrations
`summary`	Last event; counts and metrics

The event stream format name is uniqdiff.jsonl; the schema version is 1.0. The JSON Schema lives at docs/schemas/uniqdiff-jsonl-1.0.schema.json.

Parquet:

pip install "uniqdiff[parquet]"

from uniqdiff import compare_files

result = compare_files(
    "old.parquet",
    "new.parquet",
    format="parquet",
    key="id",
    columns=("id", "status"),
)

Comparison

Need	uniqdiff	pandas	DuckDB	csv-diff
Simple keyed row diff	Built in	Custom merge code	SQL query	Built in
Field-level diff	Built in	Custom code	SQL/custom code	Partial
Schema drift check	Built in	Custom code	SQL/custom code	No
Low-memory output	JSONL/CSV streaming	Manual	Possible with SQL/export	Limited
No SQL required	Yes	Yes	No	Yes
Parquet support	Optional extra	Yes	Yes	No/limited
CI-friendly CLI	Yes	No	Partial	Yes
Large output handling	Designed for file-result mode	Often materialized	Good with query design	Less scalable

Choose uniqdiff when the job is comparison.

Choose pandas when the job is dataframe transformation.

Choose DuckDB when the team wants SQL and query planning.

Choose csv-diff for small CSV-only diffs.

How it works

uniqdiff is an engine layer:

input source
  -> reader / connector
  -> key extraction and normalization
  -> backend selection
  -> exact comparison
  -> memory result or streaming file result

Backends:

memory: fastest path for small and medium datasets.
sqlite: disk-backed exact comparison with no optional dependency.
hash_partition: splits data into hash partitions for large comparisons.
external_sort: sorts chunks on disk and merges token streams.
auto: chooses a backend from input/options metadata.

Result modes:

memory: returns Python result objects.
file: streams JSONL/CSV output and returns stats/metadata.

Engine boundaries:

uniqdiff does comparison.
Reports, dashboards, workflow orchestration, cloud connector management, and business rules belong in higher-level tools such as UniqTools.

Limitations

Memory mode is fastest but requires data/results to fit in RAM.
Disk mode is slower because it uses temporary storage.
File-result mode avoids materializing output rows, but output size still affects runtime.
Field-level diff currently indexes the second input by key.
Schema inference is based on observed values, not database DDL.
Parquet requires pyarrow via uniqdiff[parquet].
Fuzzy matching and Bloom filters are helper APIs, not replacements for exact diff.
Benchmark results are workload-dependent; run the included benchmark suite on your data shape.

Call to action

If uniqdiff helps you compare datasets with less code and safer memory behavior, star the repository.

Stars help more data engineers and backend engineers find the project, and they make it easier to keep improving the engine.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
src/uniqdiff		src/uniqdiff
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMMERCIAL.md		COMMERCIAL.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
SERVICES.md		SERVICES.md
SUPPORT.md		SUPPORT.md
UNIQDIFF_ENGINE_REPORT.md		UNIQDIFF_ENGINE_REPORT.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uniqdiff

Why uniqdiff

Quick start

Example output

Use cases

Benchmarks

Features

Usage

Comparison

How it works

Limitations

Call to action

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

uniqdiff

Why uniqdiff

Quick start

Example output

Use cases

Benchmarks

Features

Usage

Comparison

How it works

Limitations

Call to action

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages