[DRAFT] Transparent Compression by Erotemic · Pull Request #133 · evaleval/every_eval_ever

Erotemic · 2026-05-08T02:01:54Z

This PR adds transparent compression support for aggregate and sample result files. Compressed variants of <uuid>.json and <uuid>_samples.jsonl are now treated as equivalents of the plain files for supported codecs: .gz, .zst, .bz2, .xz, and .lz4.

Changes:

Add every_eval_ever/io.py with compression suffix detection, transparent openers, result discovery, and duplicate-variant resolution.
Route validation and duplicate-entry checks through the new IO helpers.
Add a duplicate_variant validation error when both plain and compressed variants of the same logical file are present.
Add writer-side --compress, --compress-aggregate, and --compress-samples flags.
Compute recorded checksums over the actual on-disk bytes.
Document compression behavior and add optional [zst] and [lz4] codec extras.

Tests:

Validate compressed files through expand_paths and validate_file.
Ensure compressed files with invalid payloads still fail validation.
Cover duplicate compressed/plain variants as CI-failing errors.
Confirm aggregate and sample files do not false-positive as duplicates.
Ensure missing optional codecs produce typed errors instead of crashes.
Verify writer-side compression flags and override behavior.

Defaults remain unchanged: outputs are still uncompressed unless explicitly requested.

Out of scope:
The manifest-generation and viewer-Parquet scripts also need this resolver, but they live in EEE_datastore. This PR keeps the package-level compression support isolated and leaves datastore integration for a companion PR.

NOTE:

I would not be comfortable merging this until we have CI running. I have not tested this robustly yet.

New module ``every_eval_ever/io.py`` provides a single source of truth for opening EEE result files (``<uuid>.json`` aggregates and ``<uuid>_samples.jsonl`` per-instance samples) regardless of compression. Recognized codecs match the HuggingFace Hub's documented set (https://huggingface.co/docs/hub/en/datasets-adding#file-formats): .gz, .zst, .bz2, .xz, .lz4 ``.zip`` is intentionally excluded — it is an archive container, not a stream codec, and would conflict with the duplicate-variant rule. Helpers: * ``open_eee_text(path, mode)`` — text I/O with codec auto-detection * ``is_eee_result(path)`` — 'aggregate' / 'samples' / None * ``eee_uuid_stem(path)`` — strip kind+codec suffixes * ``detect_compression(path)`` — name of the trailing codec * ``add_compression_suffix(path, cs)`` — synthesize the on-disk filename * ``iter_eee_results(roots)`` — recursive discovery of EEE files * ``find_duplicate_variants(paths)`` — enforces 'one variant per (folder, uuid, kind)' rule ``zstandard`` and ``lz4`` are optional runtime dependencies; the open helper raises ``ImportError`` with an actionable message pointing at the right ``[zst]`` / ``[lz4]`` extra. The stdlib codecs (.gz, .bz2, .xz) work out of the box. This commit only adds the module + unit tests. Subsequent commits wire the rest of the package (validate, check_duplicate_entries, converters, CLI flags) through these helpers. 53 passed, 2 skipped (lz4-codec tests skip when ``lz4`` is not installed; same for ``zstandard`` if absent).

Wire ``validate.py`` and ``check_duplicate_entries.py`` through the ``every_eval_ever.io`` helpers so compressed result files (``.gz``, ``.zst``, ``.bz2``, ``.xz``, ``.lz4``) are validated identically to their plain counterparts. Validator changes ----------------- * ``validate_aggregate`` and ``validate_instance_file`` open via ``io.open_eee_text`` rather than ``Path.open`` / ``Path.read_text``. * ``validate_file`` dispatches on ``io.is_eee_result(path)`` rather than ``path.suffix == '.json' / '.jsonl'``. Compressed forms are recognized via the suffix-stripping detector; non-EEE filenames (``.zip``, ``.parquet``, plain ``.csv``) report ``unsupported_extension`` as before. * ``expand_paths`` enumerates EEE result files via ``io.iter_eee_results`` for directory inputs. Files passed explicitly are accepted regardless of suffix and let ``validate_file`` produce the unsupported-extension error. * New ``_duplicate_variant_reports`` produces synthetic ``ValidationReport``s with a ``duplicate_variant`` error type for every ``(folder, uuid_stem, kind)`` group with more than one physical variant. ``validate.main`` runs this check unconditionally before per-file validation, so a CI/PR-bot path cannot silently green-light a folder containing both ``abc.json`` and ``abc.json.gz``. * Missing optional codec dependencies surface as ``codec_unavailable`` validation errors rather than crashing the whole run. check_duplicate_entries changes ------------------------------- * ``expand_paths`` recognizes compressed aggregate files; explicit file inputs that aren't aggregate-shaped are skipped (preserving the historical "ignore non-JSON" behaviour for non-EEE content). * The aggregate-payload reader uses ``io.open_eee_text``. ``_samples.jsonl`` files are intentionally excluded from this command's discovery — duplicate detection runs over aggregate metadata only. io.py refinement ---------------- * ``is_eee_result`` and ``eee_uuid_stem`` are lenient about the ``_samples`` prefix on per-instance files: bare ``*.jsonl`` is still recognized as samples (matching what lm-eval emits when no file UUID is supplied). Strict ``_samples.jsonl`` matching is still preferred for stem extraction. Tests ----- * New ``tests/test_validate_compression.py`` covers: same fixture in plain + ``.gz`` form yields identical validation outcomes for both aggregate and samples; ``expand_paths`` finds compressed files; duplicate-variant reports fire and CI-gate appropriately; distinct kinds in the same folder don't false-positive; missing optional codec surfaces as a typed error rather than an exception. 213 passed, 5 skipped (4 require optional codec deps, 1 is unrelated upstream skip).

Threads compression through the converter writer paths. Defaults to ``--compress none`` everywhere for full backwards compatibility — this commit only enables compression on opt-in. CLI flags (added to all four ``convert`` subcommands) ----------------------------------------------------- * ``--compress {none|gz|zst|bz2|xz|lz4}`` Default codec for both aggregate and per-instance files. * ``--compress-aggregate {none|gz|zst|bz2|xz|lz4}`` Override for aggregate ``<uuid>.json`` only. Defaults to ``--compress``'s value. Recommended setting: ``none``, since HF and GitHub render uncompressed JSON inline for spot-checking. * ``--compress-samples {none|gz|zst|bz2|xz|lz4}`` Override for per-instance ``<uuid>_samples.jsonl`` only. Recommended setting for any submission going to a public store: ``gz`` (5–15x size reduction with no UX cost — the samples files are too large for web preview anyway). Argparse rejects unsupported codecs at parse time. Optional codec deps (``zstandard``, ``lz4``) trigger ``ImportError`` with an actionable message at write time when the user picks the codec but hasn't installed the extra. Writer-side plumbing -------------------- * ``cli._write_log`` accepts ``compression``; output filename gets the codec suffix appended (``<uuid>.json[.gz]``). * ``LMEvalInstanceLevelAdapter.transform_and_save`` accepts ``compression``; the recorded ``DetailedEvaluationResults.checksum`` is computed over the on-disk (possibly compressed) bytes so that downstream verifiers can validate the file as-stored. * ``HELMInstanceLevelDataAdapter`` accepts ``compression`` in its constructor; the adapter's ``self.path`` reflects the on-disk filename including any codec suffix. * ``cli._cmd_convert_helm`` threads the resolved samples-compression via ``metadata_args['samples_compression']`` rather than adding a new positional to the long-existing ``transform_from_directory`` signature. Tests ----- * ``tests/test_cli_compression.py``: full coverage of ``_resolve_compression``, ``_write_log`` round-trips per codec, LM-Eval samples writer compression including checksum-on-disk and the recorded ``file_path``, and parser-level wiring (defaults, per-kind override, invalid codec rejection). * ``tests/test_cli_inspect_uuid.py``: existing fake-write-log stubs updated to accept the new ``compression=`` keyword. 226 passed, 5 skipped. Backwards compatibility preserved across all existing tests.

* Add ``[zst]`` and ``[lz4]`` optional-dependency groups for the third-party codecs not available in the standard library. ``[all]`` now pulls them in alongside ``[inspect]`` and ``[helm]``. * README: new subsection under "Data Validation" documenting the recognized compressed forms, the duplicate-variant rule, and the ``--compress`` / ``--compress-samples`` writer flags. Notes that defaults remain uncompressed for backwards compatibility, and recommends ``--compress-samples gz`` (not the global ``--compress``) for public submissions so aggregate JSON stays browsable on the HF web UI. 226 passed, 5 skipped.

Erotemic added 4 commits May 7, 2026 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Transparent Compression#133

[DRAFT] Transparent Compression#133
Erotemic wants to merge 4 commits into
evaleval:mainfrom
Erotemic:dev/compression

Erotemic commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Erotemic commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant