Skip to content

[DRAFT] Transparent Compression#133

Open
Erotemic wants to merge 4 commits into
evaleval:mainfrom
Erotemic:dev/compression
Open

[DRAFT] Transparent Compression#133
Erotemic wants to merge 4 commits into
evaleval:mainfrom
Erotemic:dev/compression

Conversation

@Erotemic
Copy link
Copy Markdown
Collaborator

@Erotemic Erotemic commented May 8, 2026

This PR adds transparent compression support for aggregate and sample result files. Compressed variants of <uuid>.json and <uuid>_samples.jsonl are now treated as equivalents of the plain files for supported codecs: .gz, .zst, .bz2, .xz, and .lz4.

Changes:

  • Add every_eval_ever/io.py with compression suffix detection, transparent openers, result discovery, and duplicate-variant resolution.
  • Route validation and duplicate-entry checks through the new IO helpers.
  • Add a duplicate_variant validation error when both plain and compressed variants of the same logical file are present.
  • Add writer-side --compress, --compress-aggregate, and --compress-samples flags.
  • Compute recorded checksums over the actual on-disk bytes.
  • Document compression behavior and add optional [zst] and [lz4] codec extras.

Tests:

  • Validate compressed files through expand_paths and validate_file.
  • Ensure compressed files with invalid payloads still fail validation.
  • Cover duplicate compressed/plain variants as CI-failing errors.
  • Confirm aggregate and sample files do not false-positive as duplicates.
  • Ensure missing optional codecs produce typed errors instead of crashes.
  • Verify writer-side compression flags and override behavior.

Defaults remain unchanged: outputs are still uncompressed unless explicitly requested.

Out of scope:
The manifest-generation and viewer-Parquet scripts also need this resolver, but they live in EEE_datastore. This PR keeps the package-level compression support isolated and leaves datastore integration for a companion PR.

NOTE:

I would not be comfortable merging this until we have CI running. I have not tested this robustly yet.

Erotemic added 4 commits May 7, 2026 21:58
New module ``every_eval_ever/io.py`` provides a single source of truth
for opening EEE result files (``<uuid>.json`` aggregates and
``<uuid>_samples.jsonl`` per-instance samples) regardless of compression.

Recognized codecs match the HuggingFace Hub's documented set
(https://huggingface.co/docs/hub/en/datasets-adding#file-formats):

  .gz, .zst, .bz2, .xz, .lz4

``.zip`` is intentionally excluded — it is an archive container, not a
stream codec, and would conflict with the duplicate-variant rule.

Helpers:

* ``open_eee_text(path, mode)``        — text I/O with codec auto-detection
* ``is_eee_result(path)``              — 'aggregate' / 'samples' / None
* ``eee_uuid_stem(path)``              — strip kind+codec suffixes
* ``detect_compression(path)``         — name of the trailing codec
* ``add_compression_suffix(path, cs)`` — synthesize the on-disk filename
* ``iter_eee_results(roots)``          — recursive discovery of EEE files
* ``find_duplicate_variants(paths)``   — enforces 'one variant per
                                         (folder, uuid, kind)' rule

``zstandard`` and ``lz4`` are optional runtime dependencies; the open
helper raises ``ImportError`` with an actionable message pointing at
the right ``[zst]`` / ``[lz4]`` extra. The stdlib codecs (.gz, .bz2,
.xz) work out of the box.

This commit only adds the module + unit tests. Subsequent commits wire
the rest of the package (validate, check_duplicate_entries, converters,
CLI flags) through these helpers.

53 passed, 2 skipped (lz4-codec tests skip when ``lz4`` is not
installed; same for ``zstandard`` if absent).
Wire ``validate.py`` and ``check_duplicate_entries.py`` through the
``every_eval_ever.io`` helpers so compressed result files
(``.gz``, ``.zst``, ``.bz2``, ``.xz``, ``.lz4``) are validated
identically to their plain counterparts.

Validator changes
-----------------

* ``validate_aggregate`` and ``validate_instance_file`` open via
  ``io.open_eee_text`` rather than ``Path.open`` / ``Path.read_text``.
* ``validate_file`` dispatches on ``io.is_eee_result(path)`` rather
  than ``path.suffix == '.json' / '.jsonl'``. Compressed forms are
  recognized via the suffix-stripping detector; non-EEE filenames
  (``.zip``, ``.parquet``, plain ``.csv``) report
  ``unsupported_extension`` as before.
* ``expand_paths`` enumerates EEE result files via
  ``io.iter_eee_results`` for directory inputs. Files passed
  explicitly are accepted regardless of suffix and let
  ``validate_file`` produce the unsupported-extension error.
* New ``_duplicate_variant_reports`` produces synthetic
  ``ValidationReport``s with a ``duplicate_variant`` error type for
  every ``(folder, uuid_stem, kind)`` group with more than one
  physical variant. ``validate.main`` runs this check unconditionally
  before per-file validation, so a CI/PR-bot path cannot silently
  green-light a folder containing both ``abc.json`` and ``abc.json.gz``.
* Missing optional codec dependencies surface as ``codec_unavailable``
  validation errors rather than crashing the whole run.

check_duplicate_entries changes
-------------------------------

* ``expand_paths`` recognizes compressed aggregate files; explicit
  file inputs that aren't aggregate-shaped are skipped (preserving
  the historical "ignore non-JSON" behaviour for non-EEE content).
* The aggregate-payload reader uses ``io.open_eee_text``.
  ``_samples.jsonl`` files are intentionally excluded from this
  command's discovery — duplicate detection runs over aggregate
  metadata only.

io.py refinement
----------------

* ``is_eee_result`` and ``eee_uuid_stem`` are lenient about the
  ``_samples`` prefix on per-instance files: bare ``*.jsonl`` is
  still recognized as samples (matching what lm-eval emits when no
  file UUID is supplied). Strict ``_samples.jsonl`` matching is still
  preferred for stem extraction.

Tests
-----

* New ``tests/test_validate_compression.py`` covers: same fixture in
  plain + ``.gz`` form yields identical validation outcomes for both
  aggregate and samples; ``expand_paths`` finds compressed files;
  duplicate-variant reports fire and CI-gate appropriately;
  distinct kinds in the same folder don't false-positive; missing
  optional codec surfaces as a typed error rather than an exception.

213 passed, 5 skipped (4 require optional codec deps, 1 is unrelated
upstream skip).
Threads compression through the converter writer paths. Defaults to
``--compress none`` everywhere for full backwards compatibility — this
commit only enables compression on opt-in.

CLI flags (added to all four ``convert`` subcommands)
-----------------------------------------------------

* ``--compress {none|gz|zst|bz2|xz|lz4}``
    Default codec for both aggregate and per-instance files.
* ``--compress-aggregate {none|gz|zst|bz2|xz|lz4}``
    Override for aggregate ``<uuid>.json`` only. Defaults to
    ``--compress``'s value. Recommended setting: ``none``, since HF and
    GitHub render uncompressed JSON inline for spot-checking.
* ``--compress-samples {none|gz|zst|bz2|xz|lz4}``
    Override for per-instance ``<uuid>_samples.jsonl`` only. Recommended
    setting for any submission going to a public store: ``gz`` (5–15x
    size reduction with no UX cost — the samples files are too large
    for web preview anyway).

Argparse rejects unsupported codecs at parse time. Optional codec deps
(``zstandard``, ``lz4``) trigger ``ImportError`` with an actionable
message at write time when the user picks the codec but hasn't
installed the extra.

Writer-side plumbing
--------------------

* ``cli._write_log`` accepts ``compression``; output filename gets the
  codec suffix appended (``<uuid>.json[.gz]``).
* ``LMEvalInstanceLevelAdapter.transform_and_save`` accepts
  ``compression``; the recorded ``DetailedEvaluationResults.checksum``
  is computed over the on-disk (possibly compressed) bytes so that
  downstream verifiers can validate the file as-stored.
* ``HELMInstanceLevelDataAdapter`` accepts ``compression`` in its
  constructor; the adapter's ``self.path`` reflects the on-disk
  filename including any codec suffix.
* ``cli._cmd_convert_helm`` threads the resolved samples-compression
  via ``metadata_args['samples_compression']`` rather than adding a
  new positional to the long-existing ``transform_from_directory``
  signature.

Tests
-----

* ``tests/test_cli_compression.py``: full coverage of
  ``_resolve_compression``, ``_write_log`` round-trips per codec,
  LM-Eval samples writer compression including checksum-on-disk and
  the recorded ``file_path``, and parser-level wiring (defaults,
  per-kind override, invalid codec rejection).
* ``tests/test_cli_inspect_uuid.py``: existing fake-write-log stubs
  updated to accept the new ``compression=`` keyword.

226 passed, 5 skipped. Backwards compatibility preserved across all
existing tests.
* Add ``[zst]`` and ``[lz4]`` optional-dependency groups for the
  third-party codecs not available in the standard library. ``[all]``
  now pulls them in alongside ``[inspect]`` and ``[helm]``.

* README: new subsection under "Data Validation" documenting the
  recognized compressed forms, the duplicate-variant rule, and the
  ``--compress`` / ``--compress-samples`` writer flags. Notes that
  defaults remain uncompressed for backwards compatibility, and
  recommends ``--compress-samples gz`` (not the global ``--compress``)
  for public submissions so aggregate JSON stays browsable on the HF
  web UI.

226 passed, 5 skipped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant