[DRAFT] Transparent Compression#133
Open
Erotemic wants to merge 4 commits into
Open
Conversation
New module ``every_eval_ever/io.py`` provides a single source of truth for opening EEE result files (``<uuid>.json`` aggregates and ``<uuid>_samples.jsonl`` per-instance samples) regardless of compression. Recognized codecs match the HuggingFace Hub's documented set (https://huggingface.co/docs/hub/en/datasets-adding#file-formats): .gz, .zst, .bz2, .xz, .lz4 ``.zip`` is intentionally excluded — it is an archive container, not a stream codec, and would conflict with the duplicate-variant rule. Helpers: * ``open_eee_text(path, mode)`` — text I/O with codec auto-detection * ``is_eee_result(path)`` — 'aggregate' / 'samples' / None * ``eee_uuid_stem(path)`` — strip kind+codec suffixes * ``detect_compression(path)`` — name of the trailing codec * ``add_compression_suffix(path, cs)`` — synthesize the on-disk filename * ``iter_eee_results(roots)`` — recursive discovery of EEE files * ``find_duplicate_variants(paths)`` — enforces 'one variant per (folder, uuid, kind)' rule ``zstandard`` and ``lz4`` are optional runtime dependencies; the open helper raises ``ImportError`` with an actionable message pointing at the right ``[zst]`` / ``[lz4]`` extra. The stdlib codecs (.gz, .bz2, .xz) work out of the box. This commit only adds the module + unit tests. Subsequent commits wire the rest of the package (validate, check_duplicate_entries, converters, CLI flags) through these helpers. 53 passed, 2 skipped (lz4-codec tests skip when ``lz4`` is not installed; same for ``zstandard`` if absent).
Wire ``validate.py`` and ``check_duplicate_entries.py`` through the ``every_eval_ever.io`` helpers so compressed result files (``.gz``, ``.zst``, ``.bz2``, ``.xz``, ``.lz4``) are validated identically to their plain counterparts. Validator changes ----------------- * ``validate_aggregate`` and ``validate_instance_file`` open via ``io.open_eee_text`` rather than ``Path.open`` / ``Path.read_text``. * ``validate_file`` dispatches on ``io.is_eee_result(path)`` rather than ``path.suffix == '.json' / '.jsonl'``. Compressed forms are recognized via the suffix-stripping detector; non-EEE filenames (``.zip``, ``.parquet``, plain ``.csv``) report ``unsupported_extension`` as before. * ``expand_paths`` enumerates EEE result files via ``io.iter_eee_results`` for directory inputs. Files passed explicitly are accepted regardless of suffix and let ``validate_file`` produce the unsupported-extension error. * New ``_duplicate_variant_reports`` produces synthetic ``ValidationReport``s with a ``duplicate_variant`` error type for every ``(folder, uuid_stem, kind)`` group with more than one physical variant. ``validate.main`` runs this check unconditionally before per-file validation, so a CI/PR-bot path cannot silently green-light a folder containing both ``abc.json`` and ``abc.json.gz``. * Missing optional codec dependencies surface as ``codec_unavailable`` validation errors rather than crashing the whole run. check_duplicate_entries changes ------------------------------- * ``expand_paths`` recognizes compressed aggregate files; explicit file inputs that aren't aggregate-shaped are skipped (preserving the historical "ignore non-JSON" behaviour for non-EEE content). * The aggregate-payload reader uses ``io.open_eee_text``. ``_samples.jsonl`` files are intentionally excluded from this command's discovery — duplicate detection runs over aggregate metadata only. io.py refinement ---------------- * ``is_eee_result`` and ``eee_uuid_stem`` are lenient about the ``_samples`` prefix on per-instance files: bare ``*.jsonl`` is still recognized as samples (matching what lm-eval emits when no file UUID is supplied). Strict ``_samples.jsonl`` matching is still preferred for stem extraction. Tests ----- * New ``tests/test_validate_compression.py`` covers: same fixture in plain + ``.gz`` form yields identical validation outcomes for both aggregate and samples; ``expand_paths`` finds compressed files; duplicate-variant reports fire and CI-gate appropriately; distinct kinds in the same folder don't false-positive; missing optional codec surfaces as a typed error rather than an exception. 213 passed, 5 skipped (4 require optional codec deps, 1 is unrelated upstream skip).
Threads compression through the converter writer paths. Defaults to
``--compress none`` everywhere for full backwards compatibility — this
commit only enables compression on opt-in.
CLI flags (added to all four ``convert`` subcommands)
-----------------------------------------------------
* ``--compress {none|gz|zst|bz2|xz|lz4}``
Default codec for both aggregate and per-instance files.
* ``--compress-aggregate {none|gz|zst|bz2|xz|lz4}``
Override for aggregate ``<uuid>.json`` only. Defaults to
``--compress``'s value. Recommended setting: ``none``, since HF and
GitHub render uncompressed JSON inline for spot-checking.
* ``--compress-samples {none|gz|zst|bz2|xz|lz4}``
Override for per-instance ``<uuid>_samples.jsonl`` only. Recommended
setting for any submission going to a public store: ``gz`` (5–15x
size reduction with no UX cost — the samples files are too large
for web preview anyway).
Argparse rejects unsupported codecs at parse time. Optional codec deps
(``zstandard``, ``lz4``) trigger ``ImportError`` with an actionable
message at write time when the user picks the codec but hasn't
installed the extra.
Writer-side plumbing
--------------------
* ``cli._write_log`` accepts ``compression``; output filename gets the
codec suffix appended (``<uuid>.json[.gz]``).
* ``LMEvalInstanceLevelAdapter.transform_and_save`` accepts
``compression``; the recorded ``DetailedEvaluationResults.checksum``
is computed over the on-disk (possibly compressed) bytes so that
downstream verifiers can validate the file as-stored.
* ``HELMInstanceLevelDataAdapter`` accepts ``compression`` in its
constructor; the adapter's ``self.path`` reflects the on-disk
filename including any codec suffix.
* ``cli._cmd_convert_helm`` threads the resolved samples-compression
via ``metadata_args['samples_compression']`` rather than adding a
new positional to the long-existing ``transform_from_directory``
signature.
Tests
-----
* ``tests/test_cli_compression.py``: full coverage of
``_resolve_compression``, ``_write_log`` round-trips per codec,
LM-Eval samples writer compression including checksum-on-disk and
the recorded ``file_path``, and parser-level wiring (defaults,
per-kind override, invalid codec rejection).
* ``tests/test_cli_inspect_uuid.py``: existing fake-write-log stubs
updated to accept the new ``compression=`` keyword.
226 passed, 5 skipped. Backwards compatibility preserved across all
existing tests.
* Add ``[zst]`` and ``[lz4]`` optional-dependency groups for the third-party codecs not available in the standard library. ``[all]`` now pulls them in alongside ``[inspect]`` and ``[helm]``. * README: new subsection under "Data Validation" documenting the recognized compressed forms, the duplicate-variant rule, and the ``--compress`` / ``--compress-samples`` writer flags. Notes that defaults remain uncompressed for backwards compatibility, and recommends ``--compress-samples gz`` (not the global ``--compress``) for public submissions so aggregate JSON stays browsable on the HF web UI. 226 passed, 5 skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds transparent compression support for aggregate and sample result files. Compressed variants of
<uuid>.jsonand<uuid>_samples.jsonlare now treated as equivalents of the plain files for supported codecs:.gz,.zst,.bz2,.xz, and.lz4.Changes:
every_eval_ever/io.pywith compression suffix detection, transparent openers, result discovery, and duplicate-variant resolution.duplicate_variantvalidation error when both plain and compressed variants of the same logical file are present.--compress,--compress-aggregate, and--compress-samplesflags.[zst]and[lz4]codec extras.Tests:
expand_pathsandvalidate_file.Defaults remain unchanged: outputs are still uncompressed unless explicitly requested.
Out of scope:
The manifest-generation and viewer-Parquet scripts also need this resolver, but they live in
EEE_datastore. This PR keeps the package-level compression support isolated and leaves datastore integration for a companion PR.NOTE:
I would not be comfortable merging this until we have CI running. I have not tested this robustly yet.