Skip to content

perf(bench): add fastcover vs ffi dict-training delta#73

Open
polaz wants to merge 13 commits intomainfrom
feat/#25
Open

perf(bench): add fastcover vs ffi dict-training delta#73
polaz wants to merge 13 commits intomainfrom
feat/#25

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 5, 2026

Summary

  • add dict-train benchmark stage to compare_ffi for Rust FastCOVER vs C dictionary training
  • emit and parse REPORT_DICT_TRAIN lines in .github/scripts/run-benchmarks.sh
  • compare Rust/FFI dictionary training on a unified corpus model and report actual sampled training bytes
  • normalize FastCOVER accel consistently in tuning/scoring/output params
  • harden raw COVER builder to propagate I/O errors (io::Result<()>) instead of panicking
  • include dictionary training table in benchmark-report.md and canonical deltas in benchmark-delta.md/json
  • require dict_builder for compare_ffi bench and align benchmark docs

Validation

  • cargo fmt --all -- --check
  • cargo clippy --all-targets --features hash,std,dict_builder -- -D warnings
  • cargo nextest run --workspace --features hash,std,dict_builder
  • cargo build --workspace --features hash,std,dict_builder
  • cargo test --doc --workspace --features hash,std,dict_builder

Same-run numbers (Rust FastCOVER vs C)

  • decodecorpus-z000033: 2.703 ms vs 23.342 ms (8.6371x)
  • small-10k-random: 0.079 ms vs 0.702 ms (8.8451x)
  • small-4k-log-lines: 0.057 ms vs 0.217 ms (3.7868x)

Closes #25

Summary by CodeRabbit

  • New Features

    • Added pure‑Rust FastCover dictionary training, new dict‑training benchmarks comparing Rust vs C FFI, and a dedicated "Dictionary Training" report section.
  • Documentation

    • Expanded docs and README to document the dict_builder feature, dict‑train runs, and the speed‑delta fallback (throughput or ms/iter).
  • Bug Fixes

    • Bench tooling enforces presence of dict‑training data and falls back to ms/iter for delta computation when throughput is unavailable.
  • Tests

    • Added unit tests covering FastCover and dictionary creation edge cases.

Copilot AI review requested due to automatic review settings April 5, 2026 17:40
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a pure‑Rust FastCOVER trainer, dictionary finalization APIs, adjusts COVER internals, adds Rust vs C-FFI dict‑training benchmarks and a new bench target, and extends the CI benchmark runner/parser to emit/parse REPORT_DICT_TRAIN with a throughput-or-ms/iter fallback.

Changes

Cohort / File(s) Summary
FastCOVER implementation
zstd/src/dictionary/fastcover.rs
New module implementing FastCOVER training, parameter optimization, deterministic hashing, scoring/assembly, edge handling, and unit tests; exports params/tuned types and default candidate lists.
Dictionary API & finalization
zstd/src/dictionary/mod.rs
Added fastcover submodule and re-exports; new public types FastCoverOptions, FinalizeOptions; added finalize_raw_dict, train_fastcover_raw_from_slice, create_fastcover_raw_dict_from_source, create_fastcover_dict_from_source; switched to buffered in-memory training and propagated I/O errors; added tests.
COVER adjustments
zstd/src/dictionary/cover.rs
pick_best_segment now takes epoch: &[u8]; score_segment short‑circuits for too-short segments and fixes k‑mer loop bounds to include final window.
Benchmarks: compare & new target
zstd/benches/compare_ffi.rs, zstd/benches/dict_builder_fastcover.rs, zstd/Cargo.toml
compare_ffi.rs now separates Rust FastCOVER and C FFI training, times both, emits REPORT_DICT_TRAIN, and uses the FFI-trained dict for compression reports; added dict_builder_fastcover.rs benchmark; bench targets gated by required-features = ["dict_builder"].
Benchmark runner & parsing
.github/scripts/run-benchmarks.sh, BENCHMARKS.md
Runner invokes cargo bench --features dict_builder; parser adds DICT_TRAIN_RE to parse REPORT_DICT_TRAIN, records dictionary_training_rows and scenario_training_bytes, requires ≥1 REPORT_DICT_TRAIN (exit 1 if none), uses training bytes for dict-train canonical rows, and falls back from throughput ratio to ms/iter ratio when throughput missing; report texts updated.
Docs / README
README.md, BENCHMARKS.md
Expanded dict_builder capabilities and API descriptions; documented REPORT_DICT_TRAIN and the throughput-or-ms/iter fallback for speed-delta computation.
Tests / minor
zstd/src/encoding/frame_compressor.rs
Test updated to call .expect(...) on dict_builder creation so training errors surface during tests.

Sequence Diagram(s)

sequenceDiagram
  participant Runner as Bench Runner (script)
  participant Bench as Bench Binary
  participant Rust as Rust FastCOVER
  participant CFFI as C FFI Trainer
  participant Parser as Report Parser
  participant Report as Generated Report

  Runner->>Bench: run benches (--features dict_builder)
  Bench->>Rust: train_fastcover_raw_from_slice / optimize\n(measure rust_train_ms, rust_dict_bytes, score)
  Bench->>CFFI: call C trainer (via FFI)\n(measure ffi_train_ms, ffi_dict_bytes)
  Rust-->>Bench: rust_train_ms, rust_dict_bytes, score
  CFFI-->>Bench: ffi_train_ms, ffi_dict_bytes
  Bench-->>Runner: emit REPORT_DICT_TRAIN and other REPORT_* lines
  Runner->>Parser: parse raw output (REPORT_* including REPORT_DICT_TRAIN)
  Parser-->>Report: populate dictionary_training_rows and delta tables\n(use throughput or ms/iter fallback)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped through bytes and tuned a score,
I stitched raw bits into a headered door,
Rust ran FastCOVER, C warmed the forge,
Reports counted trains — I nibbled the gorge,
🥕 tiny dict, big crunch, I danced on the floor.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(bench): add fastcover vs ffi dict-training delta' clearly and concisely summarizes the main change—adding a dict-training benchmark stage that compares FastCOVER vs FFI dictionary training.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #25: FastCOVER algorithm with configurable parameters, dictionary finalization with zstd headers, parameter optimization, and benchmark validation that FastCOVER produces matching quality and achieves ≥5× speedup.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #25 objectives: FastCOVER implementation, dictionary finalization, benchmark infrastructure for dict-training comparison, and documentation updates. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#25

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a FastCOVER-based (pure Rust) dictionary-training path and integrates it into the Criterion compare_ffi benchmarks + CI benchmark reporting, so the repo can track Rust-vs-C dictionary training deltas alongside the existing compression/decompression matrix.

Changes:

  • Implement FastCOVER raw dictionary training (plus tuning/optimization scaffolding) and expose new public APIs for training and finalizing dictionaries.
  • Extend compare_ffi to benchmark/report dictionary training (dict-train) and update benchmark parsing/report generation to include REPORT_DICT_TRAIN.
  • Require dict_builder for relevant benches and update docs to reflect the new benchmark stage and dictionary capabilities.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
zstd/src/dictionary/mod.rs Exposes FastCOVER options/APIs, updates COVER trainer to operate over in-memory input, and adds dictionary finalization logic + tests.
zstd/src/dictionary/fastcover.rs Adds FastCOVER raw trainer + optimizer and unit tests.
zstd/src/dictionary/cover.rs Updates epoch-based segment selection and fixes k-mer scoring bounds for short segments.
zstd/Cargo.toml Gates compare_ffi/new bench behind dict_builder and adds dict_builder_fastcover bench target.
zstd/benches/dict_builder_fastcover.rs Adds a dedicated Criterion bench comparing COVER vs FastCOVER raw training.
zstd/benches/compare_ffi.rs Adds dict-train stage, emits REPORT_DICT_TRAIN, and benchmarks Rust FastCOVER vs C trainer.
README.md Updates dictionary generation documentation to include FastCOVER + finalization APIs.
BENCHMARKS.md Documents the new dict-train stage and updates benchmark invocation flags.
.github/scripts/run-benchmarks.sh Parses/emits dictionary training tables and adds a speed-delta fallback when throughput is unavailable.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/dictionary/cover.rs`:
- Around line 70-75: The chained method call creating `segments` is misformatted
and failing `cargo fmt --check`; update the expression that builds `segments`
(the `epoch.chunks(params.segment_size as usize).peekable()` chain) to follow
rustfmt style — either place each method on its own indented line with the dot
leading (e.g. `epoch.chunks(...).peekable()` reformatted across lines) or simply
run `cargo fmt` to apply the canonical formatting so the `segments` binding
compiles with rustfmt rules; ensure you adjust the `params.segment_size as
usize` cast placement if needed to satisfy formatting.

In `@zstd/src/dictionary/fastcover.rs`:
- Around line 103-130: The function signature for train_fastcover_raw is
misformatted for rustfmt; reformat it to follow Rust style (wrap parameters
across lines and place the return type and opening brace on their own line) so
cargo fmt passes — e.g. format the pub fn train_fastcover_raw(sample: &[u8],
dict_size: usize, params: FastCoverParams) -> Vec<u8> declaration to a
rustfmt-friendly multi-line signature and keep the body calling
build_raw_dict(sample, dict_size, params) unchanged; running cargo fmt/rustfmt
after the change should verify the fix.

In `@zstd/src/dictionary/mod.rs`:
- Around line 181-184: The code now allocates Vec::with_capacity(source_size)
and calls source.read_to_end(&mut all), which forces the entire training corpus
into memory; revert to a streaming approach or explicitly document the memory
constraint. Either replace read_to_end with a BufReader-based streaming read
(using BufReader around source and processing chunks iteratively instead of
collecting into all) or add a comment/Function-level documentation next to the
Vec/all/read_to_end code that states large corpora will be fully buffered and
recommend using a streaming API; reference the symbols Vec::with_capacity, all,
source.read_to_end, source_size and BufReader when making the change.
- Around line 390-439: Both create_fastcover_raw_dict_from_source and
create_fastcover_dict_from_source duplicate reading the source, empty-check, and
branching on fastcover.optimize; extract the shared training logic into a helper
(e.g., train_fastcover_internal(sample: &[u8], dict_size: usize, options:
&FastCoverOptions) -> (Vec<u8>, FastCoverTuned)) that calls
fastcover::optimize_fastcover_raw or fastcover::train_fastcover_raw and returns
(raw_dict, tuned), then have both public functions read and validate the source,
call train_fastcover_internal, call finalize_raw_dict(final_raw.as_slice(),
sample.as_slice(), dict_size, finalize) as before, write output, and return the
tuned struct.
- Around line 30-50: Reorder the use/import block so crate-internal modules are
grouped and sorted consistently (e.g., BitWriter,
decoding::dictionary::MAGIC_NUM, fse::fse_encoder, huff0::HuffmanTable,
huff0::huff0_encoder::{HuffmanEncoder, HuffmanTable},
dictionary::reservoir::create_sample, cover::*, fastcover::... ) and place std
imports (boxed::Box, collections::..., fs::..., io::..., path::..., vec::Vec) in
their own group; move the create_sample import into the crate-internal group,
run cargo fmt to apply rustfmt ordering rules, and ensure symbols like
BitWriter, DICT_MAGIC_NUM, fse_encoder, HuffmanDecoderTable, HuffmanEncoder,
create_sample, and DEFAULT_D_CANDIDATES remain correctly referenced after
reordering.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 37e2cb36-e61f-41fa-aa4c-1e5637716601

📥 Commits

Reviewing files that changed from the base of the PR and between d205d66 and b18d48f.

📒 Files selected for processing (9)
  • .github/scripts/run-benchmarks.sh
  • BENCHMARKS.md
  • README.md
  • zstd/Cargo.toml
  • zstd/benches/compare_ffi.rs
  • zstd/benches/dict_builder_fastcover.rs
  • zstd/src/dictionary/cover.rs
  • zstd/src/dictionary/fastcover.rs
  • zstd/src/dictionary/mod.rs

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 5, 2026

Codecov Report

❌ Patch coverage is 96.43735% with 29 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/dictionary/mod.rs 94.92% 27 Missing ⚠️
zstd/src/dictionary/cover.rs 83.33% 1 Missing ⚠️
zstd/src/dictionary/fastcover.rs 99.63% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/scripts/run-benchmarks.sh:
- Around line 244-273: The dict-train parsing block (dict_train_match ->
dictionary_training_rows) must not inherit compression's scenario_input_bytes
for throughput math; stop populating or referencing scenario_input_bytes for
dict-train rows and instead add a dedicated "training_bytes" field (set to None
if the report doesn't include it) so downstream throughput calculations
skip/avoid computing rust_bytes_per_sec/ffi_bytes_per_sec for dict-train. Update
the downstream throughput/printing logic to check for "training_bytes" (or its
absence) before computing bytes/sec and only use REPORT_DICT_TRAIN-provided
training_bytes if you add that field; if you need absolute throughput later, add
training_bytes to REPORT_DICT_TRAIN and consume it here. Ensure changes
reference dict_train_match, dictionary_training_rows, and the throughput
computation code that reads scenario_input_bytes.

In `@zstd/src/dictionary/fastcover.rs`:
- Around line 128-179: optimize_fastcover_raw currently panics for 1-byte
samples because split_idx.clamp(1, sample.len().saturating_sub(1)) can produce
an invalid range and it may return an empty/zeroed winner when all candidates
score 0; modify optimize_fastcover_raw to first handle tiny samples (e.g., if
sample.len() < 2 return early or set split_idx = 1 only when sample.len() > 1)
so the clamp bounds are valid, and seed the "best" and "best_dict" with the
first evaluated candidate inside the nested loops (compute
params/build_raw_dict/coverage_score for the first f/d/k triple and assign
regardless of score) so there is always a concrete dict and params returned; add
regression tests calling train_fastcover_raw_from_slice()/optimize_fastcover_raw
with a 1-byte sample and a short sample where eval split is smaller than the
minimum d to ensure no panic and a non-empty dict is produced.

In `@zstd/src/dictionary/mod.rs`:
- Around line 327-330: Reject dict_id == 0 before serializing: after computing
dict_id (from options.dict_id or derive_dict_id(raw_content)) check if dict_id
== 0 and return an early error instead of writing to out; mirror the existing
behavior used by FrameCompressor::set_dictionary which returns
DictionaryDecodeError::ZeroDictionaryId so propagate the same error on
encode/serialization paths (check both when options.dict_id is provided and when
derived via derive_dict_id).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 868b26a4-86cf-42d7-b1d2-eab09d92da77

📥 Commits

Reviewing files that changed from the base of the PR and between b18d48f and 879ab3f.

📒 Files selected for processing (5)
  • .github/scripts/run-benchmarks.sh
  • zstd/benches/compare_ffi.rs
  • zstd/src/dictionary/cover.rs
  • zstd/src/dictionary/fastcover.rs
  • zstd/src/dictionary/mod.rs

Copy link
Copy Markdown

@sw-release-bot sw-release-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: c30e9ba Previous: b280757 Ratio
compress/better/small-1k-random/matrix/c_ffi 0.11 ms 0.092 ms 1.20
compress/best/small-1k-random/matrix/pure_rust 0.296 ms 0.249 ms 1.19
compress/default/small-4k-log-lines/matrix/c_ffi 0.021 ms 0.018 ms 1.17
compress/better/small-4k-log-lines/matrix/pure_rust 0.152 ms 0.124 ms 1.23
compress/better/small-4k-log-lines/matrix/c_ffi 0.099 ms 0.077 ms 1.29
compress/best/small-4k-log-lines/matrix/pure_rust 0.258 ms 0.203 ms 1.27
compress/best/small-4k-log-lines/matrix/c_ffi 0.368 ms 0.28 ms 1.31
compress/fastest/low-entropy-1m/matrix/c_ffi 0.189 ms 0.142 ms 1.33
compress/fastest/large-log-stream/matrix/c_ffi 2.726 ms 2.149 ms 1.27
compress/best/large-log-stream/matrix/c_ffi 15.06 ms 12.903 ms 1.17
decompress/fastest/small-10k-random/rust_stream/matrix/c_ffi 0.002 ms 0.001 ms 2
decompress/fastest/high-entropy-1m/c_stream/matrix/c_ffi 0.035 ms 0.027 ms 1.30
decompress/default/high-entropy-1m/c_stream/matrix/c_ffi 0.037 ms 0.027 ms 1.37
decompress/best/high-entropy-1m/c_stream/matrix/c_ffi 0.172 ms 0.038 ms 4.53

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/scripts/run-benchmarks.sh:
- Around line 293-298: The current check for missing dictionary_training_rows
only prints a warning; change it to fail fast by replacing the print-only
behavior in the if not dictionary_training_rows: block so the script exits with
a non-zero status (e.g., call sys.exit(1)) after writing the same error message
to stderr; locate and update the if not dictionary_training_rows: conditional in
.github/scripts/run-benchmarks.sh to preserve the stderr output and then exit to
ensure CI fails when REPORT_DICT_TRAIN rows are missing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ca663b82-0133-4187-8e96-f731d78116f3

📥 Commits

Reviewing files that changed from the base of the PR and between 879ab3f and a3091ec.

📒 Files selected for processing (3)
  • .github/scripts/run-benchmarks.sh
  • zstd/src/dictionary/fastcover.rs
  • zstd/src/dictionary/mod.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 35-41: The code currently clamps table bits inside
clamp_table_bits and build_frequency_table but trains, scores, and records using
the original raw params (e.g., build_raw_dict, params.d, k, FastCoverTuned), so
normalize/replace params once up front: in the entrypoints (including the
non-optimized wrapper in mod.rs) compute normalized params.d (clamped to 4..32),
params.f (clamped via clamp_table_bits to 8..20) and adjust k to match the
normalized d, then pass these normalized values into build_frequency_table,
build_raw_dict, the optimizer scoring path, and when constructing FastCoverTuned
so that training, evaluation, and reporting all use the same clamped/adjusted
parameters.

In `@zstd/src/dictionary/mod.rs`:
- Around line 405-416: train_fastcover_raw_from_slice currently only rejects
empty sample slices but can return Ok((Vec::new(), tuned)) for tiny samples;
update this function to treat a produced empty dictionary as invalid when
dict_size > 0: after calling train_fastcover_internal (or by checking the dict
produced there), return an Err(io::ErrorKind::InvalidInput, ...) if
dict.is_empty() && dict_size > 0, or alternatively implement the same
tiny-source fallback logic used by create_raw_dict_from_source so small samples
fall back to the tiny-source path; reference train_fastcover_raw_from_slice,
train_fastcover_internal, create_fastcover_raw_dict_from_source,
create_fastcover_dict_from_source, and finalize_raw_dict when adding the check
or fallback.
- Around line 337-345: finalize_raw_dict() now writes Huffman/FSE tables before
the repeat offsets, but Dictionary::decode_dict() still assumes offsets are at
raw[8..20] and content at raw[20..], causing mis-decoding; update
Dictionary::decode_dict() (and any caller like
FrameCompressor::set_dictionary()) to parse the new wire format by first
reading/consuming the serialized Huffman and three FSE tables (as produced by
serialize_huffman_table and serialize_fse_table) before reading the three repeat
offsets and the remaining content, or alternatively gate finalize_raw_dict()
behind a feature/API flag until the in-crate parser is updated so local
consumers do not receive finalized dictionaries with mismatched layout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ce3585f0-341a-4ce8-a310-d180922a6582

📥 Commits

Reviewing files that changed from the base of the PR and between a3091ec and b24e75f.

📒 Files selected for processing (2)
  • zstd/src/dictionary/fastcover.rs
  • zstd/src/dictionary/mod.rs

polaz added 8 commits April 6, 2026 00:14
- extend compare_ffi with dict-train stage and REPORT_DICT_TRAIN lines
- parse and render dictionary training section in benchmark reports
- require dict_builder feature for compare_ffi benchmark and pipeline command
- document new benchmark stage and speed-delta fallback behavior
- treat source_size as hint and document in-memory training behavior
- fix epoch sizing units, bound huffman stats sampling, and share fastcover training helper
- add k-candidate fallback + regression tests
- rename dict-train delta field to match C/Rust ratio semantics
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 427-435: Add a short doc comment to clearly state that these
public Read-based helpers fully buffer the training corpus into memory before
training: update the doc comment for create_fastcover_raw_dict_from_source
(which calls source.read_to_end and then train_fastcover_raw_from_slice) and the
other public Read-based helper that likewise calls read_to_end (the sibling
function around lines 441-449) to say something like "This function fully
buffers the entire training corpus into memory (uses read_to_end) and may
consume large amounts of RAM for large inputs; callers should stream or use
slice-based APIs for very large corpora." Ensure both functions' public docs
contain that exact memory-buffering warning.
- Around line 450-453: The trainer currently calls
train_fastcover_raw_from_slice(..., dict_size, ...) but finalize_raw_dict(...,
dict_size, ...) only allows dict_size - header_bytes for raw content, so
training targets the wrong budget; change the training budget to use the actual
post-finalization content budget by computing the finalized content capacity the
same way finalize_raw_dict does (reuse its header/table sizing logic), e.g.,
add/implement finalized_content_budget(...) and pass that result into
train_fastcover_raw_from_slice instead of dict_size so the raw trainer only
optimizes the bytes that will survive finalization.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 61ab7123-92f2-44a7-b8cb-9ad833e2561f

📥 Commits

Reviewing files that changed from the base of the PR and between b24e75f and f83dc71.

📒 Files selected for processing (9)
  • .github/scripts/run-benchmarks.sh
  • BENCHMARKS.md
  • README.md
  • zstd/Cargo.toml
  • zstd/benches/compare_ffi.rs
  • zstd/benches/dict_builder_fastcover.rs
  • zstd/src/dictionary/cover.rs
  • zstd/src/dictionary/fastcover.rs
  • zstd/src/dictionary/mod.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
zstd/src/dictionary/mod.rs (3)

450-453: ⚠️ Potential issue | 🟠 Major

Train against the post-finalization content budget.

This path tunes raw content with the full dict_size, then finalize_raw_dict() removes header/table bytes and may reject the request anyway. Successful calls ship a truncated dictionary that FastCOVER never optimized, and impossible budgets still pay the full optimization cost.

Possible direction
-    let (raw_dict, tuned) =
-        train_fastcover_raw_from_slice(sample.as_slice(), dict_size, fastcover)?;
+    let content_budget = finalized_content_budget(sample.as_slice(), dict_size, finalize)?;
+    let (raw_dict, tuned) =
+        train_fastcover_raw_from_slice(sample.as_slice(), content_budget, fastcover)?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/dictionary/mod.rs` around lines 450 - 453, The code currently calls
train_fastcover_raw_from_slice with the full dict_size then finalizes, which
wastes work and can exceed the post-finalization content budget; change the flow
to compute the actual content budget to be used for training (the maximum number
of payload bytes allowed after finalize_raw_dict removes headers/tables) before
training, check that this content_budget is positive and return the appropriate
error if it is impossible, then call
train_fastcover_raw_from_slice(sample.as_slice(), content_budget, fastcover) to
produce raw_dict tuned to the post-finalize budget and finally call
finalize_raw_dict(raw_dict.as_slice(), sample.as_slice(), dict_size, finalize)
as before; reference the functions train_fastcover_raw_from_slice and
finalize_raw_dict and add the pre-check for an impossible content budget to
avoid wasted work.

337-340: ⚠️ Potential issue | 🟠 Major

Sequence entropy tables are still hard-coded.

finalize_raw_dict() always emits default_of_table(), default_ml_table(), and default_ll_table(), so only the Huffman side is corpus-trained. Finalized dictionaries will systematically trail C finalization quality because they never learn sample-specific sequence stats.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/dictionary/mod.rs` around lines 337 - 340, finalize_raw_dict()
currently emits hard-coded sequence FSE tables (fse_encoder::default_of_table(),
default_ml_table(), default_ll_table()), so sequence statistics from sample_data
are never used; replace those default tables with FSE tables built from the
sample sequence statistics before calling serialize_fse_table. Locate the block
where out.extend_from_slice(serialize_fse_table(...)) is called and: compute
sequence counts/stats from sample_data (the same sample source used for
serialize_huffman_table), build FSE encoder tables from those counts (instead of
using default_of_table/default_ml_table/default_ll_table), and pass the
generated tables into serialize_fse_table so finalized dictionaries include
sample-trained OF/ML/LL sequence tables.

188-199: ⚠️ Potential issue | 🟠 Major

Don't panic on I/O failures in this public builder.

create_raw_dict_from_source crashes on read/write errors (lines 193, 202, 257) via .expect(), while the newer create_fastcover_raw_dict_from_source and create_fastcover_dict_from_source properly propagate errors as io::Result. Change the return type to io::Result<()> and use ? instead of .expect() for consistent error handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/dictionary/mod.rs` around lines 188 - 199, The public builder
create_raw_dict_from_source currently panics on I/O errors via .expect() calls
(e.g., the read_to_end and write_all sites) — change its signature to return
io::Result<()> and replace those .expect(...) calls with the ? operator so I/O
errors are propagated; also update any other remaining .expect() in
create_raw_dict_from_source (including the later write at the tiny-dictionary
path) to use ? and ensure callers handle the propagated Result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 197-200: The FFI and Rust trainers are being fed different
corpora: Rust uses a flattened training_blob while the FFI calls still use
sample_refs (preserving original boundaries), so make them identical by choosing
one canonical representation and using it for both trainers; e.g., keep
training_blob (Vec<u8>) and compute sample_offsets/lengths or create slices that
point into training_blob (replace sample_refs with slices derived from
training_blob) and pass those same slices/offsets to the FFI dict-train calls
(the variables to change are training_samples, sample_refs, training_blob,
total_training_bytes and the FFI dict-train invocations near where sample_refs
are passed). Ensure total_training_bytes is computed from the unified
representation and update both trainer call sites to use that unified data.

In `@zstd/src/dictionary/fastcover.rs`:
- Around line 39-43: normalize_fastcover_params currently clamps d/k/f but
leaves accel unchecked so callers and the optimizer can use out-of-range accel
values; update normalize_fastcover_params(FastCoverParams) to clamp params.accel
to 1..=10 (same pattern as params.d and params.f using clamp_table_bits) and
return the clamped struct, then ensure any other normalization or tuning paths
that construct FastCoverTuned or score using accel (the optimizer/scoring code
and the code that builds FastCoverTuned) use that clamped accel value instead of
the raw input so the tuned result never contains accel outside 1..=10.

In `@zstd/src/dictionary/mod.rs`:
- Around line 385-398: The FastCoverParams are normalized into params via
normalize_fastcover_params(FastCoverParams { ... }) but the FastCoverTuned
struct still uses options.accel; change it to use the normalized params.accel
instead. Locate the block that creates params and returns
(fastcover::train_fastcover_raw(...), FastCoverTuned { ... }) and replace the
accel field value from options.accel to params.accel so FastCoverTuned
consistently reflects the normalized parameters (refer to
normalize_fastcover_params, FastCoverParams, params, options.accel, and
FastCoverTuned).

---

Duplicate comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 450-453: The code currently calls train_fastcover_raw_from_slice
with the full dict_size then finalizes, which wastes work and can exceed the
post-finalization content budget; change the flow to compute the actual content
budget to be used for training (the maximum number of payload bytes allowed
after finalize_raw_dict removes headers/tables) before training, check that this
content_budget is positive and return the appropriate error if it is impossible,
then call train_fastcover_raw_from_slice(sample.as_slice(), content_budget,
fastcover) to produce raw_dict tuned to the post-finalize budget and finally
call finalize_raw_dict(raw_dict.as_slice(), sample.as_slice(), dict_size,
finalize) as before; reference the functions train_fastcover_raw_from_slice and
finalize_raw_dict and add the pre-check for an impossible content budget to
avoid wasted work.
- Around line 337-340: finalize_raw_dict() currently emits hard-coded sequence
FSE tables (fse_encoder::default_of_table(), default_ml_table(),
default_ll_table()), so sequence statistics from sample_data are never used;
replace those default tables with FSE tables built from the sample sequence
statistics before calling serialize_fse_table. Locate the block where
out.extend_from_slice(serialize_fse_table(...)) is called and: compute sequence
counts/stats from sample_data (the same sample source used for
serialize_huffman_table), build FSE encoder tables from those counts (instead of
using default_of_table/default_ml_table/default_ll_table), and pass the
generated tables into serialize_fse_table so finalized dictionaries include
sample-trained OF/ML/LL sequence tables.
- Around line 188-199: The public builder create_raw_dict_from_source currently
panics on I/O errors via .expect() calls (e.g., the read_to_end and write_all
sites) — change its signature to return io::Result<()> and replace those
.expect(...) calls with the ? operator so I/O errors are propagated; also update
any other remaining .expect() in create_raw_dict_from_source (including the
later write at the tiny-dictionary path) to use ? and ensure callers handle the
propagated Result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 076babeb-74dc-441d-a1f2-a9c33c1d779a

📥 Commits

Reviewing files that changed from the base of the PR and between b24e75f and f83dc71.

📒 Files selected for processing (9)
  • .github/scripts/run-benchmarks.sh
  • BENCHMARKS.md
  • README.md
  • zstd/Cargo.toml
  • zstd/benches/compare_ffi.rs
  • zstd/benches/dict_builder_fastcover.rs
  • zstd/src/dictionary/cover.rs
  • zstd/src/dictionary/fastcover.rs
  • zstd/src/dictionary/mod.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
zstd/src/dictionary/mod.rs (1)

479-492: 🧹 Nitpick | 🔵 Trivial

Use the normalized accel in FastCoverTuned.

k, d, and f come from normalize_fastcover_params(...), but accel still comes from the raw option. That makes the public FastCoverTuned result inconsistent with the parameters actually passed to fastcover::train_fastcover_raw(...).

♻️ Proposed fix
         (
             fastcover::train_fastcover_raw(sample, dict_size, params),
             FastCoverTuned {
                 k: params.k,
                 d: params.d,
                 f: params.f,
-                accel: options.accel,
+                accel: params.accel,
                 score: 0,
             },
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/dictionary/mod.rs` around lines 479 - 492, The FastCoverTuned
returned struct uses options.accel instead of the normalized accel from
normalize_fastcover_params, causing a mismatch with the params passed to
fastcover::train_fastcover_raw; update the FastCoverTuned initializer to set
accel: params.accel (and ensure FastCoverParams/normalize_fastcover_params
provide accel) so k, d, f, and accel in FastCoverTuned reflect the actual
normalized params used for training.
zstd/benches/compare_ffi.rs (1)

197-200: ⚠️ Potential issue | 🟠 Major

Benchmark the same corpus model on both sides.

The Rust path trains on one contiguous training_blob, while the C path still trains on sample_refs with explicit sample boundaries. That changes both the work performed and the resulting dictionary, so the reported dict-train delta is not comparing like-for-like inputs.

Also applies to: 206-208, 221-223, 251-267

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 235-243: The REPORT_DICT_TRAIN reported training_bytes is
incorrect because emit_dictionary_training_report is being passed scenario.len()
even though split_training_samples() may truncate samples; update the calls to
compute the actual byte count from the samples returned by
split_training_samples (e.g., sum of lengths of the training sample slices) and
pass that sum as the training_bytes argument to emit_dictionary_training_report
(change the calls around emit_dictionary_training_report where
rust_dictionary/ffi_dictionary are computed; also apply the same fix to the
other occurrence around lines 419-431). Ensure you reference
split_training_samples, emit_dictionary_training_report, and REPORT_DICT_TRAIN
when locating and fixing the code.

---

Duplicate comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 479-492: The FastCoverTuned returned struct uses options.accel
instead of the normalized accel from normalize_fastcover_params, causing a
mismatch with the params passed to fastcover::train_fastcover_raw; update the
FastCoverTuned initializer to set accel: params.accel (and ensure
FastCoverParams/normalize_fastcover_params provide accel) so k, d, f, and accel
in FastCoverTuned reflect the actual normalized params used for training.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e2ed0ba-b8ff-4cba-91d5-6cf52e42bc1f

📥 Commits

Reviewing files that changed from the base of the PR and between f83dc71 and c30e9ba.

📒 Files selected for processing (3)
  • .github/scripts/run-benchmarks.sh
  • zstd/benches/compare_ffi.rs
  • zstd/src/dictionary/mod.rs

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 5, 2026

@coderabbitai Re duplicate about FastCoverTuned.accel: this is already fixed on HEAD (dcd9185). train_fastcover_internal now uses accel: params.accel, and normalize_fastcover_params also clamps accel to 1..=10 in zstd/src/dictionary/fastcover.rs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 5, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz Thanks for the clarification! Acknowledged — FastCoverTuned.accel is correctly populated in train_fastcover_internal (using accel: params.accel) and properly clamped via normalize_fastcover_params in zstd/src/dictionary/fastcover.rs on HEAD (dcd9185). I'll make sure not to re-flag this in future reviews.


✏️ Learnings added
Learnt from: polaz
URL: https://github.com/structured-world/structured-zstd/pull/73

Timestamp: 2026-04-05T22:29:06.349Z
Learning: In `zstd/src/dictionary/fastcover.rs` (structured-world/structured-zstd), `FastCoverTuned.accel` is correctly populated in `train_fastcover_internal` using `accel: params.accel`, and `normalize_fastcover_params` clamps `accel` to `1..=10`. Do not flag `FastCoverTuned.accel` as unclamped or incorrectly set.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 73
File: zstd/src/dictionary/mod.rs:337-345
Timestamp: 2026-04-05T21:31:53.614Z
Learning: In `zstd/src/decoding/dictionary.rs` (structured-world/structured-zstd), `Dictionary::decode_dict()` parses the full zstd dictionary wire format: magic (4 bytes) → dict_id (4 bytes) → Huffman table (variable, consumed via `build_decoder`) → OF FSE table → ML FSE table → LL FSE table → 3 × u32 repeat offsets → raw content. This matches the layout produced by `finalize_raw_dict()` in `zstd/src/dictionary/mod.rs`. Do not flag these as mismatched layouts.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 206-246: The benchmark compares non-equivalent outputs:
zstd::dict::from_samples returns a finalized dictionary while
train_fastcover_raw_from_slice returns only raw content, so either finalize the
Rust result before measuring and reporting (e.g., pass
rust_dictionary/rust_tuned through the same finalization used by
create_fastcover_dict_from_source or a slice-based wrapper so
rust_dictionary.len() and rust_train_ms reflect the full finalized dict), or
change the emitted metrics in emit_dictionary_training_report to clearly label
these as "raw-content training" and report raw vs finalized sizes separately
(update the fields/labels for rust_dict_bytes vs ffi_dict_bytes and any timing
fields accordingly). Ensure you update the code paths around
train_fastcover_raw_from_slice, rust_dictionary, rust_tuned,
zstd::dict::from_samples and emit_dictionary_training_report so the compared
artifacts are equivalent or the report explicitly notes raw vs finalized.

In `@zstd/benches/dict_builder_fastcover.rs`:
- Around line 39-71: Hoist the FastCoverOptions construction out of the b.iter()
closures so the benches don't include allocation/setup cost: create one
FastCoverOptions instance for the optimized bench (e.g., fastcover_opts =
FastCoverOptions::default()) and one for the fixed bench (e.g.,
fastcover_fixed_opts = FastCoverOptions { optimize:false, accel:4, k:256, d:8,
f:20, ..FastCoverOptions::default() }) before calling c.bench_function, then
pass &fastcover_opts / &fastcover_fixed_opts into
create_fastcover_raw_dict_from_source inside the closures (references or clones
as appropriate) instead of building them inside b.iter; reference symbols:
FastCoverOptions, create_fastcover_raw_dict_from_source, the bench names
"dict_builder/fastcover_raw_opt" and "dict_builder/fastcover_raw_fixed".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a8629e61-02e4-4fd1-b930-f0275077209a

📥 Commits

Reviewing files that changed from the base of the PR and between c30e9ba and dcd9185.

📒 Files selected for processing (5)
  • zstd/benches/compare_ffi.rs
  • zstd/benches/dict_builder_fastcover.rs
  • zstd/src/dictionary/fastcover.rs
  • zstd/src/dictionary/mod.rs
  • zstd/src/encoding/frame_compressor.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
zstd/benches/compare_ffi.rs (1)

208-225: ⚠️ Potential issue | 🟠 Major

Benchmark against the post-finalization content budget, not the full dict_size.

Lines 209-224 and Lines 268-279 still train FastCOVER with the full dict_size, then finalize_raw_dict() trims that raw dictionary down to dict_size - header_bytes. That means rust_train_ms includes work spent optimizing bytes that can never survive into the emitted artifact, so the new dict-train delta is still skewed. Please reuse the same post-finalization budget logic as create_fastcover_dict_from_source() before calling train_fastcover_raw_from_slice() in both the report path and the Criterion loop.

Also applies to: 266-280

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/benches/compare_ffi.rs` around lines 208 - 225, The benchmark is
training FastCOVER with the pre-finalization dict_size but then
finalize_raw_dict trims header bytes, so training work includes bytes that
cannot appear in output; compute the post-finalization budget the same way as
create_fastcover_dict_from_source (i.e., subtract header/metadata overhead to
get effective_dict_size) and pass that effective size into
train_fastcover_raw_from_slice instead of dict_size in both the report path
(where rust_train_started is measured) and the Criterion loop; ensure the same
effective size is used when calling finalize_raw_dict to keep training and
finalization budgets consistent (refer to train_fastcover_raw_from_slice,
finalize_raw_dict, and create_fastcover_dict_from_source to mirror its budget
computation).
zstd/src/dictionary/mod.rs (1)

301-319: ⚠️ Potential issue | 🟠 Major

LL/ML/OF tables are still trained from the wrong symbol space.

Lines 307-316 collapse raw corpus bytes into the FSE code space with byte % (max_symbol + 1), and Line 333 builds the tables from that synthetic stream. That makes the tables data-dependent, but not sequence-dependent: the resulting histograms still have no relationship to the compressor’s actual literal-length, match-length, or offset-code frequencies. The dictionaries remain parseable, but this finalized path will still miss the intended compression-quality parity. Please derive these histograms from real emitted LL/ML/OF codes before calling build_table_from_data.

Also applies to: 321-335

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/dictionary/mod.rs` around lines 301 - 319, The current
bounded_fse_symbols function and its use to create FSE tables (via
build_table_from_data) wrongly map raw corpus bytes into the FSE symbol space
(byte % (max_symbol+1)), producing histograms that do not reflect actual emitted
LL/ML/OF codes; instead, compute the histograms from real emitted
literal-length, match-length, and offset-code sequences as produced by the
compressor/parser and pass those symbol streams into build_table_from_data.
Modify the code paths that call bounded_fse_symbols (and any place that feeds
build_table_from_data for LL/ML/OF tables) to generate symbol streams by running
the corpus through the same parsing/emission logic that outputs LL/ML/OF codes,
then feed those emitted code sequences into build_table_from_data so tables are
trained on true sequence-dependent frequencies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 208-219: The benchmark removed the Rust FastCOVER score: keep the
returned rust_tuned from train_fastcover_raw_from_slice (the
Ok((rust_raw_dictionary, _rust_tuned)) pattern) and extract its score
(rust_fastcover_score) to store in DictTrainingMetrics and include it when
emitting REPORT_DICT_TRAIN; specifically, update the pattern that currently
discards rust_tuned to bind it (e.g., rust_tuned), add the rust_fastcover_score
field into the DictTrainingMetrics construction (where metrics are created
around lines ~247-258), and ensure the REPORT_DICT_TRAIN output (around lines
~441-463) appends the rust_fastcover_score field so the external script parsing
REPORT_DICT_TRAIN sees the trailing score value.

---

Duplicate comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 208-225: The benchmark is training FastCOVER with the
pre-finalization dict_size but then finalize_raw_dict trims header bytes, so
training work includes bytes that cannot appear in output; compute the
post-finalization budget the same way as create_fastcover_dict_from_source
(i.e., subtract header/metadata overhead to get effective_dict_size) and pass
that effective size into train_fastcover_raw_from_slice instead of dict_size in
both the report path (where rust_train_started is measured) and the Criterion
loop; ensure the same effective size is used when calling finalize_raw_dict to
keep training and finalization budgets consistent (refer to
train_fastcover_raw_from_slice, finalize_raw_dict, and
create_fastcover_dict_from_source to mirror its budget computation).

In `@zstd/src/dictionary/mod.rs`:
- Around line 301-319: The current bounded_fse_symbols function and its use to
create FSE tables (via build_table_from_data) wrongly map raw corpus bytes into
the FSE symbol space (byte % (max_symbol+1)), producing histograms that do not
reflect actual emitted LL/ML/OF codes; instead, compute the histograms from real
emitted literal-length, match-length, and offset-code sequences as produced by
the compressor/parser and pass those symbol streams into build_table_from_data.
Modify the code paths that call bounded_fse_symbols (and any place that feeds
build_table_from_data for LL/ML/OF tables) to generate symbol streams by running
the corpus through the same parsing/emission logic that outputs LL/ML/OF codes,
then feed those emitted code sequences into build_table_from_data so tables are
trained on true sequence-dependent frequencies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ddbef425-cf43-4e2f-aa78-7f7b152262e2

📥 Commits

Reviewing files that changed from the base of the PR and between 9625c35 and dbe0b38.

📒 Files selected for processing (2)
  • zstd/benches/compare_ffi.rs
  • zstd/src/dictionary/mod.rs

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 6, 2026

@coderabbitai Re: duplicate item about post-finalization content budget in (ranges 208-225 and 266-280) — fixed in commit 8622344. Rust FastCOVER now trains with finalized content budget via in both report and Criterion paths.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 6, 2026

@coderabbitai Re: duplicate item about post-finalization content budget in zstd/benches/compare_ffi.rs (ranges 208-225 and 266-280) — fixed in commit 8622344. Rust FastCOVER now trains with finalized content budget in both report and Criterion paths.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz Thanks for the update! Acknowledged — confirmed that as of commit 8622344, Rust FastCOVER trains with the correct finalized content budget in both the REPORT_DICT_TRAIN report path and the Criterion benchmark path in zstd/benches/compare_ffi.rs.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T22:29:06.349Z
Learning: In `zstd/src/dictionary/fastcover.rs` (structured-world/structured-zstd), `FastCoverTuned.accel` is correctly populated in `train_fastcover_internal` using `accel: params.accel`, and `normalize_fastcover_params` clamps `accel` to `1..=10`. Do not flag `FastCoverTuned.accel` as unclamped or incorrectly set.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 73
File: zstd/src/dictionary/mod.rs:337-345
Timestamp: 2026-04-05T21:31:53.614Z
Learning: In `zstd/src/decoding/dictionary.rs` (structured-world/structured-zstd), `Dictionary::decode_dict()` parses the full zstd dictionary wire format: magic (4 bytes) → dict_id (4 bytes) → Huffman table (variable, consumed via `build_decoder`) → OF FSE table → ML FSE table → LL FSE table → 3 × u32 repeat offsets → raw content. This matches the layout produced by `finalize_raw_dict()` in `zstd/src/dictionary/mod.rs`. Do not flag these as mismatched layouts.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T07:19:26.176Z
Learning: In `cli/src/main.rs` (structured-world/structured-zstd), path-based compression calls `set_source_size_hint(source_size as u64)` (NOT `set_pledged_content_size`) for the filesystem metadata size. This is intentional: `metadata().len()` is unreliable for pseudo-files (`/proc/*`), FIFOs, and growing inputs, so the hint must remain advisory (only scaling matcher/window parameters) and must not impose a hard byte contract on the encoder. Do not flag the absence of `set_pledged_content_size` in path-based compression as a missing feature.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 33
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T00:05:14.600Z
Learning: In structured-world/structured-zstd, the Clippy requirement is: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`. The `rustc-dep-of-std` feature is excluded (internal feature for Rust stdlib builds only) and `fuzz_exports` is excluded (fuzzing-specific entry points validated separately from the regular lint gate). The `--all-features` flag should NOT be used.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T16:36:03.248Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` appears asymmetric to `new()` in its pre-allocation, but the asymmetry is intentional: `decoder_scratch.reset(window_size)` internally calls `DecodeBuffer::reset()` which already reserves `window_size`. Therefore `reset()` only needs to call `scratch.buffer.reserve(frame_content_size)` on top of that for the FCS delta, rather than explicitly reserving `window_size` again as `new()` does. Do not re-flag this as a bug or asymmetry.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Avoid suggesting architectural changes that diverge too far from upstream, as this is a fork of ruzstd (KillingSpark/zstd-rs)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz Thanks for the heads-up! Acknowledged — with commit 8622344, Rust FastCOVER trains with the finalized content budget in both the report path and the Criterion benchmark path in zstd/benches/compare_ffi.rs, eliminating the inconsistency between the two ranges.


✏️ Learnings added
Learnt from: polaz
URL: https://github.com/structured-world/structured-zstd/pull/73

Timestamp: 2026-04-06T01:40:24.358Z
Learning: In `zstd/benches/compare_ffi.rs` (structured-world/structured-zstd), Rust FastCOVER trains with the post-finalization content budget in both the `REPORT_DICT_TRAIN` emission path (around lines 208-225) and the Criterion benchmark path (around lines 266-280). Both paths were aligned in commit 8622344. Do not flag these ranges as using inconsistent budget values.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T22:29:06.349Z
Learning: In `zstd/src/dictionary/fastcover.rs` (structured-world/structured-zstd), `FastCoverTuned.accel` is correctly populated in `train_fastcover_internal` using `accel: params.accel`, and `normalize_fastcover_params` clamps `accel` to `1..=10`. Do not flag `FastCoverTuned.accel` as unclamped or incorrectly set.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 73
File: zstd/src/dictionary/mod.rs:337-345
Timestamp: 2026-04-05T21:31:53.614Z
Learning: In `zstd/src/decoding/dictionary.rs` (structured-world/structured-zstd), `Dictionary::decode_dict()` parses the full zstd dictionary wire format: magic (4 bytes) → dict_id (4 bytes) → Huffman table (variable, consumed via `build_decoder`) → OF FSE table → ML FSE table → LL FSE table → 3 × u32 repeat offsets → raw content. This matches the layout produced by `finalize_raw_dict()` in `zstd/src/dictionary/mod.rs`. Do not flag these as mismatched layouts.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T07:19:26.176Z
Learning: In `cli/src/main.rs` (structured-world/structured-zstd), path-based compression calls `set_source_size_hint(source_size as u64)` (NOT `set_pledged_content_size`) for the filesystem metadata size. This is intentional: `metadata().len()` is unreliable for pseudo-files (`/proc/*`), FIFOs, and growing inputs, so the hint must remain advisory (only scaling matcher/window parameters) and must not impose a hard byte contract on the encoder. Do not flag the absence of `set_pledged_content_size` in path-based compression as a missing feature.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:479-523
Timestamp: 2026-04-02T19:19:44.750Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), the per-level roundtrip test suites (Better, Best) share a common `level_roundtrip_suite!` macro that expands the same 7-test matrix (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals) for each level via module-scoped macro invocations. Do not suggest re-inlining these as separate flat test functions — the macro was added intentionally to eliminate cross-level drift.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-05T08:17:53.856Z
Learning: In `zstd/src/encoding/frame_compressor.rs` (structured-world/structured-zstd), `set_source_size_hint` applies only to the payload (uncompressed source) bytes, not to dictionary priming. Dictionary priming is separate and does not inflate the advertised frame window size. Do not suggest including dictionary bytes in the size hint passed to `set_source_size_hint`.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 589-607: The test finalize_raw_dict_roundtrips_with_ffi_decoder
currently trains raw at the full 4096 target then finalizes into a 4096 budget,
which skips the real-world path; update the test to compute the
post-finalization content budget (use
finalized_content_budget(sample.as_slice(), 4096) or simply call
create_fastcover_dict_from_source(sample.as_slice(),
finalized_content_budget(...)) to obtain a raw trained at the smaller target)
and train the raw with that smaller target via fastcover::train_fastcover_raw
(or call create_fastcover_dict_from_source directly) before calling
finalize_raw_dict so the budgeted finalize path is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2efb5ad9-7b5b-4a69-bc20-3602df8da914

📥 Commits

Reviewing files that changed from the base of the PR and between dbe0b38 and 8622344.

📒 Files selected for processing (2)
  • zstd/benches/compare_ffi.rs
  • zstd/src/dictionary/mod.rs

Comment on lines +589 to +607
fn finalize_raw_dict_roundtrips_with_ffi_decoder() {
let sample = training_data();
let raw = fastcover::train_fastcover_raw(
sample.as_slice(),
4096,
FastCoverParams {
k: 256,
d: 8,
f: 20,
accel: 1,
},
);
let finalized = finalize_raw_dict(
raw.as_slice(),
sample.as_slice(),
4096,
FinalizeOptions::default(),
)
.expect("finalization should succeed");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Exercise the budgeted finalize path in this interop test.

This test trains raw at the full 4096-byte target and then finalizes into the same 4096-byte budget, so finalize_raw_dict() can truncate and still pass the roundtrip. That bypasses the production path, where callers first compute the post-finalization content budget and train to that smaller raw size. Reusing create_fastcover_dict_from_source() here, or at least training against finalized_content_budget(...), would keep the size-alignment fix under regression coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/dictionary/mod.rs` around lines 589 - 607, The test
finalize_raw_dict_roundtrips_with_ffi_decoder currently trains raw at the full
4096 target then finalizes into a 4096 budget, which skips the real-world path;
update the test to compute the post-finalization content budget (use
finalized_content_budget(sample.as_slice(), 4096) or simply call
create_fastcover_dict_from_source(sample.as_slice(),
finalized_content_budget(...)) to obtain a raw trained at the smaller target)
and train the raw with that smaller target via fastcover::train_fastcover_raw
(or call create_fastcover_dict_from_source directly) before calling
finalize_raw_dict so the budgeted finalize path is exercised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: dictionary builder — FastCOVER algorithm and dictionary finalization

2 participants