Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a pure‑Rust FastCOVER trainer, dictionary finalization APIs, adjusts COVER internals, adds Rust vs C-FFI dict‑training benchmarks and a new bench target, and extends the CI benchmark runner/parser to emit/parse REPORT_DICT_TRAIN with a throughput-or-ms/iter fallback. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Bench Runner (script)
participant Bench as Bench Binary
participant Rust as Rust FastCOVER
participant CFFI as C FFI Trainer
participant Parser as Report Parser
participant Report as Generated Report
Runner->>Bench: run benches (--features dict_builder)
Bench->>Rust: train_fastcover_raw_from_slice / optimize\n(measure rust_train_ms, rust_dict_bytes, score)
Bench->>CFFI: call C trainer (via FFI)\n(measure ffi_train_ms, ffi_dict_bytes)
Rust-->>Bench: rust_train_ms, rust_dict_bytes, score
CFFI-->>Bench: ffi_train_ms, ffi_dict_bytes
Bench-->>Runner: emit REPORT_DICT_TRAIN and other REPORT_* lines
Runner->>Parser: parse raw output (REPORT_* including REPORT_DICT_TRAIN)
Parser-->>Report: populate dictionary_training_rows and delta tables\n(use throughput or ms/iter fallback)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds a FastCOVER-based (pure Rust) dictionary-training path and integrates it into the Criterion compare_ffi benchmarks + CI benchmark reporting, so the repo can track Rust-vs-C dictionary training deltas alongside the existing compression/decompression matrix.
Changes:
- Implement FastCOVER raw dictionary training (plus tuning/optimization scaffolding) and expose new public APIs for training and finalizing dictionaries.
- Extend
compare_ffito benchmark/report dictionary training (dict-train) and update benchmark parsing/report generation to includeREPORT_DICT_TRAIN. - Require
dict_builderfor relevant benches and update docs to reflect the new benchmark stage and dictionary capabilities.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/dictionary/mod.rs | Exposes FastCOVER options/APIs, updates COVER trainer to operate over in-memory input, and adds dictionary finalization logic + tests. |
| zstd/src/dictionary/fastcover.rs | Adds FastCOVER raw trainer + optimizer and unit tests. |
| zstd/src/dictionary/cover.rs | Updates epoch-based segment selection and fixes k-mer scoring bounds for short segments. |
| zstd/Cargo.toml | Gates compare_ffi/new bench behind dict_builder and adds dict_builder_fastcover bench target. |
| zstd/benches/dict_builder_fastcover.rs | Adds a dedicated Criterion bench comparing COVER vs FastCOVER raw training. |
| zstd/benches/compare_ffi.rs | Adds dict-train stage, emits REPORT_DICT_TRAIN, and benchmarks Rust FastCOVER vs C trainer. |
| README.md | Updates dictionary generation documentation to include FastCOVER + finalization APIs. |
| BENCHMARKS.md | Documents the new dict-train stage and updates benchmark invocation flags. |
| .github/scripts/run-benchmarks.sh | Parses/emits dictionary training tables and adds a speed-delta fallback when throughput is unavailable. |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/cover.rs`:
- Around line 70-75: The chained method call creating `segments` is misformatted
and failing `cargo fmt --check`; update the expression that builds `segments`
(the `epoch.chunks(params.segment_size as usize).peekable()` chain) to follow
rustfmt style — either place each method on its own indented line with the dot
leading (e.g. `epoch.chunks(...).peekable()` reformatted across lines) or simply
run `cargo fmt` to apply the canonical formatting so the `segments` binding
compiles with rustfmt rules; ensure you adjust the `params.segment_size as
usize` cast placement if needed to satisfy formatting.
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 103-130: The function signature for train_fastcover_raw is
misformatted for rustfmt; reformat it to follow Rust style (wrap parameters
across lines and place the return type and opening brace on their own line) so
cargo fmt passes — e.g. format the pub fn train_fastcover_raw(sample: &[u8],
dict_size: usize, params: FastCoverParams) -> Vec<u8> declaration to a
rustfmt-friendly multi-line signature and keep the body calling
build_raw_dict(sample, dict_size, params) unchanged; running cargo fmt/rustfmt
after the change should verify the fix.
In `@zstd/src/dictionary/mod.rs`:
- Around line 181-184: The code now allocates Vec::with_capacity(source_size)
and calls source.read_to_end(&mut all), which forces the entire training corpus
into memory; revert to a streaming approach or explicitly document the memory
constraint. Either replace read_to_end with a BufReader-based streaming read
(using BufReader around source and processing chunks iteratively instead of
collecting into all) or add a comment/Function-level documentation next to the
Vec/all/read_to_end code that states large corpora will be fully buffered and
recommend using a streaming API; reference the symbols Vec::with_capacity, all,
source.read_to_end, source_size and BufReader when making the change.
- Around line 390-439: Both create_fastcover_raw_dict_from_source and
create_fastcover_dict_from_source duplicate reading the source, empty-check, and
branching on fastcover.optimize; extract the shared training logic into a helper
(e.g., train_fastcover_internal(sample: &[u8], dict_size: usize, options:
&FastCoverOptions) -> (Vec<u8>, FastCoverTuned)) that calls
fastcover::optimize_fastcover_raw or fastcover::train_fastcover_raw and returns
(raw_dict, tuned), then have both public functions read and validate the source,
call train_fastcover_internal, call finalize_raw_dict(final_raw.as_slice(),
sample.as_slice(), dict_size, finalize) as before, write output, and return the
tuned struct.
- Around line 30-50: Reorder the use/import block so crate-internal modules are
grouped and sorted consistently (e.g., BitWriter,
decoding::dictionary::MAGIC_NUM, fse::fse_encoder, huff0::HuffmanTable,
huff0::huff0_encoder::{HuffmanEncoder, HuffmanTable},
dictionary::reservoir::create_sample, cover::*, fastcover::... ) and place std
imports (boxed::Box, collections::..., fs::..., io::..., path::..., vec::Vec) in
their own group; move the create_sample import into the crate-internal group,
run cargo fmt to apply rustfmt ordering rules, and ensure symbols like
BitWriter, DICT_MAGIC_NUM, fse_encoder, HuffmanDecoderTable, HuffmanEncoder,
create_sample, and DEFAULT_D_CANDIDATES remain correctly referenced after
reordering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 37e2cb36-e61f-41fa-aa4c-1e5637716601
📒 Files selected for processing (9)
.github/scripts/run-benchmarks.shBENCHMARKS.mdREADME.mdzstd/Cargo.tomlzstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/scripts/run-benchmarks.sh:
- Around line 244-273: The dict-train parsing block (dict_train_match ->
dictionary_training_rows) must not inherit compression's scenario_input_bytes
for throughput math; stop populating or referencing scenario_input_bytes for
dict-train rows and instead add a dedicated "training_bytes" field (set to None
if the report doesn't include it) so downstream throughput calculations
skip/avoid computing rust_bytes_per_sec/ffi_bytes_per_sec for dict-train. Update
the downstream throughput/printing logic to check for "training_bytes" (or its
absence) before computing bytes/sec and only use REPORT_DICT_TRAIN-provided
training_bytes if you add that field; if you need absolute throughput later, add
training_bytes to REPORT_DICT_TRAIN and consume it here. Ensure changes
reference dict_train_match, dictionary_training_rows, and the throughput
computation code that reads scenario_input_bytes.
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 128-179: optimize_fastcover_raw currently panics for 1-byte
samples because split_idx.clamp(1, sample.len().saturating_sub(1)) can produce
an invalid range and it may return an empty/zeroed winner when all candidates
score 0; modify optimize_fastcover_raw to first handle tiny samples (e.g., if
sample.len() < 2 return early or set split_idx = 1 only when sample.len() > 1)
so the clamp bounds are valid, and seed the "best" and "best_dict" with the
first evaluated candidate inside the nested loops (compute
params/build_raw_dict/coverage_score for the first f/d/k triple and assign
regardless of score) so there is always a concrete dict and params returned; add
regression tests calling train_fastcover_raw_from_slice()/optimize_fastcover_raw
with a 1-byte sample and a short sample where eval split is smaller than the
minimum d to ensure no panic and a non-empty dict is produced.
In `@zstd/src/dictionary/mod.rs`:
- Around line 327-330: Reject dict_id == 0 before serializing: after computing
dict_id (from options.dict_id or derive_dict_id(raw_content)) check if dict_id
== 0 and return an early error instead of writing to out; mirror the existing
behavior used by FrameCompressor::set_dictionary which returns
DictionaryDecodeError::ZeroDictionaryId so propagate the same error on
encode/serialization paths (check both when options.dict_id is provided and when
derived via derive_dict_id).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 868b26a4-86cf-42d7-b1d2-eab09d92da77
📒 Files selected for processing (5)
.github/scripts/run-benchmarks.shzstd/benches/compare_ffi.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: c30e9ba | Previous: b280757 | Ratio |
|---|---|---|---|
compress/better/small-1k-random/matrix/c_ffi |
0.11 ms |
0.092 ms |
1.20 |
compress/best/small-1k-random/matrix/pure_rust |
0.296 ms |
0.249 ms |
1.19 |
compress/default/small-4k-log-lines/matrix/c_ffi |
0.021 ms |
0.018 ms |
1.17 |
compress/better/small-4k-log-lines/matrix/pure_rust |
0.152 ms |
0.124 ms |
1.23 |
compress/better/small-4k-log-lines/matrix/c_ffi |
0.099 ms |
0.077 ms |
1.29 |
compress/best/small-4k-log-lines/matrix/pure_rust |
0.258 ms |
0.203 ms |
1.27 |
compress/best/small-4k-log-lines/matrix/c_ffi |
0.368 ms |
0.28 ms |
1.31 |
compress/fastest/low-entropy-1m/matrix/c_ffi |
0.189 ms |
0.142 ms |
1.33 |
compress/fastest/large-log-stream/matrix/c_ffi |
2.726 ms |
2.149 ms |
1.27 |
compress/best/large-log-stream/matrix/c_ffi |
15.06 ms |
12.903 ms |
1.17 |
decompress/fastest/small-10k-random/rust_stream/matrix/c_ffi |
0.002 ms |
0.001 ms |
2 |
decompress/fastest/high-entropy-1m/c_stream/matrix/c_ffi |
0.035 ms |
0.027 ms |
1.30 |
decompress/default/high-entropy-1m/c_stream/matrix/c_ffi |
0.037 ms |
0.027 ms |
1.37 |
decompress/best/high-entropy-1m/c_stream/matrix/c_ffi |
0.172 ms |
0.038 ms |
4.53 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/scripts/run-benchmarks.sh:
- Around line 293-298: The current check for missing dictionary_training_rows
only prints a warning; change it to fail fast by replacing the print-only
behavior in the if not dictionary_training_rows: block so the script exits with
a non-zero status (e.g., call sys.exit(1)) after writing the same error message
to stderr; locate and update the if not dictionary_training_rows: conditional in
.github/scripts/run-benchmarks.sh to preserve the stderr output and then exit to
ensure CI fails when REPORT_DICT_TRAIN rows are missing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ca663b82-0133-4187-8e96-f731d78116f3
📒 Files selected for processing (3)
.github/scripts/run-benchmarks.shzstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 35-41: The code currently clamps table bits inside
clamp_table_bits and build_frequency_table but trains, scores, and records using
the original raw params (e.g., build_raw_dict, params.d, k, FastCoverTuned), so
normalize/replace params once up front: in the entrypoints (including the
non-optimized wrapper in mod.rs) compute normalized params.d (clamped to 4..32),
params.f (clamped via clamp_table_bits to 8..20) and adjust k to match the
normalized d, then pass these normalized values into build_frequency_table,
build_raw_dict, the optimizer scoring path, and when constructing FastCoverTuned
so that training, evaluation, and reporting all use the same clamped/adjusted
parameters.
In `@zstd/src/dictionary/mod.rs`:
- Around line 405-416: train_fastcover_raw_from_slice currently only rejects
empty sample slices but can return Ok((Vec::new(), tuned)) for tiny samples;
update this function to treat a produced empty dictionary as invalid when
dict_size > 0: after calling train_fastcover_internal (or by checking the dict
produced there), return an Err(io::ErrorKind::InvalidInput, ...) if
dict.is_empty() && dict_size > 0, or alternatively implement the same
tiny-source fallback logic used by create_raw_dict_from_source so small samples
fall back to the tiny-source path; reference train_fastcover_raw_from_slice,
train_fastcover_internal, create_fastcover_raw_dict_from_source,
create_fastcover_dict_from_source, and finalize_raw_dict when adding the check
or fallback.
- Around line 337-345: finalize_raw_dict() now writes Huffman/FSE tables before
the repeat offsets, but Dictionary::decode_dict() still assumes offsets are at
raw[8..20] and content at raw[20..], causing mis-decoding; update
Dictionary::decode_dict() (and any caller like
FrameCompressor::set_dictionary()) to parse the new wire format by first
reading/consuming the serialized Huffman and three FSE tables (as produced by
serialize_huffman_table and serialize_fse_table) before reading the three repeat
offsets and the remaining content, or alternatively gate finalize_raw_dict()
behind a feature/API flag until the in-crate parser is updated so local
consumers do not receive finalized dictionaries with mismatched layout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ce3585f0-341a-4ce8-a310-d180922a6582
📒 Files selected for processing (2)
zstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
This reverts commit 5f15353.
- extend compare_ffi with dict-train stage and REPORT_DICT_TRAIN lines - parse and render dictionary training section in benchmark reports - require dict_builder feature for compare_ffi benchmark and pipeline command - document new benchmark stage and speed-delta fallback behavior
- treat source_size as hint and document in-memory training behavior - fix epoch sizing units, bound huffman stats sampling, and share fastcover training helper - add k-candidate fallback + regression tests - rename dict-train delta field to match C/Rust ratio semantics
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 427-435: Add a short doc comment to clearly state that these
public Read-based helpers fully buffer the training corpus into memory before
training: update the doc comment for create_fastcover_raw_dict_from_source
(which calls source.read_to_end and then train_fastcover_raw_from_slice) and the
other public Read-based helper that likewise calls read_to_end (the sibling
function around lines 441-449) to say something like "This function fully
buffers the entire training corpus into memory (uses read_to_end) and may
consume large amounts of RAM for large inputs; callers should stream or use
slice-based APIs for very large corpora." Ensure both functions' public docs
contain that exact memory-buffering warning.
- Around line 450-453: The trainer currently calls
train_fastcover_raw_from_slice(..., dict_size, ...) but finalize_raw_dict(...,
dict_size, ...) only allows dict_size - header_bytes for raw content, so
training targets the wrong budget; change the training budget to use the actual
post-finalization content budget by computing the finalized content capacity the
same way finalize_raw_dict does (reuse its header/table sizing logic), e.g.,
add/implement finalized_content_budget(...) and pass that result into
train_fastcover_raw_from_slice instead of dict_size so the raw trainer only
optimizes the bytes that will survive finalization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 61ab7123-92f2-44a7-b8cb-9ad833e2561f
📒 Files selected for processing (9)
.github/scripts/run-benchmarks.shBENCHMARKS.mdREADME.mdzstd/Cargo.tomlzstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
zstd/src/dictionary/mod.rs (3)
450-453:⚠️ Potential issue | 🟠 MajorTrain against the post-finalization content budget.
This path tunes raw content with the full
dict_size, thenfinalize_raw_dict()removes header/table bytes and may reject the request anyway. Successful calls ship a truncated dictionary that FastCOVER never optimized, and impossible budgets still pay the full optimization cost.Possible direction
- let (raw_dict, tuned) = - train_fastcover_raw_from_slice(sample.as_slice(), dict_size, fastcover)?; + let content_budget = finalized_content_budget(sample.as_slice(), dict_size, finalize)?; + let (raw_dict, tuned) = + train_fastcover_raw_from_slice(sample.as_slice(), content_budget, fastcover)?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 450 - 453, The code currently calls train_fastcover_raw_from_slice with the full dict_size then finalizes, which wastes work and can exceed the post-finalization content budget; change the flow to compute the actual content budget to be used for training (the maximum number of payload bytes allowed after finalize_raw_dict removes headers/tables) before training, check that this content_budget is positive and return the appropriate error if it is impossible, then call train_fastcover_raw_from_slice(sample.as_slice(), content_budget, fastcover) to produce raw_dict tuned to the post-finalize budget and finally call finalize_raw_dict(raw_dict.as_slice(), sample.as_slice(), dict_size, finalize) as before; reference the functions train_fastcover_raw_from_slice and finalize_raw_dict and add the pre-check for an impossible content budget to avoid wasted work.
337-340:⚠️ Potential issue | 🟠 MajorSequence entropy tables are still hard-coded.
finalize_raw_dict()always emitsdefault_of_table(),default_ml_table(), anddefault_ll_table(), so only the Huffman side is corpus-trained. Finalized dictionaries will systematically trail C finalization quality because they never learn sample-specific sequence stats.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 337 - 340, finalize_raw_dict() currently emits hard-coded sequence FSE tables (fse_encoder::default_of_table(), default_ml_table(), default_ll_table()), so sequence statistics from sample_data are never used; replace those default tables with FSE tables built from the sample sequence statistics before calling serialize_fse_table. Locate the block where out.extend_from_slice(serialize_fse_table(...)) is called and: compute sequence counts/stats from sample_data (the same sample source used for serialize_huffman_table), build FSE encoder tables from those counts (instead of using default_of_table/default_ml_table/default_ll_table), and pass the generated tables into serialize_fse_table so finalized dictionaries include sample-trained OF/ML/LL sequence tables.
188-199:⚠️ Potential issue | 🟠 MajorDon't panic on I/O failures in this public builder.
create_raw_dict_from_sourcecrashes on read/write errors (lines 193, 202, 257) via.expect(), while the newercreate_fastcover_raw_dict_from_sourceandcreate_fastcover_dict_from_sourceproperly propagate errors asio::Result. Change the return type toio::Result<()>and use?instead of.expect()for consistent error handling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 188 - 199, The public builder create_raw_dict_from_source currently panics on I/O errors via .expect() calls (e.g., the read_to_end and write_all sites) — change its signature to return io::Result<()> and replace those .expect(...) calls with the ? operator so I/O errors are propagated; also update any other remaining .expect() in create_raw_dict_from_source (including the later write at the tiny-dictionary path) to use ? and ensure callers handle the propagated Result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 197-200: The FFI and Rust trainers are being fed different
corpora: Rust uses a flattened training_blob while the FFI calls still use
sample_refs (preserving original boundaries), so make them identical by choosing
one canonical representation and using it for both trainers; e.g., keep
training_blob (Vec<u8>) and compute sample_offsets/lengths or create slices that
point into training_blob (replace sample_refs with slices derived from
training_blob) and pass those same slices/offsets to the FFI dict-train calls
(the variables to change are training_samples, sample_refs, training_blob,
total_training_bytes and the FFI dict-train invocations near where sample_refs
are passed). Ensure total_training_bytes is computed from the unified
representation and update both trainer call sites to use that unified data.
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 39-43: normalize_fastcover_params currently clamps d/k/f but
leaves accel unchecked so callers and the optimizer can use out-of-range accel
values; update normalize_fastcover_params(FastCoverParams) to clamp params.accel
to 1..=10 (same pattern as params.d and params.f using clamp_table_bits) and
return the clamped struct, then ensure any other normalization or tuning paths
that construct FastCoverTuned or score using accel (the optimizer/scoring code
and the code that builds FastCoverTuned) use that clamped accel value instead of
the raw input so the tuned result never contains accel outside 1..=10.
In `@zstd/src/dictionary/mod.rs`:
- Around line 385-398: The FastCoverParams are normalized into params via
normalize_fastcover_params(FastCoverParams { ... }) but the FastCoverTuned
struct still uses options.accel; change it to use the normalized params.accel
instead. Locate the block that creates params and returns
(fastcover::train_fastcover_raw(...), FastCoverTuned { ... }) and replace the
accel field value from options.accel to params.accel so FastCoverTuned
consistently reflects the normalized parameters (refer to
normalize_fastcover_params, FastCoverParams, params, options.accel, and
FastCoverTuned).
---
Duplicate comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 450-453: The code currently calls train_fastcover_raw_from_slice
with the full dict_size then finalizes, which wastes work and can exceed the
post-finalization content budget; change the flow to compute the actual content
budget to be used for training (the maximum number of payload bytes allowed
after finalize_raw_dict removes headers/tables) before training, check that this
content_budget is positive and return the appropriate error if it is impossible,
then call train_fastcover_raw_from_slice(sample.as_slice(), content_budget,
fastcover) to produce raw_dict tuned to the post-finalize budget and finally
call finalize_raw_dict(raw_dict.as_slice(), sample.as_slice(), dict_size,
finalize) as before; reference the functions train_fastcover_raw_from_slice and
finalize_raw_dict and add the pre-check for an impossible content budget to
avoid wasted work.
- Around line 337-340: finalize_raw_dict() currently emits hard-coded sequence
FSE tables (fse_encoder::default_of_table(), default_ml_table(),
default_ll_table()), so sequence statistics from sample_data are never used;
replace those default tables with FSE tables built from the sample sequence
statistics before calling serialize_fse_table. Locate the block where
out.extend_from_slice(serialize_fse_table(...)) is called and: compute sequence
counts/stats from sample_data (the same sample source used for
serialize_huffman_table), build FSE encoder tables from those counts (instead of
using default_of_table/default_ml_table/default_ll_table), and pass the
generated tables into serialize_fse_table so finalized dictionaries include
sample-trained OF/ML/LL sequence tables.
- Around line 188-199: The public builder create_raw_dict_from_source currently
panics on I/O errors via .expect() calls (e.g., the read_to_end and write_all
sites) — change its signature to return io::Result<()> and replace those
.expect(...) calls with the ? operator so I/O errors are propagated; also update
any other remaining .expect() in create_raw_dict_from_source (including the
later write at the tiny-dictionary path) to use ? and ensure callers handle the
propagated Result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 076babeb-74dc-441d-a1f2-a9c33c1d779a
📒 Files selected for processing (9)
.github/scripts/run-benchmarks.shBENCHMARKS.mdREADME.mdzstd/Cargo.tomlzstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
zstd/src/dictionary/mod.rs (1)
479-492: 🧹 Nitpick | 🔵 TrivialUse the normalized
accelinFastCoverTuned.
k,d, andfcome fromnormalize_fastcover_params(...), butaccelstill comes from the raw option. That makes the publicFastCoverTunedresult inconsistent with the parameters actually passed tofastcover::train_fastcover_raw(...).♻️ Proposed fix
( fastcover::train_fastcover_raw(sample, dict_size, params), FastCoverTuned { k: params.k, d: params.d, f: params.f, - accel: options.accel, + accel: params.accel, score: 0, }, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 479 - 492, The FastCoverTuned returned struct uses options.accel instead of the normalized accel from normalize_fastcover_params, causing a mismatch with the params passed to fastcover::train_fastcover_raw; update the FastCoverTuned initializer to set accel: params.accel (and ensure FastCoverParams/normalize_fastcover_params provide accel) so k, d, f, and accel in FastCoverTuned reflect the actual normalized params used for training.zstd/benches/compare_ffi.rs (1)
197-200:⚠️ Potential issue | 🟠 MajorBenchmark the same corpus model on both sides.
The Rust path trains on one contiguous
training_blob, while the C path still trains onsample_refswith explicit sample boundaries. That changes both the work performed and the resulting dictionary, so the reporteddict-traindelta is not comparing like-for-like inputs.Also applies to: 206-208, 221-223, 251-267
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 235-243: The REPORT_DICT_TRAIN reported training_bytes is
incorrect because emit_dictionary_training_report is being passed scenario.len()
even though split_training_samples() may truncate samples; update the calls to
compute the actual byte count from the samples returned by
split_training_samples (e.g., sum of lengths of the training sample slices) and
pass that sum as the training_bytes argument to emit_dictionary_training_report
(change the calls around emit_dictionary_training_report where
rust_dictionary/ffi_dictionary are computed; also apply the same fix to the
other occurrence around lines 419-431). Ensure you reference
split_training_samples, emit_dictionary_training_report, and REPORT_DICT_TRAIN
when locating and fixing the code.
---
Duplicate comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 479-492: The FastCoverTuned returned struct uses options.accel
instead of the normalized accel from normalize_fastcover_params, causing a
mismatch with the params passed to fastcover::train_fastcover_raw; update the
FastCoverTuned initializer to set accel: params.accel (and ensure
FastCoverParams/normalize_fastcover_params provide accel) so k, d, f, and accel
in FastCoverTuned reflect the actual normalized params used for training.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7e2ed0ba-b8ff-4cba-91d5-6cf52e42bc1f
📒 Files selected for processing (3)
.github/scripts/run-benchmarks.shzstd/benches/compare_ffi.rszstd/src/dictionary/mod.rs
|
@coderabbitai Re duplicate about |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 206-246: The benchmark compares non-equivalent outputs:
zstd::dict::from_samples returns a finalized dictionary while
train_fastcover_raw_from_slice returns only raw content, so either finalize the
Rust result before measuring and reporting (e.g., pass
rust_dictionary/rust_tuned through the same finalization used by
create_fastcover_dict_from_source or a slice-based wrapper so
rust_dictionary.len() and rust_train_ms reflect the full finalized dict), or
change the emitted metrics in emit_dictionary_training_report to clearly label
these as "raw-content training" and report raw vs finalized sizes separately
(update the fields/labels for rust_dict_bytes vs ffi_dict_bytes and any timing
fields accordingly). Ensure you update the code paths around
train_fastcover_raw_from_slice, rust_dictionary, rust_tuned,
zstd::dict::from_samples and emit_dictionary_training_report so the compared
artifacts are equivalent or the report explicitly notes raw vs finalized.
In `@zstd/benches/dict_builder_fastcover.rs`:
- Around line 39-71: Hoist the FastCoverOptions construction out of the b.iter()
closures so the benches don't include allocation/setup cost: create one
FastCoverOptions instance for the optimized bench (e.g., fastcover_opts =
FastCoverOptions::default()) and one for the fixed bench (e.g.,
fastcover_fixed_opts = FastCoverOptions { optimize:false, accel:4, k:256, d:8,
f:20, ..FastCoverOptions::default() }) before calling c.bench_function, then
pass &fastcover_opts / &fastcover_fixed_opts into
create_fastcover_raw_dict_from_source inside the closures (references or clones
as appropriate) instead of building them inside b.iter; reference symbols:
FastCoverOptions, create_fastcover_raw_dict_from_source, the bench names
"dict_builder/fastcover_raw_opt" and "dict_builder/fastcover_raw_fixed".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a8629e61-02e4-4fd1-b930-f0275077209a
📒 Files selected for processing (5)
zstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rszstd/src/encoding/frame_compressor.rs
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
zstd/benches/compare_ffi.rs (1)
208-225:⚠️ Potential issue | 🟠 MajorBenchmark against the post-finalization content budget, not the full
dict_size.Lines 209-224 and Lines 268-279 still train FastCOVER with the full
dict_size, thenfinalize_raw_dict()trims that raw dictionary down todict_size - header_bytes. That meansrust_train_msincludes work spent optimizing bytes that can never survive into the emitted artifact, so the new dict-train delta is still skewed. Please reuse the same post-finalization budget logic ascreate_fastcover_dict_from_source()before callingtrain_fastcover_raw_from_slice()in both the report path and the Criterion loop.Also applies to: 266-280
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/benches/compare_ffi.rs` around lines 208 - 225, The benchmark is training FastCOVER with the pre-finalization dict_size but then finalize_raw_dict trims header bytes, so training work includes bytes that cannot appear in output; compute the post-finalization budget the same way as create_fastcover_dict_from_source (i.e., subtract header/metadata overhead to get effective_dict_size) and pass that effective size into train_fastcover_raw_from_slice instead of dict_size in both the report path (where rust_train_started is measured) and the Criterion loop; ensure the same effective size is used when calling finalize_raw_dict to keep training and finalization budgets consistent (refer to train_fastcover_raw_from_slice, finalize_raw_dict, and create_fastcover_dict_from_source to mirror its budget computation).zstd/src/dictionary/mod.rs (1)
301-319:⚠️ Potential issue | 🟠 MajorLL/ML/OF tables are still trained from the wrong symbol space.
Lines 307-316 collapse raw corpus bytes into the FSE code space with
byte % (max_symbol + 1), and Line 333 builds the tables from that synthetic stream. That makes the tables data-dependent, but not sequence-dependent: the resulting histograms still have no relationship to the compressor’s actual literal-length, match-length, or offset-code frequencies. The dictionaries remain parseable, but this finalized path will still miss the intended compression-quality parity. Please derive these histograms from real emitted LL/ML/OF codes before callingbuild_table_from_data.Also applies to: 321-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 301 - 319, The current bounded_fse_symbols function and its use to create FSE tables (via build_table_from_data) wrongly map raw corpus bytes into the FSE symbol space (byte % (max_symbol+1)), producing histograms that do not reflect actual emitted LL/ML/OF codes; instead, compute the histograms from real emitted literal-length, match-length, and offset-code sequences as produced by the compressor/parser and pass those symbol streams into build_table_from_data. Modify the code paths that call bounded_fse_symbols (and any place that feeds build_table_from_data for LL/ML/OF tables) to generate symbol streams by running the corpus through the same parsing/emission logic that outputs LL/ML/OF codes, then feed those emitted code sequences into build_table_from_data so tables are trained on true sequence-dependent frequencies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 208-219: The benchmark removed the Rust FastCOVER score: keep the
returned rust_tuned from train_fastcover_raw_from_slice (the
Ok((rust_raw_dictionary, _rust_tuned)) pattern) and extract its score
(rust_fastcover_score) to store in DictTrainingMetrics and include it when
emitting REPORT_DICT_TRAIN; specifically, update the pattern that currently
discards rust_tuned to bind it (e.g., rust_tuned), add the rust_fastcover_score
field into the DictTrainingMetrics construction (where metrics are created
around lines ~247-258), and ensure the REPORT_DICT_TRAIN output (around lines
~441-463) appends the rust_fastcover_score field so the external script parsing
REPORT_DICT_TRAIN sees the trailing score value.
---
Duplicate comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 208-225: The benchmark is training FastCOVER with the
pre-finalization dict_size but then finalize_raw_dict trims header bytes, so
training work includes bytes that cannot appear in output; compute the
post-finalization budget the same way as create_fastcover_dict_from_source
(i.e., subtract header/metadata overhead to get effective_dict_size) and pass
that effective size into train_fastcover_raw_from_slice instead of dict_size in
both the report path (where rust_train_started is measured) and the Criterion
loop; ensure the same effective size is used when calling finalize_raw_dict to
keep training and finalization budgets consistent (refer to
train_fastcover_raw_from_slice, finalize_raw_dict, and
create_fastcover_dict_from_source to mirror its budget computation).
In `@zstd/src/dictionary/mod.rs`:
- Around line 301-319: The current bounded_fse_symbols function and its use to
create FSE tables (via build_table_from_data) wrongly map raw corpus bytes into
the FSE symbol space (byte % (max_symbol+1)), producing histograms that do not
reflect actual emitted LL/ML/OF codes; instead, compute the histograms from real
emitted literal-length, match-length, and offset-code sequences as produced by
the compressor/parser and pass those symbol streams into build_table_from_data.
Modify the code paths that call bounded_fse_symbols (and any place that feeds
build_table_from_data for LL/ML/OF tables) to generate symbol streams by running
the corpus through the same parsing/emission logic that outputs LL/ML/OF codes,
then feed those emitted code sequences into build_table_from_data so tables are
trained on true sequence-dependent frequencies.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ddbef425-cf43-4e2f-aa78-7f7b152262e2
📒 Files selected for processing (2)
zstd/benches/compare_ffi.rszstd/src/dictionary/mod.rs
|
@coderabbitai Re: duplicate item about post-finalization content budget in (ranges 208-225 and 266-280) — fixed in commit 8622344. Rust FastCOVER now trains with finalized content budget via in both report and Criterion paths. |
|
@coderabbitai Re: duplicate item about post-finalization content budget in zstd/benches/compare_ffi.rs (ranges 208-225 and 266-280) — fixed in commit 8622344. Rust FastCOVER now trains with finalized content budget in both report and Criterion paths. |
|
Tip For best results, initiate chat on the files or code changes.
🧠 Learnings used |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 589-607: The test finalize_raw_dict_roundtrips_with_ffi_decoder
currently trains raw at the full 4096 target then finalizes into a 4096 budget,
which skips the real-world path; update the test to compute the
post-finalization content budget (use
finalized_content_budget(sample.as_slice(), 4096) or simply call
create_fastcover_dict_from_source(sample.as_slice(),
finalized_content_budget(...)) to obtain a raw trained at the smaller target)
and train the raw with that smaller target via fastcover::train_fastcover_raw
(or call create_fastcover_dict_from_source directly) before calling
finalize_raw_dict so the budgeted finalize path is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2efb5ad9-7b5b-4a69-bc20-3602df8da914
📒 Files selected for processing (2)
zstd/benches/compare_ffi.rszstd/src/dictionary/mod.rs
| fn finalize_raw_dict_roundtrips_with_ffi_decoder() { | ||
| let sample = training_data(); | ||
| let raw = fastcover::train_fastcover_raw( | ||
| sample.as_slice(), | ||
| 4096, | ||
| FastCoverParams { | ||
| k: 256, | ||
| d: 8, | ||
| f: 20, | ||
| accel: 1, | ||
| }, | ||
| ); | ||
| let finalized = finalize_raw_dict( | ||
| raw.as_slice(), | ||
| sample.as_slice(), | ||
| 4096, | ||
| FinalizeOptions::default(), | ||
| ) | ||
| .expect("finalization should succeed"); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Exercise the budgeted finalize path in this interop test.
This test trains raw at the full 4096-byte target and then finalizes into the same 4096-byte budget, so finalize_raw_dict() can truncate and still pass the roundtrip. That bypasses the production path, where callers first compute the post-finalization content budget and train to that smaller raw size. Reusing create_fastcover_dict_from_source() here, or at least training against finalized_content_budget(...), would keep the size-alignment fix under regression coverage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@zstd/src/dictionary/mod.rs` around lines 589 - 607, The test
finalize_raw_dict_roundtrips_with_ffi_decoder currently trains raw at the full
4096 target then finalizes into a 4096 budget, which skips the real-world path;
update the test to compute the post-finalization content budget (use
finalized_content_budget(sample.as_slice(), 4096) or simply call
create_fastcover_dict_from_source(sample.as_slice(),
finalized_content_budget(...)) to obtain a raw trained at the smaller target)
and train the raw with that smaller target via fastcover::train_fastcover_raw
(or call create_fastcover_dict_from_source directly) before calling
finalize_raw_dict so the budgeted finalize path is exercised.
Summary
dict-trainbenchmark stage tocompare_ffifor Rust FastCOVER vs C dictionary trainingREPORT_DICT_TRAINlines in.github/scripts/run-benchmarks.shaccelconsistently in tuning/scoring/output paramsio::Result<()>) instead of panickingbenchmark-report.mdand canonical deltas inbenchmark-delta.md/jsondict_builderforcompare_ffibench and align benchmark docsValidation
cargo fmt --all -- --checkcargo clippy --all-targets --features hash,std,dict_builder -- -D warningscargo nextest run --workspace --features hash,std,dict_buildercargo build --workspace --features hash,std,dict_buildercargo test --doc --workspace --features hash,std,dict_builderSame-run numbers (Rust FastCOVER vs C)
decodecorpus-z000033: 2.703 ms vs 23.342 ms (8.6371x)small-10k-random: 0.079 ms vs 0.702 ms (8.8451x)small-4k-log-lines: 0.057 ms vs 0.217 ms (3.7868x)Closes #25
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
Tests