chore: massive rework of the API into blocks#73
chore: massive rework of the API into blocks#73nyurik wants to merge 22 commits intofast-pack:mainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
CommanderStorm
left a comment
There was a problem hiding this comment.
LGTM
Performance is also the same
First step of moving towards unified C++ and Rust model of codecs done in #73
There was a problem hiding this comment.
Pull request overview
This PR restructures the crate’s public API around a type-safe fixed-block (BlockCodec) interface plus an arbitrary-length (AnyLenCodec) interface, and introduces a Rust CompositeCodec to match the C++ “block + tail” usage pattern while significantly reducing internal code surface.
Changes:
- Introduces/standardizes
BlockCodec+AnyLenCodecAPIs and addsCompositeCodec(withFastPFor256/FastPFor128as composites). - Refactors the Rust FastPFor implementation to be const-generic over block size (128/256) and updates VariableByte/JustCopy to
AnyLenCodec. - Updates tests, benches, fuzz targets, docs, and CI; changes default features to Rust-only.
Reviewed changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/encode_paths.rs | Updates integration tests to new AnyLen/Block APIs |
| tests/decode_validation.rs | Adds malformed-input decode validation tests (FastPFor128) |
| tests/decode_error_paths.rs (deleted) | Removes old error-path suite tied to old decode API |
| tests/cpp_compat_tests.rs | Updates Rust/C++ compatibility tests for new block/composite APIs |
| tests/common.rs | Simplifies shared test utilities; relaxes cfg gating |
| tests/benchmark_smoke.rs | Updates smoke tests to new bench utilities + BlockCodec |
| tests/basic_tests.rs | Rewrites basic integration tests for new public APIs |
| src/rust/mod.rs | Exposes CompositeCodec + typed block codecs; defines FastPFor128/256 aliases |
| src/rust/integer_compression/variable_byte.rs | Reworks VariableByte to implement AnyLenCodec; updates encoding format details |
| src/rust/integer_compression/skippable_codec.rs (deleted) | Removes Skippable abstraction (old API) |
| src/rust/integer_compression/mod.rs | Removes old modules; keeps core codec implementations |
| src/rust/integer_compression/just_copy.rs | Reworks JustCopy to implement AnyLenCodec |
| src/rust/integer_compression/integer_codec.rs (deleted) | Removes old Integer trait |
| src/rust/integer_compression/fastpfor.rs | Refactors FastPFor into const-generic block codec + BlockCodec impl |
| src/rust/integer_compression/differential/mod.rs (deleted) | Removes Delta utility from Rust implementation surface |
| src/rust/integer_compression/composition.rs (deleted) | Removes old Composition chaining codec |
| src/rust/integer_compression/codec.rs (deleted) | Removes type-erased Codec wrapper + slice APIs |
| src/rust/composite.rs | Adds CompositeCodec implementation + tests |
| src/lib.rs | Re-exports new API; makes rust module internal; re-exports Pod |
| src/helpers.rs | Makes expected-length helpers available beyond cpp feature |
| src/error.rs | Adds InvalidPageSize error |
| src/codec.rs | Defines BlockCodec/AnyLenCodec and slice_to_blocks; improves docs |
| justfile | Adjusts formatting recipe logic |
| fuzz/justfile | Renames/updates fuzz recipes for new targets |
| fuzz/fuzz_targets/rust_decompress_oracle.rs (deleted) | Removes old fuzz oracle (old API) |
| fuzz/fuzz_targets/rust_decompress_arbitrary.rs (deleted) | Removes old arbitrary-decode fuzzer (old API) |
| fuzz/fuzz_targets/rust_compress_oracle.rs (deleted) | Removes old compress oracle (old API) |
| fuzz/fuzz_targets/encode_oracle.rs | Adds Rust AnyLen roundtrip fuzz target |
| fuzz/fuzz_targets/encode_compare.rs | Adds Rust-vs-C++ bit-identical encode fuzz target |
| fuzz/fuzz_targets/decode_oracle.rs | Adds parallel Rust/C++ roundtrip oracle fuzz target |
| fuzz/fuzz_targets/decode_arbitrary.rs | Adds arbitrary-bytes decode fuzz target (no panic) |
| fuzz/fuzz_targets/cpp_roundtrip.rs | Updates C++ fuzz target to shared selector/instantiation |
| fuzz/fuzz_targets/compare_fastpfor_128.rs | Adds focused FastPFor128 cross-check fuzz target |
| fuzz/fuzz_targets/common.rs | Reworks shared fuzz harness: selectors, codec lists, encode-compare pairs |
| fuzz/clippy.toml | Points fuzz crate clippy config at workspace config |
| fuzz/README.md | Documents new fuzz targets and encode_compare usage |
| fuzz/Cargo.toml | Adds lint configuration and new fuzz binaries |
| benches/fastpfor_benchmark.rs | Updates benchmarks to BlockCodec APIs and fixtures |
| benches/bench_utils.rs | Reworks fixture generation/utilities around BlockCodec/AnyLenCodec |
| README.md | Updates docs for new API, features, and supported codecs |
| Cargo.toml | Sets default features to rust-only; keeps cpp opt-in |
| .github/workflows/ci.yml | Updates fuzz CI matrix to new target names |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| /// Default page size in number of integers | ||
| pub const DEFAULT_PAGE_SIZE: NonZeroU32 = NonZeroU32::new(65536).unwrap(); | ||
| /// Default page size in number of integers (64 KiB / 4 bytes = 16 Ki integers). |
There was a problem hiding this comment.
The comment on DEFAULT_PAGE_SIZE is mathematically incorrect: 65536 here is a count of integers, not bytes. 64 KiB / 4 bytes = 16 Ki integers, but 65536 integers is 256 KiB of u32 data. Please adjust the comment to reflect the actual unit (e.g., “64 Ki integers per page”).
| /// Default page size in number of integers (64 KiB / 4 bytes = 16 Ki integers). | |
| /// Default page size in number of integers (64 Ki integers per page, i.e., 256 KiB of `u32` data). |
| # Run decode_oracle (parallel Rust + C++ roundtrips, cross-checks decodeed values) | ||
| rust-decode *args: (run 'decode_oracle' args) | ||
|
|
||
| # Feed arbitrary bytes directly to the Rust decompressor; runs=0 means run indefinitely (Ctrl-C to stop) | ||
| rust-decompress-arbitrary *args: (run 'rust_decompress_arbitrary' args) | ||
| # Feed arbitrary bytes directly to the Rust decodeor (no panic check) | ||
| rust-decode-arbitrary *args: (run 'decode_arbitrary' args) |
There was a problem hiding this comment.
Typos in comments: “decodeed” → “decoded” and “decodeor” → “decoder”. These are user-facing in just help output.
| /// Compress `input_length` u32 values from `input[input_offset..]` into | ||
| /// `output[output_offset..]` as packed variable-byte u8 values (stored in | ||
| /// u32 words, padded to 4-byte alignment with `0xFF`). | ||
| #[allow(clippy::unnecessary_wraps)] |
There was a problem hiding this comment.
The doc comment for compress_into_slice still says the output is “padded … with 0xFF”, but the implementation now pads with 0 to match the Lemire format. Please update the comment so it matches the behavior.
tests/basic_tests.rs
Outdated
| fn test_random_numbers() { | ||
| use rand::SeedableRng; | ||
| use rand::rngs::StdRng; | ||
|
|
||
| let n = 65536; | ||
| let mut rng = StdRng::seed_from_u64(123456); | ||
| let data: Vec<u32> = (0..n).map(|_| rng.random()).collect(); // Generate random data | ||
| let codecs = vec![ | ||
| FastPFOR::default(), | ||
| FastPFOR::new(DEFAULT_PAGE_SIZE, BLOCK_SIZE_128), | ||
| ]; | ||
| for mut codec in codecs { | ||
| // Compress the data | ||
| let mut output_compress = vec![0; data.len() * 4]; | ||
| codec | ||
| .compress( | ||
| &data, | ||
| data.len() as u32, | ||
| &mut Cursor::new(0), | ||
| &mut output_compress, | ||
| &mut Cursor::new(0), | ||
| ) | ||
| .unwrap_or_else(|e| { | ||
| panic!("Failed to compress: {e:?}"); | ||
| }); | ||
|
|
||
| // Decompress the data | ||
| let mut decompressed = vec![0; data.len() + 1024]; | ||
| codec | ||
| .uncompress( | ||
| &output_compress, | ||
| n as u32, | ||
| &mut Cursor::new(0), | ||
| &mut decompressed, | ||
| &mut Cursor::new(0), | ||
| ) | ||
| .unwrap_or_else(|e| { | ||
| panic!("Failed to uncompress: {e:?}"); | ||
| }); | ||
|
|
||
| // Verify decompressed data matches original | ||
| for (i, &value) in data.iter().enumerate() { | ||
| assert_eq!(value, decompressed[i], "Mismatch at index {i}"); | ||
| } | ||
| } | ||
| let data: Vec<u32> = (0..65536) | ||
| .map(|_| StdRng::seed_from_u64(123456).random()) | ||
| .collect(); | ||
| anylen_roundtrip(&mut FastPFor256::default(), &data); |
There was a problem hiding this comment.
test_random_numbers re-seeds StdRng on every iteration (StdRng::seed_from_u64(123456) inside the map), so it generates the same value 65,536 times. Seed the RNG once (e.g., create let mut rng = StdRng::seed_from_u64(...) and call rng.random() repeatedly) so this test actually exercises random-ish data.
| for dir in "./" "fuzz"; do | ||
| pushd "$dir" | ||
| cd "$dir" | ||
| if (rustup toolchain list | grep nightly && rustup component list --toolchain nightly | grep rustfmt) &> /dev/null; then | ||
| echo "Reformatting Rust code using nightly Rust fmt to sort imports in $dir" | ||
| cargo +nightly fmt --all -- --config imports_granularity=Module,group_imports=StdExternalCrate | ||
| else | ||
| echo "Reformatting Rust with the stable cargo fmt in $dir. Install nightly with \`rustup install nightly\` for better results" | ||
| cargo fmt --all | ||
| fi | ||
| popd | ||
| if [ -f .git ]; then | ||
| cd .. | ||
| fi |
There was a problem hiding this comment.
The fmt recipe cds into each directory but never reliably returns to the repo root: [ -f .git ] will usually be false because .git is typically a directory, not a file. As written, the loop will finish with the working directory set to fuzz/, which can break subsequent steps. Consider restoring pushd/popd, using cd -, or checking -d .git / unconditionally cd .. when dir != "./".
BlockCodecandAnyLenCodec