Skip to content

chore: massive rework of the API into blocks#73

Open
nyurik wants to merge 22 commits intofast-pack:mainfrom
nyurik:block-size
Open

chore: massive rework of the API into blocks#73
nyurik wants to merge 22 commits intofast-pack:mainfrom
nyurik:block-size

Conversation

@nyurik
Copy link
Member

@nyurik nyurik commented Mar 19, 2026

  • split all API into cleanly defined "block" api and "variable length" api. The API forces the input to be blocks of constant size, at compile time, and for the codec to actually match that.
  • Made code shorter by a thousand lines
  • Identical C++ and Rust usage
  • Introduce a new Composite codec that requires a generic BlockCodec and AnyLenCodec
pub trait BlockCodec {
    /// The fixed-size block type.  Must be plain-old-data (`Pod`).
    /// In practice this will be `[u32; 128]` or `[u32; 256]`.
    type Block: Pod;

    /// Compress a slice of complete, fixed-size blocks.
    ///
    /// No remainder is possible — the caller must split the input first using
    /// [`slice_to_blocks`] and handle any remainder separately.
    fn encode_blocks(
        &mut self,
        blocks: &[Self::Block],
        out: &mut Vec<u32>,
    ) -> Result<(), FastPForError>;

    /// Decompress exactly `n_blocks` blocks from `input`.
    fn decode_blocks(
        &mut self,
        input: &[u32],
        n_blocks: usize,
        out: &mut Vec<u32>,
    ) -> Result<(), FastPForError>;
}

/// Compresses and decompresses an arbitrary-length `&[u32]` slice.
pub trait AnyLenCodec {
    /// Compress an arbitrary-length slice of `u32` values.
    fn encode(&mut self, input: &[u32], out: &mut Vec<u32>) -> Result<(), FastPForError>;

    /// Decompress a previously compressed slice of `u32` values.
    fn decode(&mut self, input: &[u32], out: &mut Vec<u32>) -> Result<(), FastPForError>;
}

/// Split a flat `&[u32]` into `(&[Blocks::Block], &[u32])` without copying.
///
/// # Example
///
/// ```rust,ignore
/// let data: Vec<u32> = (0..600).collect(); // 2 × 256 + 88 remainder
/// let (blocks, remainder) = slice_to_blocks::<FastPFor256>(&data);
/// assert_eq!(blocks.len(), 2);    // 2 blocks of [u32; 256]
/// assert_eq!(remainder.len(), 88);
/// ```
#[must_use]
pub fn slice_to_blocks<Blocks: BlockCodec>(input: &[u32]) -> (&[Blocks::Block], &[u32]) { ... }

/// Combines a block-oriented codec with an arbitrary-length tail codec.
///
/// `CompositeCodec<Blocks, Tail>` implements [`AnyLenCodec`]: it accepts any
/// input length, encodes the aligned prefix with `Blocks`, and the
/// sub-block remainder with `Tail`.
///
/// # Wire format
///
/// ```text
/// [ n_blocks: u32 ] [ Blocks encoded data... ] [ Tail encoded data... ]
/// ```
///
/// # Example
///
/// ```rust,ignore
/// use fastpfor::{AnyLenCodec, FastPFor256WithVByte};
///
/// let data: Vec<u32> = (0..600).collect(); // 2 × 256 + 88 remainder
/// let codec = FastPFor256WithVByte::default();
///
/// let mut encoded = Vec::new();
/// codec.encode(&data, &mut encoded).unwrap();
///
/// let mut decoded = Vec::new();
/// codec.decode(&encoded, &mut decoded).unwrap();
/// assert_eq!(decoded, data);
/// ```
pub struct CompositeCodec<Blocks: BlockCodec, Tail: AnyLenCodec> {...}

@nyurik nyurik requested review from CommanderStorm and Copilot and removed request for Copilot March 19, 2026 11:25
@codecov
Copy link

codecov bot commented Mar 19, 2026

Copilot AI review requested due to automatic review settings March 19, 2026 11:34

This comment was marked as outdated.

Copy link
Collaborator

@CommanderStorm CommanderStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Performance is also the same

nyurik added a commit that referenced this pull request Mar 22, 2026
First step of moving towards unified C++ and Rust model of codecs done
in #73
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restructures the crate’s public API around a type-safe fixed-block (BlockCodec) interface plus an arbitrary-length (AnyLenCodec) interface, and introduces a Rust CompositeCodec to match the C++ “block + tail” usage pattern while significantly reducing internal code surface.

Changes:

  • Introduces/standardizes BlockCodec + AnyLenCodec APIs and adds CompositeCodec (with FastPFor256/FastPFor128 as composites).
  • Refactors the Rust FastPFor implementation to be const-generic over block size (128/256) and updates VariableByte/JustCopy to AnyLenCodec.
  • Updates tests, benches, fuzz targets, docs, and CI; changes default features to Rust-only.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/encode_paths.rs Updates integration tests to new AnyLen/Block APIs
tests/decode_validation.rs Adds malformed-input decode validation tests (FastPFor128)
tests/decode_error_paths.rs (deleted) Removes old error-path suite tied to old decode API
tests/cpp_compat_tests.rs Updates Rust/C++ compatibility tests for new block/composite APIs
tests/common.rs Simplifies shared test utilities; relaxes cfg gating
tests/benchmark_smoke.rs Updates smoke tests to new bench utilities + BlockCodec
tests/basic_tests.rs Rewrites basic integration tests for new public APIs
src/rust/mod.rs Exposes CompositeCodec + typed block codecs; defines FastPFor128/256 aliases
src/rust/integer_compression/variable_byte.rs Reworks VariableByte to implement AnyLenCodec; updates encoding format details
src/rust/integer_compression/skippable_codec.rs (deleted) Removes Skippable abstraction (old API)
src/rust/integer_compression/mod.rs Removes old modules; keeps core codec implementations
src/rust/integer_compression/just_copy.rs Reworks JustCopy to implement AnyLenCodec
src/rust/integer_compression/integer_codec.rs (deleted) Removes old Integer trait
src/rust/integer_compression/fastpfor.rs Refactors FastPFor into const-generic block codec + BlockCodec impl
src/rust/integer_compression/differential/mod.rs (deleted) Removes Delta utility from Rust implementation surface
src/rust/integer_compression/composition.rs (deleted) Removes old Composition chaining codec
src/rust/integer_compression/codec.rs (deleted) Removes type-erased Codec wrapper + slice APIs
src/rust/composite.rs Adds CompositeCodec implementation + tests
src/lib.rs Re-exports new API; makes rust module internal; re-exports Pod
src/helpers.rs Makes expected-length helpers available beyond cpp feature
src/error.rs Adds InvalidPageSize error
src/codec.rs Defines BlockCodec/AnyLenCodec and slice_to_blocks; improves docs
justfile Adjusts formatting recipe logic
fuzz/justfile Renames/updates fuzz recipes for new targets
fuzz/fuzz_targets/rust_decompress_oracle.rs (deleted) Removes old fuzz oracle (old API)
fuzz/fuzz_targets/rust_decompress_arbitrary.rs (deleted) Removes old arbitrary-decode fuzzer (old API)
fuzz/fuzz_targets/rust_compress_oracle.rs (deleted) Removes old compress oracle (old API)
fuzz/fuzz_targets/encode_oracle.rs Adds Rust AnyLen roundtrip fuzz target
fuzz/fuzz_targets/encode_compare.rs Adds Rust-vs-C++ bit-identical encode fuzz target
fuzz/fuzz_targets/decode_oracle.rs Adds parallel Rust/C++ roundtrip oracle fuzz target
fuzz/fuzz_targets/decode_arbitrary.rs Adds arbitrary-bytes decode fuzz target (no panic)
fuzz/fuzz_targets/cpp_roundtrip.rs Updates C++ fuzz target to shared selector/instantiation
fuzz/fuzz_targets/compare_fastpfor_128.rs Adds focused FastPFor128 cross-check fuzz target
fuzz/fuzz_targets/common.rs Reworks shared fuzz harness: selectors, codec lists, encode-compare pairs
fuzz/clippy.toml Points fuzz crate clippy config at workspace config
fuzz/README.md Documents new fuzz targets and encode_compare usage
fuzz/Cargo.toml Adds lint configuration and new fuzz binaries
benches/fastpfor_benchmark.rs Updates benchmarks to BlockCodec APIs and fixtures
benches/bench_utils.rs Reworks fixture generation/utilities around BlockCodec/AnyLenCodec
README.md Updates docs for new API, features, and supported codecs
Cargo.toml Sets default features to rust-only; keeps cpp opt-in
.github/workflows/ci.yml Updates fuzz CI matrix to new target names

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


/// Default page size in number of integers
pub const DEFAULT_PAGE_SIZE: NonZeroU32 = NonZeroU32::new(65536).unwrap();
/// Default page size in number of integers (64 KiB / 4 bytes = 16 Ki integers).
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on DEFAULT_PAGE_SIZE is mathematically incorrect: 65536 here is a count of integers, not bytes. 64 KiB / 4 bytes = 16 Ki integers, but 65536 integers is 256 KiB of u32 data. Please adjust the comment to reflect the actual unit (e.g., “64 Ki integers per page”).

Suggested change
/// Default page size in number of integers (64 KiB / 4 bytes = 16 Ki integers).
/// Default page size in number of integers (64 Ki integers per page, i.e., 256 KiB of `u32` data).

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +41
# Run decode_oracle (parallel Rust + C++ roundtrips, cross-checks decodeed values)
rust-decode *args: (run 'decode_oracle' args)

# Feed arbitrary bytes directly to the Rust decompressor; runs=0 means run indefinitely (Ctrl-C to stop)
rust-decompress-arbitrary *args: (run 'rust_decompress_arbitrary' args)
# Feed arbitrary bytes directly to the Rust decodeor (no panic check)
rust-decode-arbitrary *args: (run 'decode_arbitrary' args)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos in comments: “decodeed” → “decoded” and “decodeor” → “decoder”. These are user-facing in just help output.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +38
/// Compress `input_length` u32 values from `input[input_offset..]` into
/// `output[output_offset..]` as packed variable-byte u8 values (stored in
/// u32 words, padded to 4-byte alignment with `0xFF`).
#[allow(clippy::unnecessary_wraps)]
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for compress_into_slice still says the output is “padded … with 0xFF”, but the implementation now pads with 0 to match the Lemire format. Please update the comment so it matches the behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +103
fn test_random_numbers() {
use rand::SeedableRng;
use rand::rngs::StdRng;

let n = 65536;
let mut rng = StdRng::seed_from_u64(123456);
let data: Vec<u32> = (0..n).map(|_| rng.random()).collect(); // Generate random data
let codecs = vec![
FastPFOR::default(),
FastPFOR::new(DEFAULT_PAGE_SIZE, BLOCK_SIZE_128),
];
for mut codec in codecs {
// Compress the data
let mut output_compress = vec![0; data.len() * 4];
codec
.compress(
&data,
data.len() as u32,
&mut Cursor::new(0),
&mut output_compress,
&mut Cursor::new(0),
)
.unwrap_or_else(|e| {
panic!("Failed to compress: {e:?}");
});

// Decompress the data
let mut decompressed = vec![0; data.len() + 1024];
codec
.uncompress(
&output_compress,
n as u32,
&mut Cursor::new(0),
&mut decompressed,
&mut Cursor::new(0),
)
.unwrap_or_else(|e| {
panic!("Failed to uncompress: {e:?}");
});

// Verify decompressed data matches original
for (i, &value) in data.iter().enumerate() {
assert_eq!(value, decompressed[i], "Mismatch at index {i}");
}
}
let data: Vec<u32> = (0..65536)
.map(|_| StdRng::seed_from_u64(123456).random())
.collect();
anylen_roundtrip(&mut FastPFor256::default(), &data);
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_random_numbers re-seeds StdRng on every iteration (StdRng::seed_from_u64(123456) inside the map), so it generates the same value 65,536 times. Seed the RNG once (e.g., create let mut rng = StdRng::seed_from_u64(...) and call rng.random() repeatedly) so this test actually exercises random-ish data.

Copilot uses AI. Check for mistakes.
Comment on lines 87 to +98
for dir in "./" "fuzz"; do
pushd "$dir"
cd "$dir"
if (rustup toolchain list | grep nightly && rustup component list --toolchain nightly | grep rustfmt) &> /dev/null; then
echo "Reformatting Rust code using nightly Rust fmt to sort imports in $dir"
cargo +nightly fmt --all -- --config imports_granularity=Module,group_imports=StdExternalCrate
else
echo "Reformatting Rust with the stable cargo fmt in $dir. Install nightly with \`rustup install nightly\` for better results"
cargo fmt --all
fi
popd
if [ -f .git ]; then
cd ..
fi
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fmt recipe cds into each directory but never reliably returns to the repo root: [ -f .git ] will usually be false because .git is typically a directory, not a file. As written, the loop will finish with the working directory set to fuzz/, which can break subsequent steps. Consider restoring pushd/popd, using cd -, or checking -d .git / unconditionally cd .. when dir != "./".

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants