Skip to content

perf(decoding): parallel block decompression for multi-block frames #72

@polaz

Description

@polaz

Problem

Zstd frames consist of multiple independent blocks (each block is self-contained after header parsing). Current decoder processes blocks sequentially. For frames with known content size and multiple blocks, blocks can be decompressed in parallel.

This is distinct from #19 (multi-threaded compression) — this is about parallel decompression.

Goal

Enable parallel decompression of independent blocks within a single zstd frame, targeting the CoordiNode use case where many small-to-medium frames are read from LSM tree.

Design considerations

Frame-level parallelism (simpler, higher impact for CoordiNode)

CoordiNode reads many independent zstd frames (one per SSTable block). These can be decompressed in parallel trivially:

  • `rayon::par_iter` over frame list
  • Each frame gets its own `FrameDecoder` — no shared state
  • Expected gain: Linear scaling with core count for batch reads

Block-level parallelism within a frame (complex, lower impact)

Single large frame with N blocks:

  1. Parse all block headers sequentially (3 bytes each — fast)
  2. Identify block boundaries and compressed sizes
  3. Decompress blocks in parallel (each block has independent Huffman/FSE tables)
  4. Write decompressed blocks to pre-allocated output buffer at correct offsets

Challenge: Block N's output position depends on blocks 0..N-1's decompressed sizes. If frame has content size and blocks have decompressed size in header → positions are computable from headers alone.

Challenge: Huffman "treeless" blocks reuse the previous block's Huffman table → creates dependency. Must detect and serialize treeless chains.

API surface

```rust
// Frame-level parallelism (simple)
pub fn decompress_batch(frames: &[&[u8]]) -> Vec<Vec>;

// Block-level parallelism (advanced)
pub fn decompress_parallel(frame: &[u8], thread_pool: &rayon::ThreadPool) -> Vec;
```

Performance expectations

  • Frame-level batch (CoordiNode workload): +200-400% on 4+ cores
  • Block-level within single frame: +50-150% on 4+ cores (limited by header parsing + treeless dependencies)

Acceptance criteria

  • Frame-level batch decompression API with rayon.
  • Block-level parallel decompression for frames with content size.
  • Correct handling of treeless Huffman blocks (serialize dependent blocks).
  • Byte-exact output parity with sequential decoder.
  • Feature-gated behind `parallel` feature (rayon dependency optional).
  • Benchmark with CoordiNode-representative workload (many 1-16KB frames).

Dependencies

  • Independent from other performance issues.
  • Rayon as optional dependency.

Estimate

3d

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementP3-lowLow priority — nice to haveenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions