Problem
Zstd frames consist of multiple independent blocks (each block is self-contained after header parsing). Current decoder processes blocks sequentially. For frames with known content size and multiple blocks, blocks can be decompressed in parallel.
This is distinct from #19 (multi-threaded compression) — this is about parallel decompression.
Goal
Enable parallel decompression of independent blocks within a single zstd frame, targeting the CoordiNode use case where many small-to-medium frames are read from LSM tree.
Design considerations
Frame-level parallelism (simpler, higher impact for CoordiNode)
CoordiNode reads many independent zstd frames (one per SSTable block). These can be decompressed in parallel trivially:
- `rayon::par_iter` over frame list
- Each frame gets its own `FrameDecoder` — no shared state
- Expected gain: Linear scaling with core count for batch reads
Block-level parallelism within a frame (complex, lower impact)
Single large frame with N blocks:
- Parse all block headers sequentially (3 bytes each — fast)
- Identify block boundaries and compressed sizes
- Decompress blocks in parallel (each block has independent Huffman/FSE tables)
- Write decompressed blocks to pre-allocated output buffer at correct offsets
Challenge: Block N's output position depends on blocks 0..N-1's decompressed sizes. If frame has content size and blocks have decompressed size in header → positions are computable from headers alone.
Challenge: Huffman "treeless" blocks reuse the previous block's Huffman table → creates dependency. Must detect and serialize treeless chains.
API surface
```rust
// Frame-level parallelism (simple)
pub fn decompress_batch(frames: &[&[u8]]) -> Vec<Vec>;
// Block-level parallelism (advanced)
pub fn decompress_parallel(frame: &[u8], thread_pool: &rayon::ThreadPool) -> Vec;
```
Performance expectations
- Frame-level batch (CoordiNode workload): +200-400% on 4+ cores
- Block-level within single frame: +50-150% on 4+ cores (limited by header parsing + treeless dependencies)
Acceptance criteria
Dependencies
- Independent from other performance issues.
- Rayon as optional dependency.
Estimate
3d
Problem
Zstd frames consist of multiple independent blocks (each block is self-contained after header parsing). Current decoder processes blocks sequentially. For frames with known content size and multiple blocks, blocks can be decompressed in parallel.
This is distinct from #19 (multi-threaded compression) — this is about parallel decompression.
Goal
Enable parallel decompression of independent blocks within a single zstd frame, targeting the CoordiNode use case where many small-to-medium frames are read from LSM tree.
Design considerations
Frame-level parallelism (simpler, higher impact for CoordiNode)
CoordiNode reads many independent zstd frames (one per SSTable block). These can be decompressed in parallel trivially:
Block-level parallelism within a frame (complex, lower impact)
Single large frame with N blocks:
Challenge: Block N's output position depends on blocks 0..N-1's decompressed sizes. If frame has content size and blocks have decompressed size in header → positions are computable from headers alone.
Challenge: Huffman "treeless" blocks reuse the previous block's Huffman table → creates dependency. Must detect and serialize treeless chains.
API surface
```rust
// Frame-level parallelism (simple)
pub fn decompress_batch(frames: &[&[u8]]) -> Vec<Vec>;
// Block-level parallelism (advanced)
pub fn decompress_parallel(frame: &[u8], thread_pool: &rayon::ThreadPool) -> Vec;
```
Performance expectations
Acceptance criteria
Dependencies
Estimate
3d