perf(decoding): parallel block decompression for multi-block frames

## Problem

Zstd frames consist of multiple independent blocks (each block is self-contained after header parsing). Current decoder processes blocks sequentially. For frames with known content size and multiple blocks, blocks can be decompressed in parallel.

This is distinct from #19 (multi-threaded compression) — this is about parallel **decompression**.

## Goal

Enable parallel decompression of independent blocks within a single zstd frame, targeting the CoordiNode use case where many small-to-medium frames are read from LSM tree.

## Design considerations

### Frame-level parallelism (simpler, higher impact for CoordiNode)
CoordiNode reads many independent zstd frames (one per SSTable block). These can be decompressed in parallel trivially:
- \`rayon::par_iter\` over frame list
- Each frame gets its own \`FrameDecoder\` — no shared state
- **Expected gain:** Linear scaling with core count for batch reads

### Block-level parallelism within a frame (complex, lower impact)
Single large frame with N blocks:
1. Parse all block headers sequentially (3 bytes each — fast)
2. Identify block boundaries and compressed sizes
3. Decompress blocks in parallel (each block has independent Huffman/FSE tables)
4. Write decompressed blocks to pre-allocated output buffer at correct offsets

**Challenge:** Block N's output position depends on blocks 0..N-1's decompressed sizes. If frame has content size and blocks have decompressed size in header → positions are computable from headers alone.

**Challenge:** Huffman "treeless" blocks reuse the previous block's Huffman table → creates dependency. Must detect and serialize treeless chains.

### API surface
\`\`\`rust
// Frame-level parallelism (simple)
pub fn decompress_batch(frames: &[&[u8]]) -> Vec<Vec<u8>>;

// Block-level parallelism (advanced)  
pub fn decompress_parallel(frame: &[u8], thread_pool: &rayon::ThreadPool) -> Vec<u8>;
\`\`\`

## Performance expectations
- Frame-level batch (CoordiNode workload): **+200-400%** on 4+ cores
- Block-level within single frame: **+50-150%** on 4+ cores (limited by header parsing + treeless dependencies)

## Acceptance criteria
- [ ] Frame-level batch decompression API with rayon.
- [ ] Block-level parallel decompression for frames with content size.
- [ ] Correct handling of treeless Huffman blocks (serialize dependent blocks).
- [ ] Byte-exact output parity with sequential decoder.
- [ ] Feature-gated behind \`parallel\` feature (rayon dependency optional).
- [ ] Benchmark with CoordiNode-representative workload (many 1-16KB frames).

## Dependencies
- Independent from other performance issues.
- Rayon as optional dependency.

## Estimate
3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decoding): parallel block decompression for multi-block frames #72

Problem

Goal

Design considerations

Frame-level parallelism (simpler, higher impact for CoordiNode)

Block-level parallelism within a frame (complex, lower impact)

API surface

Performance expectations

Acceptance criteria

Dependencies

Estimate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf(decoding): parallel block decompression for multi-block frames #72

Description

Problem

Goal

Design considerations

Frame-level parallelism (simpler, higher impact for CoordiNode)

Block-level parallelism within a frame (complex, lower impact)

API surface

Performance expectations

Acceptance criteria

Dependencies

Estimate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions