Skip to content

perf(decoding): investigate best-effort partial decode past block failure [SPIKE] #175

@polaz

Description

@polaz

⚠️ Deferred — post-Phase 6 drop-in parity (2026-05-19)

Project priority sequence per #28: complete encoder rewrite (#111 incl. #23) → speed/ratio optimizations (#178, #180) → params API (#27) → magicless (#26) → Phase 6 C-ABI / CLI drop-in (#126/#127/#128/#130/#131/#132) → THEN this track. lsm-tree bilateral coordination accepted 2026-05-18 — commitment preserved, but execution defers until drop-in parity ships. Pre-Phase-6 work on this issue will not be scheduled.


⚠️ Feature gate (mandatory): all Rust code added by this issue is compiled only when the lsm Cargo feature is enabled (#[cfg(feature = "lsm")] on every new public item — module, struct, enum variant, impl block, function). Feature is default off, opt-in for downstream consumers. Without lsm: build is byte-identical to today, no new public symbols, cdylib from Phase 6 stays strict drop-in for donor libzstd v1.5.7. C FFI surface is unaffected regardless of feature state.


Status

Investigation / spike. Largest of the structural-vocabulary task set, lowest priority. Lands only after lsm-tree's ECC layer (LSM-T5) is in production and demonstrates an operational need for "decode-as-much-as-possible-past-corruption" beyond the targeted block repair from #174.

Acceptable outcome: "ECC repair already covers the cases we care about, partial-decode is over-engineered, close without landing." That's a valid result — the analysis itself is the deliverable.

Context

When the lsm-tree ECC layer (LSM-T5 + LSM-T6) has exhausted its parity budget on an SSTable block and a zstd block is still unrecoverable, downstream callers asking for a range query may still benefit from "give me whatever decoded successfully before the corruption." Today FrameDecoder::decode_blocks treats any block-decode failure as terminal: any decompressed bytes already in the buffer are discarded along with the error.

A best-effort partial-decode mode would let the caller extract bytes_decoded worth of plaintext + a positional error pointing at the unrecoverable block, then proceed with degraded results.

Proposed scope

impl FrameDecoder {
    /// Decode blocks from `src`, emitting decoded bytes via the existing
    /// read interface, stopping at the first block-decode failure. Returns
    /// what was decoded plus the error position. Unlike `decode_blocks`,
    /// the decoded prefix is preserved and accessible via `read()`.
    pub fn decode_blocks_partial(
        &mut self,
        src: &mut impl Read,
    ) -> Result<PartialDecode, FrameDecoderError>;
}

pub struct PartialDecode {
    pub bytes_decoded: u64,
    pub blocks_decoded: u32,
    pub stopped_at: Option<(u32 /* block_index */, FrameDecoderError)>,
}

Why this is hard

The current decode state machine treats a block failure as fatal and may leave internal state mid-update (block buffer half-written, window not advanced, FSE state mid-reload). Resumable abort needs careful unwinding so the caller can still drain bytes_decoded from prior blocks without corrupted state poisoning the read interface. Estimated ~300-500 LoC + nontrivial test surface (need to validate buffered-output integrity after each terminal abort variant).

Kill-switch criteria

Close without landing if any of:

  • lsm-tree's ECC layer + LSM-T6 lazy repair handles every operationally observed case → partial decode never triggered in practice.
  • Adding the resumable-abort plumbing visibly slows the happy-path decode benches by > 1%.
  • The state-machine cleanup turns out to require touching encoding/blocks internals — would expand scope beyond a single decode-path PR.

Acceptance criteria (if it lands)

  • Happy-path decode (no corruption) is byte-identical to existing decode_blocks output.
  • Corrupt block at index N → PartialDecode { bytes_decoded: <sum of bodies 0..N>, blocks_decoded: N, stopped_at: Some((N, _)) } and FrameDecoder::read() returns exactly the decompressed bytes of blocks 0..N.
  • No happy-path perf regression on compare_ffi decompress benches (< 1% delta).
  • 503/503 lib tests pass.
  • At least one integration test demonstrating lsm-tree-style range-query partial recovery.

Related


ADDENDUM (2026-05-18): feature gating

If this lands, decode_blocks_partial and PartialDecode are gated behind the lsm Cargo feature (default off) — same gate as #171/#172/#173/#174.

#[cfg(feature = "lsm")]
impl FrameDecoder {
    pub fn decode_blocks_partial(...) -> Result<PartialDecode, FrameDecoderError>;
}

#[cfg(feature = "lsm")]
pub struct PartialDecode { ... }

Default-build cdylib from #126/#127 remains strict drop-in for donor v1.5.7 — donor has no partial-decode primitive, so its absence in the no-feature build is correct.


Bilateral cross-reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3-lowLow priority — nice to haveenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions