Skip to content

feat(decoding): expected-field validation setters on FrameDecoder (expect_dict_id, expect_window_descriptor) #177

@polaz

Description

@polaz

⚠️ Deferred — post-Phase 6 drop-in parity (2026-05-19)

Project priority sequence per #28: complete encoder rewrite (#111 incl. #23) → speed/ratio optimizations (#178, #180) → params API (#27) → magicless (#26) → Phase 6 C-ABI / CLI drop-in (#126/#127/#128/#130/#131/#132) → THEN this track. lsm-tree bilateral coordination accepted 2026-05-18 — commitment preserved, but execution defers until drop-in parity ships. Pre-Phase-6 work on this issue will not be scheduled.


⚠️ Feature gate (mandatory): all Rust code added by this issue is compiled only when the lsm Cargo feature is enabled (#[cfg(feature = "lsm")] on every new public item — module, struct, enum variant, impl block, function). Feature is default off, opt-in for downstream consumers. Without lsm: build is byte-identical to today, no new public symbols, cdylib from Phase 6 stays strict drop-in for donor libzstd v1.5.7. C FFI surface is unaffected regardless of feature state.


Context

lsm-tree's LSM-T2 (encrypted wire-format encoder/decoder) needs post-AEAD-decrypt validation: after AEAD authenticates the ciphertext and the bytes are decrypted to a raw zstd frame, the wire-format spec mandates two cross-checks against the MetadataPayload fields baked into the AAD:

  1. Dictionary_ID match. The inner zstd frame's dict_id (parsed from its Frame_Header_Descriptor) MUST equal MetadataPayload.DictID. Otherwise an attacker who somehow forges a valid AEAD blob with the wrong dictionary could trigger silent corruption — see §8 of the wire-format draft, "dict substitution" row.
  2. Window_Descriptor match. Inner frame's window descriptor byte MUST equal MetadataPayload.WindowLog. Defeats decompression-bomb attacks where a swapped-in ciphertext claims a much larger window than the encrypted blob originally permitted.

Currently lsm-tree would have to parse the inner frame header itself, read FrameHeader::dictionary_id() and FrameHeader::window_size() exposed at zstd/src/decoding/frame.rs:142, 116, and compare against expectations. That's open-coded validation duplicated at every decryption site. Better: structured-zstd accepts the expectations up front and fails the decode with a typed error if they don't match.

Scope — Rust-side only, behind lsm feature

#[cfg(feature = "lsm")]
impl FrameDecoder {
    /// Pin the expected `Dictionary_ID` for the next frame. If set, decode
    /// fails fast (before any block work) when the parsed frame header's
    /// `dict_id` does not match. `Some(0)` is treated as "no dictionary
    /// expected" — the frame's `dict_id` must be absent (or 0). `None`
    /// (default) disables the check.
    pub fn expect_dict_id(&mut self, expected: Option<u32>);

    /// Pin the expected raw `Window_Descriptor` byte (RFC 8878 §3.1.1.1.2
    /// layout: `(exp << 3) | mantissa`) for the next frame. If set, decode
    /// fails fast when the parsed frame header's `window_descriptor` byte
    /// does not match. `None` (default) disables the check.
    ///
    /// Note: this is byte-exact match, NOT a ceiling. Donor's
    /// `ZSTD_d_windowLogMax` is a separate ceiling-style limit that lives
    /// in the Phase 6 C FFI surface (#127); it's a different semantic and
    /// stays available unconditionally there.
    pub fn expect_window_descriptor(&mut self, expected: Option<u8>);
}

#[cfg(feature = "lsm")]
#[non_exhaustive]
pub enum FrameDecoderError {
    // existing variants unchanged ...

    UnexpectedDictId {
        expected: Option<u32>,
        found: Option<u32>,
    },
    UnexpectedWindowDescriptor {
        expected: u8,
        found: u8,
    },
}

Implementation

Hook in FrameDecoder::init / FrameDecoder::reset (zstd/src/decoding/frame_decoder.rs:103, 131) immediately after frame::read_frame_header returns successfully:

let (frame_header, header_size) = frame::read_frame_header(source)?;

#[cfg(feature = "lsm")]
{
    if let Some(expected) = self.expect_dict_id {
        let found = frame_header.dictionary_id();
        if expected != found.unwrap_or(0) /* with Some(0) ↔ None equivalence */ {
            return Err(FrameDecoderError::UnexpectedDictId { expected: Some(expected), found });
        }
    }
    if let Some(expected) = self.expect_window_descriptor {
        let found = frame_header.descriptor.0 /* or a dedicated accessor */;
        // Window_Descriptor only meaningful when single_segment_flag is unset;
        // when set, donor zstd packs frame_content_size in its place.
        if !frame_header.descriptor.single_segment_flag() {
            let found_wd = frame_header.window_descriptor /* accessor needed */;
            if expected != found_wd {
                return Err(FrameDecoderError::UnexpectedWindowDescriptor { expected, found: found_wd });
            }
        }
    }
}

The window_descriptor field on FrameHeader (frame.rs:105) is currently pub(crate) — needs a pub fn window_descriptor(&self) -> Option<u8> accessor (returning None for single-segment frames). Small additive API exposure.

What this is NOT

  • Not a replacement for AEAD. AEAD authentication still runs first; this gate is for the post-decrypt sanity check that the wire-format spec mandates.
  • Not a ceiling limit (donor ZSTD_d_windowLogMax is a different thing — that lives in feat(c-api): #28 Phase 6.2 — advanced + streaming + dictionary C FFI surface #127's C FFI surface, unconditionally available, separate semantic). This is byte-exact equality.
  • Not a dict_id != 0 required gate. expected = Some(0) explicitly means "no dictionary" — useful for blocks that don't use one.

Why feature-gate behind lsm

Consistent with #171-#175: validates strict drop-in C FFI parity (default-build cdylib has zero new surface). The validation is bespoke to wire-format-with-AAD scenarios; non-lsm consumers don't need it.

Acceptance criteria

  • expect_dict_id(Some(42)), decode frame with dict_id = 42 → ok, decodes normally.
  • expect_dict_id(Some(42)), decode frame with dict_id = 43UnexpectedDictId { expected: Some(42), found: Some(43) }, no bytes decoded.
  • expect_dict_id(Some(42)), decode frame with no dict_id (flag 0) → UnexpectedDictId { expected: Some(42), found: None }.
  • expect_dict_id(Some(0)), decode frame with no dict_id → ok (Some(0) ↔ None equivalence).
  • expect_dict_id(None) (default), decode anything → no validation, no behavior change.
  • expect_window_descriptor(Some(0x4A)), decode frame with window_descriptor=0x4A → ok.
  • expect_window_descriptor(Some(0x4A)), decode frame with window_descriptor=0x60 → UnexpectedWindowDescriptor { expected: 0x4A, found: 0x60 }.
  • Single-segment frame + expect_window_descriptor(Some(_)) → behavior documented (skip check, or fail explicitly — needs design decision in PR).
  • Validation fires BEFORE any block decode work — no allocation, no XXH64 init, no partial output.
  • After validation failure, calling init/reset again clears the failed state cleanly.
  • Without lsm feature: setters absent, error variants absent, build byte-identical to today.
  • 503/503 lib tests pass.

Estimated size

~80 LoC + tests. One small additive accessor (FrameHeader::window_descriptor()), two setters on FrameDecoder, two new error variants, validation block in init/reset.

Phasing

PR-F. Independent of #171/#172/#173/#174/#175. Lands any time before lsm-tree's LSM-T2 (wire-format encoder/decoder) goes live — LSM-T2 uses these setters in its post-AEAD-decrypt validation step. Can ship parallel to PR-A in Phase α of the bilateral phasing.

Related


Bilateral cross-reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions