feat(encryption): raft envelope + engine pre-apply hook (Stage 3)#744
feat(encryption): raft envelope + engine pre-apply hook (Stage 3)#744bootjp wants to merge 1 commit intofeat/encryption-storagefrom
Conversation
Stage 3 of the encryption rollout (design doc §4.2 + §6.3). Adds
the AES-256-GCM raft envelope used to wrap Raft proposal payloads
end-to-end through the WAL, plus the engine-side pre-apply hook
that unwraps committed entries past a configurable cutover index.
Wire format. Reuses the §4.1 storage envelope layout but with a
distinct AAD: 'R' (0x52) ‖ envelope_version ‖ key_id BE32. No
pebble_key — the raft envelope is location-independent; the engine
identifies entries by raftpb.Entry.Index. Storage and raft envelopes
therefore cannot be cross-replayed: a storage ciphertext fed to
UnwrapRaftPayload (or vice versa) fails GCM tag verification on
the AAD prefix.
Engine pre-apply hook. internal/raftengine/etcd/engine.go gains
two optional OpenConfig fields (RaftCipher, RaftCutoverIndex) plus
matching Engine struct fields. applyNormalEntry now decodes the
proposal envelope FIRST (preserving the leading
0x01 proposalEnvelopeVersion byte that resolveProposal needs), then
— iff cipher!=nil AND entry.Index > cutoverIndex() — unwraps the
inner fsm payload through the raft DEK before handing the
cleartext to fsm.Apply. The FSM contract is unchanged at
Apply(data []byte) any. resolveProposal continues to receive the
original entry.Data, so per-proposal id resolution and response
delivery are untouched.
Fail-closed. applyNormalEntry now returns (any, error). On unwrap
failure (GCM tag mismatch / unknown DEK / malformed envelope)
applyCommitted propagates ErrRaftUnwrapFailed up to runLoop and
DOES NOT advance setApplied — the next restart must replay the
same entry, not skip it. Silent skip would let the local FSM
diverge from peers that successfully unwrapped, defeating the
integrity property the tag was added to detect (design §6.3).
applyCommitted refactor. Each entry-type arm extracted into its
own helper (applyNormalCommitted, applyConfChangeCommitted,
applyConfChangeV2Committed) so the dispatch loop stays under the
cyclomatic budget while the new (any, error) signature is woven
through.
Coordinator seam. kv/raft_payload_wrapper.go introduces the
RaftPayloadWrapper closure type and a wrappedProposer adapter that
runs the wrapper transparently before forwarding to a
raftengine.Proposer. Stage 3 ships only the seam: a nil wrapper is
returned verbatim by newWrappedProposer, preserving the existing
zero-encryption behaviour for every coordinator path. Stage 6
installs the active wrapper (cipher + active raft DEK + nonce
factory) from the sidecar.
Stage 3 is INERT in production. With cipher=nil and cutover=
^uint64(0) (the OpenConfig defaults from orInertCutover), every
existing engine_test continues to pass byte-for-byte; the new
behaviour only fires when Stage 5/6 wires the cipher and cutover.
Test coverage:
- internal/encryption/raft_envelope_test.go: round-trip across 4
sizes, deterministic-nonce determinism, tag tamper, key_id
tamper, retired DEK, short-input, nil cipher, AAD layout pin
(TestBuildRaftAAD), layer-confusion (storage envelope rejected
by UnwrapRaftPayload), no-plaintext-leak.
- internal/raftengine/etcd/encryption_test.go: cipher=nil pass-
through, below-cutover pass-through (incl. boundary index ==
cutover), above-cutover Unwrap, unwrap-failure halts apply
loop without advancing setApplied, proposal id still
resolvable after unwrap, OpenConfig nil-cutover default.
- kv/raft_payload_wrapper_test.go: nil wrapper is identity,
wrapper invoked exactly once per Propose, propagates wrap-side
errors without touching the inner proposer, end-to-end round-
trip through a real Cipher.
Self-review (CLAUDE.md 5 passes):
1. Data loss: tampered raft entries fail closed via
ErrRaftUnwrapFailed; setApplied never advances past a failing
entry; runLoop's existing fatal-error path takes the process
down for operator-supervised restart.
2. Concurrency: raftCipher / raftCutoverIndex are set once at
Open and only read from the run-loop apply goroutine — same
single-writer discipline as fsm. The wrapper closure is
documented as concurrent-safe; Stage 3's pass-through default
avoids the question entirely.
3. Performance: cipher=nil hot path is one nil branch + one
uint64 compare per applied entry. With cipher set, the
Unwrap path adds one DecodeEnvelope + one AEAD.Open per
above-cutover entry; AES key expansion is amortised in the
keystore's pre-initialised AEAD cache.
4. Consistency: AAD purpose byte 'R' rules out storage↔raft
layer confusion; key_id participates in AAD so on-disk
rewrite fails GCM; cutover boundary is strict greater-than
so the enable-flag entry itself stays cleartext (design
§7.1).
5. Test coverage: 30+ unit tests across three packages plus the
existing raftengine_test.go which continues to pass with no
changes (cipher=nil legacy path).
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements Stage 3 of the Raft-level encryption design, introducing the core logic for wrapping and unwrapping Raft payloads using AES-GCM. It adds a new raft_envelope package for handling encrypted envelopes with distinct AAD, integrates an unwrap hook into the etcd-based Raft engine with cutover index support, and provides a coordinator-side wrapper for transparently encrypting proposals. The implementation ensures that decryption failures are fatal and prevent the advancement of the applied index to maintain consistency. I have no feedback to provide.
Summary
Stage 3 of the data-at-rest encryption rollout (design doc:
docs/design/2026_04_29_proposed_data_at_rest_encryption.md, §4.2 + §6.3). Wraps Raft proposal payloads in an authenticated AES-256-GCM envelope so the etcd/raft WAL on disk holds ciphertext only, and adds the engine pre-apply hook that unwraps committed entries past the §7.1 Phase 2 cutover index. Stage 3 is inert in production: with the newOpenConfigfields nil/unset, the apply hook never fires and every existing engine test continues to pass byte-for-byte.'R' ‖ envelope_version ‖ key_id) so storage↔raft layer-confusion replay fails GCM.applyNormalEntrybecomes(any, error); on unwrap failureapplyCommittedhalts without advancingsetApplied, mapping the entry-replay invariant directly into the runLoop's fatal-error path. Silent skip is rejected by design (§6.3).applyCommittedarms refactored intoapplyNormalCommitted/applyConfChangeCommitted/applyConfChangeV2Committedto stay under the cyclomatic budget while threading the new error.RaftPayloadWrapper+wrappedProposeradapter — Stage 3 ships only the seam (nil wrapper = identity); Stage 6 will install the active wrapper from the sidecar.Out of scope (deferred)
raftEncodeEncryptionRegistration/Bootstrap/Rotationopcodes)Test plan
go test -race ./internal/encryption/...— round-trip, tag/keyID tamper, retired DEK, short input, layer-confusion, no-leak, AAD layout pingo test -race ./internal/raftengine/etcd/...— cipher=nil pass-through, cutover boundary (strict greater-than), above-cutover Unwrap, unwrap-failure halts apply loop without advancing setApplied, proposal-id still resolvable after unwrap, OpenConfig nil-cutover defaultgo test -race ./kv/...— wrapper nil = identity, wrapper invoked exactly once, propagates wrap-side errors, end-to-end round-trip with real Ciphergolangci-lint run ./internal/raftengine/... ./internal/encryption/... ./kv/...cleanmake testfull sweep (CI verifies)Self-review (CLAUDE.md 5 passes)
ErrRaftUnwrapFailed;setAppliednever advances past a failing entry; runLoop's existing fatal-error path takes the process down for operator-supervised restart. The replay-not-skip invariant is the load-bearing safety property and is locked down byTestApplyCommitted_UnwrapFailure_DoesNotAdvanceApplied.raftCipher/raftCutoverIndexare set once atOpenand only read from the run-loop apply goroutine (same single-writer discipline asfsm). The wrapper closure type is documented as concurrent-safe; Stage 3's pass-through default sidesteps the question.cipher=nilhot path is one nil branch + one uint64 compare per applied entry. With cipher set, the Unwrap path adds oneDecodeEnvelope+ oneAEAD.Openper above-cutover entry; AES key expansion is amortised in the keystore's pre-initialised AEAD cache.'R'rules out storage↔raft layer confusion;key_idparticipates in AAD so on-disk rewrite fails GCM; cutover boundary is strict>so the enable-flag entry itself stays cleartext (design §7.1).raftengine_test.gocontinues to pass untouched.