feat(cluster): implement replica bootstrap#3163
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (9.07%) is below the target coverage (50.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #3163 +/- ##
============================================
- Coverage 74.43% 71.09% -3.35%
Complexity 943 943
============================================
Files 1186 1190 +4
Lines 106343 102827 -3516
Branches 83377 79873 -3504
============================================
- Hits 79161 73107 -6054
- Misses 24434 26769 +2335
- Partials 2748 2951 +203
🚀 New features to boost your workflow:
|
39ce6cb to
a1d20d7
Compare
hubcio
left a comment
There was a problem hiding this comment.
out of diff findings:
core/partitions/src/messages_writer.rs:60-63 — let _ = file.sync_all().await.map_err(...) swallows IO error on the file_exists open path. EIO at boot indicates dying media, but bootstrap proceeds with a stale view of disk. on master HEAD already, but server-ng makes this path live for every partition. propagate the error.
core/partitions/src/iggy_index_writer.rs:53 — let _ = file.sync_all().await; same pattern, no map_err at all. propagate.
core/partitions/src/messages_writer.rs:137 — let chunk_vec: Vec<_> = chunk.to_vec(); allocates a Vec per chunk per save_frozen_batches. steady-state per-batch alloc on what is now the primary write path. cache a reusable Vec in the writer or refactor the compio iov call to take a borrowed slice.
core/shard/src/lib.rs:622 — let namespaces: Vec<_> = planes.1.0.namespaces().copied().collect(); allocates a fresh Vec per inbox-frame iteration. core/shard/src/router.rs:270 calls process_loopback after every dispatched frame in steady state. hoist a namespaces_buf: Vec<u64> like the existing loopback_buf, or short-circuit when partitions / loopback drained are empty.
core/shard/src/router.rs:94-103 and 199-208 — shards_table.shard_for(...).unwrap_or_else(|| { warn!(...); 0 }) silently falls back to shard 0. single-shard server-ng masks this today; multi-shard makes it a silent routing-bug hider. drop the fallback or fail loud.
core/binary_protocol/src/consensus/message.rs:347 — transmute_header does a 256B header copy + 256B memset + Message::try_from(owned) re-validate after the closure already wrote a known-valid header. that's a 3rd validation per request (1st transport recv, 2nd try_into_typed, 3rd here). reached per-client-message via the new make_deferred_*_handler closures at core/server-ng/src/bootstrap.rs:1086,1116. add a transmute_header_unchecked that skips the trailing TryFrom validation and rely on the closure invariant.
core/metadata/src/stm/stream.rs:615-676 — Snapshotable::to_snapshot reads round_robin_counter with Ordering::Relaxed during snapshot fill. concurrent producer increments race the snapshot read. single-shard today: not exploitable. document the invariant.
core/common/src/types/streaming_stats.rs counter ordering — counter fetch_add / fetch_sub use Ordering::AcqRel even though the counters synchronize no other data. Relaxed is correct; AcqRel forces unnecessary ldar / stlr fences on aarch64. per-batch produce path. confirmed via method names like _inconsistent and load_for_snapshot that no consumer relies on the AcqRel half.
|
Addressed all of the correctness issues, I've skipped the optimizations nits, as this code will change quite a few times before it's going to be used by main server binary. |
hubcio
left a comment
There was a problem hiding this comment.
additional findings cite code outside this PR's diff hunks:
core/journal/src/prepare_journal.rs:143 - open() does not detect WAL exceeding slot capacity
file not in this PR but the recovery rewrite exposes this gap. live append cannot wrap the 1024-slot ring (panic at prepare_journal.rs:411-443:422-431; SnapshotCoordinator::checkpoint_if_needed at core/metadata/src/impls/metadata.rs:243-260 with CHECKPOINT_MARGIN=64 forces snapshot near capacity). live runtime cannot lose ops. but the rebuild loop at lines 143-184 silently overwrites slot N % SLOT_COUNT when reading a WAL file containing more entries than the ring can hold (manual recovery, copy from peer, mismatched binary). iter_headers_from(0) then returns only the last 1024 slots. recovery.rs:142 (the recovery caller in this PR) does not detect this. fix: at open, bail when first slot's op > expected replay_from, or when slot count exceeds SLOT_COUNT. defensive; cheap.
core/consensus/src/impls.rs:732 - set_view should take &self
view: Cell<u32> (line 504). set_last_prepare_checksum(&self) (line 779) and set_log_view(&self) (line 788) take &self to mutate identical Cell fields. set_view is the only odd one out at &mut self. Cell allows mutation through shared refs; &mut is not load-bearing. caller bootstrap.rs:251 has let mut consensus = VsrConsensus::new(...) solely because of this method, then later setters work on &self. fix: pub fn set_view(&self, view: u32). one-line.
core/consensus/src/impls.rs:750 - pipeline_mut is dead code
pub const fn pipeline_mut(&mut self) -> &mut RefCell<P>. rg pipeline_mut returns 1 match, the definition itself, with no callers anywhere in core/. unused / dead code should be removed. returning &mut RefCell is doubly redundant: RefCell already provides interior mutability via borrow_mut. the author TODO at line 741 already flags the borrow-across-await footgun on the related pipeline() accessor; same risk applies here, plus the &mut requirement. fix: delete.
| let recovered = recover::<ServerNgMuxStateMachine>(Path::new(&config.system.path)) | ||
| .await | ||
| .map_err(ServerNgError::MetadataRecovery)?; | ||
| let restored_op = recovered.last_applied_op.unwrap_or_else(|| { |
There was a problem hiding this comment.
recovery.rs:139-158 walks the WAL and applies every replayed entry, setting last_applied_op to journal tail M. bootstrap then calls restore_commit_state(M, M) declaring commit_min == commit_max == M. impls.rs:1110-1130, 1218, 1230 propagate commit_min via primary heartbeat and DVC; view_change_quorum.rs:88-95 dvc_max_commit propagates further into peer state via complete_view_change_as_primary.
WAL entries past the last quorum-confirmed op are NOT committed; bootstrap declares them so. multi-replica safety violation: a replica with uncommitted-but-replayed ops claims they're committed.
single-node trivially correct (quorum=1). latent landmine the moment cluster ships.
fix: persist a separate commit watermark (superblock) and only restore commit_min up to that watermark. replay-as-applied does not mean replay-as-committed.
There was a problem hiding this comment.
We don't store the commit ranges yet persistently, I've done this with assumption that the cluster for now on bootstrap will be starting either with empty state, or where all replicas in the cluster are in exactly the same state on the shutdown/reboot moment (no state transfer support).
| pub const STREAM_MASK: u64 = (1u64 << STREAM_BITS) - 1; | ||
|
|
||
| #[derive(Debug, Clone, Copy, PartialEq, Eq)] | ||
| pub enum NamespaceCapacityError { |
There was a problem hiding this comment.
I think we should use derive Error from thiserror crate.
Implement
Replicabootstrap logic. Currently it does not handle cases where replicas are out of sync (needsState Transferto be implemented).