Skip to content

feat(cluster): implement replica bootstrap#3163

Open
numinnex wants to merge 11 commits intomasterfrom
replica_bootstrap
Open

feat(cluster): implement replica bootstrap#3163
numinnex wants to merge 11 commits intomasterfrom
replica_bootstrap

Conversation

@numinnex
Copy link
Copy Markdown
Contributor

Implement Replica bootstrap logic. Currently it does not handle cases where replicas are out of sync (needs State Transfer to be implemented).

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 9.07850% with 1332 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.09%. Comparing base (78f447e) to head (59a9231).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
core/server-ng/src/bootstrap.rs 0.00% 1142 Missing ⚠️
core/server-ng/src/config_writer.rs 0.00% 71 Missing ⚠️
core/server-ng/src/main.rs 0.00% 33 Missing ⚠️
core/shard/src/router.rs 0.00% 17 Missing ⚠️
core/common/src/types/streaming_stats.rs 50.00% 12 Missing ⚠️
core/configs/src/server_ng_config/validators.rs 41.17% 7 Missing and 3 partials ⚠️
core/consensus/src/impls.rs 0.00% 10 Missing ⚠️
core/configs/src/server_ng_config/displays.rs 0.00% 7 Missing ⚠️
core/partitions/src/iggy_index_writer.rs 0.00% 7 Missing ⚠️
core/partitions/src/messages_writer.rs 0.00% 6 Missing ⚠️
... and 5 more

❌ Your patch check has failed because the patch coverage (9.07%) is below the target coverage (50.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3163      +/-   ##
============================================
- Coverage     74.43%   71.09%   -3.35%     
  Complexity      943      943              
============================================
  Files          1186     1190       +4     
  Lines        106343   102827    -3516     
  Branches      83377    79873    -3504     
============================================
- Hits          79161    73107    -6054     
- Misses        24434    26769    +2335     
- Partials       2748     2951     +203     
Components Coverage Δ
Rust Core 71.42% <9.07%> (-4.30%) ⬇️
Java SDK 60.14% <ø> (ø)
C# SDK 69.07% <ø> (-0.31%) ⬇️
Python SDK 81.43% <ø> (ø)
Node SDK 91.44% <ø> (+0.03%) ⬆️
Go SDK 39.80% <ø> (+0.20%) ⬆️
Files with missing lines Coverage Δ
core/configs/src/server_config/server.rs 80.28% <100.00%> (+6.20%) ⬆️
core/configs/src/server_ng_config/defaults.rs 100.00% <100.00%> (ø)
core/configs/src/server_ng_config/server_ng.rs 47.05% <100.00%> (+6.07%) ⬆️
core/message_bus/src/replica/io.rs 83.09% <100.00%> (+0.18%) ⬆️
core/metadata/src/stm/mod.rs 41.50% <100.00%> (+1.70%) ⬆️
core/metadata/src/stm/mux.rs 92.63% <100.00%> (+0.96%) ⬆️
core/partitions/src/lib.rs 0.00% <ø> (ø)
core/common/src/sharding/namespace.rs 95.00% <93.75%> (+29.37%) ⬆️
core/message_bus/src/lib.rs 92.08% <0.00%> (-0.82%) ⬇️
core/metadata/src/impls/recovery.rs 74.12% <88.88%> (+3.49%) ⬆️
... and 12 more

... and 123 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@numinnex numinnex force-pushed the replica_bootstrap branch from 39ce6cb to a1d20d7 Compare May 4, 2026 13:18
Copy link
Copy Markdown
Contributor

@hubcio hubcio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of diff findings:

core/partitions/src/messages_writer.rs:60-63let _ = file.sync_all().await.map_err(...) swallows IO error on the file_exists open path. EIO at boot indicates dying media, but bootstrap proceeds with a stale view of disk. on master HEAD already, but server-ng makes this path live for every partition. propagate the error.

core/partitions/src/iggy_index_writer.rs:53let _ = file.sync_all().await; same pattern, no map_err at all. propagate.

core/partitions/src/messages_writer.rs:137let chunk_vec: Vec<_> = chunk.to_vec(); allocates a Vec per chunk per save_frozen_batches. steady-state per-batch alloc on what is now the primary write path. cache a reusable Vec in the writer or refactor the compio iov call to take a borrowed slice.

core/shard/src/lib.rs:622let namespaces: Vec<_> = planes.1.0.namespaces().copied().collect(); allocates a fresh Vec per inbox-frame iteration. core/shard/src/router.rs:270 calls process_loopback after every dispatched frame in steady state. hoist a namespaces_buf: Vec<u64> like the existing loopback_buf, or short-circuit when partitions / loopback drained are empty.

core/shard/src/router.rs:94-103 and 199-208shards_table.shard_for(...).unwrap_or_else(|| { warn!(...); 0 }) silently falls back to shard 0. single-shard server-ng masks this today; multi-shard makes it a silent routing-bug hider. drop the fallback or fail loud.

core/binary_protocol/src/consensus/message.rs:347transmute_header does a 256B header copy + 256B memset + Message::try_from(owned) re-validate after the closure already wrote a known-valid header. that's a 3rd validation per request (1st transport recv, 2nd try_into_typed, 3rd here). reached per-client-message via the new make_deferred_*_handler closures at core/server-ng/src/bootstrap.rs:1086,1116. add a transmute_header_unchecked that skips the trailing TryFrom validation and rely on the closure invariant.

core/metadata/src/stm/stream.rs:615-676Snapshotable::to_snapshot reads round_robin_counter with Ordering::Relaxed during snapshot fill. concurrent producer increments race the snapshot read. single-shard today: not exploitable. document the invariant.

core/common/src/types/streaming_stats.rs counter ordering — counter fetch_add / fetch_sub use Ordering::AcqRel even though the counters synchronize no other data. Relaxed is correct; AcqRel forces unnecessary ldar / stlr fences on aarch64. per-batch produce path. confirmed via method names like _inconsistent and load_for_snapshot that no consumer relies on the AcqRel half.

Comment thread core/server-ng/src/bootstrap.rs
Comment thread core/server-ng/src/bootstrap.rs
Comment thread core/server-ng/src/bootstrap.rs
Comment thread core/server-ng/src/bootstrap.rs
Comment thread core/server-ng/src/bootstrap.rs
Comment thread core/configs/src/server_ng_config/server_ng.rs
Comment thread core/configs/src/server_config/server.rs
Comment thread core/consensus/src/impls.rs
Comment thread core/metadata/src/impls/recovery.rs
Comment thread core/common/src/sharding/namespace.rs
@numinnex
Copy link
Copy Markdown
Contributor Author

numinnex commented May 6, 2026

Addressed all of the correctness issues, I've skipped the optimizations nits, as this code will change quite a few times before it's going to be used by main server binary.

Copy link
Copy Markdown
Contributor

@hubcio hubcio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional findings cite code outside this PR's diff hunks:

core/journal/src/prepare_journal.rs:143 - open() does not detect WAL exceeding slot capacity

file not in this PR but the recovery rewrite exposes this gap. live append cannot wrap the 1024-slot ring (panic at prepare_journal.rs:411-443:422-431; SnapshotCoordinator::checkpoint_if_needed at core/metadata/src/impls/metadata.rs:243-260 with CHECKPOINT_MARGIN=64 forces snapshot near capacity). live runtime cannot lose ops. but the rebuild loop at lines 143-184 silently overwrites slot N % SLOT_COUNT when reading a WAL file containing more entries than the ring can hold (manual recovery, copy from peer, mismatched binary). iter_headers_from(0) then returns only the last 1024 slots. recovery.rs:142 (the recovery caller in this PR) does not detect this. fix: at open, bail when first slot's op > expected replay_from, or when slot count exceeds SLOT_COUNT. defensive; cheap.

core/consensus/src/impls.rs:732 - set_view should take &self

view: Cell<u32> (line 504). set_last_prepare_checksum(&self) (line 779) and set_log_view(&self) (line 788) take &self to mutate identical Cell fields. set_view is the only odd one out at &mut self. Cell allows mutation through shared refs; &mut is not load-bearing. caller bootstrap.rs:251 has let mut consensus = VsrConsensus::new(...) solely because of this method, then later setters work on &self. fix: pub fn set_view(&self, view: u32). one-line.

core/consensus/src/impls.rs:750 - pipeline_mut is dead code

pub const fn pipeline_mut(&mut self) -> &mut RefCell<P>. rg pipeline_mut returns 1 match, the definition itself, with no callers anywhere in core/. unused / dead code should be removed. returning &mut RefCell is doubly redundant: RefCell already provides interior mutability via borrow_mut. the author TODO at line 741 already flags the borrow-across-await footgun on the related pipeline() accessor; same risk applies here, plus the &mut requirement. fix: delete.

Comment thread core/server-ng/src/bootstrap.rs Outdated
Comment thread core/server-ng/src/bootstrap.rs Outdated
let recovered = recover::<ServerNgMuxStateMachine>(Path::new(&config.system.path))
.await
.map_err(ServerNgError::MetadataRecovery)?;
let restored_op = recovered.last_applied_op.unwrap_or_else(|| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recovery.rs:139-158 walks the WAL and applies every replayed entry, setting last_applied_op to journal tail M. bootstrap then calls restore_commit_state(M, M) declaring commit_min == commit_max == M. impls.rs:1110-1130, 1218, 1230 propagate commit_min via primary heartbeat and DVC; view_change_quorum.rs:88-95 dvc_max_commit propagates further into peer state via complete_view_change_as_primary.

WAL entries past the last quorum-confirmed op are NOT committed; bootstrap declares them so. multi-replica safety violation: a replica with uncommitted-but-replayed ops claims they're committed.

single-node trivially correct (quorum=1). latent landmine the moment cluster ships.

fix: persist a separate commit watermark (superblock) and only restore commit_min up to that watermark. replay-as-applied does not mean replay-as-committed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't store the commit ranges yet persistently, I've done this with assumption that the cluster for now on bootstrap will be starting either with empty state, or where all replicas in the cluster are in exactly the same state on the shutdown/reboot moment (no state transfer support).

Comment thread core/server-ng/src/bootstrap.rs
Comment thread core/server-ng/src/bootstrap.rs
pub const STREAM_MASK: u64 = (1u64 << STREAM_BITS) - 1;

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum NamespaceCapacityError {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use derive Error from thiserror crate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants