feat: make chainlink aware of validator mode#1227
Conversation
📝 WalkthroughWalkthroughAdds coordination-mode checks to HTTP RPC handling to reject or short-circuit ensure/transaction flows based on coordination mode and Chainlink primary runtime readiness. Refactors Chainlink into a lifecycle/runtime model with enable_primary/disable and runtime_fetch_cloner gating, implements coordinated shutdown for FetchCloner, RemoteAccountProvider, and SubMuxClient (including test mocks), updates replicator/validator primary/standby sequencing to enforce ordering, and expands tests and metrics to cover routing decisions and shutdown behavior. Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
magicblock-api/src/magic_validator.rs (1)
879-901:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't treat
mode_tx.send(...)as a completed Primary transition.This only queues
SchedulerMode::Primary. The new HTTP/chainlink gates readCoordinationMode::current(), so startup can continue pastenable_primary()while the old mode is still visible. Please make the coordination-mode switch synchronous here, or wait for an acknowledgment before enabling primary-only behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-api/src/magic_validator.rs` around lines 879 - 901, The code currently only enqueues SchedulerMode::Primary via mode_tx.send(...) which doesn't guarantee CoordinationMode::current() has switched before calling chainlink.enable_primary(); change this to perform a synchronous coordination-mode transition or wait for an explicit acknowledgment: either (a) invoke the coordination-mode switch synchronously (so CoordinationMode::current() reflects Primary before calling chainlink.enable_primary()), or (b) extend the scheduler handshake to return/await an ack after processing SchedulerMode::Primary (e.g., send a request that the scheduler replies to) and only call chainlink.enable_primary() after receiving that ack; update the logic around mode_tx.send, SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination mode is visible before enabling primary-only behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@magicblock-replicator/src/service/context.rs`:
- Around line 244-249: The promotion/demotion path currently enqueues
SchedulerMode inside enter_primary_mode()/enter_replica_mode() but silently
drops send failures, so the rest of
run_primary_readiness_sequence()/run_replica_readiness_sequence() can proceed
while CoordinationMode::current() is stale or the receiver is gone; change
enter_primary_mode and enter_replica_mode to return Result<(), Error> (or a
suitable error type) and have them propagate the failure of mode_tx.send(...)
instead of ignoring it, then update the calls in run_primary_readiness_sequence
and the analogous replica sequence to invoke those methods through the fallible
closures you already pass (e.g. replace || self.enter_primary_mode() with a
closure that returns the Result), so the readiness sequence aborts on
send/send-failure and the mode switch is externally visible; ensure
SchedulerMode send failures are converted into the same error type you return so
callers can handle them.
---
Outside diff comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 879-901: The code currently only enqueues SchedulerMode::Primary
via mode_tx.send(...) which doesn't guarantee CoordinationMode::current() has
switched before calling chainlink.enable_primary(); change this to perform a
synchronous coordination-mode transition or wait for an explicit acknowledgment:
either (a) invoke the coordination-mode switch synchronously (so
CoordinationMode::current() reflects Primary before calling
chainlink.enable_primary()), or (b) extend the scheduler handshake to
return/await an ack after processing SchedulerMode::Primary (e.g., send a
request that the scheduler replies to) and only call chainlink.enable_primary()
after receiving that ack; update the logic around mode_tx.send,
SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination
mode is visible before enabling primary-only behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 43480cb3-ed7b-4cd8-8ece-e9efe4b990f5
📒 Files selected for processing (10)
magicblock-aperture/src/requests/http/mod.rsmagicblock-aperture/src/requests/http/send_transaction.rsmagicblock-aperture/src/requests/http/simulate_transaction.rsmagicblock-api/src/magic_validator.rsmagicblock-chainlink/src/chainlink/mod.rsmagicblock-chainlink/tests/10_non_primary_gating.rsmagicblock-chainlink/tests/utils/test_context.rsmagicblock-replicator/src/error.rsmagicblock-replicator/src/service/context.rsmagicblock-replicator/src/service/mod.rs
* master: fix: CommitFinalize post-commit actions (#1220)
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@magicblock-aperture/src/requests/http/mod.rs`:
- Around line 207-210: The ensure_routing_decision calls are using hardcoded
labels instead of the passed RPC method parameter, which drops per-RPC context;
update each call site to pass the method variable rather than a string literal
(e.g., replace ensure_routing_decision("read_account") with
ensure_routing_decision(method)), and if the parameter is named _method rename
it to method (or remove the leading underscore) so it can be used; apply this
fix for all functions that accept a method label and currently call
ensure_routing_decision with fixed strings (the three occurrences shown).
In `@magicblock-metrics/src/metrics/mod.rs`:
- Around line 216-223: Replace the .unwrap() on the metric construction with
explicit error handling: call IntCounterVec::new(...) and match on the Result,
returning the created IntCounterVec on Ok or invoking panic! with a clear,
descriptive message including the error on Err. Apply the same change for the
other metric initialization referenced around lines 257-265; reference the
static names APERTURE_ENSURE_ROUTING_COUNT (and the other metric static) so you
replace .unwrap() with a match that panics with a contextual message and the
underlying error.
In `@magicblock-replicator/src/service/context.rs`:
- Around line 246-255: After calling chainlink.enable_primary().await (in the
block using chainlink.enable_primary()), ensure we rollback by calling
chainlink.disable_primary().await when subsequent steps fail: if
chainlink.primary_runtime_readiness() does not allow promotion or if
mode_tx.send(SchedulerMode::Primary).await fails, call disable_primary() and
propagate an error; for the mode_tx send failure ensure disable_primary() is
awaited even if send returns Err, and convert any disable error into/log
alongside the original error so Chainlink is not left enabled when the function
returns an Err.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 192b3c12-314f-401a-bc9d-a4a58900cc75
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
magicblock-aperture/Cargo.tomlmagicblock-aperture/src/error.rsmagicblock-aperture/src/requests/http/get_account_info.rsmagicblock-aperture/src/requests/http/get_balance.rsmagicblock-aperture/src/requests/http/get_delegation_status.rsmagicblock-aperture/src/requests/http/get_multiple_accounts.rsmagicblock-aperture/src/requests/http/get_token_account_balance.rsmagicblock-aperture/src/requests/http/mod.rsmagicblock-aperture/src/requests/http/send_transaction.rsmagicblock-aperture/src/requests/http/simulate_transaction.rsmagicblock-api/src/magic_validator.rsmagicblock-chainlink/Cargo.tomlmagicblock-chainlink/src/chainlink/errors.rsmagicblock-chainlink/src/chainlink/mod.rsmagicblock-chainlink/tests/10_non_primary_gating.rsmagicblock-chainlink/tests/11_lifecycle_shutdown.rsmagicblock-metrics/src/metrics/mod.rsmagicblock-replicator/Cargo.tomlmagicblock-replicator/src/service/context.rsmagicblock-replicator/src/service/mod.rs
| _method: &'static str, | ||
| pubkey: &Pubkey, | ||
| ) -> Option<AccountSharedData> { | ||
| let mark_empty_if_not_found = [*pubkey]; | ||
| let _timer = ENSURE_ACCOUNTS_TIME | ||
| .with_label_values(&["account"]) | ||
| .start_timer(); | ||
| let _ = self | ||
| .chainlink | ||
| .ensure_accounts( | ||
| &[*pubkey], | ||
| Some(&mark_empty_if_not_found), | ||
| AccountFetchOrigin::GetAccount, | ||
| None, | ||
| ) | ||
| .await | ||
| .inspect_err(|e| { | ||
| // There is nothing we can do if fetching the account fails | ||
| // Log the error and return whatever is in the accounts db | ||
| debug!(error = ?e, "Failed to ensure account"); | ||
| }); | ||
| self.accountsdb.get_account(pubkey) | ||
| ) -> RpcResult<Option<AccountSharedData>> { | ||
| match self.ensure_routing_decision("read_account").await { |
There was a problem hiding this comment.
Use the passed RPC method instead of hardcoded ensure labels.
Line 207, Line 249, and Line 354 accept a method label, but Line 210, Line 252, and Line 357 call ensure_routing_decision(...) with fixed strings. This drops per-RPC context, so call-site method names are ignored for routing metrics/log labels.
Suggested fix
async fn read_account_with_ensure(
&self,
- _method: &'static str,
+ method: &'static str,
pubkey: &Pubkey,
) -> RpcResult<Option<AccountSharedData>> {
- match self.ensure_routing_decision("read_account").await {
+ match self.ensure_routing_decision(method).await {
...
}
}
async fn read_accounts_with_ensure(
&self,
- _method: &'static str,
+ method: &'static str,
pubkeys: &[Pubkey],
) -> RpcResult<Vec<Option<AccountSharedData>>> {
- match self.ensure_routing_decision("read_accounts").await {
+ match self.ensure_routing_decision(method).await {
...
}
}
async fn ensure_transaction_accounts(
&self,
- _method: &'static str,
+ method: &'static str,
transaction: &SanitizedTransaction,
) -> RpcResult<()> {
- match self.ensure_routing_decision("transaction").await {
+ match self.ensure_routing_decision(method).await {
...
}
}Also applies to: 249-252, 354-357
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-aperture/src/requests/http/mod.rs` around lines 207 - 210, The
ensure_routing_decision calls are using hardcoded labels instead of the passed
RPC method parameter, which drops per-RPC context; update each call site to pass
the method variable rather than a string literal (e.g., replace
ensure_routing_decision("read_account") with ensure_routing_decision(method)),
and if the parameter is named _method rename it to method (or remove the leading
underscore) so it can be used; apply this fix for all functions that accept a
method label and currently call ensure_routing_decision with fixed strings (the
three occurrences shown).
| pub static ref APERTURE_ENSURE_ROUTING_COUNT: IntCounterVec = IntCounterVec::new( | ||
| Opts::new( | ||
| "aperture_ensure_routing_count", | ||
| "Number of aperture ensure-routing decisions by method and decision", | ||
| ), | ||
| &["method", "decision"], | ||
| ).unwrap(); | ||
|
|
There was a problem hiding this comment.
Remove .unwrap() from production metric initialization.
Both new metric constructors use .unwrap() in production code. Please switch to explicit error handling (or an explicit invariant path without unwrap/expect) to satisfy the repository rule.
Example adjustment pattern
+fn build_int_counter_vec(name: &str, help: &str, labels: &[&str]) -> IntCounterVec {
+ match IntCounterVec::new(Opts::new(name, help), labels) {
+ Ok(metric) => metric,
+ Err(err) => panic!("metric definition invariant violated for `{name}`: {err}"),
+ }
+}
...
- pub static ref APERTURE_ENSURE_ROUTING_COUNT: IntCounterVec = IntCounterVec::new(
- Opts::new(
- "aperture_ensure_routing_count",
- "Number of aperture ensure-routing decisions by method and decision",
- ),
- &["method", "decision"],
- ).unwrap();
+ pub static ref APERTURE_ENSURE_ROUTING_COUNT: IntCounterVec = build_int_counter_vec(
+ "aperture_ensure_routing_count",
+ "Number of aperture ensure-routing decisions by method and decision",
+ &["method", "decision"],
+ );As per coding guidelines {magicblock-*,programs,storage-proto}/**: Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.
Also applies to: 257-265
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-metrics/src/metrics/mod.rs` around lines 216 - 223, Replace the
.unwrap() on the metric construction with explicit error handling: call
IntCounterVec::new(...) and match on the Result, returning the created
IntCounterVec on Ok or invoking panic! with a clear, descriptive message
including the error on Err. Apply the same change for the other metric
initialization referenced around lines 257-265; reference the static names
APERTURE_ENSURE_ROUTING_COUNT (and the other metric static) so you replace
.unwrap() with a match that panics with a contextual message and the underlying
error.
| let outcome = chainlink.enable_primary().await.map_err(Error::from)?; | ||
| let readiness = chainlink.primary_runtime_readiness().await; | ||
| if !readiness.allows_primary_ensure() { | ||
| return Err(Error::Internal(format!( | ||
| "chainlink primary runtime not ready after enable: outcome={outcome:?}, readiness={readiness:?}" | ||
| ))); | ||
| } | ||
| mode_tx.send(SchedulerMode::Primary).await.map_err(|e| { | ||
| Error::Internal(format!("failed to enter primary mode: {e}")) | ||
| }) |
There was a problem hiding this comment.
Rollback Chainlink if promotion fails after enable_primary().
After Line 246 succeeds, failures on Line 248 (readiness gate) or Line 253 (mode publish) return early without disabling Chainlink. That can leave remote runtime active while still non-primary.
Suggested rollback pattern
let outcome = chainlink.enable_primary().await.map_err(Error::from)?;
let readiness = chainlink.primary_runtime_readiness().await;
if !readiness.allows_primary_ensure() {
+ let _ = chainlink.disable().await;
return Err(Error::Internal(format!(
"chainlink primary runtime not ready after enable: outcome={outcome:?}, readiness={readiness:?}"
)));
}
-mode_tx.send(SchedulerMode::Primary).await.map_err(|e| {
- Error::Internal(format!("failed to enter primary mode: {e}"))
-})
+if let Err(e) = mode_tx.send(SchedulerMode::Primary).await {
+ let _ = chainlink.disable().await;
+ return Err(Error::Internal(format!("failed to enter primary mode: {e}")));
+}
+Ok(())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let outcome = chainlink.enable_primary().await.map_err(Error::from)?; | |
| let readiness = chainlink.primary_runtime_readiness().await; | |
| if !readiness.allows_primary_ensure() { | |
| return Err(Error::Internal(format!( | |
| "chainlink primary runtime not ready after enable: outcome={outcome:?}, readiness={readiness:?}" | |
| ))); | |
| } | |
| mode_tx.send(SchedulerMode::Primary).await.map_err(|e| { | |
| Error::Internal(format!("failed to enter primary mode: {e}")) | |
| }) | |
| let outcome = chainlink.enable_primary().await.map_err(Error::from)?; | |
| let readiness = chainlink.primary_runtime_readiness().await; | |
| if !readiness.allows_primary_ensure() { | |
| let _ = chainlink.disable().await; | |
| return Err(Error::Internal(format!( | |
| "chainlink primary runtime not ready after enable: outcome={outcome:?}, readiness={readiness:?}" | |
| ))); | |
| } | |
| if let Err(e) = mode_tx.send(SchedulerMode::Primary).await { | |
| let _ = chainlink.disable().await; | |
| return Err(Error::Internal(format!("failed to enter primary mode: {e}"))); | |
| } | |
| Ok(()) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-replicator/src/service/context.rs` around lines 246 - 255, After
calling chainlink.enable_primary().await (in the block using
chainlink.enable_primary()), ensure we rollback by calling
chainlink.disable_primary().await when subsequent steps fail: if
chainlink.primary_runtime_readiness() does not allow promotion or if
mode_tx.send(SchedulerMode::Primary).await fails, call disable_primary() and
propagate an error; for the mode_tx send failure ensure disable_primary() is
awaited even if send returns Err, and convert any disable error into/log
alongside the original error so Chainlink is not left enabled when the function
returns an Err.
|
Superseeded by simpler PR |
Summary
Make Chainlink and RPC behavior coordination-mode aware so replica/standby validators stay local-only until they become primary, and fully tie Chainlink remote runtime startup/shutdown to primary lifecycle transitions.
ADDRESSES: #1203
Details
This PR tightens the boundary between primary and non-primary validator behavior. Transaction RPCs are rejected while a validator is non-primary, account read RPCs remain local-only, and Chainlink remote work now exists only while the validator is explicitly enabled as primary.
Steps when Switching Modes
Switching to Primary Mode
For standalone startup, the validator performs account-bank cleanup before enabling Chainlink and publishing primary scheduler mode. For replicated validators, the same cleanup runs during promotion after replica replay work is idle.
chainlink.enable_primary(). This builds and installs the active Chainlink runtime: remote account provider, WebSocket/gRPC/Laserstream pubsub clients, FetchCloner, removed-account subscription handling, and subscription management.Primarymode to the scheduler/global coordination state. At this point primary traffic can be accepted and Chainlink remote paths are active.Switching from Primary Mode
When a primary demotes into standby/replica behavior, the order is intentionally reversed around externally visible behavior: stop being primary first, then perform Chainlink shutdown and standby setup.
Replicaimmediately at the start of standby transition.chainlink.disable()to take and stop the active Chainlink runtime.AccountsDb, undelegation subscription setup is skipped, removed-account eviction notifications cannot submit eviction transactions, and public calls do not recreate remote subscriptions.sendTransactionandsimulateTransaction) reject while non-primary, so they cannot trigger transaction preparation, account ensure, signature-cache mutation, or scheduler simulation.magicblock-aperture
RPC dispatch checks the current coordination mode.
sendTransactionis rejected while non-primary before it can affect the transaction cache, andsimulateTransactionis rejected before transaction preparation, account ensuring, or scheduler simulation. Account reads remain usable against local state. Tests cover non-primary transaction RPC rejection and local account reads.magicblock-chainlink
Chainlink now owns an explicit lifecycle state (
Disabled,Starting,Enabled,Stopping) and an optional active runtime. Runtime construction is deferred untilenable_primary(), whiledisable()tears down the active runtime asynchronously. The previous broad defensive remote-sync gates inside lower-level Chainlink paths were removed in favor of this active/inactive runtime boundary.The shutdown path covers FetchCloner cancellation, pending waiter failure, remote account provider shutdown, deterministic pubsub/submux cleanup, deferred pubsub task shutdown, and removed-account subscription task cancellation. Tests cover inactive startup modes, repeated enable/disable cycles, public calls after disable, provider account updates after disable, removed-account notifications after disable, and fresh runtime startup after shutdown.
magicblock-api
Standalone validators now reset the account bank only as part of primary readiness, then enable Chainlink before publishing primary scheduler mode. If sending primary mode to the scheduler fails, Chainlink enablement is rolled back. Replicated validators skip startup reset and defer cleanup until promotion.
magicblock-replicator
Promotion ordering is documented and covered: wait for scheduler idle, reset account bank, enable Chainlink runtime, then enter primary mode. Demotion switches to replica mode before disabling Chainlink and before standby setup can block or retry, preventing stale primary behavior from continuing during transition. Tests cover promotion failure ordering, Chainlink enable/disable ordering, and replica-only behavior that must not reach the primary path.
Summary by CodeRabbit
Bug Fixes
Refactor
New Features
Tests