Skip to content

Stop user RPC on local DB too big#175

Open
ZanCorDX wants to merge 18 commits intomainfrom
dx/stop-rpc-on-disk-full
Open

Stop user RPC on local DB too big#175
ZanCorDX wants to merge 18 commits intomainfrom
dx/stop-rpc-on-disk-full

Conversation

@ZanCorDX
Copy link
Collaborator

@ZanCorDX ZanCorDX commented Feb 5, 2026

This PR adds a new parameter disk_max_size_to_accept_user_rpc_mb.
When the user RPC detects that the local DB backup size exceeds disk_max_size_to_accept_user_rpc_mb, it starts rejecting flow.
This allows us to configure it to guarantee we never lose archive flow.
It also adds a metric for disk_max_size_to_accept_user_rpc_mb to allow adding alarms easily.

@metachris metachris requested a review from Copilot February 11, 2026 07:55
@metachris
Copy link
Contributor

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable safeguard to reject user RPC requests when the local ClickHouse backup DB grows beyond a configured size, protecting archive flow and exposing the threshold via metrics.

Changes:

  • Introduces disk_max_size_to_accept_user_rpc_mb CLI/config and propagates it into ingress runtime state.
  • Adds a DiskFull JSON-RPC error and rejects user RPC when ClickHouse local backup exceeds threshold.
  • Exposes configured threshold as a metric and adds an integration test covering reject→drain→accept behavior.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/cli.rs Adds new CLI/env config for max disk size threshold (MB)
src/lib.rs Publishes threshold metric and passes threshold (bytes) into ingress
src/metrics.rs Adds gauge metric for the configured threshold
src/jsonrpc.rs Adds DiskFull JSON-RPC error and mappings
src/ingress/mod.rs Enforces threshold by rejecting requests based on shared disk-size state
src/indexer/click/mod.rs Tracks backup DB disk size globally and updates it from ClickHouse backup metrics; adds integration test
src/utils.rs Adds helpers for generating bundles with predictable tx input sizes (test support)
Cargo.toml Adds parking_lot dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +269 to +277
/// Maximum local ClickHouse backup disk size in MB above which user RPC (e.g. eth_sendBundle)
/// is rejected with disk full. Defaults to 1024 MB (1 GiB).
#[clap(
long = "disk-max-size-to-accept-user-rpc-mb",
default_value_t = 1024,
env = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC",
id = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC"
)]
pub disk_max_size_to_accept_user_rpc_mb: u64,
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is documented as 'MB' but the runtime conversion uses 1024 * 1024 (MiB). Either update docs/flag naming to MiB (recommended for clarity) or change the conversion to use decimal MB (1_000_000) to match the current documentation. Also consider aligning the env var name with the unit (e.g., ..._MB or ..._MIB) to avoid ambiguity.

Copilot uses AI. Check for mistakes.
Self::MethodNotFound(_) => StatusCode::NOT_FOUND,
Self::RateLimited => StatusCode::TOO_MANY_REQUESTS,
Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping DiskFull to HTTP 500 makes it look like a server crash rather than a capacity/backpressure condition. Consider using 507 Insufficient Storage (closest semantic match) or 503 Service Unavailable so clients can distinguish this from internal failures.

Suggested change
Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,
Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::DiskFull => StatusCode::INSUFFICIENT_STORAGE,

Copilot uses AI. Check for mistakes.
self.bundle_receipts_size = size;
}
pub(crate) fn disk_size(&self) -> u64 {
self.bundles_size + self.bundle_receipts_size
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summing two u64 sizes with + can overflow and wrap/panic depending on build settings. Using saturating_add (or checked_add with fallback/logging) avoids incorrect disk-size readings when sizes get large.

Suggested change
self.bundles_size + self.bundle_receipts_size
self.bundles_size.saturating_add(self.bundle_receipts_size)

Copilot uses AI. Check for mistakes.
Comment on lines 161 to 164
if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >
ingress.disk_max_size_to_accept_user_rpc
{
return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes a global mutex on every user RPC request (hot path). Since the value is just a size snapshot, consider switching to atomics (e.g., AtomicU64 for total size, or two atomics and saturating_add on read) to avoid contention and reduce tail latency under load.

Suggested change
if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >
ingress.disk_max_size_to_accept_user_rpc
{
return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);
if let Ok(backup_disk_size) = CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.try_lock() {
if backup_disk_size.disk_size() > ingress.disk_max_size_to_accept_user_rpc {
return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);
}

Copilot uses AI. Check for mistakes.
(DISK_MAX_SIZE_TO_ACCEPT_USER_RPC_MB * 1024 * 1024) as usize /
(BUNDLE_TX_COUNT * BUNDLE_TX_INPUT_SIZE);

const FLOWPROXY_START_DELAY_MS: Duration = Duration::from_millis(800);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Copilot uses AI. Check for mistakes.
Comment on lines +692 to +696
tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;

let reqwest_client = reqwest::Client::default();
let signer = PrivateKeySigner::random();

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Suggested change
tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;
let reqwest_client = reqwest::Client::default();
let signer = PrivateKeySigner::random();
let reqwest_client = reqwest::Client::default();
let signer = PrivateKeySigner::random();
let readiness_client = reqwest_client.clone();
tokio::time::timeout(FLOWPROXY_START_DELAY_MS, async move {
loop {
match readiness_client.get(&user_url).send().await {
Ok(_) => break,
Err(_) => tokio::time::sleep(Duration::from_millis(50)).await,
}
}
})
.await
.expect("flowproxy did not become ready in time");

Copilot uses AI. Check for mistakes.
@ZanCorDX
Copy link
Collaborator Author

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Good point, added.

Comment on lines 59 to 61
pub(crate) static CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE: LazyLock<
Mutex<ClickhouseLocalBackupDiskSize>,
> = LazyLock::new(|| Mutex::new(ClickhouseLocalBackupDiskSize::default()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a Mutex, would it make sense to use AtomicU64 in ClickhouseLocalBackupDiskSize? Just to avoid locking for every RPC request

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@mempirate mempirate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to return a handle from spawn_clickhouse_indexer_and_backup in rbuilder_utils that contains these metrics, instead of this callback pattern? Seems a little less complex to me

ZanCorDX and others added 2 commits February 18, 2026 09:38
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ZanCorDX
Copy link
Collaborator Author

ZanCorDX commented Feb 18, 2026

Would it make sense to return a handle from spawn_clickhouse_indexer_and_backup in rbuilder_utils that contains these metrics, instead of this callback pattern? Seems a little less complex to me

It got a little more complex due to the fact that the metric is stateless, there is no object "metrics" passed.
I guess the nice thing would be to pass an Arc<dyn Metric> and then we implement whatever we want in set_disk_backup_size.
I just didn't want to make a huge change on the lib.

@ZanCorDX ZanCorDX requested a review from mempirate February 18, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments