Stop user RPC on local DB too big by ZanCorDX · Pull Request #175 · BuilderNet/FlowProxy

ZanCorDX · 2026-02-05T21:17:01Z

This PR adds a new parameter disk_max_size_to_accept_user_rpc_mb.
When the user RPC detects that the local DB backup size exceeds disk_max_size_to_accept_user_rpc_mb, it starts rejecting flow.
This allows us to configure it to guarantee we never lose archive flow.
It also adds a metric for disk_max_size_to_accept_user_rpc_mb to allow adding alarms easily.

metachris · 2026-02-11T07:56:34Z

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Copilot

Pull request overview

Adds a configurable safeguard to reject user RPC requests when the local ClickHouse backup DB grows beyond a configured size, protecting archive flow and exposing the threshold via metrics.

Changes:

Introduces disk_max_size_to_accept_user_rpc_mb CLI/config and propagates it into ingress runtime state.
Adds a DiskFull JSON-RPC error and rejects user RPC when ClickHouse local backup exceeds threshold.
Exposes configured threshold as a metric and adds an integration test covering reject→drain→accept behavior.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/cli.rs	Adds new CLI/env config for max disk size threshold (MB)
src/lib.rs	Publishes threshold metric and passes threshold (bytes) into ingress
src/metrics.rs	Adds gauge metric for the configured threshold
src/jsonrpc.rs	Adds `DiskFull` JSON-RPC error and mappings
src/ingress/mod.rs	Enforces threshold by rejecting requests based on shared disk-size state
src/indexer/click/mod.rs	Tracks backup DB disk size globally and updates it from ClickHouse backup metrics; adds integration test
src/utils.rs	Adds helpers for generating bundles with predictable tx input sizes (test support)
Cargo.toml	Adds `parking_lot` dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lib.rs

Copilot · 2026-02-11T07:57:56Z

src/cli.rs

+    /// Maximum local ClickHouse backup disk size in MB above which user RPC (e.g. eth_sendBundle)
+    /// is rejected with disk full. Defaults to 1024 MB (1 GiB).
+    #[clap(
+        long = "disk-max-size-to-accept-user-rpc-mb",
+        default_value_t = 1024,
+        env = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC",
+        id = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC"
+    )]
+    pub disk_max_size_to_accept_user_rpc_mb: u64,


The flag is documented as 'MB' but the runtime conversion uses 1024 * 1024 (MiB). Either update docs/flag naming to MiB (recommended for clarity) or change the conversion to use decimal MB (1_000_000) to match the current documentation. Also consider aligning the env var name with the unit (e.g., ..._MB or ..._MIB) to avoid ambiguity.

Copilot · 2026-02-11T07:57:56Z

src/jsonrpc.rs

            Self::MethodNotFound(_) => StatusCode::NOT_FOUND,
            Self::RateLimited => StatusCode::TOO_MANY_REQUESTS,
-            Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
+            Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,


Mapping DiskFull to HTTP 500 makes it look like a server crash rather than a capacity/backpressure condition. Consider using 507 Insufficient Storage (closest semantic match) or 503 Service Unavailable so clients can distinguish this from internal failures.

Suggested change

Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,

Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,

Self::DiskFull => StatusCode::INSUFFICIENT_STORAGE,

Copilot · 2026-02-11T07:57:57Z

src/indexer/click/mod.rs

+        self.bundle_receipts_size = size;
+    }
+    pub(crate) fn disk_size(&self) -> u64 {
+        self.bundles_size + self.bundle_receipts_size


Summing two u64 sizes with + can overflow and wrap/panic depending on build settings. Using saturating_add (or checked_add with fallback/logging) avoids incorrect disk-size readings when sizes get large.

Suggested change

self.bundles_size + self.bundle_receipts_size

self.bundles_size.saturating_add(self.bundle_receipts_size)

Copilot · 2026-02-11T07:57:57Z

src/ingress/mod.rs

+        if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >
+            ingress.disk_max_size_to_accept_user_rpc
+        {
+            return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);


This takes a global mutex on every user RPC request (hot path). Since the value is just a size snapshot, consider switching to atomics (e.g., AtomicU64 for total size, or two atomics and saturating_add on read) to avoid contention and reduce tail latency under load.

Suggested change

if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >

ingress.disk_max_size_to_accept_user_rpc

{

return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);

if let Ok(backup_disk_size) = CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.try_lock() {

if backup_disk_size.disk_size() > ingress.disk_max_size_to_accept_user_rpc {

return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);

}

Copilot · 2026-02-11T07:57:57Z

src/indexer/click/mod.rs

+            (DISK_MAX_SIZE_TO_ACCEPT_USER_RPC_MB * 1024 * 1024) as usize /
+            (BUNDLE_TX_COUNT * BUNDLE_TX_INPUT_SIZE);
+
+        const FLOWPROXY_START_DELAY_MS: Duration = Duration::from_millis(800);


The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Copilot · 2026-02-11T07:57:57Z

src/indexer/click/mod.rs

+        tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;
+
+        let reqwest_client = reqwest::Client::default();
+        let signer = PrivateKeySigner::random();
+


The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Suggested change

tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;

let reqwest_client = reqwest::Client::default();

let signer = PrivateKeySigner::random();

let reqwest_client = reqwest::Client::default();

let signer = PrivateKeySigner::random();

let readiness_client = reqwest_client.clone();

tokio::time::timeout(FLOWPROXY_START_DELAY_MS, async move {

loop {

match readiness_client.get(&user_url).send().await {

Ok(_) => break,

Err(_) => tokio::time::sleep(Duration::from_millis(50)).await,

}

}

})

.await

.expect("flowproxy did not become ready in time");

src/indexer/click/mod.rs

ZanCorDX · 2026-02-12T17:12:58Z

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Good point, added.

mempirate · 2026-02-16T10:32:35Z

src/indexer/click/mod.rs

+pub(crate) static CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE: LazyLock<
+    Mutex<ClickhouseLocalBackupDiskSize>,
+> = LazyLock::new(|| Mutex::new(ClickhouseLocalBackupDiskSize::default()));


Instead of a Mutex, would it make sense to use AtomicU64 in ClickhouseLocalBackupDiskSize? Just to avoid locking for every RPC request

mempirate

Would it make sense to return a handle from spawn_clickhouse_indexer_and_backup in rbuilder_utils that contains these metrics, instead of this callback pattern? Seems a little less complex to me

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ZanCorDX · 2026-02-18T12:57:41Z

Would it make sense to return a handle from spawn_clickhouse_indexer_and_backup in rbuilder_utils that contains these metrics, instead of this callback pattern? Seems a little less complex to me

It got a little more complex due to the fact that the metric is stateless, there is no object "metrics" passed.
I guess the nice thing would be to pass an Arc<dyn Metric> and then we implement whatever we want in set_disk_backup_size.
I just didn't want to make a huge change on the lib.

ZanCorDX and others added 12 commits February 4, 2026 15:51

proper-shutdown

55aa648

polish

548e42d

test/lint

80d5818

lint

7d69282

trying to fix tests

dc00c4a

removed println!

acdcef3

avoid Global executor already set error log

62dbf47

fix: IPv4/IPv6 issues + TLS SNI

3871ee2

min fixes

a5311fc

done

47a4f91

new metric

490d562

test

0c68418

ZanCorDX requested review from efbig, mempirate, metachris and thedevbirb as code owners February 5, 2026 21:17

Merge main

4243732

metachris requested a review from Copilot February 11, 2026 07:55

Copilot AI reviewed Feb 11, 2026

View reviewed changes

ready handler

a91ebae

mempirate reviewed Feb 16, 2026

View reviewed changes

ZanCorDX and others added 2 commits February 18, 2026 09:38

Update src/lib.rs

af292c4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/indexer/click/mod.rs

ce5067e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mutex->atomic

d568db3

ZanCorDX requested a review from mempirate February 18, 2026 13:08

lint/shear

7d784ec

	Self::Internal \| Self::Unknown(_) \| Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,
	Self::Internal \| Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
	Self::DiskFull => StatusCode::INSUFFICIENT_STORAGE,

	self.bundles_size + self.bundle_receipts_size
	self.bundles_size.saturating_add(self.bundle_receipts_size)

-        tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;
-        let reqwest_client = reqwest::Client::default();
-        let signer = PrivateKeySigner::random();
+        let reqwest_client = reqwest::Client::default();
+        let signer = PrivateKeySigner::random();
+        let readiness_client = reqwest_client.clone();
+        tokio::time::timeout(FLOWPROXY_START_DELAY_MS, async move {
+            loop {
+                match readiness_client.get(&user_url).send().await {
+                    Ok(_) => break,
+                    Err(_) => tokio::time::sleep(Duration::from_millis(50)).await,
+                }
+            }
+        })
+        .await
+        .expect("flowproxy did not become ready in time");

Conversation

ZanCorDX commented Feb 5, 2026

Uh oh!

metachris commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZanCorDX commented Feb 12, 2026

Uh oh!

mempirate Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

ZanCorDX Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

mempirate left a comment

Choose a reason for hiding this comment

Uh oh!

ZanCorDX commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

ZanCorDX commented Feb 18, 2026 •

edited

Loading