disagg: limit S3 read amplification on compute nodes (#10771) by ti-chi-bot · Pull Request #10789 · pingcap/tiflash

ti-chi-bot · 2026-04-08T00:58:39Z

This is an automated cherry-pick of #10771

What problem does this PR solve?

Issue Number: close #10752

Problem Summary:

In disaggregated read workloads, compute nodes can generate uncontrolled S3 traffic when cold reads and repeated cache misses amplify remote reads on the same node. We need a production-safe way to bound S3 traffic, reduce duplicate downloads for the same object, and make the resulting behavior observable and tunable.

What is changed and how it works?

add a node-level S3 byte limiter and wire it into both direct reads and FileCache downloads
avoid duplicate S3 downloads for the same key by adding bounded wait on in-flight FileCache entries
add phase-2 remote cache observability for bounded wait and background download stages
refine FileCache and S3 read paths for maintainability, comments, failure handling, and reload consistency
add an English design document for the latest disagg S3 backpressure plan

More specifically:

add S3ReadLimiter to cap node-level S3 read bytes for both direct reads and FileCache downloads
add storage.io_rate_limit.s3_max_read_bytes_per_sec
remove the earlier s3_max_get_object_streams stream-cap design after validation showed that a reader-lifetime stream token can stall disaggregated query progress
publish the shared limiter through ClientFactory so all S3 clients on the node observe the same node-level byte budget
keep the limiter object stable across config reloads; reloading to 0 now updates the existing limiter instance instead of replacing it, so in-flight readers observe the disable and the related metric is updated consistently
make S3RandomAccessFile use limiter-aware chunked read/seek paths while sharing common read/seek tail logic
make FileCache downloads reuse the same limiter, add profiles.default.dt_filecache_wait_on_downloading_ms, and let follower requests perform bounded wait on an in-flight same-key download instead of immediately creating more remote-read pressure
handle bounded-wait download failures by removing failed placeholders eagerly so later retries can proceed
add remote-cache observability for:
- bounded wait result / bytes / wait-seconds by result x file_type
- background download stage seconds by stage x file_type
- reject counts for too_many_download by file_type
- background download queue depth
add an English design doc under docs/design/2026-03-24-disagg-s3-node-level-backpressure-and-filecache-dedup.md

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Manual test

Tested with following py script

run_cold_disagg_read.py

# Round 1: cold-cache baseline without node-level S3 byte limiting and without FileCache bounded wait.
# Round 2: cold-cache baseline with two concurrent queries and no parameter changes.
# Round 3: cold-cache run with node-level S3 byte limiting enabled at 150 MiB/s.
# Round 4: keep the same S3 byte limit, enable FileCache bounded wait at 5000ms.
# Round 5: raise the S3 byte limit from 150 MiB/s to 200 MiB/s and rerun the cold-cache query.
# Round 6: keep the same config as round 5 and run the target query twice concurrently.

The six-round run_cold_disagg_read.py experiment clearly reproduces the target scenario described in the design doc: under cold-cache conditions, two concurrent queries without any limiter push the host bond1 receive peak to about 2.48 GB/s, while direct reads and FileCache downloads both surge at the same time. This is exactly the node-level S3 read amplification problem the proposal is meant to control. After enabling s3_max_read_bytes_per_sec, TiFlash shows sustained s3_read_byte pending activity, and the host-level network peak is consistently reduced to the few-hundred-MB/s range. That confirms the node-level byte limiter is effective not only in internal accounting but also at the real machine network level, turning the system from “unbounded amplification that can blow up the node network” into a bounded, observable, and tunable degraded state.

In this workload, 150 MiB/s is too conservative and significantly stretches TiFlash execution time. Raising the limit to 200 MiB/s noticeably improves both tiflash_task execution time and overall slowlog wall time in the single-query case, while still keeping network usage under control. For the current sample, s3_max_read_bytes_per_sec = 209715200 together with dt_filecache_wait_on_downloading_ms = 5000 is the most balanced configuration observed so far. Bounded wait helps mainly on the merged path in single-query runs, but under two-query concurrency it also starts producing substantial coldata hits/timeouts, which indicates that its benefit expands once overlap becomes strong enough. At this point, the dominant remaining bottleneck is no longer uncontrolled network traffic, but FileCache background-download supply: bg_downloading_count, bg_download_queue_count, and too_many_download stay high across multiple rounds. Therefore, the next optimization priority should be downloader concurrency / queue / admission behavior rather than further increasing the bounded-wait budget.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Affects user behaviors
Contains syntax changes
Contains variable changes
New configs tiflash: Add disaggregated arch related configs docs-cn#21455

profiles.default.dt_filecache_wait_on_downloading_ms
storage.io_rate_limit.s3_max_read_bytes_per_sec

Contains experimental features
Changes MySQL compatibility

Release note

limit S3 read amplification on compute nodes

Summary by CodeRabbit

New Features
- Node-level S3 read-rate limiter; bounded follower wait for in-flight FileCache downloads; two per-node HTTP cache-evict endpoints.
Improvements
- Limiter-aware chunked reads/seeks and downloads, background-download queue/stage timing, failed-placeholder cleanup, expanded remote-cache & I/O-limiter metrics, runtime wait tuning setting.
Tests
- New unit/integration coverage for limiter, FileCache wait semantics, background-failure and scheduling cleanup paths, and limiter publication.
Documentation
- Added design doc and HTTP API docs.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2026-04-08T00:58:44Z

@JaySon-Huang This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

coderabbitai · 2026-04-08T00:58:46Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)

release-8.5
release-7.5
release-8.1

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9a68380c-4dcf-4200-af18-259d9a118655

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ti-chi-bot · 2026-04-08T00:58:47Z

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

JaySon-Huang

LGTM

Signed-off-by: JaySon-Huang <tshent@qq.com>

ti-chi-bot · 2026-04-08T01:09:07Z

[LGTM Timeline notifier]

Timeline:

2026-04-08 00:59:11.875529172 +0000 UTC m=+917957.080889219: ☑️ agreed by JaySon-Huang.
2026-04-08 01:09:06.803653923 +0000 UTC m=+918552.009013980: ☑️ agreed by yudongusa.

Signed-off-by: JaySon-Huang <tshent@qq.com>

ti-chi-bot · 2026-04-08T01:54:10Z

@kolafish: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-08T01:54:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, kolafish, yudongusa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [JaySon-Huang]
~~dbms/src/Interpreters/OWNERS~~ [yudongusa]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JaySon-Huang · 2026-04-08T01:54:14Z

/unhold
conflict are resolved

This is an automated cherry-pick of pingcap#10771

e76ba01

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Apr 8, 2026

disagg: limit S3 read amplification on compute nodes #10771

Merged

12 tasks

ti-chi-bot bot added the do-not-merge/cherry-pick-not-approved label Apr 8, 2026

ti-chi-bot assigned JaySon-Huang Apr 8, 2026

JaySon-Huang approved these changes Apr 8, 2026

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 8, 2026

Resolve conflict

348aa69

Signed-off-by: JaySon-Huang <tshent@qq.com>

yudongusa approved these changes Apr 8, 2026

View reviewed changes

ti-chi-bot bot added lgtm approved and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 8, 2026

Resolve conflict

97d9c8e

Signed-off-by: JaySon-Huang <tshent@qq.com>

kolafish approved these changes Apr 8, 2026

View reviewed changes

ti-chi-bot bot added the cherry-pick-approved Cherry pick PR approved by release team. label Apr 8, 2026

ti-chi-bot bot removed the do-not-merge/cherry-pick-not-approved label Apr 8, 2026

ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2026

ti-chi-bot bot merged commit 416eaca into pingcap:release-nextgen-20251011 Apr 8, 2026
5 checks passed

ti-chi-bot bot deleted the cherry-pick-10771-to-release-nextgen-20251011 branch April 8, 2026 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disagg: limit S3 read amplification on compute nodes (#10771)#10789

disagg: limit S3 read amplification on compute nodes (#10771)#10789
ti-chi-bot[bot] merged 3 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-10771-to-release-nextgen-20251011

ti-chi-bot commented Apr 8, 2026

Uh oh!

ti-chi-bot commented Apr 8, 2026

Uh oh!

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

Review skipped

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

JaySon-Huang left a comment

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

JaySon-Huang commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ti-chi-bot commented Apr 8, 2026

What problem does this PR solve?

What is changed and how it works?

Check List

Manual test

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot commented Apr 8, 2026

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

JaySon-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

[LGTM Timeline notifier]

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

JaySon-Huang commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Apr 8, 2026 •

edited

Loading