Skip to content

disagg: limit S3 read amplification on compute nodes (#10771)#10789

Merged
ti-chi-bot[bot] merged 3 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-10771-to-release-nextgen-20251011
Apr 8, 2026
Merged

disagg: limit S3 read amplification on compute nodes (#10771)#10789
ti-chi-bot[bot] merged 3 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-10771-to-release-nextgen-20251011

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

This is an automated cherry-pick of #10771

What problem does this PR solve?

Issue Number: close #10752

Problem Summary:

In disaggregated read workloads, compute nodes can generate uncontrolled S3 traffic when cold reads and repeated cache misses amplify remote reads on the same node. We need a production-safe way to bound S3 traffic, reduce duplicate downloads for the same object, and make the resulting behavior observable and tunable.

What is changed and how it works?

add a node-level S3 byte limiter and wire it into both direct reads and FileCache downloads
avoid duplicate S3 downloads for the same key by adding bounded wait on in-flight FileCache entries
add phase-2 remote cache observability for bounded wait and background download stages
refine FileCache and S3 read paths for maintainability, comments, failure handling, and reload consistency
add an English design document for the latest disagg S3 backpressure plan

More specifically:

  • add S3ReadLimiter to cap node-level S3 read bytes for both direct reads and FileCache downloads
  • add storage.io_rate_limit.s3_max_read_bytes_per_sec
  • remove the earlier s3_max_get_object_streams stream-cap design after validation showed that a reader-lifetime stream token can stall disaggregated query progress
  • publish the shared limiter through ClientFactory so all S3 clients on the node observe the same node-level byte budget
  • keep the limiter object stable across config reloads; reloading to 0 now updates the existing limiter instance instead of replacing it, so in-flight readers observe the disable and the related metric is updated consistently
  • make S3RandomAccessFile use limiter-aware chunked read/seek paths while sharing common read/seek tail logic
  • make FileCache downloads reuse the same limiter, add profiles.default.dt_filecache_wait_on_downloading_ms, and let follower requests perform bounded wait on an in-flight same-key download instead of immediately creating more remote-read pressure
  • handle bounded-wait download failures by removing failed placeholders eagerly so later retries can proceed
  • add remote-cache observability for:
    • bounded wait result / bytes / wait-seconds by result x file_type
    • background download stage seconds by stage x file_type
    • reject counts for too_many_download by file_type
    • background download queue depth
  • add an English design doc under docs/design/2026-03-24-disagg-s3-node-level-backpressure-and-filecache-dedup.md

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Manual test

Tested with following py script

# Round 1: cold-cache baseline without node-level S3 byte limiting and without FileCache bounded wait.
# Round 2: cold-cache baseline with two concurrent queries and no parameter changes.
# Round 3: cold-cache run with node-level S3 byte limiting enabled at 150 MiB/s.
# Round 4: keep the same S3 byte limit, enable FileCache bounded wait at 5000ms.
# Round 5: raise the S3 byte limit from 150 MiB/s to 200 MiB/s and rerun the cold-cache query.
# Round 6: keep the same config as round 5 and run the target query twice concurrently.
image image image image
The six-round run_cold_disagg_read.py experiment clearly reproduces the target scenario described in the design doc: under cold-cache conditions, two concurrent queries without any limiter push the host bond1 receive peak to about 2.48 GB/s, while direct reads and FileCache downloads both surge at the same time. This is exactly the node-level S3 read amplification problem the proposal is meant to control. After enabling s3_max_read_bytes_per_sec, TiFlash shows sustained s3_read_byte pending activity, and the host-level network peak is consistently reduced to the few-hundred-MB/s range. That confirms the node-level byte limiter is effective not only in internal accounting but also at the real machine network level, turning the system from “unbounded amplification that can blow up the node network” into a bounded, observable, and tunable degraded state.

In this workload, 150 MiB/s is too conservative and significantly stretches TiFlash execution time. Raising the limit to 200 MiB/s noticeably improves both tiflash_task execution time and overall slowlog wall time in the single-query case, while still keeping network usage under control. For the current sample, s3_max_read_bytes_per_sec = 209715200 together with dt_filecache_wait_on_downloading_ms = 5000 is the most balanced configuration observed so far. Bounded wait helps mainly on the merged path in single-query runs, but under two-query concurrency it also starts producing substantial coldata hits/timeouts, which indicates that its benefit expands once overlap becomes strong enough. At this point, the dominant remaining bottleneck is no longer uncontrolled network traffic, but FileCache background-download supply: bg_downloading_count, bg_download_queue_count, and too_many_download stay high across multiple rounds. Therefore, the next optimization priority should be downloader concurrency / queue / admission behavior rather than further increasing the bounded-wait budget.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

profiles.default.dt_filecache_wait_on_downloading_ms
storage.io_rate_limit.s3_max_read_bytes_per_sec
  • Contains experimental features
  • Changes MySQL compatibility

Release note

limit S3 read amplification on compute nodes

Summary by CodeRabbit

  • New Features

    • Node-level S3 read-rate limiter; bounded follower wait for in-flight FileCache downloads; two per-node HTTP cache-evict endpoints.
  • Improvements

    • Limiter-aware chunked reads/seeks and downloads, background-download queue/stage timing, failed-placeholder cleanup, expanded remote-cache & I/O-limiter metrics, runtime wait tuning setting.
  • Tests

    • New unit/integration coverage for limiter, FileCache wait semantics, background-failure and scheduling cleanup paths, and limiter publication.
  • Documentation

    • Added design doc and HTTP API docs.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011 labels Apr 8, 2026
@ti-chi-bot
Copy link
Copy Markdown
Member Author

@JaySon-Huang This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 8, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)
  • release-8.5
  • release-7.5
  • release-8.1

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9a68380c-4dcf-4200-af18-259d9a118655

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Copy link
Copy Markdown
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 8, 2026
Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot ti-chi-bot bot added lgtm approved and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 8, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-08 00:59:11.875529172 +0000 UTC m=+917957.080889219: ☑️ agreed by JaySon-Huang.
  • 2026-04-08 01:09:06.803653923 +0000 UTC m=+918552.009013980: ☑️ agreed by yudongusa.

Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot ti-chi-bot bot added the cherry-pick-approved Cherry pick PR approved by release team. label Apr 8, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

@kolafish: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, kolafish, yudongusa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JaySon-Huang
Copy link
Copy Markdown
Contributor

/unhold
conflict are resolved

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2026
@ti-chi-bot ti-chi-bot bot merged commit 416eaca into pingcap:release-nextgen-20251011 Apr 8, 2026
5 checks passed
@ti-chi-bot ti-chi-bot bot deleted the cherry-pick-10771-to-release-nextgen-20251011 branch April 8, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants