disagg: Fix unexpected object storage usage caused by pre-lock residue by JaySon-Huang · Pull Request #10760 · pingcap/tiflash

JaySon-Huang · 2026-03-22T05:01:11Z

What problem does this PR solve?

Issue Number: close #10763

Problem Summary:

In concurrent remote write paths, PageDirectory write-group semantics could cause follower writers to miss their own applied lock-id cleanup signals.
As a result, S3LockLocalManager.pre_lock_keys could remain resident and be repeatedly written into manifest locks.
S3GC then treated many obsolete objects as still protected, leading to long-term remote storage usage inflation.

What is changed and how it works?

disagg: eliminate pre-lock key residue that lead to unexpected OSS usage

End-to-end correctness fixes for lock lifecycle
- PageDirectory::apply now returns writer-scoped applied_data_files for both write-group owner and followers, so each writer gets its own cleanup signal.
- UniversalPageStorage::write uses those per-writer ids to clean pre-locks reliably after apply.
- Added explicit failure cleanup path: cleanPreLockKeysOnWriteFailure(...) is invoked when remote write/apply fails.
- createS3LockForWriteBatch was adjusted to avoid partial pre-lock residue on partial lock-creation failures (append to pre_lock_keys after lock-creation pass), and its return value is now aligned with "newly appended keys" semantics.
Test coverage and regression guards
- Added write-group concurrency tests in PageDirectory and UniversalPageStorage paths.
- Added focused S3LockLocalManager tests for partial cleanup, failure cleanup, lock-return semantics, and partial-failure atomicity.
- Updated SyncPoint-based async tests to use std::launch::async to avoid deferred scheduling risk.
Observability and operations improvements
- Most observability change are split into seperate PR disagg: Add O11y on object store usage summary of each tiflash store #10764 to keep this logical changes clean
- Added lock-manager metrics to track pre-lock residency and cleanup outcomes (hit/miss/remaining).
- Added owner-only periodic S3 storage summary in S3GCManagerService.
- Added per-store S3 summary gauge:
  - tiflash_storage_s3_store_summary_bytes{store_id, type=data_file_bytes|dt_file_bytes}
- Added setting remote_summary_interval_seconds and wired it through TMTContext; <= 0 disables periodic summary task registration.
- Updated Grafana panels for the new S3 summary metric.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

# Run chbenchmark workload and check the metrics of `prelock_keys` and OSS usage
tiup bench ch --host 10.2.12.81 -P 8081 --warehouses 8000 run -D chbenchmark8k -T 50 -t 0 --time 30m --ignore-error --queries q1
# Before the fix, from 23:29 to 00:00, the number of prelock_keys in memory would accumulate and increase with the write load; after the fix, from 02:00 to 02:30, there was no longer any persistent residue of prelock_keys in memory.
# Also can check the new added grafana panel "Remote Store Summary (Disagg arch)"

No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Fix an issue in disaggregated remote-write paths where pre-lock keys could remain resident under write-group concurrency or partial failure, causing S3GC to retain obsolete objects and inflate remote storage usage. Also add configurable periodic S3 storage summary and per-store summary metrics.

Summary by CodeRabbit

Bug Fixes
- Resolved S3 pre-lock key cleanup on write failures to prevent orphaned lock keys.
- Improved remote write error handling with enhanced exception logging.
New Features
- Added S3 lock manager metrics for monitoring lock creation, cleanup, and status.
- Extended S3 store summary metrics tracking.
Improvements
- Enhanced checkpoint operation logging for better visibility.
- Refined concurrent write batch processing with improved lock key tracking.

Signed-off-by: JaySon-Huang <tshent@qq.com>

pantheon-ai · 2026-03-22T05:01:17Z

Review Complete

Findings: 7 issues
Posted: 3
Duplicates/Skipped: 4

_{ℹ️ Learn more details on Pantheon AI.}

coderabbitai · 2026-03-22T05:01:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR fixes a critical bug where pre-lock keys accumulated indefinitely under write-group concurrency in S3GC, causing remote storage to bloat to ~240 TB. Changes capture per-writer applied data files before group merge in PageDirectory, refactor S3LockLocalManager to return created lock keys and clean up failures, track and clean pre-lock keys on write failures in UniversalPageStorage, and replace boolean remote-write tracking with a counter.

Changes

Cohort / File(s)	Summary
S3 Lock Manager Metrics `dbms/src/Common/TiFlashMetrics.h`	Added two new metric families to the metrics registry: `tiflash_storage_s3_lock_mgr_status` (Gauge) and `tiflash_storage_s3_lock_mgr_counter` (Counter) for tracking lock manager state and operations. Added public API `setS3StoreSummaryBytes()` with internal state for per-store-id gauge registration.
PageDirectory Write-Group State `dbms/src/Storages/Page/V3/PageDirectory.h`, `dbms/src/Storages/Page/V3/PageDirectory.cpp`	Added per-writer `applied_data_files` field to `PageDirectory<Trait>::Writer` and modified `apply()` to compute applied data files before group merge, enabling each writer to independently track and return its own applied lock keys regardless of merge behavior.
S3 Lock Manager Core Logic `dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.h`, `dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp`	Refactored `createS3LockForWriteBatch()` to return `std::unordered_set<String>` of newly created lock keys; added metrics and enhanced logging; introduced `cleanPreLockKeysImpl()` helper for cleanup accounting and `cleanPreLockKeysOnWriteFailure()` for failure-path cleanup.
Universal Page Storage Write Path `dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp`, `dbms/src/Storages/Page/V3/Universal/UniversalPageStorageService.cpp`	Modified `UniversalPageStorage::write()` to track created pre-lock keys, wrap blob/page-directory operations in try/catch, and conditionally clean pre-lock keys on write failures before rethrowing. Added logging for pre-lock key counts.
Remote Write Tracking `dbms/src/Storages/Page/V3/Universal/UniversalWriteBatchImpl.h`	Replaced boolean `has_writes_from_remote` flag with integer counter `writes_remote_count` and replaced `hasWritesFromRemote()` accessor with `writesRemoteCount()` to enable precise remote-write count tracking across batch operations.
S3 Lock Manager Tests `dbms/src/Storages/Page/V3/Universal/tests/gtest_lock_local_mgr.cpp`	Added four new test cases covering partial cleanup, pre-lock key cleanup on write failure, returned-keys filtering for write batches, and partial failure scenarios within lock creation.
Page Directory Tests `dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp`	Updated async launch semantics and added two new tests coordinating three concurrent writers into a single write group, verifying each writer receives per-writer applied data files corresponding to its own remote lock keys.
Universal Page Storage Tests `dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp`	Added concurrent remote-write test using SyncPointCtl to coordinate three writers into the same write group and verify pre-lock keys converge to empty after successful completion.
Logging & Documentation `dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp`, `AGENTS.md`	Updated segment-merge logging to use `LOG_INFO` with `reason=concurrent_update` field. Enhanced C++ style guidance for exception handling and logging patterns.

Sequence Diagram

sequenceDiagram
    participant Writer1 as Writer 1<br/>(Remote Write)
    participant Writer2 as Writer 2<br/>(Remote Write)
    participant PageDir as PageDirectory<br/>Write-Group
    participant LockMgr as S3LockLocalManager
    participant BlobStore as Blob Store
    participant S3 as S3 Remote

    rect rgba(0, 150, 200, 0.5)
    Note over Writer1,Writer2: Concurrent Write Preparation
    Writer1->>LockMgr: createS3LockForWriteBatch(batch1)
    LockMgr->>S3: Create lock files
    LockMgr-->>Writer1: Return {lock_key_1}
    Writer2->>LockMgr: createS3LockForWriteBatch(batch2)
    LockMgr->>S3: Create lock files
    LockMgr-->>Writer2: Return {lock_key_2}
    end

    rect rgba(100, 200, 100, 0.5)
    Note over Writer1,PageDir: Write-Group Coordination
    Writer1->>PageDir: apply(edit1)
    Writer2->>PageDir: apply(edit2)
    PageDir->>PageDir: Enter write group
    PageDir->>BlobStore: write(blob_data)
    BlobStore-->>PageDir: Success
    PageDir->>PageDir: Capture applied_data_files<br/>for each writer
    PageDir-->>Writer1: Return {applied_lock_1}
    PageDir-->>Writer2: Return {applied_lock_2}
    end

    rect rgba(200, 100, 100, 0.5)
    Note over Writer1,LockMgr: Lock Cleanup
    Writer1->>LockMgr: cleanAppliedS3ExternalFiles({applied_lock_1})
    LockMgr->>S3: Remove cleaned locks
    Writer2->>LockMgr: cleanAppliedS3ExternalFiles({applied_lock_2})
    LockMgr->>S3: Remove cleaned locks
    LockMgr-->>Writer1: Cleanup complete
    LockMgr-->>Writer2: Cleanup complete
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

size/XL, approved, lgtm

Suggested reviewers

JinheLin
kolafish
CalvinNeo

Poem

🐰 Hop along the write-group trail,
Where locks now flow and pre-keys sail,
Each writer claims what's truly theirs,
No bloated storage fills the airs!
Concurrent merges, clean and fleet—
Storage cleanup, now complete! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'disagg: Fix unexpected object storage usage caused by pre-lock residue' directly and clearly describes the main change: fixing storage usage issues caused by pre-lock key residue. It is concise, specific, and highlights the primary problem being solved.
Description check	✅ Passed	The PR description comprehensively covers the problem, solution, test coverage, and observability improvements. It includes issue reference, problem summary, detailed changes, checklist (unit and manual tests marked), side effects assessment, documentation notes, and release note. All major template sections are completed.
Linked Issues check	✅ Passed	The PR addresses all objectives from issue `#10763`: fixes writer-scoped applied_data_files for cleanup signal propagation, adds explicit failure-path cleanup via cleanPreLockKeysOnWriteFailure, adjusts createS3LockForWriteBatch to prevent partial residue, and adds comprehensive regression tests for write-group concurrency scenarios.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing pre-lock residue issues: metrics instrumentation for observability, PageDirectory apply return value changes for per-writer cleanup, S3LockLocalManager API adjustments for failure cleanup, and focused regression tests. Minor documentation updates in AGENTS.md align with logging best practices for this codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (9)

dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp (1)

1286-1293: Avoid duplicate logs for the same invalid-segment branch.

This branch now emits both INFO and DEBUG with nearly identical content. Keeping one structured log entry is cleaner and reduces log volume under contention.

🧹 Possible consolidation

         if (!isSegmentValid(lock, segment))
         {
             LOG_INFO(
                 log,
                 "MergeDelta - Give up segmentMergeDelta because segment not valid, reason=concurrent_update segment={}",
                 segment->simpleInfo());
-            LOG_DEBUG(
-                log,
-                "MergeDelta - Give up segmentMergeDelta because segment not valid, segment={}",
-                segment->simpleInfo());
             wbs.setRollback();
             return {};
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp` around
lines 1286 - 1293, The code emits duplicate logs for the same invalid-segment
branch: remove the redundant LOG_INFO call and keep a single structured log (use
LOG_DEBUG) that includes the reason and segment details (references: LOG_INFO,
LOG_DEBUG, segment->simpleInfo(), segmentMergeDelta branch in
DeltaMergeStore_InternalSegment.cpp); ensure the remaining LOG_DEBUG message
contains the reason ("concurrent_update") and segment->simpleInfo().

dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h (1)

174-179: Add table/keyspace context to rollback log for easier correlation.

This rollback log is useful, but it’s hard to correlate in shared deployments without identifiers. Consider including keyspace_id and ns_id in the same line.

💡 Proposed logging-context tweak

-        LOG_INFO(
-            Logger::get("WriteBatches"),
-            "Rollback written log/data, run_mode={} written_log_pages={} written_data_pages={}",
-            magic_enum::enum_name(run_mode),
-            written_log.size(),
-            written_data.size());
+        auto log = Logger::get("WriteBatches");
+        LOG_INFO(
+            log,
+            "Rollback written log/data, keyspace_id={} table_id={} run_mode={} written_log_pages={} written_data_pages={}",
+            keyspace_id,
+            ns_id,
+            magic_enum::enum_name(run_mode),
+            written_log.size(),
+            written_data.size());

As per coding guidelines, use `LoggerPtr` and `LOG_INFO(log, ...)` with relevant context (e.g., `region_id`, `table_id`) for logging in storage engine code.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h` around lines 174 - 179,
Update the rollback log call in WriteBatchesImpl.h to include keyspace_id and
ns_id and to use a LoggerPtr with LOG_INFO(log, ...). Replace the current
Logger::get("WriteBatches") usage in the LOG_INFO call and add the keyspace_id
and ns_id values alongside run_mode, written_log.size() and written_data.size()
so the message can be correlated across deployments; keep the existing
run_mode/magic_enum::enum_name and written_log/written_data references and
ensure the new log uses a local LoggerPtr (e.g., log) per coding guidelines.

dbms/src/Storages/Page/V3/BlobStore.cpp (1)

430-441: Avoid INFO-level full wb.toString() in write path.

Serializing the full write batch in an INFO log can be expensive and noisy. Keep summary fields at INFO and move full payload to DEBUG/TRACE.

⚙️ Suggested log split

         LOG_INFO(
             log,
             "BlobStore::write remote edit summary, stage={} wb_size={} edit_size={} put_external_count={} "
-            "put_remote_count={} checkpoint_info_count={} sample_data_file_ids={} wb={}",
+            "put_remote_count={} checkpoint_info_count={} sample_data_file_ids={}",
             stage,
             wb.size(),
             edit.size(),
             put_external_count,
             put_remote_count,
             checkpoint_info_count,
-            sample_data_file_ids,
-            wb.toString());
+            sample_data_file_ids);
+        LOG_DEBUG(log, "BlobStore::write remote edit detail, stage={} wb={}", stage, wb.toString());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/BlobStore.cpp` around lines 430 - 441,
BlobStore::write currently logs the full write batch via wb.toString() at INFO
level; remove wb.toString() from the LOG_INFO call (keep stage, wb.size(),
edit.size(), put_external_count, put_remote_count, checkpoint_info_count,
sample_data_file_ids) and add a separate LOG_DEBUG (or TRACE) entry that logs
the full wb.toString() payload along with context (stage, maybe wb.size() and
edit.size()) so the heavy serialization is only produced at debug/trace levels;
update the logging calls around the write function accordingly.

dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp (1)

727-789: Consider folding this into the previous test to reduce duplication.

This scenario is very close to the prior test and can be merged to keep maintenance cost lower while preserving coverage.

♻️ Optional simplification

-TEST_F(PageDirectoryTest, BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles)
-{
-    // duplicated setup + orchestration ...
-    // weaker assertions than the previous test
-}
+// Option A:
+// Merge the non-empty assertions into
+// BatchWriteRemoteCheckpointEachWriterReturnsAppliedDataFiles
+// and remove this second test.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp` around lines 727 -
789, Fold this test's logic into the previous closely-related test instead of
duplicating setup: move the body of
BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles (including
make_remote_entry, keys key1/key2/key3, the SyncPointCtl waits, the three async
dir->apply calls, and the final ASSERT_FALSE/ASSERT_EQ checks) into the prior
test after its existing assertions (or convert both into a single parameterized
test), remove the standalone TEST_F for
BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles, and ensure you
preserve the SyncPointCtl scopes ("before_PageDirectory::leader_apply" and
"after_PageDirectory::enter_write_group") and calls to dir->apply, th_write1/2/3
handling, and the appliedN checks so behavior and coverage remain identical.

dbms/src/Storages/Page/V3/PageDirectory.cpp (1)

1856-1873: Test-only diagnostics - ensure follow-up before production.

Similar to UniversalPageStorage.cpp, the comment on line 1858 indicates these are "test-only verbose diagnostics" that should be removed or demoted before production rollout.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/PageDirectory.cpp` around lines 1856 - 1873, The
block in PageDirectory.cpp contains a "test-only verbose diagnostics" comment
and a LOG_INFO call that should not remain at info level for production; change
the comment to a TODO note referencing follow-up (or remove it) and demote the
LOG_INFO invocation to a lower verbosity (e.g., LOG_DEBUG or LOG_TRACE) so
apply-checkpoint diagnostics (LOG call that logs edit_size, put_count,
put_external_count, checkpoint_info_count,
put_external_with_checkpoint_info_count, edit_digest, applied_data_files,
candidate_records, suspicious_records) are not noisy in production while
preserving the details for debugging.

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp (2)

130-149: Performance consideration: diagnostic data collection overhead.

When has_writes_from_remote is true, the code:

Iterates all writes to collect remote_lock_keys (lines 130-142)
Converts the entire write batch to a debug string (line 143)

For large write batches, write_batch.toString() may be expensive. Consider deferring string conversion to only when an error occurs, or sampling for normal logging.

♻️ Proposed optimization for deferred string conversion

-        write_batch_debug = write_batch.toString();
-        LOG_INFO(
-            log,
-            "Remote write batch begin, trace_id={} write_batch={} lock_keys={}",
-            trace_id,
-            write_batch_debug,
-            remote_lock_keys);
+        LOG_INFO(
+            log,
+            "Remote write batch begin, trace_id={} write_count={} lock_keys={}",
+            trace_id,
+            write_batch.size(),
+            remote_lock_keys);

Then only capture write_batch_debug in the error handlers where it's needed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp` around lines
130 - 149, The current logging eagerly builds write_batch_debug by calling
write_batch.toString() and always collects remote_lock_keys before the LOG_INFO
call in UniversalPageStorage.cpp; to avoid expensive diagnostic work for large
batches, remove the unconditional write_batch.toString() and only call
write_batch.toString() (and compute any heavy diagnostics) inside error paths or
detailed debug branches, e.g., compute write_batch_debug lazily when preparing
error logs or when log level is DEBUG; keep the remote_lock_keys collection only
if it’s required for the immediate log path (or defer building remote_lock_keys
similarly) and update the LOG_INFO invocation to log lightweight info while
reserving full write_batch.toString() for error handlers that need the full
dump.

188-219: Test-only diagnostics marked for removal - ensure follow-up.

The comment on line 188 indicates these diagnostics are test-only and should be removed or have their log level lowered before production rollout. Consider creating a tracking issue or TODO to ensure this is addressed.

Would you like me to open a new issue to track the removal/demotion of these verbose diagnostics before production rollout?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp` around lines
188 - 219, The verbose test-only diagnostics (construction of
edit_records_debug, edit_digest and the LOG_INFO call) must not run
unconditionally in production: move the creation of edit_records_debug and
edit_digest and the log emission behind a debug-only guard (e.g., check logger
debug-level via log->isDebugEnabled() or a runtime flag) so the expensive
string/hash work is skipped in normal runs, change LOG_INFO to a debug-level log
(e.g., LOG_DEBUG) and add a single-line TODO with a tracking issue reference to
remove or reevaluate these diagnostics before production; refer to symbols
edit_records_debug, edit_digest, edit.getRecords(), LOG_INFO and trace_id to
locate and update the code.

dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp (1)

308-315: Consider reducing logging verbosity for production.

Logging each erased and missing key individually at LOG_INFO level could generate excessive log volume in production when many keys are processed. The aggregate summary on lines 301-307 already provides useful diagnostics.

Consider either:

Moving per-key logs to LOG_DEBUG level
Adding a threshold to only log when counts are unexpectedly high

♻️ Proposed fix to reduce verbosity

-    for (const auto & key : erased_keys)
-    {
-        LOG_INFO(log, "Clean applied S3 external files, erase_hit lockkey={}", key);
-    }
-    for (const auto & key : missing_keys)
-    {
-        LOG_INFO(log, "Clean applied S3 external files, erase_miss lockkey={}", key);
-    }
+    for (const auto & key : erased_keys)
+    {
+        LOG_DEBUG(log, "Clean applied S3 external files, erase_hit lockkey={}", key);
+    }
+    for (const auto & key : missing_keys)
+    {
+        LOG_DEBUG(log, "Clean applied S3 external files, erase_miss lockkey={}", key);
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp` around lines 308
- 315, The per-key logging in the loops iterating erased_keys and missing_keys
is too verbose at LOG_INFO; change those per-key messages to LOG_DEBUG (replace
LOG_INFO with LOG_DEBUG for the loops that log "erase_hit lockkey={}" and
"erase_miss lockkey={}") or add a threshold guard (e.g., only emit per-key logs
when erased_keys.size() or missing_keys.size() is below a small threshold,
otherwise skip per-key logs and rely on the existing aggregate summary). Update
the loops in S3LockLocalManager (the blocks that iterate erased_keys and
missing_keys) to use one of these approaches and ensure the aggregate summary
remains at INFO.

dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp (1)

618-675: Missing CATCH macro for consistent exception handling.

Other tests in this file (e.g., WriteRead, WriteManyNonExistedDeleted) use the try { ... } CATCH pattern for proper exception handling and test failure reporting. This test should follow the same pattern.

♻️ Proposed fix to add try-CATCH pattern

 TEST_F(UniPageStorageTest, ConcurrentRemoteWriteShouldNotLeavePreLockKeys)
+try
 {
     DB::tests::TiFlashTestEnv::enableS3Config();

And at the end of the test:

     if (DB::tests::TiFlashTestEnv::isMockedS3Client())
         DB::tests::TiFlashTestEnv::deleteBucket(*s3_client);
     DB::tests::TiFlashTestEnv::disableS3Config();
 }
+CATCH

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp`
around lines 618 - 675, Wrap the body of the TEST_F function
ConcurrentRemoteWriteShouldNotLeavePreLockKeys in a try { ... } CATCH block to
match other tests' exception handling; specifically, enclose everything from
DB::tests::TiFlashTestEnv::enableS3Config() through
DB::tests::TiFlashTestEnv::disableS3Config() in try and add the existing CATCH
macro at the end so exceptions are handled and reported consistently (refer to
the test name ConcurrentRemoteWriteShouldNotLeavePreLockKeys and the use of
page_storage, SyncPointCtl, and S3::ClientFactory::instance() within the test to
locate the code to wrap).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/grafana/tiflash_summary.json`:
- Around line 19244-19254: The PromQL expr for metric
tiflash_storage_s3_lock_mgr_status contains two filters on the same label
(instance=~"$instance", instance=~"$tiflash_role") which forces an impossible
AND; fix by removing the duplicate instance filter — either drop
instance=~"$tiflash_role" so only instance=~"$instance" remains, or if you
intended to filter by role replace the second instance=~"$tiflash_role" with the
correct label (e.g., tiflash_role=~"$tiflash_role"); update the expr string used
in the target accordingly.
- Around line 19347-19357: The PromQL target has a duplicated instance filter
(instance=~"$instance", instance=~"$tiflash_role"); remove the second duplicate
so only one instance matcher remains (e.g., keep instance=~"$instance" and
delete instance=~"$tiflash_role") across all occurrences (~436 places) and run a
systematic replace. Also adjust the legendFormat which incorrectly uses
{{$additional_groupby}} (a dashboard variable, not a Prometheus label) — replace
it with the dashboard variable itself (e.g., use $additional_groupby) or remove
it, then test the panel in Grafana to confirm the legend renders as intended.

---

Nitpick comments:
In `@dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp`:
- Around line 1286-1293: The code emits duplicate logs for the same
invalid-segment branch: remove the redundant LOG_INFO call and keep a single
structured log (use LOG_DEBUG) that includes the reason and segment details
(references: LOG_INFO, LOG_DEBUG, segment->simpleInfo(), segmentMergeDelta
branch in DeltaMergeStore_InternalSegment.cpp); ensure the remaining LOG_DEBUG
message contains the reason ("concurrent_update") and segment->simpleInfo().

In `@dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h`:
- Around line 174-179: Update the rollback log call in WriteBatchesImpl.h to
include keyspace_id and ns_id and to use a LoggerPtr with LOG_INFO(log, ...).
Replace the current Logger::get("WriteBatches") usage in the LOG_INFO call and
add the keyspace_id and ns_id values alongside run_mode, written_log.size() and
written_data.size() so the message can be correlated across deployments; keep
the existing run_mode/magic_enum::enum_name and written_log/written_data
references and ensure the new log uses a local LoggerPtr (e.g., log) per coding
guidelines.

In `@dbms/src/Storages/Page/V3/BlobStore.cpp`:
- Around line 430-441: BlobStore::write currently logs the full write batch via
wb.toString() at INFO level; remove wb.toString() from the LOG_INFO call (keep
stage, wb.size(), edit.size(), put_external_count, put_remote_count,
checkpoint_info_count, sample_data_file_ids) and add a separate LOG_DEBUG (or
TRACE) entry that logs the full wb.toString() payload along with context (stage,
maybe wb.size() and edit.size()) so the heavy serialization is only produced at
debug/trace levels; update the logging calls around the write function
accordingly.

In `@dbms/src/Storages/Page/V3/PageDirectory.cpp`:
- Around line 1856-1873: The block in PageDirectory.cpp contains a "test-only
verbose diagnostics" comment and a LOG_INFO call that should not remain at info
level for production; change the comment to a TODO note referencing follow-up
(or remove it) and demote the LOG_INFO invocation to a lower verbosity (e.g.,
LOG_DEBUG or LOG_TRACE) so apply-checkpoint diagnostics (LOG call that logs
edit_size, put_count, put_external_count, checkpoint_info_count,
put_external_with_checkpoint_info_count, edit_digest, applied_data_files,
candidate_records, suspicious_records) are not noisy in production while
preserving the details for debugging.

In `@dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp`:
- Around line 727-789: Fold this test's logic into the previous closely-related
test instead of duplicating setup: move the body of
BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles (including
make_remote_entry, keys key1/key2/key3, the SyncPointCtl waits, the three async
dir->apply calls, and the final ASSERT_FALSE/ASSERT_EQ checks) into the prior
test after its existing assertions (or convert both into a single parameterized
test), remove the standalone TEST_F for
BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles, and ensure you
preserve the SyncPointCtl scopes ("before_PageDirectory::leader_apply" and
"after_PageDirectory::enter_write_group") and calls to dir->apply, th_write1/2/3
handling, and the appliedN checks so behavior and coverage remain identical.

In `@dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp`:
- Around line 308-315: The per-key logging in the loops iterating erased_keys
and missing_keys is too verbose at LOG_INFO; change those per-key messages to
LOG_DEBUG (replace LOG_INFO with LOG_DEBUG for the loops that log "erase_hit
lockkey={}" and "erase_miss lockkey={}") or add a threshold guard (e.g., only
emit per-key logs when erased_keys.size() or missing_keys.size() is below a
small threshold, otherwise skip per-key logs and rely on the existing aggregate
summary). Update the loops in S3LockLocalManager (the blocks that iterate
erased_keys and missing_keys) to use one of these approaches and ensure the
aggregate summary remains at INFO.

In `@dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp`:
- Around line 618-675: Wrap the body of the TEST_F function
ConcurrentRemoteWriteShouldNotLeavePreLockKeys in a try { ... } CATCH block to
match other tests' exception handling; specifically, enclose everything from
DB::tests::TiFlashTestEnv::enableS3Config() through
DB::tests::TiFlashTestEnv::disableS3Config() in try and add the existing CATCH
macro at the end so exceptions are handled and reported consistently (refer to
the test name ConcurrentRemoteWriteShouldNotLeavePreLockKeys and the use of
page_storage, SyncPointCtl, and S3::ClientFactory::instance() within the test to
locate the code to wrap).

In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp`:
- Around line 130-149: The current logging eagerly builds write_batch_debug by
calling write_batch.toString() and always collects remote_lock_keys before the
LOG_INFO call in UniversalPageStorage.cpp; to avoid expensive diagnostic work
for large batches, remove the unconditional write_batch.toString() and only call
write_batch.toString() (and compute any heavy diagnostics) inside error paths or
detailed debug branches, e.g., compute write_batch_debug lazily when preparing
error logs or when log level is DEBUG; keep the remote_lock_keys collection only
if it’s required for the immediate log path (or defer building remote_lock_keys
similarly) and update the LOG_INFO invocation to log lightweight info while
reserving full write_batch.toString() for error handlers that need the full
dump.
- Around line 188-219: The verbose test-only diagnostics (construction of
edit_records_debug, edit_digest and the LOG_INFO call) must not run
unconditionally in production: move the creation of edit_records_debug and
edit_digest and the log emission behind a debug-only guard (e.g., check logger
debug-level via log->isDebugEnabled() or a runtime flag) so the expensive
string/hash work is skipped in normal runs, change LOG_INFO to a debug-level log
(e.g., LOG_DEBUG) and add a single-line TODO with a tracking issue reference to
remove or reevaluate these diagnostics before production; refer to symbols
edit_records_debug, edit_digest, edit.getRecords(), LOG_INFO and trace_id to
locate and update the code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 812c7b73-fc68-4449-ae11-244c93130fb7

📥 Commits

Reviewing files that changed from the base of the PR and between 7fe4b3b and 5856c85.

📒 Files selected for processing (13)

dbms/src/Common/TiFlashMetrics.h
dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp
dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h
dbms/src/Storages/Page/V3/BlobStore.cpp
dbms/src/Storages/Page/V3/PageDirectory.cpp
dbms/src/Storages/Page/V3/PageDirectory.h
dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp
dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp
dbms/src/Storages/Page/V3/Universal/UniversalPageStorageService.cpp
dbms/src/Storages/Page/V3/Universal/tests/gtest_lock_local_mgr.cpp
dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp
dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp
metrics/grafana/tiflash_summary.json

metrics/grafana/tiflash_summary.json

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp

dbms/src/Storages/Page/V3/PageDirectory.cpp

dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp

JaySon-Huang · 2026-03-22T07:19:50Z

Second-opinion review completed for upstream/master...HEAD (PR #10760 head branch). I did not confirm any blocking correctness/concurrency/compatibility defects in the current diff.

Residual risk: monitor the newly added S3 lock manager logs/metrics in staging to ensure observability signal-to-noise stays healthy.
Source: process/pr-review | Second Opinion

JaySon-Huang · 2026-03-22T07:28:23Z

Review Comments (`d2b5ace80c`..`83df9b1bf7`)

Scope

Included commits:
- d2b5ace80c cleanup(page): remove investigation logs and keep actionable signals
- f9b1b684da refactor(page): simplify remote write and lock cleanup logging
- 998eb358fb refactor(page): centralize remote write count handling
- 5a1e2199d9 fix(page): cleanup pre-lock keys on write failure
- 83df9b1bf7 tests(page): document purposes and steps for lock cleanup cases
Explicitly excluded: 9aad7252c2 (docs-only guidance update)

Overall Assessment

The sequence is coherent: reduce investigation-time noise first, simplify the remote write path, then add failure-path cleanup and clearer tests.
The key fixes (writer-scoped applied lock ids, and pre-lock cleanup on write failure) match the original problem pattern.
Maintainability improved: less duplicated exception handling, single-source remote write counting, and clearer test intent.

Review Comments by Commit

`d2b5ace80c`

👍 Keep: removing investigation-heavy logs in BlobStore/PageDirectory/UPS/S3LockLocalManager is correct for hot-path overhead.
👍 Keep: comments in PageDirectory clarifying writer-scoped applied_data_files reduce future misreads.
⚠️ Note: after log reduction, diagnosis depends more on counters/gauges and warnings; later commits preserved key warning signals, so this is acceptable.

`f9b1b684da`

👍 Keep: UniversalPageStorage::write moved to one try-catch with failed_stage and tryLogCurrentException, which removes duplicated branches.
👍 Keep: S3LockLocalManager uses LOG_IMPL with runtime-selected level, improving readability and reducing duplicated logging blocks.
👍 Keep: removed trace_id and bulky object dumps while retaining critical failure signals.

`998eb358fb`

👍 Keep: UniversalWriteBatch::writesRemoteCount() centralizes remote-write detection/counting and removes repeated scans in callers.
👍 Keep: disableRemoteLock() semantics are reflected directly via return value (0 when disabled), making behavior explicit.
⚠️ Suggestion: when evolving WriteBatchWriteType, keep count-maintenance points aligned (merge/clear/move/swap are already covered well).

`5a1e2199d9`

✅ Key fix, strongly keep: added cleanPreLockKeysOnWriteFailure(...) and invoked it on exception path in UniversalPageStorage::write, covering pre-lock residue after write/apply failures.
👍 Keep: extracted shared cleanup logic (cleanPreLockKeysImpl(...)) to avoid divergence between success and failure cleanup paths.
👍 Keep: added CleanPreLockKeysOnWriteFailure unit test to validate the failure cleanup behavior.
⚠️ Follow-up candidate: partial-success failure inside createS3LockForWriteBatch can still be enhanced with internal compensation cleanup.

`83df9b1bf7`

👍 Keep: added clear “purpose + step-by-step” comments to four key regression tests.
👍 Keep: comments focus on exactly the risky areas from this incident domain (write-group behavior, applied ids semantics, pre-lock cleanup).

Conclusion

This commit set looks solid and ready to merge.
Optional next hardening item: define and test compensation cleanup for partial lock-creation failures inside createS3LockForWriteBatch.

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp

dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp

dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp

Signed-off-by: JaySon-Huang <tshent@qq.com>

JaySon-Huang · 2026-03-23T07:27:02Z

@pantheon-ai review

pantheon-ai · 2026-03-23T07:27:10Z

Review Complete

Findings: 4 issues
Posted: 2
Duplicates/Skipped: 2

_{ℹ️ Learn more details on Pantheon AI.}

Copilot

Pull request overview

This PR fixes a correctness bug in disaggregated remote-write paths where write-group concurrency (and partial failures) could leave S3LockLocalManager.pre_lock_keys resident, causing manifests to retain obsolete lock keys and S3GC to over-retain objects (inflating remote storage usage).

Changes:

Make PageDirectory::apply return per-writer applied_data_files so every writer in a write-group can clean up its own pre-lock keys.
Harden UniversalPageStorage::write + S3LockLocalManager lock lifecycle: return “newly appended” pre-lock keys, and add explicit failure cleanup.
Add regression tests for write-group concurrency and lock-manager cleanup semantics; add lock-manager metrics and update style guidance.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp	Adds remote-checkpoint write-group concurrency regression tests; forces `std::async` to run asynchronously.
dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp	Adds an end-to-end concurrent remote write test validating no residual `pre_lock_keys`.
dbms/src/Storages/Page/V3/Universal/tests/gtest_lock_local_mgr.cpp	Adds unit tests for partial cleanup, failure cleanup, return semantics, and partial-failure behavior.
dbms/src/Storages/Page/V3/Universal/UniversalWriteBatchImpl.h	Replaces boolean remote-write flag with a remote-write counter (`writesRemoteCount`).
dbms/src/Storages/Page/V3/Universal/UniversalPageStorageService.cpp	Adds log line to expose checkpoint lock-set size/sequence at upload time.
dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp	Uses per-writer applied ids for cleanup and adds failure-path pre-lock cleanup.
dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.h	Changes lock-creation API to return newly appended keys; adds failure-cleanup API.
dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp	Implements atomic-ish append semantics, cleanup helpers, and lock-manager metrics/logging.
dbms/src/Storages/Page/V3/PageDirectory.h	Extends `Writer` to retain per-writer applied checkpoint lock ids.
dbms/src/Storages/Page/V3/PageDirectory.cpp	Captures per-writer checkpoint ids pre-merge and returns them for both owner/followers.
dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp	Adjusts logging severity/message for a MergeDelta invalidation case.
dbms/src/Common/TiFlashMetrics.h	Adds S3 lock-manager gauge/counters.
AGENTS.md	Updates documented C++ error-handling/logging guidelines.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp

dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp

dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp

Signed-off-by: JaySon-Huang <tshent@qq.com>

coderabbitai

🧹 Nitpick comments (1)

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp (1)

137-151: Consider wrapping cleanup call to avoid masking the original exception.

If cleanPreLockKeysOnWriteFailure throws, the original exception caught at line 137 would be lost. While cleanup functions should generally be noexcept, adding a defensive wrapper ensures the original failure context is preserved for diagnostics.

🛡️ Optional: Defensive wrapper for cleanup

     catch (...)
     {
         if (remote_write_count > 0)
         {
             // If write fails after pre-lock creation, clean these pre-lock keys
             // to avoid residual entries in checkpoint_manager.pre_lock_keys.
-            remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys));
+            try
+            {
+                remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys));
+            }
+            catch (...)
+            {
+                tryLogCurrentException(log, "Failed to clean pre-lock keys on write failure");
+            }
             tryLogCurrentException(
                 log,
                 fmt::format(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp` around lines
137 - 151, The catch(...) block currently calls
remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys))
which could throw and mask the original exception; wrap that cleanup call in its
own try-catch around cleanPreLockKeysOnWriteFailure (and any logging) so any
exception from cleanup is caught and logged (e.g., via tryLogCurrentException)
but does not replace the original exception being rethrown by the outer catch;
ensure you still call tryLogCurrentException(...) for the original failure
(using failed_stage and remote_write_count) and then rethrow the original
exception after the defensive cleanup handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp`:
- Around line 137-151: The catch(...) block currently calls
remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys))
which could throw and mask the original exception; wrap that cleanup call in its
own try-catch around cleanPreLockKeysOnWriteFailure (and any logging) so any
exception from cleanup is caught and logged (e.g., via tryLogCurrentException)
but does not replace the original exception being rethrown by the outer catch;
ensure you still call tryLogCurrentException(...) for the original failure
(using failed_stage and remote_write_count) and then rethrow the original
exception after the defensive cleanup handling.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 38399200-3ace-4a2f-b475-021ab2dcdf3b

📥 Commits

Reviewing files that changed from the base of the PR and between 0eed7b8 and c71feec.

📒 Files selected for processing (1)

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp

dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp

ti-chi-bot · 2026-03-23T08:43:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CalvinNeo, JinheLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [CalvinNeo,JinheLin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-03-23T08:43:45Z

[LGTM Timeline notifier]

Timeline:

2026-03-23 08:39:51.234977153 +0000 UTC m=+171187.271047413: ☑️ agreed by CalvinNeo.
2026-03-23 08:43:44.089449037 +0000 UTC m=+171420.125519296: ☑️ agreed by JinheLin.

JaySon-Huang · 2026-03-23T08:47:16Z

/hold

JaySon-Huang · 2026-03-23T08:52:48Z

/unhold

review comments are resolved

JaySon-Huang · 2026-03-23T08:54:52Z

/cherry-pick release-nextgen-20251011

ti-chi-bot · 2026-03-23T08:54:54Z

@JaySon-Huang: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-nextgen-20251011

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2026-03-23T10:06:37Z

@JaySon-Huang: new pull request created to branch release-nextgen-20251011: #10767.

Details

In response to this:

/cherry-pick release-nextgen-20251011

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

#10760) (#10767) close #10763 disagg: eliminate pre-lock key residue that lead to unexpected OSS usage Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> Co-authored-by: JaySon <tshent@qq.com>

JaySon-Huang added 5 commits March 22, 2026 12:54

Add metrics and logging

6ead46d

Signed-off-by: JaySon-Huang <tshent@qq.com>

metrics: Update grafana panel

66a18b4

Signed-off-by: JaySon-Huang <tshent@qq.com>

Add debug logging

14140e0

Signed-off-by: JaySon-Huang <tshent@qq.com>

tests: add write-group remote apply behavior coverage

40ac415

fix(page): return writer-scoped applied remote lock ids

5856c85

ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Mar 22, 2026

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 22, 2026

JaySon-Huang marked this pull request as draft March 22, 2026 05:02

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

metrics/grafana/tiflash_summary.json Outdated Show resolved Hide resolved

metrics/grafana/tiflash_summary.json Outdated Show resolved Hide resolved

cleanup(page): remove investigation logs and keep actionable signals

d2b5ace

pantheon-ai bot reviewed Mar 22, 2026

View reviewed changes

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp Outdated Show resolved Hide resolved

dbms/src/Storages/Page/V3/PageDirectory.cpp Show resolved Hide resolved

dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp Outdated Show resolved Hide resolved

JaySon-Huang added 5 commits March 22, 2026 14:19

refactor(page): simplify remote write and lock cleanup logging

f9b1b68

docs: refine exception and logging guidance

9aad725

refactor(page): centralize remote write count handling

998eb35

fix(page): cleanup pre-lock keys on write failure

5a1e219

tests(page): document purposes and steps for lock cleanup cases

83df9b1

JaySon-Huang commented Mar 22, 2026

View reviewed changes

dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp Outdated Show resolved Hide resolved

JaySon-Huang commented Mar 22, 2026

View reviewed changes

dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp Show resolved Hide resolved

JaySon-Huang added 3 commits March 22, 2026 17:17

tests(page): force async launch for syncpoint concurrency cases

98d0036

fix(page): avoid partial pre-lock residue on lock creation

b360e87

tests(page): cover lock return semantics and partial failure

7f0b357

JaySon-Huang commented Mar 22, 2026

View reviewed changes

dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp Outdated Show resolved Hide resolved

JaySon-Huang added 2 commits March 22, 2026 21:47

tests(page): enforce async launch in write-group syncpoint tests

bda0e5a

Format codes

48d3aa7

Signed-off-by: JaySon-Huang <tshent@qq.com>

JaySon-Huang changed the title ~~[WIP] disagg: Fix unexpected object storage usage caused by pre-lock residue~~ disagg: Fix unexpected object storage usage caused by pre-lock residue Mar 23, 2026

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2026

JaySon-Huang requested a review from Copilot March 23, 2026 07:29

Copilot started reviewing on behalf of JaySon-Huang March 23, 2026 07:29 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Address comment

c71feec

Signed-off-by: JaySon-Huang <tshent@qq.com>

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

CalvinNeo approved these changes Mar 23, 2026

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 23, 2026

pantheon-ai bot reviewed Mar 23, 2026

View reviewed changes

dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp Show resolved Hide resolved

dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp Show resolved Hide resolved

JinheLin approved these changes Mar 23, 2026

View reviewed changes

ti-chi-bot bot added the lgtm label Mar 23, 2026

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 23, 2026

ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 23, 2026

ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 23, 2026

ti-chi-bot bot merged commit 74a92f5 into pingcap:master Mar 23, 2026
8 checks passed

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Mar 23, 2026

This is an automated cherry-pick of pingcap#10760

fdb8f44

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Mar 23, 2026

disagg: Fix unexpected object storage usage caused by pre-lock residue (#10760) #10767

Merged

12 tasks

coderabbitai bot mentioned this pull request Apr 4, 2026

disagg: limit S3 read amplification on compute nodes #10771

Merged

12 tasks

Conversation

JaySon-Huang commented Mar 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JaySon-Huang commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JaySon-Huang commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Comments (d2b5ace80c..83df9b1bf7)

Scope

Overall Assessment

Review Comments by Commit

d2b5ace80c

f9b1b684da

998eb358fb

5a1e2199d9

83df9b1bf7

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JaySon-Huang commented Mar 23, 2026

Uh oh!

pantheon-ai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot bot commented Mar 23, 2026

Uh oh!

ti-chi-bot bot commented Mar 23, 2026

[LGTM Timeline notifier]

Uh oh!

JaySon-Huang commented Mar 23, 2026

Uh oh!

JaySon-Huang commented Mar 23, 2026

Uh oh!

JaySon-Huang commented Mar 23, 2026

JaySon-Huang commented Mar 22, 2026 •

edited by coderabbitai bot

Loading

pantheon-ai bot commented Mar 22, 2026 •

edited

Loading

coderabbitai bot commented Mar 22, 2026 •

edited

Loading

JaySon-Huang commented Mar 22, 2026 •

edited

Loading

JaySon-Huang commented Mar 22, 2026 •

edited

Loading

Review Comments (`d2b5ace80c`..`83df9b1bf7`)

`d2b5ace80c`

`f9b1b684da`

`998eb358fb`

`5a1e2199d9`

`83df9b1bf7`

pantheon-ai bot commented Mar 23, 2026 •

edited

Loading