disagg: Fix unexpected object storage usage caused by pre-lock residue#10760
disagg: Fix unexpected object storage usage caused by pre-lock residue#10760ti-chi-bot[bot] merged 21 commits intopingcap:masterfrom
Conversation
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
|
Review Complete Findings: 7 issues ℹ️ Learn more details on Pantheon AI. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR fixes a critical bug where pre-lock keys accumulated indefinitely under write-group concurrency in S3GC, causing remote storage to bloat to ~240 TB. Changes capture per-writer applied data files before group merge in PageDirectory, refactor S3LockLocalManager to return created lock keys and clean up failures, track and clean pre-lock keys on write failures in UniversalPageStorage, and replace boolean remote-write tracking with a counter. Changes
Sequence DiagramsequenceDiagram
participant Writer1 as Writer 1<br/>(Remote Write)
participant Writer2 as Writer 2<br/>(Remote Write)
participant PageDir as PageDirectory<br/>Write-Group
participant LockMgr as S3LockLocalManager
participant BlobStore as Blob Store
participant S3 as S3 Remote
rect rgba(0, 150, 200, 0.5)
Note over Writer1,Writer2: Concurrent Write Preparation
Writer1->>LockMgr: createS3LockForWriteBatch(batch1)
LockMgr->>S3: Create lock files
LockMgr-->>Writer1: Return {lock_key_1}
Writer2->>LockMgr: createS3LockForWriteBatch(batch2)
LockMgr->>S3: Create lock files
LockMgr-->>Writer2: Return {lock_key_2}
end
rect rgba(100, 200, 100, 0.5)
Note over Writer1,PageDir: Write-Group Coordination
Writer1->>PageDir: apply(edit1)
Writer2->>PageDir: apply(edit2)
PageDir->>PageDir: Enter write group
PageDir->>BlobStore: write(blob_data)
BlobStore-->>PageDir: Success
PageDir->>PageDir: Capture applied_data_files<br/>for each writer
PageDir-->>Writer1: Return {applied_lock_1}
PageDir-->>Writer2: Return {applied_lock_2}
end
rect rgba(200, 100, 100, 0.5)
Note over Writer1,LockMgr: Lock Cleanup
Writer1->>LockMgr: cleanAppliedS3ExternalFiles({applied_lock_1})
LockMgr->>S3: Remove cleaned locks
Writer2->>LockMgr: cleanAppliedS3ExternalFiles({applied_lock_2})
LockMgr->>S3: Remove cleaned locks
LockMgr-->>Writer1: Cleanup complete
LockMgr-->>Writer2: Cleanup complete
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (9)
dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp (1)
1286-1293: Avoid duplicate logs for the same invalid-segment branch.This branch now emits both INFO and DEBUG with nearly identical content. Keeping one structured log entry is cleaner and reduces log volume under contention.
🧹 Possible consolidation
if (!isSegmentValid(lock, segment)) { LOG_INFO( log, "MergeDelta - Give up segmentMergeDelta because segment not valid, reason=concurrent_update segment={}", segment->simpleInfo()); - LOG_DEBUG( - log, - "MergeDelta - Give up segmentMergeDelta because segment not valid, segment={}", - segment->simpleInfo()); wbs.setRollback(); return {}; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp` around lines 1286 - 1293, The code emits duplicate logs for the same invalid-segment branch: remove the redundant LOG_INFO call and keep a single structured log (use LOG_DEBUG) that includes the reason and segment details (references: LOG_INFO, LOG_DEBUG, segment->simpleInfo(), segmentMergeDelta branch in DeltaMergeStore_InternalSegment.cpp); ensure the remaining LOG_DEBUG message contains the reason ("concurrent_update") and segment->simpleInfo().dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h (1)
174-179: Add table/keyspace context to rollback log for easier correlation.This rollback log is useful, but it’s hard to correlate in shared deployments without identifiers. Consider including
keyspace_idandns_idin the same line.As per coding guidelines, use `LoggerPtr` and `LOG_INFO(log, ...)` with relevant context (e.g., `region_id`, `table_id`) for logging in storage engine code.💡 Proposed logging-context tweak
- LOG_INFO( - Logger::get("WriteBatches"), - "Rollback written log/data, run_mode={} written_log_pages={} written_data_pages={}", - magic_enum::enum_name(run_mode), - written_log.size(), - written_data.size()); + auto log = Logger::get("WriteBatches"); + LOG_INFO( + log, + "Rollback written log/data, keyspace_id={} table_id={} run_mode={} written_log_pages={} written_data_pages={}", + keyspace_id, + ns_id, + magic_enum::enum_name(run_mode), + written_log.size(), + written_data.size());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h` around lines 174 - 179, Update the rollback log call in WriteBatchesImpl.h to include keyspace_id and ns_id and to use a LoggerPtr with LOG_INFO(log, ...). Replace the current Logger::get("WriteBatches") usage in the LOG_INFO call and add the keyspace_id and ns_id values alongside run_mode, written_log.size() and written_data.size() so the message can be correlated across deployments; keep the existing run_mode/magic_enum::enum_name and written_log/written_data references and ensure the new log uses a local LoggerPtr (e.g., log) per coding guidelines.dbms/src/Storages/Page/V3/BlobStore.cpp (1)
430-441: Avoid INFO-level fullwb.toString()in write path.Serializing the full write batch in an INFO log can be expensive and noisy. Keep summary fields at INFO and move full payload to DEBUG/TRACE.
⚙️ Suggested log split
LOG_INFO( log, "BlobStore::write remote edit summary, stage={} wb_size={} edit_size={} put_external_count={} " - "put_remote_count={} checkpoint_info_count={} sample_data_file_ids={} wb={}", + "put_remote_count={} checkpoint_info_count={} sample_data_file_ids={}", stage, wb.size(), edit.size(), put_external_count, put_remote_count, checkpoint_info_count, - sample_data_file_ids, - wb.toString()); + sample_data_file_ids); + LOG_DEBUG(log, "BlobStore::write remote edit detail, stage={} wb={}", stage, wb.toString());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/BlobStore.cpp` around lines 430 - 441, BlobStore::write currently logs the full write batch via wb.toString() at INFO level; remove wb.toString() from the LOG_INFO call (keep stage, wb.size(), edit.size(), put_external_count, put_remote_count, checkpoint_info_count, sample_data_file_ids) and add a separate LOG_DEBUG (or TRACE) entry that logs the full wb.toString() payload along with context (stage, maybe wb.size() and edit.size()) so the heavy serialization is only produced at debug/trace levels; update the logging calls around the write function accordingly.dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp (1)
727-789: Consider folding this into the previous test to reduce duplication.This scenario is very close to the prior test and can be merged to keep maintenance cost lower while preserving coverage.
♻️ Optional simplification
-TEST_F(PageDirectoryTest, BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles) -{ - // duplicated setup + orchestration ... - // weaker assertions than the previous test -} +// Option A: +// Merge the non-empty assertions into +// BatchWriteRemoteCheckpointEachWriterReturnsAppliedDataFiles +// and remove this second test.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp` around lines 727 - 789, Fold this test's logic into the previous closely-related test instead of duplicating setup: move the body of BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles (including make_remote_entry, keys key1/key2/key3, the SyncPointCtl waits, the three async dir->apply calls, and the final ASSERT_FALSE/ASSERT_EQ checks) into the prior test after its existing assertions (or convert both into a single parameterized test), remove the standalone TEST_F for BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles, and ensure you preserve the SyncPointCtl scopes ("before_PageDirectory::leader_apply" and "after_PageDirectory::enter_write_group") and calls to dir->apply, th_write1/2/3 handling, and the appliedN checks so behavior and coverage remain identical.dbms/src/Storages/Page/V3/PageDirectory.cpp (1)
1856-1873: Test-only diagnostics - ensure follow-up before production.Similar to
UniversalPageStorage.cpp, the comment on line 1858 indicates these are "test-only verbose diagnostics" that should be removed or demoted before production rollout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/PageDirectory.cpp` around lines 1856 - 1873, The block in PageDirectory.cpp contains a "test-only verbose diagnostics" comment and a LOG_INFO call that should not remain at info level for production; change the comment to a TODO note referencing follow-up (or remove it) and demote the LOG_INFO invocation to a lower verbosity (e.g., LOG_DEBUG or LOG_TRACE) so apply-checkpoint diagnostics (LOG call that logs edit_size, put_count, put_external_count, checkpoint_info_count, put_external_with_checkpoint_info_count, edit_digest, applied_data_files, candidate_records, suspicious_records) are not noisy in production while preserving the details for debugging.dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp (2)
130-149: Performance consideration: diagnostic data collection overhead.When
has_writes_from_remoteis true, the code:
- Iterates all writes to collect
remote_lock_keys(lines 130-142)- Converts the entire write batch to a debug string (line 143)
For large write batches,
write_batch.toString()may be expensive. Consider deferring string conversion to only when an error occurs, or sampling for normal logging.♻️ Proposed optimization for deferred string conversion
- write_batch_debug = write_batch.toString(); - LOG_INFO( - log, - "Remote write batch begin, trace_id={} write_batch={} lock_keys={}", - trace_id, - write_batch_debug, - remote_lock_keys); + LOG_INFO( + log, + "Remote write batch begin, trace_id={} write_count={} lock_keys={}", + trace_id, + write_batch.size(), + remote_lock_keys);Then only capture
write_batch_debugin the error handlers where it's needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp` around lines 130 - 149, The current logging eagerly builds write_batch_debug by calling write_batch.toString() and always collects remote_lock_keys before the LOG_INFO call in UniversalPageStorage.cpp; to avoid expensive diagnostic work for large batches, remove the unconditional write_batch.toString() and only call write_batch.toString() (and compute any heavy diagnostics) inside error paths or detailed debug branches, e.g., compute write_batch_debug lazily when preparing error logs or when log level is DEBUG; keep the remote_lock_keys collection only if it’s required for the immediate log path (or defer building remote_lock_keys similarly) and update the LOG_INFO invocation to log lightweight info while reserving full write_batch.toString() for error handlers that need the full dump.
188-219: Test-only diagnostics marked for removal - ensure follow-up.The comment on line 188 indicates these diagnostics are test-only and should be removed or have their log level lowered before production rollout. Consider creating a tracking issue or TODO to ensure this is addressed.
Would you like me to open a new issue to track the removal/demotion of these verbose diagnostics before production rollout?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp` around lines 188 - 219, The verbose test-only diagnostics (construction of edit_records_debug, edit_digest and the LOG_INFO call) must not run unconditionally in production: move the creation of edit_records_debug and edit_digest and the log emission behind a debug-only guard (e.g., check logger debug-level via log->isDebugEnabled() or a runtime flag) so the expensive string/hash work is skipped in normal runs, change LOG_INFO to a debug-level log (e.g., LOG_DEBUG) and add a single-line TODO with a tracking issue reference to remove or reevaluate these diagnostics before production; refer to symbols edit_records_debug, edit_digest, edit.getRecords(), LOG_INFO and trace_id to locate and update the code.dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp (1)
308-315: Consider reducing logging verbosity for production.Logging each erased and missing key individually at
LOG_INFOlevel could generate excessive log volume in production when many keys are processed. The aggregate summary on lines 301-307 already provides useful diagnostics.Consider either:
- Moving per-key logs to
LOG_DEBUGlevel- Adding a threshold to only log when counts are unexpectedly high
♻️ Proposed fix to reduce verbosity
- for (const auto & key : erased_keys) - { - LOG_INFO(log, "Clean applied S3 external files, erase_hit lockkey={}", key); - } - for (const auto & key : missing_keys) - { - LOG_INFO(log, "Clean applied S3 external files, erase_miss lockkey={}", key); - } + for (const auto & key : erased_keys) + { + LOG_DEBUG(log, "Clean applied S3 external files, erase_hit lockkey={}", key); + } + for (const auto & key : missing_keys) + { + LOG_DEBUG(log, "Clean applied S3 external files, erase_miss lockkey={}", key); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp` around lines 308 - 315, The per-key logging in the loops iterating erased_keys and missing_keys is too verbose at LOG_INFO; change those per-key messages to LOG_DEBUG (replace LOG_INFO with LOG_DEBUG for the loops that log "erase_hit lockkey={}" and "erase_miss lockkey={}") or add a threshold guard (e.g., only emit per-key logs when erased_keys.size() or missing_keys.size() is below a small threshold, otherwise skip per-key logs and rely on the existing aggregate summary). Update the loops in S3LockLocalManager (the blocks that iterate erased_keys and missing_keys) to use one of these approaches and ensure the aggregate summary remains at INFO.dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp (1)
618-675: MissingCATCHmacro for consistent exception handling.Other tests in this file (e.g.,
WriteRead,WriteManyNonExistedDeleted) use thetry { ... } CATCHpattern for proper exception handling and test failure reporting. This test should follow the same pattern.♻️ Proposed fix to add try-CATCH pattern
TEST_F(UniPageStorageTest, ConcurrentRemoteWriteShouldNotLeavePreLockKeys) +try { DB::tests::TiFlashTestEnv::enableS3Config();And at the end of the test:
if (DB::tests::TiFlashTestEnv::isMockedS3Client()) DB::tests::TiFlashTestEnv::deleteBucket(*s3_client); DB::tests::TiFlashTestEnv::disableS3Config(); } +CATCH🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp` around lines 618 - 675, Wrap the body of the TEST_F function ConcurrentRemoteWriteShouldNotLeavePreLockKeys in a try { ... } CATCH block to match other tests' exception handling; specifically, enclose everything from DB::tests::TiFlashTestEnv::enableS3Config() through DB::tests::TiFlashTestEnv::disableS3Config() in try and add the existing CATCH macro at the end so exceptions are handled and reported consistently (refer to the test name ConcurrentRemoteWriteShouldNotLeavePreLockKeys and the use of page_storage, SyncPointCtl, and S3::ClientFactory::instance() within the test to locate the code to wrap).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@metrics/grafana/tiflash_summary.json`:
- Around line 19244-19254: The PromQL expr for metric
tiflash_storage_s3_lock_mgr_status contains two filters on the same label
(instance=~"$instance", instance=~"$tiflash_role") which forces an impossible
AND; fix by removing the duplicate instance filter — either drop
instance=~"$tiflash_role" so only instance=~"$instance" remains, or if you
intended to filter by role replace the second instance=~"$tiflash_role" with the
correct label (e.g., tiflash_role=~"$tiflash_role"); update the expr string used
in the target accordingly.
- Around line 19347-19357: The PromQL target has a duplicated instance filter
(instance=~"$instance", instance=~"$tiflash_role"); remove the second duplicate
so only one instance matcher remains (e.g., keep instance=~"$instance" and
delete instance=~"$tiflash_role") across all occurrences (~436 places) and run a
systematic replace. Also adjust the legendFormat which incorrectly uses
{{$additional_groupby}} (a dashboard variable, not a Prometheus label) — replace
it with the dashboard variable itself (e.g., use $additional_groupby) or remove
it, then test the panel in Grafana to confirm the legend renders as intended.
---
Nitpick comments:
In `@dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp`:
- Around line 1286-1293: The code emits duplicate logs for the same
invalid-segment branch: remove the redundant LOG_INFO call and keep a single
structured log (use LOG_DEBUG) that includes the reason and segment details
(references: LOG_INFO, LOG_DEBUG, segment->simpleInfo(), segmentMergeDelta
branch in DeltaMergeStore_InternalSegment.cpp); ensure the remaining LOG_DEBUG
message contains the reason ("concurrent_update") and segment->simpleInfo().
In `@dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h`:
- Around line 174-179: Update the rollback log call in WriteBatchesImpl.h to
include keyspace_id and ns_id and to use a LoggerPtr with LOG_INFO(log, ...).
Replace the current Logger::get("WriteBatches") usage in the LOG_INFO call and
add the keyspace_id and ns_id values alongside run_mode, written_log.size() and
written_data.size() so the message can be correlated across deployments; keep
the existing run_mode/magic_enum::enum_name and written_log/written_data
references and ensure the new log uses a local LoggerPtr (e.g., log) per coding
guidelines.
In `@dbms/src/Storages/Page/V3/BlobStore.cpp`:
- Around line 430-441: BlobStore::write currently logs the full write batch via
wb.toString() at INFO level; remove wb.toString() from the LOG_INFO call (keep
stage, wb.size(), edit.size(), put_external_count, put_remote_count,
checkpoint_info_count, sample_data_file_ids) and add a separate LOG_DEBUG (or
TRACE) entry that logs the full wb.toString() payload along with context (stage,
maybe wb.size() and edit.size()) so the heavy serialization is only produced at
debug/trace levels; update the logging calls around the write function
accordingly.
In `@dbms/src/Storages/Page/V3/PageDirectory.cpp`:
- Around line 1856-1873: The block in PageDirectory.cpp contains a "test-only
verbose diagnostics" comment and a LOG_INFO call that should not remain at info
level for production; change the comment to a TODO note referencing follow-up
(or remove it) and demote the LOG_INFO invocation to a lower verbosity (e.g.,
LOG_DEBUG or LOG_TRACE) so apply-checkpoint diagnostics (LOG call that logs
edit_size, put_count, put_external_count, checkpoint_info_count,
put_external_with_checkpoint_info_count, edit_digest, applied_data_files,
candidate_records, suspicious_records) are not noisy in production while
preserving the details for debugging.
In `@dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp`:
- Around line 727-789: Fold this test's logic into the previous closely-related
test instead of duplicating setup: move the body of
BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles (including
make_remote_entry, keys key1/key2/key3, the SyncPointCtl waits, the three async
dir->apply calls, and the final ASSERT_FALSE/ASSERT_EQ checks) into the prior
test after its existing assertions (or convert both into a single parameterized
test), remove the standalone TEST_F for
BatchWriteRemoteCheckpointEachWriterShouldGetAppliedDataFiles, and ensure you
preserve the SyncPointCtl scopes ("before_PageDirectory::leader_apply" and
"after_PageDirectory::enter_write_group") and calls to dir->apply, th_write1/2/3
handling, and the appliedN checks so behavior and coverage remain identical.
In `@dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp`:
- Around line 308-315: The per-key logging in the loops iterating erased_keys
and missing_keys is too verbose at LOG_INFO; change those per-key messages to
LOG_DEBUG (replace LOG_INFO with LOG_DEBUG for the loops that log "erase_hit
lockkey={}" and "erase_miss lockkey={}") or add a threshold guard (e.g., only
emit per-key logs when erased_keys.size() or missing_keys.size() is below a
small threshold, otherwise skip per-key logs and rely on the existing aggregate
summary). Update the loops in S3LockLocalManager (the blocks that iterate
erased_keys and missing_keys) to use one of these approaches and ensure the
aggregate summary remains at INFO.
In `@dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp`:
- Around line 618-675: Wrap the body of the TEST_F function
ConcurrentRemoteWriteShouldNotLeavePreLockKeys in a try { ... } CATCH block to
match other tests' exception handling; specifically, enclose everything from
DB::tests::TiFlashTestEnv::enableS3Config() through
DB::tests::TiFlashTestEnv::disableS3Config() in try and add the existing CATCH
macro at the end so exceptions are handled and reported consistently (refer to
the test name ConcurrentRemoteWriteShouldNotLeavePreLockKeys and the use of
page_storage, SyncPointCtl, and S3::ClientFactory::instance() within the test to
locate the code to wrap).
In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp`:
- Around line 130-149: The current logging eagerly builds write_batch_debug by
calling write_batch.toString() and always collects remote_lock_keys before the
LOG_INFO call in UniversalPageStorage.cpp; to avoid expensive diagnostic work
for large batches, remove the unconditional write_batch.toString() and only call
write_batch.toString() (and compute any heavy diagnostics) inside error paths or
detailed debug branches, e.g., compute write_batch_debug lazily when preparing
error logs or when log level is DEBUG; keep the remote_lock_keys collection only
if it’s required for the immediate log path (or defer building remote_lock_keys
similarly) and update the LOG_INFO invocation to log lightweight info while
reserving full write_batch.toString() for error handlers that need the full
dump.
- Around line 188-219: The verbose test-only diagnostics (construction of
edit_records_debug, edit_digest and the LOG_INFO call) must not run
unconditionally in production: move the creation of edit_records_debug and
edit_digest and the log emission behind a debug-only guard (e.g., check logger
debug-level via log->isDebugEnabled() or a runtime flag) so the expensive
string/hash work is skipped in normal runs, change LOG_INFO to a debug-level log
(e.g., LOG_DEBUG) and add a single-line TODO with a tracking issue reference to
remove or reevaluate these diagnostics before production; refer to symbols
edit_records_debug, edit_digest, edit.getRecords(), LOG_INFO and trace_id to
locate and update the code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 812c7b73-fc68-4449-ae11-244c93130fb7
📒 Files selected for processing (13)
dbms/src/Common/TiFlashMetrics.hdbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cppdbms/src/Storages/DeltaMerge/WriteBatchesImpl.hdbms/src/Storages/Page/V3/BlobStore.cppdbms/src/Storages/Page/V3/PageDirectory.cppdbms/src/Storages/Page/V3/PageDirectory.hdbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cppdbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cppdbms/src/Storages/Page/V3/Universal/UniversalPageStorageService.cppdbms/src/Storages/Page/V3/Universal/tests/gtest_lock_local_mgr.cppdbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cppdbms/src/Storages/Page/V3/tests/gtest_page_directory.cppmetrics/grafana/tiflash_summary.json
dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp
Outdated
Show resolved
Hide resolved
|
Second-opinion review completed for upstream/master...HEAD (PR #10760 head branch). I did not confirm any blocking correctness/concurrency/compatibility defects in the current diff. Residual risk: monitor the newly added S3 lock manager logs/metrics in staging to ensure observability signal-to-noise stays healthy. |
Review Comments (
|
Signed-off-by: JaySon-Huang <tshent@qq.com>
|
@pantheon-ai review |
|
Review Complete Findings: 4 issues ℹ️ Learn more details on Pantheon AI. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a correctness bug in disaggregated remote-write paths where write-group concurrency (and partial failures) could leave S3LockLocalManager.pre_lock_keys resident, causing manifests to retain obsolete lock keys and S3GC to over-retain objects (inflating remote storage usage).
Changes:
- Make
PageDirectory::applyreturn per-writerapplied_data_filesso every writer in a write-group can clean up its own pre-lock keys. - Harden
UniversalPageStorage::write+S3LockLocalManagerlock lifecycle: return “newly appended” pre-lock keys, and add explicit failure cleanup. - Add regression tests for write-group concurrency and lock-manager cleanup semantics; add lock-manager metrics and update style guidance.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| dbms/src/Storages/Page/V3/tests/gtest_page_directory.cpp | Adds remote-checkpoint write-group concurrency regression tests; forces std::async to run asynchronously. |
| dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp | Adds an end-to-end concurrent remote write test validating no residual pre_lock_keys. |
| dbms/src/Storages/Page/V3/Universal/tests/gtest_lock_local_mgr.cpp | Adds unit tests for partial cleanup, failure cleanup, return semantics, and partial-failure behavior. |
| dbms/src/Storages/Page/V3/Universal/UniversalWriteBatchImpl.h | Replaces boolean remote-write flag with a remote-write counter (writesRemoteCount). |
| dbms/src/Storages/Page/V3/Universal/UniversalPageStorageService.cpp | Adds log line to expose checkpoint lock-set size/sequence at upload time. |
| dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp | Uses per-writer applied ids for cleanup and adds failure-path pre-lock cleanup. |
| dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.h | Changes lock-creation API to return newly appended keys; adds failure-cleanup API. |
| dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp | Implements atomic-ish append semantics, cleanup helpers, and lock-manager metrics/logging. |
| dbms/src/Storages/Page/V3/PageDirectory.h | Extends Writer to retain per-writer applied checkpoint lock ids. |
| dbms/src/Storages/Page/V3/PageDirectory.cpp | Captures per-writer checkpoint ids pre-merge and returns them for both owner/followers. |
| dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalSegment.cpp | Adjusts logging severity/message for a MergeDelta invalidation case. |
| dbms/src/Common/TiFlashMetrics.h | Adds S3 lock-manager gauge/counters. |
| AGENTS.md | Updates documented C++ error-handling/logging guidelines. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dbms/src/Storages/Page/V3/Universal/tests/gtest_universal_page_storage.cpp
Show resolved
Hide resolved
Signed-off-by: JaySon-Huang <tshent@qq.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp (1)
137-151: Consider wrapping cleanup call to avoid masking the original exception.If
cleanPreLockKeysOnWriteFailurethrows, the original exception caught at line 137 would be lost. While cleanup functions should generally be noexcept, adding a defensive wrapper ensures the original failure context is preserved for diagnostics.🛡️ Optional: Defensive wrapper for cleanup
catch (...) { if (remote_write_count > 0) { // If write fails after pre-lock creation, clean these pre-lock keys // to avoid residual entries in checkpoint_manager.pre_lock_keys. - remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys)); + try + { + remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys)); + } + catch (...) + { + tryLogCurrentException(log, "Failed to clean pre-lock keys on write failure"); + } tryLogCurrentException( log, fmt::format(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp` around lines 137 - 151, The catch(...) block currently calls remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys)) which could throw and mask the original exception; wrap that cleanup call in its own try-catch around cleanPreLockKeysOnWriteFailure (and any logging) so any exception from cleanup is caught and logged (e.g., via tryLogCurrentException) but does not replace the original exception being rethrown by the outer catch; ensure you still call tryLogCurrentException(...) for the original failure (using failed_stage and remote_write_count) and then rethrow the original exception after the defensive cleanup handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp`:
- Around line 137-151: The catch(...) block currently calls
remote_locks_local_mgr->cleanPreLockKeysOnWriteFailure(std::move(created_pre_lock_keys))
which could throw and mask the original exception; wrap that cleanup call in its
own try-catch around cleanPreLockKeysOnWriteFailure (and any logging) so any
exception from cleanup is caught and logged (e.g., via tryLogCurrentException)
but does not replace the original exception being rethrown by the outer catch;
ensure you still call tryLogCurrentException(...) for the original failure
(using failed_stage and remote_write_count) and then rethrow the original
exception after the defensive cleanup handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 38399200-3ace-4a2f-b475-021ab2dcdf3b
📒 Files selected for processing (1)
dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CalvinNeo, JinheLin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
/unhold review comments are resolved |
|
/cherry-pick release-nextgen-20251011 |
|
@JaySon-Huang: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
|
@JaySon-Huang: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What problem does this PR solve?
Issue Number: close #10763
Problem Summary:
PageDirectorywrite-group semantics could cause follower writers to miss their own applied lock-id cleanup signals.S3LockLocalManager.pre_lock_keyscould remain resident and be repeatedly written into manifest locks.What is changed and how it works?
End-to-end correctness fixes for lock lifecycle
PageDirectory::applynow returns writer-scopedapplied_data_filesfor both write-group owner and followers, so each writer gets its own cleanup signal.UniversalPageStorage::writeuses those per-writer ids to clean pre-locks reliably after apply.cleanPreLockKeysOnWriteFailure(...)is invoked when remote write/apply fails.createS3LockForWriteBatchwas adjusted to avoid partial pre-lock residue on partial lock-creation failures (append topre_lock_keysafter lock-creation pass), and its return value is now aligned with "newly appended keys" semantics.Test coverage and regression guards
PageDirectoryandUniversalPageStoragepaths.S3LockLocalManagertests for partial cleanup, failure cleanup, lock-return semantics, and partial-failure atomicity.std::launch::asyncto avoid deferred scheduling risk.Observability and operations improvements
S3GCManagerService.tiflash_storage_s3_store_summary_bytes{store_id, type=data_file_bytes|dt_file_bytes}remote_summary_interval_secondsand wired it throughTMTContext;<= 0disables periodic summary task registration.Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
Bug Fixes
New Features
Improvements