Skip to content

Conversation

@maxlepikhin
Copy link

@maxlepikhin maxlepikhin commented Dec 6, 2025

Description

Prevent peer/segment recovery timeouts from being reset by mere status polling. Previously, every call to ReplicationCollection#get() bumped the recovery’s lastAccessTime, so a hung recovery could sit in INITIALIZING indefinitely. This change removes that implicit update and explicitly touches the timestamp only when meaningful work happens (start, file transfer, clean‑files, translog, finalize, etc.), ensuring indices.recovery.recovery_activity_timeout actually fails stuck recoveries.

Related Issues

Resolves #20177

Check List

  • Functionality includes testing. (./gradlew :server:test --tests org.opensearch.recovery.ReplicationCollectionTests)
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

Summary by CodeRabbit

  • Bug Fixes

    • Ensure recovery and replication targets record last-access times when actively used, and prevent frequent polling from inadvertently extending timeouts.
    • Stop updating last-access time on simple collection reference creation to avoid misleading activity metrics.
  • Tests

    • Added tests validating last-access updates and that polling does not reset recovery/replication timeouts.

✏️ Tip: You can customize this high-level summary in your review settings.

@maxlepikhin maxlepikhin requested a review from a team as a code owner December 6, 2025 22:43
@github-actions github-actions bot added _No response_ bug Something isn't working labels Dec 6, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 6, 2025

Walkthrough

Centralized last-access-time updates: removed automatic updates from ReplicationCollection refs and added explicit setLastAccessTime() calls at recovery and replication checkpoints to record activity for timeout detection across recovery and replication flows.

Changes

Cohort / File(s) Summary
Recovery handler access-time updates
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java
Added recoveryTarget.setLastAccessTime() calls at multiple recovery handler entry points (FilesInfo, FileChunk, CleanFiles, PrepareForTranslogOperations, FinalizeRecovery, HandoffPrimaryContext, TranslogOperations) and in the main doRecovery path.
Replication operation access-time updates
server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java, server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java
Added target.setLastAccessTime() immediately after obtaining replication targets and at the start of handleFileChunk to mark activity when replication work begins.
Removed automatic access-time updates
server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
Removed the implicit status.setLastAccessTime() call from ReplicationCollection.ReplicationRef constructor; last-access updates are now explicit.
Tests: verify polling vs explicit updates
server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java
Modified testLastAccessTimeUpdate to assert that get() does not auto-update lastAccessTime and that explicit setLastAccessTime() does; added testRecoveryTimeoutNotResetByPolling to ensure polling status does not reset recovery timeout.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Multiple similar edits lower cognitive load, but review needs to ensure coverage consistency.
  • Pay special attention to:
    • All recovery and replication entry points where activity should reset the timeout (PeerRecoveryTargetService handlers, SegmentReplicator, ReplicationTarget).
    • Any code paths that previously relied on automatic updates from ReplicationRef — ensure no gaps introduced.
    • Tests for timing (flaky/wait) and their reliability across CI environments.

Poem

🐰 I hop through logs and tenderly trace,
Timestamps at checkpoints to quicken the race.
No more hidden peeks that silently stay,
Now every hard hop marks time on the way.
Freed slots and fresh retries — a happier place! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix peer recovery activity tracking' directly addresses the main change: fixing how recovery activity (last access time) is tracked to prevent timeouts from being reset by mere polling.
Description check ✅ Passed The description clearly explains what the change achieves, identifies the related issue (#20177), and confirms testing. All key template sections are present and adequately filled.
Linked Issues check ✅ Passed The PR directly addresses the first objective from issue #20177: auto-failing hung recoveries by removing implicit lastAccessTime updates on polling and explicitly updating it only during meaningful work, enabling indices.recovery.recovery_activity_timeout to function correctly.
Out of Scope Changes check ✅ Passed All changes are strictly scoped to implementing the peer recovery timeout fix: removing implicit updates in ReplicationCollection constructor, adding explicit setLastAccessTime() calls in recovery handlers, and adding relevant tests.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Max Lepikhin <mlepikhin@gmail.com>
@maxlepikhin maxlepikhin force-pushed the bugfix/peer-recovery-timeout branch from afeebd6 to 4263de5 Compare December 6, 2025 22:46
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (1)

239-247: Explicit activity bump on each received file chunk

Updating replicationTarget.setLastAccessTime() at the top of handleFileChunk makes chunk traffic reliably count as recovery activity, independent of status polling. Note that FileChunkTransportRequestHandler also calls setLastAccessTime() before invoking this method, so you now have two updates per chunk; that’s harmless but could be consolidated in one place later if you want to avoid duplication.

server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (1)

229-241: Good separation between “real recovery work” and status polling

The new recoveryTarget.setLastAccessTime() calls in doRecovery and all the recovery request handlers ensure that only meaningful events (start/restart, files info, file chunks, clean files, prepare/finalize, handoff primary context, and translog ops, including mapping‑retry runs) advance the activity timestamp. This matches the goal of letting idle/hung recoveries time out even while status is being polled via ReplicationCollection.get().

There is a bit of benign duplication (e.g., for translog ops and file chunks you now bump lastAccessTime in both the transport handler and the downstream helper), so if you later want to tighten things up you could centralize the update in a single layer per operation type.

Also applies to: 399-408, 414-425, 431-438, 447-457, 471-473, 535-552, 559-573, 585-595

server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (1)

96-127: Solid regression test; consider tightening latch assertion

testRecoveryTimeoutNotResetByPolling does a good job simulating frequent status polling and asserting that the recovery still times out (via the custom listener and failed flag), which directly protects against the original bug.

For extra clarity in failure cases, you might also assert that latch.await(30, TimeUnit.SECONDS) itself returns true before checking failed.get(), so a stuck test reports “recovery did not complete” rather than only “failed to timeout.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0449ce8 and afeebd6.

📒 Files selected for processing (5)
  • server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (9 hunks)
  • server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1 hunks)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java (0 hunks)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (1 hunks)
  • server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (2 hunks)
💤 Files with no reviewable changes (1)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (2)
server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1)

320-333: Aligns segment replication with explicit activity tracking

Calling target.setLastAccessTime() immediately after obtaining the ReplicationRef cleanly reintroduces the “replication just (re)started” activity bump that was previously hidden in ReplicationRef’s constructor. This keeps timeout behavior correct after the implicit update removal.

server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (1)

72-88: Tests now correctly enforce explicit lastAccessTime updates

The extra assertBusy before calling setLastAccessTime() is a nice regression guard: it proves collection.get(recoveryId) alone no longer mutates lastAccessTime, and that only an explicit setLastAccessTime() changes what subsequent observers see. This matches the new ReplicationRef behavior.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (1)

399-402: Preparing for translog phase now properly refreshes activity timestamp

Updating lastAccessTime right after resolving the RecoveryTarget for PREPARE_TRANSLOG ties the timeout to an actual protocol step rather than to generic collection access, which matches the stated goal.

You repeat the getSafe(...) → recoveryRef.get() → setLastAccessTime() pattern in several handlers; consider a small helper to centralize this so future handlers can’t forget the activity update.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afeebd6 and 4263de5.

📒 Files selected for processing (5)
  • server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (9 hunks)
  • server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1 hunks)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java (0 hunks)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (1 hunks)
  • server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (2 hunks)
💤 Files with no reviewable changes (1)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
✅ Files skipped from review due to trivial changes (1)
  • server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java
🚧 Files skipped from review as they are similar to previous changes (2)
  • server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java
  • server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (8)
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (8)

239-241: Explicit activity bump at start/re-establish of recovery looks correct

Calling recoveryTarget.setLastAccessTime() immediately after acquiring the RecoveryTarget ensures long-running local prep (e.g., pre-recovery, remote segment sync) is treated as activity and prevents premature timeout, while still decoupling from bare status polling.


416-419: Finalize phase correctly counted as recovery activity

Bumping lastAccessTime before finalizeRecovery(...) ensures that a recovery doing only finalize/refresh work is not timed out as idle, which is consistent with “meaningful work” semantics.


435-438: Primary-context handoff now tracked as active recovery work

Setting lastAccessTime before handoffPrimaryContext(...) brings this path in line with the other phases; the handoff is a critical step and should indeed reset the inactivity timer.


449-452: Translog-ops request correctly refreshes activity timestamp

Updating lastAccessTime upon receipt of a TRANSLOG_OPS request ensures that inbound translog batches keep the recovery alive and avoids timeouts being reset by unrelated collection lookups.


471-473: Extra activity bump in performTranslogOps is justified for retry path

Calling setLastAccessTime() again inside performTranslogOps covers internal retries triggered after mapping updates (which re-enter via performTranslogOps without a new transport message), so those retries won’t be misclassified as idle; the double bump for the initial call is harmless.


536-539: Files-info phase now participates in recovery timeout semantics

Marking activity when handling FILES_INFO aligns this early phase with the rest of the recovery pipeline and prevents a recovery from timing out while the source is legitimately sending file metadata.


560-563: Clean-files step correctly refreshes last access time

Treating CLEAN_FILES as activity makes sense, since it can be non-trivial work and may be the only thing happening for a period; this fits well with the intent of recovery_activity_timeout.


585-588: File-chunk handling updates lastAccessTime at the right place

Updating lastAccessTime right before handleFileChunk(...) ties the timer to actual segment-bytes transfer, which is central to the “no bytes transferred within the interval” heuristic and complements the additional updates in the replication layer.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

❌ Gradle check result for 4263de5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working _No response_

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Hung peer recovery permanently blocks replica allocation.

1 participant