Guard PID reset against replacement process#4325
Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
Open
Guard PID reset against replacement process#4325gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2 wants to merge 1 commit intostacklok:mainfrom
Conversation
Add ResetWorkloadPIDIfMatch to StatusManager that only resets PID to 0 when the status file PID matches the caller's PID. This prevents a dying process's cleanup from clobbering the PID written by a replacement process during thv rm + thv run sequences. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4325 +/- ##
==========================================
- Coverage 68.61% 68.41% -0.20%
==========================================
Files 478 478
Lines 48450 48499 +49
==========================================
- Hits 33243 33182 -61
- Misses 12367 12422 +55
- Partials 2840 2895 +55 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
thv rm+thv run), the old process's cleanup unconditionally writesprocess_id: 0to the status file after a shutdown timeout (up to 30s). By then, the new process has already written its own PID, so the old process clobbers it. This causes false "desync" reports in monitoring tools.ResetWorkloadPIDIfMatchtoStatusManagerthat reads the current PID under the existing file lock and only resets to 0 if it matches the caller's PID. Both cleanup paths inrunner.gonow use this guarded reset.Fixes #4324
Type of change
Test plan
task test)task lint-fix)Reproduced the bug with Glean (the only server whose backend holds SSE streams open long enough to trigger the race). Verified that after the fix, the status file PID remains correct after
thv rm+thv run+ 35s wait.Changes
pkg/workloads/statuses/status.goResetWorkloadPIDIfMatchto interface +runtimeStatusManagerno-oppkg/workloads/statuses/file_status.gopkg/workloads/statuses/noop.gopkg/workloads/statuses/mocks/mock_status_manager.gopkg/runner/runner.goResetWorkloadPIDcall sites →ResetWorkloadPIDIfMatch(ctx, name, os.Getpid())pkg/workloads/statuses/file_status_test.goSpecial notes for reviewers
The existing
ResetWorkloadPID(unconditional) is kept for backward compatibility. It's still correct for thestopProcesspath inmanager.gowhere the manager just killed the process and wants an unconditional reset. Only the runner's self-cleanup paths need the guarded version.The file lock (
withFileLock) ensures the read-compare-write is atomic with respect to other status file operations, preventing TOCTOU races.Generated with Claude Code