Skip to content

Guard PID reset against replacement process#4325

Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/status-file-pid-clobbering
Open

Guard PID reset against replacement process#4325
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/status-file-pid-clobbering

Conversation

@gkatz2
Copy link
Contributor

@gkatz2 gkatz2 commented Mar 24, 2026

Summary

  • When replacing a server (thv rm + thv run), the old process's cleanup unconditionally writes process_id: 0 to the status file after a shutdown timeout (up to 30s). By then, the new process has already written its own PID, so the old process clobbers it. This causes false "desync" reports in monitoring tools.
  • Add ResetWorkloadPIDIfMatch to StatusManager that reads the current PID under the existing file lock and only resets to 0 if it matches the caller's PID. Both cleanup paths in runner.go now use this guarded reset.

Fixes #4324

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Reproduced the bug with Glean (the only server whose backend holds SSE streams open long enough to trigger the race). Verified that after the fix, the status file PID remains correct after thv rm + thv run + 35s wait.

Changes

File Change
pkg/workloads/statuses/status.go Add ResetWorkloadPIDIfMatch to interface + runtimeStatusManager no-op
pkg/workloads/statuses/file_status.go Implement PID-guarded reset: read PID under file lock, compare, skip if mismatched
pkg/workloads/statuses/noop.go No-op implementation
pkg/workloads/statuses/mocks/mock_status_manager.go Regenerated mock
pkg/runner/runner.go Both ResetWorkloadPID call sites → ResetWorkloadPIDIfMatch(ctx, name, os.Getpid())
pkg/workloads/statuses/file_status_test.go 3 new tests: matching PID resets, non-matching skips, non-existent is no-op

Special notes for reviewers

The existing ResetWorkloadPID (unconditional) is kept for backward compatibility. It's still correct for the stopProcess path in manager.go where the manager just killed the process and wants an unconditional reset. Only the runner's self-cleanup paths need the guarded version.

The file lock (withFileLock) ensures the read-compare-write is atomic with respect to other status file operations, preventing TOCTOU races.

Generated with Claude Code

Add ResetWorkloadPIDIfMatch to StatusManager that only
resets PID to 0 when the status file PID matches the
caller's PID. This prevents a dying process's cleanup
from clobbering the PID written by a replacement process
during thv rm + thv run sequences.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 24, 2026
@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 35.48387% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.41%. Comparing base (e735122) to head (785c9d9).

Files with missing lines Patch % Lines
pkg/workloads/statuses/file_status.go 47.82% 4 Missing and 8 partials ⚠️
pkg/workloads/statuses/status.go 0.00% 4 Missing ⚠️
pkg/runner/runner.go 0.00% 2 Missing ⚠️
pkg/workloads/statuses/noop.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4325      +/-   ##
==========================================
- Coverage   68.61%   68.41%   -0.20%     
==========================================
  Files         478      478              
  Lines       48450    48499      +49     
==========================================
- Hits        33243    33182      -61     
- Misses      12367    12422      +55     
- Partials     2840     2895      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dying process clobbers replacement process PID in status file

1 participant