Skip to content

fix(stop-review-gate): treat job as inactive when its pid is dead#297

Open
yoshitarof wants to merge 1 commit intoopenai:mainfrom
yoshitarof:fix/zombie-job-pid-liveness
Open

fix(stop-review-gate): treat job as inactive when its pid is dead#297
yoshitarof wants to merge 1 commit intoopenai:mainfrom
yoshitarof:fix/zombie-job-pid-liveness

Conversation

@yoshitarof
Copy link
Copy Markdown

Summary

Codex plugin job state (state.json / jobs/task-*.json) records status: "running" while a job is in flight. The dying codex.exe is expected to update this field to "completed" / "cancelled" on exit, but that responsibility cannot be honored if the process is killed by something external (SIGKILL, OS reboot, power loss, a cleanup script, or taskkill /F).

Result: zombie entries with status: "running" accumulate in state.json. stop-review-gate-hook.mjs reads them, finds a "running" entry, and emits Codex task task-XYZ is still running. on every session end — even when no Codex process is actually alive. Users have to either ignore the false alarm or repeatedly invoke /codex:cancel against ghost ids that no longer exist.

We hit this in the wild on 2026-05-05: 11 zombie entries with status "running" persisted across multiple sessions after a process cleanup; all 11 pids were verified dead via the host OS but the hook continued firing. The root cause is that the hook trusts status as a single source of truth without verifying the recorded pid.

Fix

Add two helpers in plugins/codex/scripts/lib/state.mjs:

  • isPidAlive(pid) — cross-platform pid liveness via process.kill(pid, 0) (ESRCH = dead, EPERM = exists but unsignalable / still alive). Both error codes handled correctly.
  • isJobActive(job) — combines the existing status === "queued" || "running" check with isPidAlive(job.pid). Returns false when pid is missing (older state format) so it fails safe.

Replace the inline filter in plugins/codex/scripts/stop-review-gate-hook.mjs:

-  const runningJob = jobs.find((job) => job.status === "queued" || job.status === "running");
+  const runningJob = jobs.find(isJobActive);

Net change: ~30 lines added, 1 line modified.

Why this scope (read-only)

The patch does not mutate state.json. Persistent zombie cleanup (rewriting status: "running" to "cancelled" with a cancelReason field) deserves its own PR — it changes mutation semantics that other consumers (e.g., /codex:status) depend on. The hook fix alone stops the user-visible noise immediately and is safe to land independently.

Test plan

Smoke-tested locally on Windows (Node v24, but the helpers use only standard Node APIs):

  • isPidAlive(99999999)false (dead pid)
  • isPidAlive(process.pid)true (self)
  • isPidAlive(null)false
  • isJobActive({status: "running", pid: process.pid})true
  • isJobActive({status: "running", pid: 99999999})false (zombie case)
  • isJobActive({status: "completed", pid: process.pid})false (already done)
  • isJobActive({status: "running", pid: null})false (older state, fails safe)
  • node --check on both modified files

End-to-end: injected a fake {id: "task-FAKE", status: "running", pid: 99999999} into a real state.json, confirmed the hook previously emitted Codex task task-FAKE is still running., applied the patch, confirmed no false alarm, then verified a real long-lived process with a recorded pid is still detected correctly.

Caveats

  • For workspaces where pid was never recorded (older format, pre-pid-tracking), the helper falls back to false (treat as inactive) rather than trusting status alone. This is the safer default for the false-alarm use case but is technically a behavior change for that legacy slice. If reviewers prefer, we can fall back to the old status-only check when pid is missing — happy to adjust.
  • Persistent zombies remain in state.json until a separate cleanup pass rewrites them. We have a sweeper script in our downstream workspace; we'd be glad to contribute it as a follow-up PR if the maintainers are interested in folding it in.

Notes for reviewer

This was discovered during a downstream workspace audit (Nippon Medical School research environment). The local mitigation (a PowerShell sweeper run on SessionStart) is already in place; this upstream PR is the smaller, safer half of the fix.

🤖 Generated with Claude Code

Codex plugin job state (state.json / jobs/task-*.json) records
status: "running" while a job is in flight. The dying codex.exe is
expected to update this field to "completed" / "cancelled" on exit.
But that responsibility cannot be honored if the process is killed by
something external (SIGKILL, OS reboot, power loss, a cleanup script,
or `taskkill /F`).

Result: zombie entries with status: "running" accumulate in
state.json. stop-review-gate-hook.mjs reads them, finds a "running"
entry, and emits "Codex task task-XYZ is still running." on every
session end -- even when no Codex process is actually alive.

Fix: add isPidAlive(pid) and isJobActive(job) helpers in lib/state.mjs.
isPidAlive uses process.kill(pid, 0) -- cross-platform: ESRCH = dead,
EPERM = exists but unsignalable (still alive). isJobActive combines
the existing status check with the pid liveness probe.

Replace the inline filter in stop-review-gate-hook.mjs with
jobs.find(isJobActive). The hook now only reports a running task when
it is actually running.

This is read-only -- the hook does not mutate state.json. Persistent
zombie cleanup is a separate concern that deserves its own PR.

Smoke-tested locally on Windows (Node v24): isPidAlive correctly
returns true for self pid, false for unused pid 99999999, false for
null. isJobActive correctly returns true for {running, alive pid},
false for {running, dead pid}, false for {completed, alive pid},
false for {running, no pid}.
@yoshitarof yoshitarof requested a review from a team May 5, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant