fix(stop-review-gate): treat job as inactive when its pid is dead#297
Open
yoshitarof wants to merge 1 commit intoopenai:mainfrom
Open
fix(stop-review-gate): treat job as inactive when its pid is dead#297yoshitarof wants to merge 1 commit intoopenai:mainfrom
yoshitarof wants to merge 1 commit intoopenai:mainfrom
Conversation
Codex plugin job state (state.json / jobs/task-*.json) records
status: "running" while a job is in flight. The dying codex.exe is
expected to update this field to "completed" / "cancelled" on exit.
But that responsibility cannot be honored if the process is killed by
something external (SIGKILL, OS reboot, power loss, a cleanup script,
or `taskkill /F`).
Result: zombie entries with status: "running" accumulate in
state.json. stop-review-gate-hook.mjs reads them, finds a "running"
entry, and emits "Codex task task-XYZ is still running." on every
session end -- even when no Codex process is actually alive.
Fix: add isPidAlive(pid) and isJobActive(job) helpers in lib/state.mjs.
isPidAlive uses process.kill(pid, 0) -- cross-platform: ESRCH = dead,
EPERM = exists but unsignalable (still alive). isJobActive combines
the existing status check with the pid liveness probe.
Replace the inline filter in stop-review-gate-hook.mjs with
jobs.find(isJobActive). The hook now only reports a running task when
it is actually running.
This is read-only -- the hook does not mutate state.json. Persistent
zombie cleanup is a separate concern that deserves its own PR.
Smoke-tested locally on Windows (Node v24): isPidAlive correctly
returns true for self pid, false for unused pid 99999999, false for
null. isJobActive correctly returns true for {running, alive pid},
false for {running, dead pid}, false for {completed, alive pid},
false for {running, no pid}.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Codex plugin job state (
state.json/jobs/task-*.json) recordsstatus: "running"while a job is in flight. The dyingcodex.exeis expected to update this field to"completed"/"cancelled"on exit, but that responsibility cannot be honored if the process is killed by something external (SIGKILL, OS reboot, power loss, a cleanup script, ortaskkill /F).Result: zombie entries with
status: "running"accumulate instate.json.stop-review-gate-hook.mjsreads them, finds a "running" entry, and emitsCodex task task-XYZ is still running.on every session end — even when no Codex process is actually alive. Users have to either ignore the false alarm or repeatedly invoke/codex:cancelagainst ghost ids that no longer exist.We hit this in the wild on 2026-05-05: 11 zombie entries with status
"running"persisted across multiple sessions after a process cleanup; all 11 pids were verified dead via the host OS but the hook continued firing. The root cause is that the hook trustsstatusas a single source of truth without verifying the recordedpid.Fix
Add two helpers in
plugins/codex/scripts/lib/state.mjs:isPidAlive(pid)— cross-platform pid liveness viaprocess.kill(pid, 0)(ESRCH= dead,EPERM= exists but unsignalable / still alive). Both error codes handled correctly.isJobActive(job)— combines the existingstatus === "queued" || "running"check withisPidAlive(job.pid). Returnsfalsewhenpidis missing (older state format) so it fails safe.Replace the inline filter in
plugins/codex/scripts/stop-review-gate-hook.mjs:Net change: ~30 lines added, 1 line modified.
Why this scope (read-only)
The patch does not mutate
state.json. Persistent zombie cleanup (rewritingstatus: "running"to"cancelled"with acancelReasonfield) deserves its own PR — it changes mutation semantics that other consumers (e.g.,/codex:status) depend on. The hook fix alone stops the user-visible noise immediately and is safe to land independently.Test plan
Smoke-tested locally on Windows (Node v24, but the helpers use only standard Node APIs):
isPidAlive(99999999)→false(dead pid)isPidAlive(process.pid)→true(self)isPidAlive(null)→falseisJobActive({status: "running", pid: process.pid})→trueisJobActive({status: "running", pid: 99999999})→false(zombie case)isJobActive({status: "completed", pid: process.pid})→false(already done)isJobActive({status: "running", pid: null})→false(older state, fails safe)node --checkon both modified filesEnd-to-end: injected a fake
{id: "task-FAKE", status: "running", pid: 99999999}into a realstate.json, confirmed the hook previously emittedCodex task task-FAKE is still running., applied the patch, confirmed no false alarm, then verified a real long-lived process with a recorded pid is still detected correctly.Caveats
pidwas never recorded (older format, pre-pid-tracking), the helper falls back tofalse(treat as inactive) rather than trusting status alone. This is the safer default for the false-alarm use case but is technically a behavior change for that legacy slice. If reviewers prefer, we can fall back to the old status-only check whenpidis missing — happy to adjust.state.jsonuntil a separate cleanup pass rewrites them. We have a sweeper script in our downstream workspace; we'd be glad to contribute it as a follow-up PR if the maintainers are interested in folding it in.Notes for reviewer
This was discovered during a downstream workspace audit (Nippon Medical School research environment). The local mitigation (a PowerShell sweeper run on
SessionStart) is already in place; this upstream PR is the smaller, safer half of the fix.🤖 Generated with Claude Code