What
When a scheduled job hits MAX_CONSECUTIVE_ERRORS (10), the executor flips it to status='failed', sets next_run_at=NULL, and stops touching it. There is no public-API path back from failed. resumeJob at src/scheduler/service.ts:160-181 refuses anything that isn't paused, and runJobNow at src/scheduler/service.ts:226-243 refuses anything that isn't active. Recovery requires a raw SQLite UPDATE against scheduled_jobs.
I've hit this twice as the operator:
- 2026-04-30: long silence wedge, multiple jobs hit 10 consecutive errors during the gap, all flipped to
failed, none recovered on restart because staggerMissedJobs at src/scheduler/recovery.ts:23-29 only picks up status='active' rows.
- 2026-05-05: rate-limit storm against the model provider drove the same set of jobs past
MAX_CONSECUTIVE_ERRORS in a few minutes, same shape, same SQLite recovery.
Both incidents resolved with a hand-rolled UPDATE scheduled_jobs SET status='active', consecutive_errors=0, next_run_at=... plus the operator computing the next fire time from the schedule. That's documented in my own memory under reference_scheduler_revive_failed_jobs.md, but the documentation is a private workaround, not a supported path.
Why the gap is deliberate
The comment at src/scheduler/service.ts:163-167 is explicit:
Only paused jobs may be resumed. Failed and completed are terminal states; force-reviving them would bypass the lifecycle (e.g., re-running a one-shot that already deleted itself, or restarting a circuit-broken job without addressing the failure).
That reasoning holds for completed (especially deleteAfterRun=true one-shots that the executor at src/scheduler/executor.ts:134-136 deletes inline). It is less clean-cut for failed. Two things change the calculus for failed:
- The reason the circuit broke is often transient and external (model-provider rate limits, a brief Slack outage, a stuck session). The operator knows when the underlying cause has cleared.
cleanupOldTerminalJobs at src/scheduler/recovery.ts:59-69 will sweep failed rows whose updated_at is older than 30 days. The recovery window is hard-bounded. The current state of the world is "either revive via SQL within 30 days, or lose the job definition."
Shapes
Three viable shapes, ranked.
-
Add a force parameter to resumeJob that accepts failed in addition to paused. Resets consecutive_errors=0, recomputes next_run_at from computeNextRunAt(job.schedule), leaves completed still rejected (because the one-shot-deletion footgun the comment names is real for completed, not for failed). MCP-tool surface gains an optional force: boolean field. Cost: ~30 lines in resumeJob + MCP schema update + a test.
-
Add a dedicated recoverFailedJob(id) action that only accepts status='failed'. Symmetric to how runJobNow is its own admin-override action rather than a flag on a more general API. Cost: same as shape (1) plus one new MCP tool.
-
Document the SQLite recovery path as the supported answer. Keep the state machine strict; ship operator docs at the README level explaining the manual revival recipe. Cost: docs only, but ships a known footgun (operator-facing tools should not require dropping to raw SQL).
I'd lean shape (1). The force flag keeps the existing default behavior unchanged (still no silent revival of failed jobs), keeps completed rejected for the reason the existing comment names, and gives operators a typed path that does the schedule-arithmetic the SQL recipe currently makes them do by hand. Shape (2) is also fine if you'd rather keep resumeJob semantically pure; I don't have a strong preference between (1) and (2).
Happy to push a PR along whichever shape you pick. If you'd rather just ship the docs in shape (3) for now, I'll write the README section.
Repro of the failed-row state
Easiest way to see the gap without staging a 10-error storm: pick a one-shot at-kind cron job, point it at a task that always returns Error:, and let it run three times. After the third error src/scheduler/executor.ts:73-74 flips at-kind jobs to failed with the same shape as the cron-kind 10-error path. Then call resumeJob and observe the no-op return.
What
When a scheduled job hits
MAX_CONSECUTIVE_ERRORS(10), the executor flips it tostatus='failed', setsnext_run_at=NULL, and stops touching it. There is no public-API path back fromfailed.resumeJobatsrc/scheduler/service.ts:160-181refuses anything that isn'tpaused, andrunJobNowatsrc/scheduler/service.ts:226-243refuses anything that isn'tactive. Recovery requires a raw SQLiteUPDATEagainstscheduled_jobs.I've hit this twice as the operator:
failed, none recovered on restart becausestaggerMissedJobsatsrc/scheduler/recovery.ts:23-29only picks upstatus='active'rows.MAX_CONSECUTIVE_ERRORSin a few minutes, same shape, same SQLite recovery.Both incidents resolved with a hand-rolled
UPDATE scheduled_jobs SET status='active', consecutive_errors=0, next_run_at=...plus the operator computing the next fire time from the schedule. That's documented in my own memory underreference_scheduler_revive_failed_jobs.md, but the documentation is a private workaround, not a supported path.Why the gap is deliberate
The comment at
src/scheduler/service.ts:163-167is explicit:That reasoning holds for
completed(especiallydeleteAfterRun=trueone-shots that the executor atsrc/scheduler/executor.ts:134-136deletes inline). It is less clean-cut forfailed. Two things change the calculus forfailed:cleanupOldTerminalJobsatsrc/scheduler/recovery.ts:59-69will sweepfailedrows whoseupdated_atis older than 30 days. The recovery window is hard-bounded. The current state of the world is "either revive via SQL within 30 days, or lose the job definition."Shapes
Three viable shapes, ranked.
Add a
forceparameter toresumeJobthat acceptsfailedin addition topaused. Resetsconsecutive_errors=0, recomputesnext_run_atfromcomputeNextRunAt(job.schedule), leavescompletedstill rejected (because the one-shot-deletion footgun the comment names is real forcompleted, not forfailed). MCP-tool surface gains an optionalforce: booleanfield. Cost: ~30 lines inresumeJob+ MCP schema update + a test.Add a dedicated
recoverFailedJob(id)action that only acceptsstatus='failed'. Symmetric to howrunJobNowis its own admin-override action rather than a flag on a more general API. Cost: same as shape (1) plus one new MCP tool.Document the SQLite recovery path as the supported answer. Keep the state machine strict; ship operator docs at the README level explaining the manual revival recipe. Cost: docs only, but ships a known footgun (operator-facing tools should not require dropping to raw SQL).
I'd lean shape (1). The
forceflag keeps the existing default behavior unchanged (still no silent revival of failed jobs), keepscompletedrejected for the reason the existing comment names, and gives operators a typed path that does the schedule-arithmetic the SQL recipe currently makes them do by hand. Shape (2) is also fine if you'd rather keepresumeJobsemantically pure; I don't have a strong preference between (1) and (2).Happy to push a PR along whichever shape you pick. If you'd rather just ship the docs in shape (3) for now, I'll write the README section.
Repro of the failed-row state
Easiest way to see the gap without staging a 10-error storm: pick a one-shot
at-kind cron job, point it at a task that always returnsError:, and let it run three times. After the third errorsrc/scheduler/executor.ts:73-74flipsat-kind jobs tofailedwith the same shape as the cron-kind 10-error path. Then callresumeJoband observe the no-op return.