Skip to content

feat(tasks): persist saga leases and stale-worker recovery decisions #168

@hadamrd

Description

@hadamrd

Parent: #163

Goal

Persist task saga lifecycle, leases, heartbeats, stale-worker recovery decisions, and compensation requirements.

This ticket is about making "stale workers" a durable control-plane state instead of an operator surprise. A crashed or expired worker should be recoverable because the task saga says who owns it, when the lease expires, and what cleanup/compensation is required.

Scope

Expected owned surface:

  • src/forge_loop/tasks/saga.py
  • a new task saga store module under src/forge_loop/tasks/ if needed;
  • focused tests under tests/test_task_saga_store.py or equivalent.

This can be a standalone store first. Runner integration can be a later ticket unless a narrow hook is cheap and well-tested.

Required Behavior

  • Create a planned/dispatched/running task saga with issue number, branch, worktree, and registered compensations.
  • Acquire a lease with owner ID and expiry time.
  • Extend a heartbeat before expiry.
  • Detect stale/expired leases.
  • Mark terminal states: completed, failed, compensated, quarantined.
  • Prevent leasing or mutating terminal tasks except for explicitly allowed audit metadata.
  • Record compensation decisions with enough data to clean a worktree/branch later.

Acceptance Tests

  • New task starts leaseable and becomes running after lease acquisition.
  • Heartbeat extends the lease.
  • Expired task is reported stale.
  • Completed task cannot be leased again.
  • Failed task requires or preserves a compensation record.
  • Reopening the store preserves task state, lease expiry, and compensations.

Non-goals

  • Do not implement microVM isolation here.
  • Do not delete real worktrees in tests.
  • Do not change worker execution behavior until the store contract is green.
  • Do not silently ignore terminal-state mutations.

Verification

Run at minimum:

  • New task saga store tests.
  • env -u VIRTUAL_ENV uv run --extra dev pytest tests/test_eventlog_sqlite.py -q if the store writes events.
  • env -u VIRTUAL_ENV uv run --extra dev ruff check <changed files>
  • env -u VIRTUAL_ENV uv run --extra dev ruff format --check <changed files>

Customer Story

An operator dispatching workers on a real repository without trusting their context or code benefits because stale or crashed workers become explicit recoverable task states.

Source

Expanded during Forge self-dogfood sprint planning on 2026-06-02.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions