Skip to content

feat(tasks): persist saga leases and stale recovery state#184

Merged
hadamrd merged 6 commits into
trunkfrom
loop/168-feat-tasks-persist-saga-leases-and-stale
Jun 3, 2026
Merged

feat(tasks): persist saga leases and stale recovery state#184
hadamrd merged 6 commits into
trunkfrom
loop/168-feat-tasks-persist-saga-leases-and-stale

Conversation

@hadamrd
Copy link
Copy Markdown
Owner

@hadamrd hadamrd commented Jun 2, 2026

Fixes #168

Acceptance criteria

  • Creates planned task sagas with issue, branch, worktree, and registered compensations.
  • Acquires leases with owner ID, expiry, and initial heartbeat timestamp.
  • Extends heartbeats before expiry and reports expired non-terminal leases as stale.
  • Marks terminal states and rejects lease/mutation attempts on terminal tasks except explicit audit metadata.
  • Requires or preserves compensation records for failed tasks.
  • Reopening SQLite preserves task state, lease expiry, heartbeat, and compensations.

Tests

  • env -u VIRTUAL_ENV uv run --extra dev pytest tests/test_task_saga_store.py::TestTaskSagaLeaseLifecycle -q
  • env -u VIRTUAL_ENV uv run --extra dev pytest tests/test_task_saga_store.py -q
  • env -u VIRTUAL_ENV uv run --extra dev ruff check src/forge_loop/tasks/saga.py src/forge_loop/tasks/store.py src/forge_loop/tasks/init.py src/forge_loop/_testing/task_saga_store.py tests/test_task_saga_store.py
  • env -u VIRTUAL_ENV uv run --extra dev ruff format --check src/forge_loop/tasks/saga.py src/forge_loop/tasks/store.py src/forge_loop/tasks/init.py src/forge_loop/_testing/task_saga_store.py tests/test_task_saga_store.py
  • env -u VIRTUAL_ENV uv run --extra dev mypy src/forge_loop/tasks/saga.py src/forge_loop/tasks/store.py src/forge_loop/_testing/task_saga_store.py
  • env -u VIRTUAL_ENV uv run --extra dev pyright src/forge_loop/tasks/saga.py src/forge_loop/tasks/store.py src/forge_loop/_testing/task_saga_store.py tests/test_task_saga_store.py

Lumen discovery was skipped because mcp__lumen__semantic_search is unavailable in this session.

Persist task saga lease ownership, expiry, heartbeats, stale-worker detection, terminal lifecycle transitions, and compensation records for #168.

This gives operators a durable control-plane record for crashed or expired workers while keeping terminal tasks immutable except for explicit audit metadata.
@hadamrd hadamrd added the critic:blocking Critic found blocking issues label Jun 2, 2026
Copy link
Copy Markdown
Owner Author

@hadamrd hadamrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sev1/correctness] Lease acquisition is a read-then-write transition: two workers can both read an unleased or expired non-terminal task, pass the lease check, and then both write a RUNNING lease, with the last writer silently winning. That breaks the lease ownership guarantee for stale-worker recovery. Make acquisition atomic, e.g. use a transaction/BEGIN IMMEDIATE or a conditional UPDATE/INSERT with a WHERE clause that checks terminal state and expiry, then raise LeaseConflictError when no row is claimed.

) -> TaskSaga:
if expires_at <= acquired_at:
raise LeaseConflictError("lease expiry must be after acquisition time")
saga = self._require_mutable(task_id)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sev1/correctness] Lease acquisition is a read-then-write transition: two workers can both read an unleased or expired non-terminal task, pass the lease check, and then both write a RUNNING lease, with the last writer silently winning. That breaks the lease ownership guarantee for stale-worker recovery. Make acquisition atomic, e.g. use a transaction/BEGIN IMMEDIATE or a conditional UPDATE/INSERT with a WHERE clause that checks terminal state and expiry, then raise LeaseConflictError when no row is claimed.

Copy link
Copy Markdown
Owner Author

@hadamrd hadamrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sev1/tests] The issue requires marking terminal states completed, failed, compensated, and quarantined, but the tests only exercise mark_failed; completed is created via put(), and mark_completed/mark_compensated/mark_quarantined can be broken while this suite stays green. Add assertions that each terminal marker writes the expected TaskState and terminal_reason and rejects later mutation/lease attempts.

assert real.get("task-running") == fake.get("task-running")


class TestTaskSagaLeaseLifecycle:
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sev1/tests] The issue requires marking terminal states completed, failed, compensated, and quarantined, but the tests only exercise mark_failed; completed is created via put(), and mark_completed/mark_compensated/mark_quarantined can be broken while this suite stays green. Add assertions that each terminal marker writes the expected TaskState and terminal_reason and rejects later mutation/lease attempts.

Copy link
Copy Markdown
Owner Author

@hadamrd hadamrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manifesto violations:

  • [sev2] testing.md#TE-002 — `+ def heartbeat(
  •    self,
    
  •    task_id: str,
    
  •    *,
    
  •    owner_id: str,
    
  •    heartbeat_at: dat` → Add at least one adversarial heartbeat test, such as wrong owner, no active lease, expired lease, or expires_at that does not extend the current lease, and assert LeaseConflictError plus unchanged stored state.
    

MAJDOUB Khalid added 5 commits June 3, 2026 02:15
Address #184 critic findings for #168: guarded lease acquisition now uses a conditional SQLite update so stale readers cannot overwrite a concurrent claimant, and tests cover concurrent claims, heartbeat wrong-owner preservation, and terminal marker methods.
@hadamrd hadamrd removed the critic:blocking Critic found blocking issues label Jun 3, 2026
@hadamrd
Copy link
Copy Markdown
Owner Author

hadamrd commented Jun 3, 2026

Manual monitor follow-up after Forge critic on #168.\n\nAdded fixes on top of worker output:\n- atomic lease acquisition and heartbeat transitions with adversarial race coverage\n- terminal-state and failed-compensation invariants on public put paths\n- stale generic put protection so active leases cannot be cleared\n- duplicate create rejection\n- UTC normalization for lease timestamp storage/comparison\n- fake store parity for acquire/heartbeat/reclaim behavior\n\nVerification on afb858e:\n- env -u VIRTUAL_ENV uv run --extra dev pytest tests/test_task_saga_store.py -q -> 20 passed\n- ruff check focused task files -> passed\n- ruff format --check focused task files -> passed\n- mypy focused task files -> passed\n- pyright focused task files/tests -> 0 errors\n- GitHub replay -> success\n- direct Forge critic -> only non-blocking sev3 no-actionable-defect note

@hadamrd hadamrd merged commit 21fb482 into trunk Jun 3, 2026
2 checks passed
@hadamrd hadamrd deleted the loop/168-feat-tasks-persist-saga-leases-and-stale branch June 3, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(tasks): persist saga leases and stale-worker recovery decisions

1 participant