feat(tasks): persist saga leases and stale recovery state#184
Conversation
Persist task saga lease ownership, expiry, heartbeats, stale-worker detection, terminal lifecycle transitions, and compensation records for #168. This gives operators a durable control-plane record for crashed or expired workers while keeping terminal tasks immutable except for explicit audit metadata.
hadamrd
left a comment
There was a problem hiding this comment.
[sev1/correctness] Lease acquisition is a read-then-write transition: two workers can both read an unleased or expired non-terminal task, pass the lease check, and then both write a RUNNING lease, with the last writer silently winning. That breaks the lease ownership guarantee for stale-worker recovery. Make acquisition atomic, e.g. use a transaction/BEGIN IMMEDIATE or a conditional UPDATE/INSERT with a WHERE clause that checks terminal state and expiry, then raise LeaseConflictError when no row is claimed.
| ) -> TaskSaga: | ||
| if expires_at <= acquired_at: | ||
| raise LeaseConflictError("lease expiry must be after acquisition time") | ||
| saga = self._require_mutable(task_id) |
There was a problem hiding this comment.
[sev1/correctness] Lease acquisition is a read-then-write transition: two workers can both read an unleased or expired non-terminal task, pass the lease check, and then both write a RUNNING lease, with the last writer silently winning. That breaks the lease ownership guarantee for stale-worker recovery. Make acquisition atomic, e.g. use a transaction/BEGIN IMMEDIATE or a conditional UPDATE/INSERT with a WHERE clause that checks terminal state and expiry, then raise LeaseConflictError when no row is claimed.
hadamrd
left a comment
There was a problem hiding this comment.
[sev1/tests] The issue requires marking terminal states completed, failed, compensated, and quarantined, but the tests only exercise mark_failed; completed is created via put(), and mark_completed/mark_compensated/mark_quarantined can be broken while this suite stays green. Add assertions that each terminal marker writes the expected TaskState and terminal_reason and rejects later mutation/lease attempts.
| assert real.get("task-running") == fake.get("task-running") | ||
|
|
||
|
|
||
| class TestTaskSagaLeaseLifecycle: |
There was a problem hiding this comment.
[sev1/tests] The issue requires marking terminal states completed, failed, compensated, and quarantined, but the tests only exercise mark_failed; completed is created via put(), and mark_completed/mark_compensated/mark_quarantined can be broken while this suite stays green. Add assertions that each terminal marker writes the expected TaskState and terminal_reason and rejects later mutation/lease attempts.
hadamrd
left a comment
There was a problem hiding this comment.
Manifesto violations:
- [sev2] testing.md#TE-002 — `+ def heartbeat(
-
self, -
task_id: str, -
*, -
owner_id: str, -
heartbeat_at: dat` → Add at least one adversarial heartbeat test, such as wrong owner, no active lease, expired lease, or expires_at that does not extend the current lease, and assert LeaseConflictError plus unchanged stored state.
|
Manual monitor follow-up after Forge critic on #168.\n\nAdded fixes on top of worker output:\n- atomic lease acquisition and heartbeat transitions with adversarial race coverage\n- terminal-state and failed-compensation invariants on public put paths\n- stale generic put protection so active leases cannot be cleared\n- duplicate create rejection\n- UTC normalization for lease timestamp storage/comparison\n- fake store parity for acquire/heartbeat/reclaim behavior\n\nVerification on afb858e:\n- env -u VIRTUAL_ENV uv run --extra dev pytest tests/test_task_saga_store.py -q -> 20 passed\n- ruff check focused task files -> passed\n- ruff format --check focused task files -> passed\n- mypy focused task files -> passed\n- pyright focused task files/tests -> 0 errors\n- GitHub replay -> success\n- direct Forge critic -> only non-blocking sev3 no-actionable-defect note |
Fixes #168
Acceptance criteria
Tests
Lumen discovery was skipped because mcp__lumen__semantic_search is unavailable in this session.