Spec
Specification: specs/064-durable-execution/spec.md (added in commit 65f3a525 on branch feat/durable-execution-layer).
The spec was produced via a spec-driven design chain with adversarial validation: architecture → critic + security + performance review (all returned significant) → revision → re-validation (all cleared) → spec review. 14 invariants, 15 functional requirements, 7 measurable NFRs, 10 criterion benchmarks.
Summary
zeph-durable is a new Layer-0 in-process durable execution crate that adds crash-resumable execution to Zeph without an external runtime (we evaluated and rejected Restate-as-core; it remains an optional backend). It journals execution control flow and replays it after a crash, turning today's "discard orphaned tool pairs" behavior into a resume.
Core: append-only journal with deterministic replay (StepId by call position + ReplayDivergence fingerprint guard), DurableStep with per-EffectClass exactly-once semantics (two-phase EffectIntent/StepResult, per-class OnAmbiguous), a background JournalWriter actor on a dedicated durable.db pool (fsync off the hot path), vault-keyed XChaCha20-Poly1305 payload cipher (AAD-bound), durable promises (resolver-token auth) and timers, and a sealed ExecutionBackend trait (always-on LocalBackend + optional restate backend in the server bundle).
Scope — implement in phases (separate PRs)
- zeph-durable core — journal,
DurableContext/DurableStep, JournalWriter actor, EffectClass/OnAmbiguous, IdempotencyKey, ReplayDivergence, PayloadCipher, DurablePromise/DurableTimer, sealed ExecutionBackend + LocalBackend, schema in zeph-db/migrations, zeph durable CLI, config [durable], --init/--migrate-config.
- P1 — agent tool loop (highest reliability value). Explicit change in
zeph-agent-tools/tier_loop.rs (LLM call + tool dispatch wrapped as durable steps). Triggers the LLM-serialization live-API gate — requires a live multi-turn + tool-call session test before merge.
- P2 — orchestration
/plan resume replan-counter restore (narrow scope; not auto-crash-recovery).
- P3 — scheduler exactly-once job fire.
- P4 — subagent durable promise spawn/await.
Each adapter is opt-in via config and degrades to today's behavior when disabled.
Key acceptance criteria (testable)
- Crash mid-turn (mid tool/LLM step) resumes from the first un-journaled step on restart; no double-emission of already-printed output; no double-persisted messages (invariant 057).
bench_step_run_exactly_once_n at N=5 completes in <= 5 ms on CI SSD (hot-path regression gate).
ExactlyOnceGuarded effect crashed in the ambiguous window resolves per OnAmbiguous (paid-LLM Skip; destructive Fail + audit); destructive step with unspecified policy is a construction-time error.
- Journal payloads are AEAD-encrypted at rest; a tampered journal entry fails authentication on replay (no forged tool/LLM result injection).
durable.db lives on its own pool; durable schema is applied via zeph_db::run_migrations with no sqlx::migrate! in zeph-durable (single migration runner, invariant 031).
- Restate backend only compiles under the
restate feature; absent it, LocalBackend is the default and the binary has no Restate dependency.
Suggested next step
Run /rust-agents:team-develop new-feature on this issue, starting with phase 1 (zeph-durable core). The architect and developer will pick up specs/064-durable-execution/spec.md automatically. Implement phases 2-5 as follow-up PRs gated on phase 1.
Housekeeping (separate, out of scope)
specs/ has pre-existing duplicate numbers (two 057-*, two 058-*). An automated spec-number picker could re-collide; the autoskill duplicates should be renumbered in a separate task.
Spec
Specification:
specs/064-durable-execution/spec.md(added in commit65f3a525on branchfeat/durable-execution-layer).The spec was produced via a spec-driven design chain with adversarial validation: architecture → critic + security + performance review (all returned
significant) → revision → re-validation (all cleared) → spec review. 14 invariants, 15 functional requirements, 7 measurable NFRs, 10 criterion benchmarks.Summary
zeph-durableis a new Layer-0 in-process durable execution crate that adds crash-resumable execution to Zeph without an external runtime (we evaluated and rejected Restate-as-core; it remains an optional backend). It journals execution control flow and replays it after a crash, turning today's "discard orphaned tool pairs" behavior into a resume.Core: append-only journal with deterministic replay (StepId by call position +
ReplayDivergencefingerprint guard),DurableStepwith per-EffectClassexactly-once semantics (two-phaseEffectIntent/StepResult, per-classOnAmbiguous), a backgroundJournalWriteractor on a dedicateddurable.dbpool (fsync off the hot path), vault-keyed XChaCha20-Poly1305 payload cipher (AAD-bound), durable promises (resolver-token auth) and timers, and a sealedExecutionBackendtrait (always-onLocalBackend+ optionalrestatebackend in theserverbundle).Scope — implement in phases (separate PRs)
DurableContext/DurableStep,JournalWriteractor,EffectClass/OnAmbiguous,IdempotencyKey,ReplayDivergence,PayloadCipher,DurablePromise/DurableTimer, sealedExecutionBackend+LocalBackend, schema inzeph-db/migrations,zeph durableCLI, config[durable],--init/--migrate-config.zeph-agent-tools/tier_loop.rs(LLM call + tool dispatch wrapped as durable steps). Triggers the LLM-serialization live-API gate — requires a live multi-turn + tool-call session test before merge./plan resumereplan-counter restore (narrow scope; not auto-crash-recovery).Each adapter is opt-in via config and degrades to today's behavior when disabled.
Key acceptance criteria (testable)
bench_step_run_exactly_once_nat N=5 completes in <= 5 ms on CI SSD (hot-path regression gate).ExactlyOnceGuardedeffect crashed in the ambiguous window resolves perOnAmbiguous(paid-LLM Skip; destructive Fail + audit); destructive step with unspecified policy is a construction-time error.durable.dblives on its own pool; durable schema is applied viazeph_db::run_migrationswith nosqlx::migrate!inzeph-durable(single migration runner, invariant 031).restatefeature; absent it,LocalBackendis the default and the binary has no Restate dependency.Suggested next step
Run
/rust-agents:team-develop new-featureon this issue, starting with phase 1 (zeph-durablecore). The architect and developer will pick upspecs/064-durable-execution/spec.mdautomatically. Implement phases 2-5 as follow-up PRs gated on phase 1.Housekeeping (separate, out of scope)
specs/has pre-existing duplicate numbers (two057-*, two058-*). An automated spec-number picker could re-collide; the autoskill duplicates should be renumbered in a separate task.