feat: add resume session feature (Claude Code-inspired append-only JSONL)#467
feat: add resume session feature (Claude Code-inspired append-only JSONL)#467ahmedk20 wants to merge 1 commit into
Conversation
Greptile SummaryThis PR introduces an append-only JSONL session-persistence layer (
Confidence Score: 3/5Not ready to merge — one new P1 plus three unresolved P1s from prior review rounds. A new P1 ( strix/sessions/resume.py (new P1), strix/telemetry/tracer.py (broken iteration_count), strix/telemetry/session_meta.py (race condition) Important Files Changed
|
| if mode == "reopen": | ||
| state.add_message( | ||
| "user", | ||
| "The previous scan session has been reopened. Please summarize the key " | ||
| "findings so far and ask what to investigate or test next.", | ||
| ) |
There was a problem hiding this comment.
"Reopen" synthetic message not persisted to JSONL
When mode == "reopen", state.add_message(...) is called on a freshly-constructed AgentState whose _conversation_log is still None — set_conversation_log isn't called until BaseAgent.__init__ later. The synthetic "user" message is added to the in-memory state.messages list but never written to conversation.jsonl.
The LLM then responds to it, and that assistant message IS written (the log is set by then). On a subsequent resume of this reopened session, replay() reconstructs a conversation that starts with an assistant message with no preceding user turn — an invalid message sequence that will likely cause an LLM API error or confuse the model.
Fix: write the synthetic message to the log explicitly after the log is wired up (e.g., in BaseAgent.__init__ after set_conversation_log, check for a pending synthetic message), or pass a flag so ConversationLog.append_message can be called immediately after the log is attached.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/sessions/resume.py
Line: 58-63
Comment:
**"Reopen" synthetic message not persisted to JSONL**
When `mode == "reopen"`, `state.add_message(...)` is called on a freshly-constructed `AgentState` whose `_conversation_log` is still `None` — `set_conversation_log` isn't called until `BaseAgent.__init__` later. The synthetic `"user"` message is added to the in-memory `state.messages` list but never written to `conversation.jsonl`.
The LLM then responds to it, and that assistant message IS written (the log is set by then). On a subsequent resume of this reopened session, `replay()` reconstructs a conversation that starts with an assistant message with no preceding user turn — an invalid message sequence that will likely cause an LLM API error or confuse the model.
Fix: write the synthetic message to the log explicitly after the log is wired up (e.g., in `BaseAgent.__init__` after `set_conversation_log`, check for a pending synthetic message), or pass a flag so `ConversationLog.append_message` can be called immediately after the log is attached.
How can I resolve this? If you propose a fix, please make it concise.| "status": "completed" if completed else "errored", | ||
| "ended_at": datetime.now(UTC).isoformat(), | ||
| "iteration_count": max( | ||
| (a.get("iteration", 0) for ag in self.agents.values() | ||
| for a in ag.get("tool_executions", [])), default=0 | ||
| ), | ||
| "vulnerability_count": len(self.vulnerability_reports), | ||
| "agent_count": len(self.agents), | ||
| }, | ||
| ) |
There was a problem hiding this comment.
iteration_count expression crashes silently — always writes 0
self.agents[agent_id]["tool_executions"] is a list[int] (execution IDs appended in log_tool_execution_start), not a list of dicts. The generator expression calls .get("iteration", 0) on each int, which raises AttributeError. The surrounding try/except Exception: pass swallows the error silently, so iteration_count is never written to session_meta.json and the listing table always shows "?" for every session.
# current (broken) — iterates integers, not dicts
"iteration_count": max(
(a.get("iteration", 0) for ag in self.agents.values()
for a in ag.get("tool_executions", [])), default=0
),The tracer doesn't store AgentState.iteration directly, so a straightforward fix is to read the final iteration from the last iteration_end entry written to the conversation log, or store the iteration on the tracer when append_iteration_end is called in BaseAgent.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/tracer.py
Line: 896-905
Comment:
**`iteration_count` expression crashes silently — always writes 0**
`self.agents[agent_id]["tool_executions"]` is a `list[int]` (execution IDs appended in `log_tool_execution_start`), not a list of dicts. The generator expression calls `.get("iteration", 0)` on each `int`, which raises `AttributeError`. The surrounding `try/except Exception: pass` swallows the error silently, so `iteration_count` is never written to `session_meta.json` and the listing table always shows `"?"` for every session.
```python
# current (broken) — iterates integers, not dicts
"iteration_count": max(
(a.get("iteration", 0) for ag in self.agents.values()
for a in ag.get("tool_executions", [])), default=0
),
```
The tracer doesn't store `AgentState.iteration` directly, so a straightforward fix is to read the final iteration from the last `iteration_end` entry written to the conversation log, or store the iteration on the tracer when `append_iteration_end` is called in `BaseAgent`.
How can I resolve this? If you propose a fix, please make it concise.| def write_session_meta(run_dir: Path, meta: dict[str, Any]) -> None: | ||
| """Atomically merge *meta* into session_meta.json, preserving user fields.""" | ||
| path = run_dir / "session_meta.json" | ||
| existing = _read_raw(path) or {} | ||
|
|
||
| merged = {**existing, **meta} | ||
| # Always preserve user-editable fields from existing file | ||
| merged["title"] = existing.get("title", merged.get("title")) | ||
| merged["tags"] = existing.get("tags", merged.get("tags", [])) | ||
| merged["last_updated"] = datetime.now(UTC).isoformat() | ||
|
|
||
| tmp = path.with_suffix(".tmp") | ||
| tmp.write_text(json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8") | ||
| os.replace(tmp, path) |
There was a problem hiding this comment.
Non-atomic read-modify-write in
write_session_meta
os.replace makes the final rename atomic, but the read→merge→write cycle is not protected by a lock. If two threads call write_session_meta concurrently (e.g., _write_initial_session_meta from set_scan_config and _finalize_session_meta from cleanup racing on shutdown), both can read the same old state and the later write silently discards the earlier one's updates. Consider adding a per-file lock (similar to the one used in ConversationLog._append) around the entire read-merge-write block.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/session_meta.py
Line: 10-23
Comment:
**Non-atomic read-modify-write in `write_session_meta`**
`os.replace` makes the final rename atomic, but the read→merge→write cycle is not protected by a lock. If two threads call `write_session_meta` concurrently (e.g., `_write_initial_session_meta` from `set_scan_config` and `_finalize_session_meta` from `cleanup` racing on shutdown), both can read the same old state and the later write silently discards the earlier one's updates. Consider adding a per-file lock (similar to the one used in `ConversationLog._append`) around the entire read-merge-write block.
How can I resolve this? If you propose a fix, please make it concise.Pulls usestrix#467 (feat: add resume session feature) ahead of upstream merge. Enables unattended scans to survive crashes or forced termination (e.g. AWS STS session expiry mid-scan) by appending every LLM message to strix_runs/<run>/conversation.jsonl and replaying on --resume. New CLI: strix --list-sessions # tabulate past scans strix --continue (or -c) # resume most recent strix --resume <run_name> # resume by name strix --resume # open interactive picker Motivates: https://github.com/seedcx/strix-scan-workflow/actions/runs/25116247499 — trade-api flag-aware scan exited at 71min with APIConnectionError ("security token included in the request is expired"). Session had produced 10 findings; all lost once the report upload also hit ExpiredToken. With this commit merged, the CI composite action can wrap strix in a re-assume + --continue loop so each sub-session fits inside the 60min STS budget and total scan duration is bounded only by the overall workflow timeout. Upstream integration: - Once usestrix#467 merges, drop from the seedcx-build recipe - Until then, include alongside usestrix#454 (memory compression), usestrix#460 (truncate retry), usestrix#468 (thinking retry), this one, and the adaptive-thinking commit on this branch Clean 3-way merge, no conflicts. Smoke tests: - uv sync succeeds - strix --help surfaces --resume / --continue / --list-sessions - imports (strix.sessions.resume, strix.telemetry.conversation_log) OK Follow-up (separate): strix-scan-workflow README recipe + composite action consumer changes to invoke resume on re-dispatch.
Summary
This PR adds a resume session feature that lets users continue a past penetration test scan from where it left off — including the full LLM conversation history, iteration state, and
agent context.
Rather than a snapshot checkpoint (which loses progress on crashes), this uses an append-only conversation.jsonl file that writes every LLM message to disk in real-time, inspired by how
Claude Code persists sessions.
Why not the checkpoint approach (from test-resume branch)?
The previous test-resume attempt saved a checkpoint.json snapshot periodically. The fatal flaw: a crash between saves loses all progress since the last checkpoint. It also blocked
resuming completed scans and only worked in --non-interactive mode.
Architecture
Two new files per run:
strix_runs/{run_name}/conversation.jsonl │ Full LLM conversation, appended in real-time
strix_runs/{run_name}/session_meta.json │ Lightweight sidecar for fast listing (no need to parse the full log)
conversation.jsonl entry types:
{"type": "session_start", "scan_config": {...}, "schema_version": 1, ...}
{"type": "message", "role": "user", "content": "...", "iteration": 1, ...}
{"type": "message", "role": "assistant", "content": [...], "iteration": 1, ...}
{"type": "iteration_end", "iteration": 1, "context": {...}, "completed": false, ...}
{"type": "session_end", "completed": true, ...}
On resume: replay message entries → AgentState.messages, latest iteration_end → iteration/context.
Crash safety: even a hard kill (kill -9) leaves all messages written. The worst case is a missing iteration_end marker — the iteration count is inferred from message count instead.
Changes
New files:
Modified files:
Test plan
Manual QA checklist: