add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans by Ahmex000 · Pull Request #380 · usestrix/strix

Ahmex000 · 2026-03-21T03:23:25Z

Strix Checkpoint Feature (Custom Modification)

I implemented a simple modification that introduces an important feature:
the ability to stop a running scan and resume it later without losing progress or restarting from scratch.

🔹 How It Works

Run Strix with a --run-name to create a checkpoint:

strix --target http://php.testinvicti.com/ --run-name my-scan

This saves a checkpoint under the name: my-scan.
You can resume the scan at any time using the same command.

🔹 Starting a New Scan with the Same Name

If you want to start a fresh scan using the same name:

strix --target http://php.testinvicti.com/ --run-name my-scan --new

🔹 Using a Different Checkpoint Name

Alternatively, you can simply use a new run name:

strix --target http://php.testinvicti.com/ --run-name my-scan-2

Note : I have fixed the issue that was in the previous code https://github.com/usestrix/strix/pull/373 , where you could only create one checkpoint, but in the latest update you can complete more than once.
Note : I have fixed the issue that was in the previous code https://github.com/usestrix/strix/pull/378 Previously, there were some issues with file locations, but the process is now easier and lighter on the system.

The current version works without any problems, and you can pause and resume scans multiple times.

- New strix/telemetry/checkpoint.py: Pydantic CheckpointModel + CheckpointManager with atomic writes (.tmp → rename), non-fatal errors, target-hash validation - base_agent.py: save checkpoint after every iteration (root agents only), delete on clean completion, guard against duplicate task message on resume - main.py: add --run-name, --resume, --new/--force-new CLI flags; _setup_checkpoint_on_args() handles load/validate/corrupt-recovery - cli.py: resume banner, history replay (previous vulns + last 3 thoughts), restore AgentState with fresh sandbox + extended max_iterations budget - tui.py: pre-populate tracer from checkpoint, restore AgentState in agent_config - README.md: add "Resuming Interrupted Scans" section with usage examples Original scan behavior is 100% preserved when --run-name is not used. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously checkpoints only saved tracer.chat_messages and tracer.vulnerability_reports, leaving tracer.agents and tracer.tool_executions empty on resume — so all sub-agents (both in-progress and completed) were invisible after resuming. Changes: - checkpoint.py: add tracer_agents, tracer_tool_executions, tracer_next_execution_id fields to CheckpointModel; populate them in CheckpointManager.save() from the live tracer - cli.py: on resume, restore agents dict, tool_executions dict, and advance _next_execution_id to avoid ID collisions - tui.py: same restore logic so TUI sidebar shows all previous agents and their tool results immediately on resume Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous guard `if not self.state.messages` broke sub-agents because they can have pre-loaded context messages in their state before agent_loop is called. This caused them to start without a task and produce no output. Fix: only skip the initial task message when parent_id is None AND messages is already populated (= root agent resume). Sub-agents always get their task message regardless of whether their state has prior context. - Fresh root agent: parent_id=None, messages=[] → adds task ✓ - Fresh sub-agent: parent_id=set, messages=[] → adds task ✓ - Sub-agent with context: parent_id=set, messages=[..] → adds task ✓ - Resumed root agent: parent_id=None, messages=[..] → skips ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three bugs fixed: 1. Ghost sub-agents (root cause of all sub-agent issues): Restoring tracer.agents/tool_executions injected old sub-agent entries that had no live instances. The TUI showed them as interactive but they could not receive messages or run. Worse, they polluted the agent message-routing system so new sub-agents spawned after resume failed to communicate with the root agent. Fix: only restore chat_messages, vulnerability_reports, and the execution ID counter. The root agent's LLM context (message history) already knows what all sub-agents did. 2. Root agent stuck in wait state after resume: If the scan was interrupted while the root agent was in a wait state (waiting_for_input=True, stop_requested=True, etc.) the restored AgentState had those flags set and the loop froze immediately. Fix: reset all blocking flags on restore in both cli.py and tui.py. 3. Completed flag causing instant exit: If completed=True was serialised into the checkpoint (edge case) the loop would exit on the first should_stop() check. Fix: reset completed=False on restore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sume Defines the missing helper function called in cli.py and adds the equivalent to tui.py. Injects a user message into the restored AgentState so the LLM knows the scan was interrupted and must continue rather than call finish_scan or agent_finish due to an abruptly-ended message history. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Root cause of resume not working: - generate_run_name() adds a random suffix every time, so without --run-name the checkpoint from a previous session was never found. - Ctrl+C during the first iteration (before any checkpoint was saved) left no checkpoint to resume from. Fixes: 1. _find_checkpoint_by_target_hash(): scans strix_runs/ for the most recent checkpoint whose target_hash matches the current targets. Now running `strix --target example.com` again automatically resumes the last interrupted scan without needing --run-name. 2. _save_checkpoint_on_interrupt(): saves current agent state in both the signal handler and atexit in cli.py and tui.py, so a Ctrl+C mid-first-iteration still produces a valid checkpoint. 3. _setup_checkpoint_on_args() restructured: handles run_name=None, --force-new, explicit --run-name, and auto-detect in one place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Root causes: 1. Resume message said "re-spawn sub-agents" but didn't say WHICH ones or that old agent IDs are dead — LLM tried to interact with old IDs and got confused. 2. send_message_to_agent returned unhelpful "not found" error when the LLM used old (dead) agent IDs after resume. Fixes: - _build_resume_context_message / _inject_resume_context_message now accept the full checkpoint_data object and extract tracer_agents to list every non-completed sub-agent by name and task. The LLM now knows exactly which agents to re-spawn. - Message explicitly forbids interacting with any agent ID from history and instructs the LLM to call view_agent_graph first. - send_message_to_agent returns an actionable error when target is not found: explains it may be a dead session ID and tells the LLM to use view_agent_graph then create_agent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… context Root cause: create_agent passes the parent's full conversation history (inherited_messages) to each new sub-agent. After resume, the parent's history ends with a [SYSTEM - SCAN RESUMED] message that says: "ALL previous sub-agents have been terminated" "Do NOT call agent_finish unless all testing is genuinely complete" Sub-agents reading this in their inherited context got confused: - They thought they were the "terminated" agents and shouldn't be running - They avoided calling agent_finish even when their task was done - This caused them to hang, loop, or exit immediately without reporting Fix: filter out any [SYSTEM - SCAN RESUMED] messages from the inherited context before giving it to sub-agents. The resume instructions are only relevant to the root agent — sub-agents should see normal parent context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously sub-agents were terminated on Ctrl+C and never properly restored — only the root agent resumed. This is the full fix. Architecture change: - checkpoint.py: Added sub_agent_states field (dict[agent_id -> AgentState dump]) saved from _agent_instances at every checkpoint write. Every currently-running non-root agent is captured. - base_agent.py: Replaced fragile _is_root_resume heuristic with an explicit is_resumed flag (set via agent config). Works for both root and sub-agents. Prevents duplicate task message from being added to restored agents. - cli.py / tui.py: Added _restore_sub_agents() which, on resume, iterates checkpoint sub_agent_states in topological order (parents before children), restores each agent's full AgentState, resets blocking flags, clears the old sandbox, injects a [SYSTEM - SUB-AGENT RESUMED] message, and spawns each agent in a daemon thread — identical to how the root agent is handled. Sub-agents are spawned BEFORE execute_scan so root agent can communicate with them immediately using their original IDs. - Root agent's resume message now says "these sub-agents are ALREADY RUNNING at IDs [X, Y]" instead of "re-spawn them" — prevents double-spawning. - agents_graph_actions.py: [SYSTEM - SUB-AGENT RESUMED] filtered from inherited context alongside [SYSTEM - SCAN RESUMED] so freshly-spawned child agents never see these system markers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add _needs_fresh_container flag: always create a fresh container on first call to _get_or_create_container in a new runtime instance, preventing reuse of stale containers from a previous session whose async docker-rm hasn't completed yet. - Add _cleanup_existing_containers(): uses subprocess.run (synchronous docker rm -f) instead of the SDK remove() which returns before Docker fully frees the container name, causing 409 Conflict on containers.run(). Searches by both name filter and strix-scan-id label to catch containers in mid-removal state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Multiple sub-agent threads starting simultaneously all saw _needs_fresh_container=True and each called _create_container, which removes the container the previous thread just created. Result: only 1 of N sub-agents got a live container; the rest got 'Container X not found' on every tool call, and the root agent's sandbox init also failed when sub-agents trashed the container underneath it. Fix: add threading.Lock (_container_init_lock) around the slow path in _get_or_create_container. Only the first thread to acquire the lock creates the container; all waiting threads re-check _scan_container inside the lock and reuse the already-running one, paying zero extra Docker overhead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Root cause: checkpoint and tracer run directories used CWD-relative paths (Path("strix_runs") and Path.cwd() / "strix_runs"). Launching strix from different directories across sessions created separate checkpoint files that never updated each other, so the third session always resumed from the first session's iteration. Fix: use Path.home() / "strix_runs" as the canonical absolute path in both tracer.py and main.py so all sessions write to the same location regardless of CWD. Also includes earlier serialization robustness fixes (mode="json" + _json_default fallback) and explicit checkpoint save in action_custom_quit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Writes SAVED/FAILED entries to ~/strix_checkpoint_debug.log on every save attempt, bypassing the suppressed warning logger. This lets us see if saves are happening during resumed sessions and what error (if any) is causing them to fail silently. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- config.py: cli-config.json LLM vars now always override shell env, preventing stale shell values from reverting the configured model - checkpoint_restore.py: extract shared restore logic from cli/tui to eliminate code duplication - cli.py / tui.py: use shared checkpoint_restore module, add double-save guard via threading.Event - agents_graph_actions.py: add _agents_lock for thread-safe access to _running_agents and _agent_instances, fix mutable default arg in restore_sub_agents Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- checkpoint.py: remove debug log file (strix_checkpoint_debug.log) that was writing to every user's home directory on each save - checkpoint.py: acquire _agents_lock before iterating _agent_instances to prevent RuntimeError on concurrent sub-agent creation/removal - checkpoint_restore.py: register restored sub-agents in _agent_graph, _agent_instances, and _agent_states so send_message_to_agent can route to them — previously they were unreachable after resume - tracer.py: revert Path.home() back to Path.cwd() to avoid silent breaking change for all users; checkpoint logic in main.py already uses Path.home() directly so tracer change was not needed - cli.py / tui.py: move checkpoint_restore imports to top of file per PEP 8, remove noqa: E402 suppressions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-21T03:28:09Z

Greptile Summary

This PR introduces a checkpoint/resume system for Strix scans, allowing users to pause and continue interrupted penetration tests without losing progress. The feature is activated via --run-name, --resume, and --new CLI flags, and checkpoints are saved atomically after each agent iteration to ~/strix_runs/<run-name>/checkpoint.json.

Key changes:

strix/telemetry/checkpoint.py (new): CheckpointManager with atomic write (.tmp → rename), CheckpointModel Pydantic schema, and compute_target_hash for checkpoint validation.
strix/interface/checkpoint_restore.py (new): restore_sub_agents spawns previously-running sub-agents from checkpoint state in topological order; build_root_resume_message injects a resume context message into the root agent.
strix/agents/base_agent.py: Saves checkpoint after every successful iteration (root agents only) and deletes it on clean completion; is_resumed flag suppresses the duplicate task message.
strix/interface/cli.py / tui.py: Resume banner, previous-findings replay, checkpoint manager wiring, and a completed guard that prevents re-saving the checkpoint after a clean scan finish (addressing prior review concerns).
strix/runtime/docker_runtime.py: Adds _container_init_lock and synchronous _cleanup_existing_containers (via subprocess.run docker rm) to prevent multi-thread container creation races on resume.
strix/tools/agents_graph/agents_graph_actions.py: Adds _agents_lock for thread-safe access to _running_agents / _agent_instances; strips resume system messages from inherited sub-agent context.

Two concerns remain in checkpoint_restore.py:

The _run closure that drives restored sub-agents does not replicate the graph-cleanup logic from _run_agent_in_thread — restored sub-agents permanently show status: \"running\" in view_agent_graph after they finish, which may confuse the root agent.
t.start() is called before the agent is registered in _agent_instances and _agent_graph[\"nodes\"], reversing the ordering used in create_agent and creating a small race window.

Confidence Score: 4/5

Safe to merge for most use cases; two P2 issues in restore_sub_agents may cause stale graph state for multi-agent resumed scans.

The core checkpoint save/load/delete cycle, atomic writes, interrupt guards, and the completed-flag fixes are all correct. The two open issues (stale graph status for restored sub-agents and reversed thread-start vs registration order) only affect the multi-sub-agent resume path and don't cause data loss or crashes — they could cause the root agent LLM to see a slightly inconsistent graph view, potentially leading to duplicated sub-agent spawning in edge cases.

strix/interface/checkpoint_restore.py — the _run closure needs graph-cleanup logic and thread registration should happen before t.start()

Important Files Changed

Filename	Overview
strix/telemetry/checkpoint.py	New file: CheckpointModel (Pydantic), CheckpointManager with atomic write (.tmp → rename), load, delete, and compute_target_hash helper. Logic is sound and non-fatal on I/O errors.
strix/interface/checkpoint_restore.py	New file: restore_sub_agents starts sub-agents in daemon threads without proper graph-status cleanup on exit, and registers agents after thread start (reversed order vs. create_agent). build_root_resume_message is well structured.
strix/interface/cli.py	Adds resume banner, output replay, checkpoint manager wiring, and interrupt handlers with completed-guard. Previous-thread concerns about stale checkpoint on clean completion are now fixed.
strix/interface/tui.py	Mirrors cli.py resume support: restores tracer state, wires checkpoint manager, adds completed-guard before saving on quit. Previous-thread stale-checkpoint issue is addressed.
strix/interface/main.py	Adds --run-name, --resume, --new CLI flags and _setup_checkpoint_on_args which handles all four cases: force-new, auto-detect by target hash, explicit run name with checkpoint, explicit run name without. Target-hash mismatch warning is clear.
strix/agents/base_agent.py	Adds checkpoint_manager save call after each successful iteration (root agents only) and delete on clean completion. is_resumed flag prevents duplicate task message injection.
strix/runtime/docker_runtime.py	Adds _container_init_lock to prevent multiple threads from racing to create a container on resume, and _cleanup_existing_containers using synchronous docker CLI rm to avoid the async SDK race where the name registry still holds the name.
strix/tools/agents_graph/agents_graph_actions.py	Adds _agents_lock threading.Lock to protect _running_agents and _agent_instances; strips resume system messages from inherited context to prevent sub-agent confusion.
strix/config/config.py	Removes _llm_env_changed guard (stale LLM keys no longer auto-purged from stored config) and refactors apply logic. The effective precedence rule (shell env wins when set) is unchanged, but see prior thread discussion.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: strix/interface/checkpoint_restore.py
Line: 77-92

Comment:
**Restored sub-agents never update graph status on completion**

The `_run` closure calls `agent.agent_loop()` but does nothing when the loop exits. In the normal path (`_run_agent_in_thread`), the thread wrapper updates the agent graph node status to `"completed"` or `"error"` and pops the agent from `_running_agents` / `_agent_instances`:

```python
# _run_agent_in_thread (agents_graph_actions.py) — NOT done by _run
_agent_graph["nodes"][state.agent_id]["status"] = "completed"
_agent_graph["nodes"][state.agent_id]["finished_at"] = datetime.now(UTC).isoformat()
with _agents_lock:
    _running_agents.pop(state.agent_id, None)
    _agent_instances.pop(state.agent_id, None)
```

Because `_run` omits this, every restored sub-agent permanently shows `status: "running"` in `view_agent_graph` after it finishes. The root agent may conclude those sub-agents are still alive, avoid re-spawning needed work, or wait indefinitely for results that have already been delivered.

Consider replacing the bare `_run` closure with a call to `_run_agent_in_thread`, or at minimum replicating the graph-cleanup logic inside `_run`'s `finally` block:

```python
def _run(a: Any = agent, s: Any = state) -> None:
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    try:
        loop.run_until_complete(a.agent_loop(s.task))
    except Exception:  # noqa: BLE001
        with contextlib.suppress(Exception):
            agents_graph_actions._agent_graph["nodes"][s.agent_id]["status"] = "error"
    else:
        with contextlib.suppress(Exception):
            agents_graph_actions._agent_graph["nodes"][s.agent_id]["status"] = "completed"
    finally:
        loop.close()
        with contextlib.suppress(Exception):
            with agents_graph_actions._agents_lock:
                agents_graph_actions._running_agents.pop(s.agent_id, None)
                agents_graph_actions._agent_instances.pop(s.agent_id, None)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/interface/checkpoint_restore.py
Line: 83-100

Comment:
**Thread starts before agent is registered in the global registry**

`t.start()` is called before the agent is inserted into `_agent_instances`, `_running_agents`, and `_agent_graph["nodes"]`. This creates a window where the sub-agent thread is executing but the global registry doesn't know about it yet.

In the normal path (`create_agent` in `agents_graph_actions.py`), `_agent_instances[state.agent_id] = agent` is set *before* `thread.start()`. The reversed order here means that if the sub-agent's first LLM call returns very quickly and triggers a graph lookup (e.g. `view_agent_graph` or `agent_finish`), the node won't exist yet, potentially leading to a `KeyError`.

Move `t.start()` to after the full registration block:

```python
with agents_graph_actions._agents_lock:
    agents_graph_actions._running_agents[agent_id] = t
    agents_graph_actions._agent_instances[agent_id] = agent
    agents_graph_actions._agent_states[agent_id] = state
agents_graph_actions._agent_graph["nodes"][agent_id] = {
    "status": "running",
    "name": state.agent_name,
    "task": state.task,
    "parent_id": state.parent_id,
    "started_at": datetime.now(UTC).isoformat(),
}
t.start()  # start only after full registration
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "fix code-review issues and add 8 new vul..." | Re-trigger Greptile}

strix/interface/cli.py

strix/interface/tui.py

strix/interface/checkpoint_restore.py

strix/config/config.py

Ahmex000 · 2026-03-21T09:05:59Z

2026-03-21.11-02-02.mp4

Checkpoint / resume fixes (bot review on PR usestrix#380): - cli.py: skip checkpoint save when scan completed cleanly (agent.state.completed) to prevent stale checkpoint re-creating after base_agent.py deletes it - tui.py: same completed guard in both _save_checkpoint_on_interrupt and action_custom_quit to cover all TUI exit paths - checkpoint_restore.py: fix infinite recursion in _depth() for cyclic parent_id references in corrupted checkpoints — mark node before recursing - config.py: restore original shell-env-wins precedence for LLM vars; cli-config.json only applies when the shell var is absent, preventing silent override of rotated keys managed via shell environment New vulnerability skills (from upstream PRs usestrix#204 and usestrix#334): - clickjacking, cors_misconfiguration, nosql_injection, prototype_pollution, ssti, websocket_security (PR usestrix#204) - mfa_bypass, edge_cases (PR usestrix#334) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ahmex000 · 2026-03-27T22:56:37Z

@greptile-apps

Add strix/skills/vulnerabilities/http_request_smuggling.md covering CL.TE, TE.CL, H2.CL, H2.TE desync techniques, detection methodology, exploitation scenarios, and validation steps. Cherry-picked from usestrix#405.

Replace minimal nosql_injection.md with comprehensive version from usestrix#404 covering MongoDB operator injection, Redis/Elasticsearch/DynamoDB attack surfaces, blind extraction, bypass techniques, and validation methodology.

Ahmex000 and others added 15 commits March 19, 2026 07:30

greptile-apps bot reviewed Mar 21, 2026

View reviewed changes

strix/interface/cli.py Show resolved Hide resolved

strix/interface/tui.py Outdated Show resolved Hide resolved

strix/interface/checkpoint_restore.py Show resolved Hide resolved

strix/config/config.py Outdated Show resolved Hide resolved

Ahmex000 added 2 commits March 28, 2026 23:05

feat: add HTTP request smuggling skill

96c6394

Add strix/skills/vulnerabilities/http_request_smuggling.md covering CL.TE, TE.CL, H2.CL, H2.TE desync techniques, detection methodology, exploitation scenarios, and validation steps. Cherry-picked from usestrix#405.

feat: update NoSQL injection skill with expanded coverage

10e0327

Replace minimal nosql_injection.md with comprehensive version from usestrix#404 covering MongoDB operator injection, Redis/Elasticsearch/DynamoDB attack surfaces, blind extraction, bypass techniques, and validation methodology.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380

add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380
Ahmex000 wants to merge 18 commits intousestrix:mainfrom
Ahmex000:main

Ahmex000 commented Mar 21, 2026

Uh oh!

greptile-apps bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ahmex000 commented Mar 21, 2026

Uh oh!

Ahmex000 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ahmex000 commented Mar 21, 2026

Strix Checkpoint Feature (Custom Modification)

🔹 How It Works

🔹 Starting a New Scan with the Same Name

🔹 Using a Different Checkpoint Name

Uh oh!

greptile-apps bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ahmex000 commented Mar 21, 2026

Uh oh!

Ahmex000 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 21, 2026 •

edited

Loading