Skip to content

add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380

Open
Ahmex000 wants to merge 18 commits intousestrix:mainfrom
Ahmex000:main
Open

add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380
Ahmex000 wants to merge 18 commits intousestrix:mainfrom
Ahmex000:main

Conversation

@Ahmex000
Copy link
Copy Markdown

Strix Checkpoint Feature (Custom Modification)

I implemented a simple modification that introduces an important feature:
the ability to stop a running scan and resume it later without losing progress or restarting from scratch.

🔹 How It Works

Run Strix with a --run-name to create a checkpoint:

strix --target http://php.testinvicti.com/ --run-name my-scan
  • This saves a checkpoint under the name: my-scan.
  • You can resume the scan at any time using the same command.

🔹 Starting a New Scan with the Same Name

If you want to start a fresh scan using the same name:

strix --target http://php.testinvicti.com/ --run-name my-scan --new

🔹 Using a Different Checkpoint Name

Alternatively, you can simply use a new run name:

strix --target http://php.testinvicti.com/ --run-name my-scan-2
  • Note : I have fixed the issue that was in the previous code https://github.com/usestrix/strix/pull/373 , where you could only create one checkpoint, but in the latest update you can complete more than once.

  • Note : I have fixed the issue that was in the previous code https://github.com/usestrix/strix/pull/378 Previously, there were some issues with file locations, but the process is now easier and lighter on the system.

The current version works without any problems, and you can pause and resume scans multiple times.

Ahmex000 and others added 15 commits March 19, 2026 07:30
- New strix/telemetry/checkpoint.py: Pydantic CheckpointModel + CheckpointManager
  with atomic writes (.tmp → rename), non-fatal errors, target-hash validation
- base_agent.py: save checkpoint after every iteration (root agents only),
  delete on clean completion, guard against duplicate task message on resume
- main.py: add --run-name, --resume, --new/--force-new CLI flags;
  _setup_checkpoint_on_args() handles load/validate/corrupt-recovery
- cli.py: resume banner, history replay (previous vulns + last 3 thoughts),
  restore AgentState with fresh sandbox + extended max_iterations budget
- tui.py: pre-populate tracer from checkpoint, restore AgentState in agent_config
- README.md: add "Resuming Interrupted Scans" section with usage examples

Original scan behavior is 100% preserved when --run-name is not used.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously checkpoints only saved tracer.chat_messages and
tracer.vulnerability_reports, leaving tracer.agents and
tracer.tool_executions empty on resume — so all sub-agents
(both in-progress and completed) were invisible after resuming.

Changes:
- checkpoint.py: add tracer_agents, tracer_tool_executions,
  tracer_next_execution_id fields to CheckpointModel; populate
  them in CheckpointManager.save() from the live tracer
- cli.py: on resume, restore agents dict, tool_executions dict,
  and advance _next_execution_id to avoid ID collisions
- tui.py: same restore logic so TUI sidebar shows all previous
  agents and their tool results immediately on resume

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous guard `if not self.state.messages` broke sub-agents because
they can have pre-loaded context messages in their state before agent_loop
is called. This caused them to start without a task and produce no output.

Fix: only skip the initial task message when parent_id is None AND messages
is already populated (= root agent resume). Sub-agents always get their
task message regardless of whether their state has prior context.

- Fresh root agent:        parent_id=None, messages=[]   → adds task ✓
- Fresh sub-agent:         parent_id=set,  messages=[]   → adds task ✓
- Sub-agent with context:  parent_id=set,  messages=[..] → adds task ✓
- Resumed root agent:      parent_id=None, messages=[..] → skips  ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three bugs fixed:

1. Ghost sub-agents (root cause of all sub-agent issues):
   Restoring tracer.agents/tool_executions injected old sub-agent
   entries that had no live instances. The TUI showed them as
   interactive but they could not receive messages or run. Worse,
   they polluted the agent message-routing system so new sub-agents
   spawned after resume failed to communicate with the root agent.
   Fix: only restore chat_messages, vulnerability_reports, and the
   execution ID counter. The root agent's LLM context (message
   history) already knows what all sub-agents did.

2. Root agent stuck in wait state after resume:
   If the scan was interrupted while the root agent was in a wait
   state (waiting_for_input=True, stop_requested=True, etc.) the
   restored AgentState had those flags set and the loop froze
   immediately. Fix: reset all blocking flags on restore in both
   cli.py and tui.py.

3. Completed flag causing instant exit:
   If completed=True was serialised into the checkpoint (edge case)
   the loop would exit on the first should_stop() check. Fix:
   reset completed=False on restore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sume

Defines the missing helper function called in cli.py and adds the equivalent
to tui.py. Injects a user message into the restored AgentState so the LLM
knows the scan was interrupted and must continue rather than call finish_scan
or agent_finish due to an abruptly-ended message history.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of resume not working:
- generate_run_name() adds a random suffix every time, so without
  --run-name the checkpoint from a previous session was never found.
- Ctrl+C during the first iteration (before any checkpoint was saved)
  left no checkpoint to resume from.

Fixes:
1. _find_checkpoint_by_target_hash(): scans strix_runs/ for the most
   recent checkpoint whose target_hash matches the current targets.
   Now running `strix --target example.com` again automatically
   resumes the last interrupted scan without needing --run-name.

2. _save_checkpoint_on_interrupt(): saves current agent state in both
   the signal handler and atexit in cli.py and tui.py, so a Ctrl+C
   mid-first-iteration still produces a valid checkpoint.

3. _setup_checkpoint_on_args() restructured: handles run_name=None,
   --force-new, explicit --run-name, and auto-detect in one place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root causes:
1. Resume message said "re-spawn sub-agents" but didn't say WHICH ones
   or that old agent IDs are dead — LLM tried to interact with old IDs
   and got confused.
2. send_message_to_agent returned unhelpful "not found" error when the
   LLM used old (dead) agent IDs after resume.

Fixes:
- _build_resume_context_message / _inject_resume_context_message now
  accept the full checkpoint_data object and extract tracer_agents to
  list every non-completed sub-agent by name and task. The LLM now
  knows exactly which agents to re-spawn.
- Message explicitly forbids interacting with any agent ID from history
  and instructs the LLM to call view_agent_graph first.
- send_message_to_agent returns an actionable error when target is not
  found: explains it may be a dead session ID and tells the LLM to use
  view_agent_graph then create_agent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… context

Root cause: create_agent passes the parent's full conversation history
(inherited_messages) to each new sub-agent. After resume, the parent's
history ends with a [SYSTEM - SCAN RESUMED] message that says:
  "ALL previous sub-agents have been terminated"
  "Do NOT call agent_finish unless all testing is genuinely complete"

Sub-agents reading this in their inherited context got confused:
- They thought they were the "terminated" agents and shouldn't be running
- They avoided calling agent_finish even when their task was done
- This caused them to hang, loop, or exit immediately without reporting

Fix: filter out any [SYSTEM - SCAN RESUMED] messages from the inherited
context before giving it to sub-agents. The resume instructions are only
relevant to the root agent — sub-agents should see normal parent context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously sub-agents were terminated on Ctrl+C and never properly restored
— only the root agent resumed. This is the full fix.

Architecture change:
- checkpoint.py: Added sub_agent_states field (dict[agent_id -> AgentState
  dump]) saved from _agent_instances at every checkpoint write. Every
  currently-running non-root agent is captured.

- base_agent.py: Replaced fragile _is_root_resume heuristic with an explicit
  is_resumed flag (set via agent config). Works for both root and sub-agents.
  Prevents duplicate task message from being added to restored agents.

- cli.py / tui.py: Added _restore_sub_agents() which, on resume, iterates
  checkpoint sub_agent_states in topological order (parents before children),
  restores each agent's full AgentState, resets blocking flags, clears the
  old sandbox, injects a [SYSTEM - SUB-AGENT RESUMED] message, and spawns
  each agent in a daemon thread — identical to how the root agent is handled.
  Sub-agents are spawned BEFORE execute_scan so root agent can communicate
  with them immediately using their original IDs.

- Root agent's resume message now says "these sub-agents are ALREADY RUNNING
  at IDs [X, Y]" instead of "re-spawn them" — prevents double-spawning.

- agents_graph_actions.py: [SYSTEM - SUB-AGENT RESUMED] filtered from
  inherited context alongside [SYSTEM - SCAN RESUMED] so freshly-spawned
  child agents never see these system markers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _needs_fresh_container flag: always create a fresh container on
  first call to _get_or_create_container in a new runtime instance,
  preventing reuse of stale containers from a previous session whose
  async docker-rm hasn't completed yet.

- Add _cleanup_existing_containers(): uses subprocess.run (synchronous
  docker rm -f) instead of the SDK remove() which returns before Docker
  fully frees the container name, causing 409 Conflict on containers.run().
  Searches by both name filter and strix-scan-id label to catch containers
  in mid-removal state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multiple sub-agent threads starting simultaneously all saw
_needs_fresh_container=True and each called _create_container,
which removes the container the previous thread just created.
Result: only 1 of N sub-agents got a live container; the rest
got 'Container X not found' on every tool call, and the root
agent's sandbox init also failed when sub-agents trashed the
container underneath it.

Fix: add threading.Lock (_container_init_lock) around the
slow path in _get_or_create_container. Only the first thread
to acquire the lock creates the container; all waiting threads
re-check _scan_container inside the lock and reuse the
already-running one, paying zero extra Docker overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: checkpoint and tracer run directories used CWD-relative
paths (Path("strix_runs") and Path.cwd() / "strix_runs"). Launching
strix from different directories across sessions created separate
checkpoint files that never updated each other, so the third session
always resumed from the first session's iteration.

Fix: use Path.home() / "strix_runs" as the canonical absolute path in
both tracer.py and main.py so all sessions write to the same location
regardless of CWD.

Also includes earlier serialization robustness fixes (mode="json" +
_json_default fallback) and explicit checkpoint save in action_custom_quit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writes SAVED/FAILED entries to ~/strix_checkpoint_debug.log on every
save attempt, bypassing the suppressed warning logger. This lets us
see if saves are happening during resumed sessions and what error (if
any) is causing them to fail silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- config.py: cli-config.json LLM vars now always override shell env,
  preventing stale shell values from reverting the configured model
- checkpoint_restore.py: extract shared restore logic from cli/tui to
  eliminate code duplication
- cli.py / tui.py: use shared checkpoint_restore module, add
  double-save guard via threading.Event
- agents_graph_actions.py: add _agents_lock for thread-safe access to
  _running_agents and _agent_instances, fix mutable default arg in
  restore_sub_agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- checkpoint.py: remove debug log file (strix_checkpoint_debug.log)
  that was writing to every user's home directory on each save
- checkpoint.py: acquire _agents_lock before iterating _agent_instances
  to prevent RuntimeError on concurrent sub-agent creation/removal
- checkpoint_restore.py: register restored sub-agents in _agent_graph,
  _agent_instances, and _agent_states so send_message_to_agent can
  route to them — previously they were unreachable after resume
- tracer.py: revert Path.home() back to Path.cwd() to avoid silent
  breaking change for all users; checkpoint logic in main.py already
  uses Path.home() directly so tracer change was not needed
- cli.py / tui.py: move checkpoint_restore imports to top of file
  per PEP 8, remove noqa: E402 suppressions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR introduces a checkpoint/resume system for Strix scans, allowing users to pause and continue interrupted penetration tests without losing progress. The feature is activated via --run-name, --resume, and --new CLI flags, and checkpoints are saved atomically after each agent iteration to ~/strix_runs/<run-name>/checkpoint.json.

Key changes:

  • strix/telemetry/checkpoint.py (new): CheckpointManager with atomic write (.tmp → rename), CheckpointModel Pydantic schema, and compute_target_hash for checkpoint validation.
  • strix/interface/checkpoint_restore.py (new): restore_sub_agents spawns previously-running sub-agents from checkpoint state in topological order; build_root_resume_message injects a resume context message into the root agent.
  • strix/agents/base_agent.py: Saves checkpoint after every successful iteration (root agents only) and deletes it on clean completion; is_resumed flag suppresses the duplicate task message.
  • strix/interface/cli.py / tui.py: Resume banner, previous-findings replay, checkpoint manager wiring, and a completed guard that prevents re-saving the checkpoint after a clean scan finish (addressing prior review concerns).
  • strix/runtime/docker_runtime.py: Adds _container_init_lock and synchronous _cleanup_existing_containers (via subprocess.run docker rm) to prevent multi-thread container creation races on resume.
  • strix/tools/agents_graph/agents_graph_actions.py: Adds _agents_lock for thread-safe access to _running_agents / _agent_instances; strips resume system messages from inherited sub-agent context.

Two concerns remain in checkpoint_restore.py:

  1. The _run closure that drives restored sub-agents does not replicate the graph-cleanup logic from _run_agent_in_thread — restored sub-agents permanently show status: \"running\" in view_agent_graph after they finish, which may confuse the root agent.
  2. t.start() is called before the agent is registered in _agent_instances and _agent_graph[\"nodes\"], reversing the ordering used in create_agent and creating a small race window.

Confidence Score: 4/5

Safe to merge for most use cases; two P2 issues in restore_sub_agents may cause stale graph state for multi-agent resumed scans.

The core checkpoint save/load/delete cycle, atomic writes, interrupt guards, and the completed-flag fixes are all correct. The two open issues (stale graph status for restored sub-agents and reversed thread-start vs registration order) only affect the multi-sub-agent resume path and don't cause data loss or crashes — they could cause the root agent LLM to see a slightly inconsistent graph view, potentially leading to duplicated sub-agent spawning in edge cases.

strix/interface/checkpoint_restore.py — the _run closure needs graph-cleanup logic and thread registration should happen before t.start()

Important Files Changed

Filename Overview
strix/telemetry/checkpoint.py New file: CheckpointModel (Pydantic), CheckpointManager with atomic write (.tmp → rename), load, delete, and compute_target_hash helper. Logic is sound and non-fatal on I/O errors.
strix/interface/checkpoint_restore.py New file: restore_sub_agents starts sub-agents in daemon threads without proper graph-status cleanup on exit, and registers agents after thread start (reversed order vs. create_agent). build_root_resume_message is well structured.
strix/interface/cli.py Adds resume banner, output replay, checkpoint manager wiring, and interrupt handlers with completed-guard. Previous-thread concerns about stale checkpoint on clean completion are now fixed.
strix/interface/tui.py Mirrors cli.py resume support: restores tracer state, wires checkpoint manager, adds completed-guard before saving on quit. Previous-thread stale-checkpoint issue is addressed.
strix/interface/main.py Adds --run-name, --resume, --new CLI flags and _setup_checkpoint_on_args which handles all four cases: force-new, auto-detect by target hash, explicit run name with checkpoint, explicit run name without. Target-hash mismatch warning is clear.
strix/agents/base_agent.py Adds checkpoint_manager save call after each successful iteration (root agents only) and delete on clean completion. is_resumed flag prevents duplicate task message injection.
strix/runtime/docker_runtime.py Adds _container_init_lock to prevent multiple threads from racing to create a container on resume, and _cleanup_existing_containers using synchronous docker CLI rm to avoid the async SDK race where the name registry still holds the name.
strix/tools/agents_graph/agents_graph_actions.py Adds _agents_lock threading.Lock to protect _running_agents and _agent_instances; strips resume system messages from inherited context to prevent sub-agent confusion.
strix/config/config.py Removes _llm_env_changed guard (stale LLM keys no longer auto-purged from stored config) and refactors apply logic. The effective precedence rule (shell env wins when set) is unchanged, but see prior thread discussion.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: strix/interface/checkpoint_restore.py
Line: 77-92

Comment:
**Restored sub-agents never update graph status on completion**

The `_run` closure calls `agent.agent_loop()` but does nothing when the loop exits. In the normal path (`_run_agent_in_thread`), the thread wrapper updates the agent graph node status to `"completed"` or `"error"` and pops the agent from `_running_agents` / `_agent_instances`:

```python
# _run_agent_in_thread (agents_graph_actions.py) — NOT done by _run
_agent_graph["nodes"][state.agent_id]["status"] = "completed"
_agent_graph["nodes"][state.agent_id]["finished_at"] = datetime.now(UTC).isoformat()
with _agents_lock:
    _running_agents.pop(state.agent_id, None)
    _agent_instances.pop(state.agent_id, None)
```

Because `_run` omits this, every restored sub-agent permanently shows `status: "running"` in `view_agent_graph` after it finishes. The root agent may conclude those sub-agents are still alive, avoid re-spawning needed work, or wait indefinitely for results that have already been delivered.

Consider replacing the bare `_run` closure with a call to `_run_agent_in_thread`, or at minimum replicating the graph-cleanup logic inside `_run`'s `finally` block:

```python
def _run(a: Any = agent, s: Any = state) -> None:
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    try:
        loop.run_until_complete(a.agent_loop(s.task))
    except Exception:  # noqa: BLE001
        with contextlib.suppress(Exception):
            agents_graph_actions._agent_graph["nodes"][s.agent_id]["status"] = "error"
    else:
        with contextlib.suppress(Exception):
            agents_graph_actions._agent_graph["nodes"][s.agent_id]["status"] = "completed"
    finally:
        loop.close()
        with contextlib.suppress(Exception):
            with agents_graph_actions._agents_lock:
                agents_graph_actions._running_agents.pop(s.agent_id, None)
                agents_graph_actions._agent_instances.pop(s.agent_id, None)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/interface/checkpoint_restore.py
Line: 83-100

Comment:
**Thread starts before agent is registered in the global registry**

`t.start()` is called before the agent is inserted into `_agent_instances`, `_running_agents`, and `_agent_graph["nodes"]`. This creates a window where the sub-agent thread is executing but the global registry doesn't know about it yet.

In the normal path (`create_agent` in `agents_graph_actions.py`), `_agent_instances[state.agent_id] = agent` is set *before* `thread.start()`. The reversed order here means that if the sub-agent's first LLM call returns very quickly and triggers a graph lookup (e.g. `view_agent_graph` or `agent_finish`), the node won't exist yet, potentially leading to a `KeyError`.

Move `t.start()` to after the full registration block:

```python
with agents_graph_actions._agents_lock:
    agents_graph_actions._running_agents[agent_id] = t
    agents_graph_actions._agent_instances[agent_id] = agent
    agents_graph_actions._agent_states[agent_id] = state
agents_graph_actions._agent_graph["nodes"][agent_id] = {
    "status": "running",
    "name": state.agent_name,
    "task": state.task,
    "parent_id": state.parent_id,
    "started_at": datetime.now(UTC).isoformat(),
}
t.start()  # start only after full registration
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix code-review issues and add 8 new vul..." | Re-trigger Greptile

@Ahmex000
Copy link
Copy Markdown
Author

2026-03-21.11-02-02.mp4

Checkpoint / resume fixes (bot review on PR usestrix#380):
- cli.py: skip checkpoint save when scan completed cleanly (agent.state.completed)
  to prevent stale checkpoint re-creating after base_agent.py deletes it
- tui.py: same completed guard in both _save_checkpoint_on_interrupt and
  action_custom_quit to cover all TUI exit paths
- checkpoint_restore.py: fix infinite recursion in _depth() for cyclic
  parent_id references in corrupted checkpoints — mark node before recursing
- config.py: restore original shell-env-wins precedence for LLM vars;
  cli-config.json only applies when the shell var is absent, preventing
  silent override of rotated keys managed via shell environment

New vulnerability skills (from upstream PRs usestrix#204 and usestrix#334):
- clickjacking, cors_misconfiguration, nosql_injection, prototype_pollution,
  ssti, websocket_security (PR usestrix#204)
- mfa_bypass, edge_cases (PR usestrix#334)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Ahmex000
Copy link
Copy Markdown
Author

@greptile-apps

Add strix/skills/vulnerabilities/http_request_smuggling.md covering CL.TE, TE.CL, H2.CL, H2.TE desync techniques, detection methodology, exploitation scenarios, and validation steps.

Cherry-picked from usestrix#405.
Replace minimal nosql_injection.md with comprehensive version from usestrix#404 covering MongoDB operator injection, Redis/Elasticsearch/DynamoDB attack surfaces, blind extraction, bypass techniques, and validation methodology.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant