Skip to content

feat(engine): stop/resume workflow control with provider-level idle detection#53

Open
jrob5756 wants to merge 5 commits intomainfrom
feature/stop-resume-idle-detection-v2
Open

feat(engine): stop/resume workflow control with provider-level idle detection#53
jrob5756 wants to merge 5 commits intomainfrom
feature/stop-resume-idle-detection-v2

Conversation

@jrob5756
Copy link
Collaborator

@jrob5756 jrob5756 commented Mar 22, 2026

Summary

Stop/Resume workflow control for the web dashboard, provider-level idle detection with interrupt support, and dashboard state resilience across page refreshes.

Changes

Provider-level idle detection + interrupt (Copilot)

  • Merged interrupt support INTO the idle detection loop so both work simultaneously
  • Previously, having an interrupt_signal completely bypassed idle recovery, causing sessions to hang forever
  • Now the provider monitors for user interrupts AND sends recovery prompts after 90s of silence, while respecting active work (events flowing = not idle)
  • Lowered idle timeout from 300s to 90s for faster stall recovery
  • Recovery prompt ("please continue") is harmless to active work — model ignores it if already processing

Provider interrupt during API calls (Claude)

  • Race interrupt signal against both _execute_api_call and _execute_with_parse_recovery for provider parity
  • Copy working_messages in _request_partial_output to avoid mutating caller's history

Web dashboard Stop → Resume

  • Stop (POST /api/stop): Aborts the current agent via interrupt event (not kill workflow)
  • Resume (POST /api/resume): Re-executes the paused agent
  • Kill (POST /api/kill): Hard-stop with checkpoint saved for CLI resume
  • Separate kill_event from _stop_event to prevent permanent poisoning across pause cycles
  • disconnect_event for auto-resume when browser closes during pause (prevents hanging forever)
  • agent_paused / agent_resumed events for dashboard state tracking

Dashboard state resilience (page refresh)

  • workflow_started handler uses event timestamp instead of Date.now() — elapsed timer survives refresh
  • agent_started / script_started store startedAt from event timestamp — node elapsed timers survive refresh
  • replayState sets lastEventTime from last replayed event — idle timer works after refresh
  • useLiveElapsed hook uses stored startedAt instead of Date.now() on mount

Web mode interrupt support

  • Always create interrupt_event in --web/--web-bg mode so dashboard Stop button reaches the engine
  • Share interrupt_event with dashboard via set_interrupt_event()

Other

  • Remove revive infrastructure (Ctrl+R, /api/revive, revive watcher) — replaced by idle detection + Stop/Resume
  • Remove dead _wait_with_interrupt method (merged into idle detection)
  • Hoist json import to module level in workflow engine
  • Silently consume stale interrupt in web mode between-agent checks
  • Bump flaky CLI import time test threshold from 0.5s to 1.0s
  • Header: useEffect for button state resets, error logging in catch blocks, killing state
  • Upgrade Copilot SDK dependency to >=0.2.0

Test results

  • 1760 passed, 9 skipped
  • Lint clean, typecheck clean
  • Frontend builds successfully

- Separate kill_event from _stop_event to prevent permanent poisoning
- Add disconnect_event for auto-resume when browser closes during pause
- Race interrupt signal against _execute_with_parse_recovery (Claude parity)
- Copy working_messages in _request_partial_output to avoid mutation
- Hoist json import to module level, remove scattered lazy imports
- Silently consume stale interrupt in web mode between-agent checks
- Remove dead _wait_with_interrupt method (merged into idle detection)
- Header: useEffect for state resets, error logging, killing state
- Updated docstrings for stop/kill terminology consistency
@jrob5756 jrob5756 force-pushed the feature/stop-resume-idle-detection-v2 branch from e6bf69f to 8bbdd1e Compare March 22, 2026 18:48
Jason Robert added 4 commits March 22, 2026 20:06
- Separate kill_event from _stop_event to prevent permanent poisoning
- Add disconnect_event for auto-resume when browser closes during pause
- Race interrupt signal against _execute_with_parse_recovery (Claude parity)
- Copy working_messages in _request_partial_output to avoid mutation
- Hoist json import to module level, remove scattered lazy imports
- Silently consume stale interrupt in web mode between-agent checks
- Remove dead _wait_with_interrupt method (merged into idle detection)
- Header: useEffect for state resets, error logging, killing state
- Updated docstrings for stop/kill terminology consistency
- Extended kill test to verify kill_event and bg_event
- Update github-copilot-sdk dependency to >=0.2.0
- Lower idle timeout from 300s to 90s for faster stall recovery
- Update provider and tests for SDK v0.2.0 API changes
- Add --locked flag to all uv tool install commands for reproducible deps
- Increase idle recovery max attempts from 3 to 5 (7.5min total runway)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant