Skip to content

feat(engine): unified event-driven observability with always-on logging and idle detection#52

Merged
jrob5756 merged 4 commits intomainfrom
feature/always-on-event-logging
Mar 22, 2026
Merged

feat(engine): unified event-driven observability with always-on logging and idle detection#52
jrob5756 merged 4 commits intomainfrom
feature/always-on-event-logging

Conversation

@jrob5756
Copy link
Collaborator

@jrob5756 jrob5756 commented Mar 22, 2026

Problem

When a workflow hangs or fails, there is no easy way to diagnose what happened. The event history only exists in-memory (lost on crash), log files require an explicit --log-file flag, and the dashboard gives no indication of whether the model is thinking or the connection is dead. Timeout errors say "workflow timed out" but not which agent was stuck or for how long.

Solution

Unify all observability through the event emitter — make it always-on, persist events to disk automatically, and surface diagnostic data in the dashboard.

Always-on structured event logging

  • WorkflowEventEmitter is now created for every run, not just --web mode
  • New EventLogSubscriber writes every event as JSONL to $TMPDIR/conductor/conductor-<name>-<timestamp>.events.jsonl
  • Path printed to stderr on exit — always available for post-mortem analysis
  • No flag needed, no opt-in, zero overhead when not read

Engine refactor: events as the single source of truth

  • Removed all 24 _verbose_log* calls and 13 lazy-import wrappers from workflow.py (~180 lines deleted)
  • Console output now flows through ConsoleEventSubscriber in run.py — subscribes to the emitter and calls the existing verbose_log_* display functions
  • The engine only emits events; how they are displayed is a subscriber concern
  • Three subscribers run in parallel: console, JSONL file, web dashboard

Richer failure diagnostics

  • workflow_failed event now includes elapsed_seconds, timeout_seconds, and current_agent for timeout errors — directly answers "which agent was stuck?"
  • New checkpoint_saved event emitted with file path, agent name, and error type
  • Dashboard error banner shows timed-out agent name and checkpoint filename

Dashboard: idle detection and log access

  • Idle timer in status bar: shows Xs idle after 5s of no events, turns amber at 60s — immediately tells you whether the model is thinking or the connection stalled
  • Logs download button always visible in header (not just after completion) — grab a snapshot of events mid-run
  • GET /api/logs endpoint returns full event history as downloadable JSON
  • awaiting_model event emitted by both providers right before SDK/API calls, marking the exact start of dead zones

Provider parity

  • Added awaiting_model event to Claude provider (was only in Copilot)
  • Documented provider parity rules in AGENTS.md — all providers must maintain feature parity for event callbacks, retry semantics, output contracts, tool execution, and session management

Testing

  • 1729 tests pass, 0 failures
  • 6 new tests for EventLogSubscriber
  • Updated test_for_each_verbose.py to verify events instead of patching removed functions
  • Updated 3 Claude provider tests for new awaiting_model event sequence
  • Verified with live runs: simple-qa (console), parallel-research (console), simple-qa (--web-bg dashboard)

Commits

  • aaf42dc — Always-on JSONL event logging, richer timeout/checkpoint events
  • 1cedfcc — Consolidate verbose logging into ConsoleEventSubscriber
  • b978284 — Idle timer, always-on logs button, awaiting_model event (Copilot)
  • 24d9e28 — Claude provider parity, AGENTS.md provider parity rules

Jason Robert added 4 commits March 22, 2026 11:07
- Add EventLogSubscriber: writes every workflow event as JSONL to
  $TMPDIR/conductor/ so diagnostic data is always available, not just
  when --web or --log-file is passed.
- Always create the WorkflowEventEmitter regardless of --web flag,
  enabling event-driven diagnostics for all runs.
- Enrich workflow_failed event with timeout-specific fields
  (elapsed_seconds, timeout_seconds, current_agent) so timeouts
  are immediately diagnosable.
- Emit new checkpoint_saved event with path, agent, and error type.
- Add /api/logs download endpoint to web dashboard.
- Show "Download Logs" button in dashboard header after workflow ends.
- Enhance error banner with timeout agent name and checkpoint path.
- Add 6 tests for EventLogSubscriber.
@jrob5756 jrob5756 changed the title feat(engine): always-on structured event logging and richer diagnostics feat(engine): unified event-driven observability with always-on logging and idle detection Mar 22, 2026
@jrob5756 jrob5756 merged commit 1e88d06 into main Mar 22, 2026
7 checks passed
@jrob5756 jrob5756 deleted the feature/always-on-event-logging branch March 22, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant