feat(engine): unified event-driven observability with always-on logging and idle detection by jrob5756 · Pull Request #52 · microsoft/conductor

jrob5756 · 2026-03-22T15:08:11Z

Problem

When a workflow hangs or fails, there is no easy way to diagnose what happened. The event history only exists in-memory (lost on crash), log files require an explicit --log-file flag, and the dashboard gives no indication of whether the model is thinking or the connection is dead. Timeout errors say "workflow timed out" but not which agent was stuck or for how long.

Solution

Unify all observability through the event emitter — make it always-on, persist events to disk automatically, and surface diagnostic data in the dashboard.

Always-on structured event logging

WorkflowEventEmitter is now created for every run, not just --web mode
New EventLogSubscriber writes every event as JSONL to $TMPDIR/conductor/conductor-<name>-<timestamp>.events.jsonl
Path printed to stderr on exit — always available for post-mortem analysis
No flag needed, no opt-in, zero overhead when not read

Engine refactor: events as the single source of truth

Removed all 24 _verbose_log* calls and 13 lazy-import wrappers from workflow.py (~180 lines deleted)
Console output now flows through ConsoleEventSubscriber in run.py — subscribes to the emitter and calls the existing verbose_log_* display functions
The engine only emits events; how they are displayed is a subscriber concern
Three subscribers run in parallel: console, JSONL file, web dashboard

Richer failure diagnostics

workflow_failed event now includes elapsed_seconds, timeout_seconds, and current_agent for timeout errors — directly answers "which agent was stuck?"
New checkpoint_saved event emitted with file path, agent name, and error type
Dashboard error banner shows timed-out agent name and checkpoint filename

Dashboard: idle detection and log access

Idle timer in status bar: shows Xs idle after 5s of no events, turns amber at 60s — immediately tells you whether the model is thinking or the connection stalled
Logs download button always visible in header (not just after completion) — grab a snapshot of events mid-run
GET /api/logs endpoint returns full event history as downloadable JSON
awaiting_model event emitted by both providers right before SDK/API calls, marking the exact start of dead zones

Provider parity

Added awaiting_model event to Claude provider (was only in Copilot)
Documented provider parity rules in AGENTS.md — all providers must maintain feature parity for event callbacks, retry semantics, output contracts, tool execution, and session management

Testing

1729 tests pass, 0 failures
6 new tests for EventLogSubscriber
Updated test_for_each_verbose.py to verify events instead of patching removed functions
Updated 3 Claude provider tests for new awaiting_model event sequence
Verified with live runs: simple-qa (console), parallel-research (console), simple-qa (--web-bg dashboard)

Commits

aaf42dc — Always-on JSONL event logging, richer timeout/checkpoint events
1cedfcc — Consolidate verbose logging into ConsoleEventSubscriber
b978284 — Idle timer, always-on logs button, awaiting_model event (Copilot)
24d9e28 — Claude provider parity, AGENTS.md provider parity rules

- Add EventLogSubscriber: writes every workflow event as JSONL to $TMPDIR/conductor/ so diagnostic data is always available, not just when --web or --log-file is passed. - Always create the WorkflowEventEmitter regardless of --web flag, enabling event-driven diagnostics for all runs. - Enrich workflow_failed event with timeout-specific fields (elapsed_seconds, timeout_seconds, current_agent) so timeouts are immediately diagnosable. - Emit new checkpoint_saved event with path, agent, and error type. - Add /api/logs download endpoint to web dashboard. - Show "Download Logs" button in dashboard header after workflow ends. - Enhance error banner with timeout agent name and checkpoint path. - Add 6 tests for EventLogSubscriber.

…ules

Jason Robert added 4 commits March 22, 2026 11:07

refactor(engine): consolidate verbose logging into event subscriber

1cedfcc

feat(web): idle timer, always-on logs button, and awaiting_model event

b978284

feat(claude): add awaiting_model event and document provider parity r…

24d9e28

…ules

jrob5756 changed the title ~~feat(engine): always-on structured event logging and richer diagnostics~~ feat(engine): unified event-driven observability with always-on logging and idle detection Mar 22, 2026

jrob5756 merged commit 1e88d06 into main Mar 22, 2026
7 checks passed

jrob5756 deleted the feature/always-on-event-logging branch March 22, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): unified event-driven observability with always-on logging and idle detection#52

feat(engine): unified event-driven observability with always-on logging and idle detection#52
jrob5756 merged 4 commits intomainfrom
feature/always-on-event-logging

jrob5756 commented Mar 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jrob5756 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Always-on structured event logging

Engine refactor: events as the single source of truth

Richer failure diagnostics

Dashboard: idle detection and log access

Provider parity

Testing

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jrob5756 commented Mar 22, 2026 •

edited

Loading