Skip to content

CDP telemetry fixes#251

Open
Sayan- wants to merge 2 commits into
mainfrom
sayan/kernel-1301-add-more-events--cdp
Open

CDP telemetry fixes#251
Sayan- wants to merge 2 commits into
mainfrom
sayan/kernel-1301-add-more-events--cdp

Conversation

@Sayan-
Copy link
Copy Markdown
Contributor

@Sayan- Sayan- commented May 22, 2026

Overview

Bug

When chromium restarted mid-session (PATCH /chromium/flags, /configure, etc., or a crash + supervisord autorestart), the CDP proxy emitted cdp_disconnect.reason = "client_close" instead of "upstream_changed". Live repro confirmed: ~1.6s after the WS opened, reason was client_close even though chromium clearly went away.

Root cause

A goroutine watched mgr.Subscribe() for the new chromium URL and stamped reasonOverride = upstream_changed before triggering pump teardown. But the upstream's TCP EOF (chromium dying) reaches the pump before the new "DevTools listening on" log line is emitted (~5–8 s gap). So cleanup always ran first, saw reasonOverride == nil, and fell through to client_close.

Fix

  • wsproxy.Pump's onClose callback now receives a PumpExitCause (client / upstream / context).
  • Deleted the URL-watcher goroutine + reasonOverride in devtoolsproxy. Reason resolution moved into cleanup as a small helper, resolveDisconnectReason, which:
    • returns client_close / context_cancelled immediately from the cause,
    • on upstream cause, checks mgr.Current() and otherwise waits up to restartConfirmWait (10 s) on urlCh for a new URL — new URL ⇒ upstream_changed, timeout ⇒ upstream_error.

Tests

  • TestResolveDisconnectReason — 6 table cases covering all four reasons (sub-100 ms).
  • TestWebSocketProxyHandler_EmitsUpstreamChangedOnMidStreamRestart — integration through the handler with restartConfirmWait shortened.
  • Existing dial-time upstream_error + client_close tests continue to pass.

Verification

Repro'd live against the headful image. Before fix: reason=client_close, duration_ms=1620. After fix: reason=upstream_changed, duration_ms=10388.

Note

Medium Risk
Touches the WebSocket proxy pump/cleanup path and changes how cdp_disconnect reasons are derived, which could affect client reconnect behavior and telemetry classification under failure/restart conditions.

Overview
Improves CDP disconnect telemetry by teaching wsproxy.Pump to report which side caused the proxy loop to exit, and using that signal in devtoolsproxy to classify disconnects as client_close, context_cancelled, upstream_error, or upstream_changed.

When the upstream side dies, the proxy now briefly waits for an updated DevTools URL from UpstreamManager (with a configurable restartConfirmWait) to distinguish Chromium restarts from generic upstream failures; adds targeted unit tests for the new resolution logic and a mid-stream restart scenario. Most other diffs are import ordering/whitespace cleanup.

Reviewed by Cursor Bugbot for commit d6ef60c. Bugbot is set up for automated code reviews on this repo. Configure here.

@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR title 'CDP telemetry fixes' does not indicate changes to kernel API endpoints or Temporal workflows; please add a comment to opt in to deploy monitoring if this affects those areas.

To monitor this PR anyway, reply with @firetiger monitor this.

@Sayan- Sayan- requested review from archandatta and hiroTamada May 22, 2026 21:49
} else if r.Context().Err() != nil {
reason = oapi.ContextCancelled
}
reason := resolveDisconnectReason(cause, r.Context(), mgr, urlCh, upstreamURL, restartConfirmWait, logger)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a behavioral change, not just telemetry. Previously the URL watcher canceled the pump as soon as UpstreamManager published a different DevTools URL, forcing clients off stale upstream sessions. With this change, urlCh is only checked during dial/reason resolution after the pump exits, so a session can stay attached to the old upstream until the websocket itself errors or the request context is canceled. Could we preserve the watcher for stale-session cancellation while keeping this new cause-based reason resolution?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am wrong, but just double-checking here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants