Skip to content

Fix message-subscription races in callCommand and expect/intercept#73

Merged
nathanfallet merged 2 commits into
mainfrom
fix/flow-subscription-races
May 23, 2026
Merged

Fix message-subscription races in callCommand and expect/intercept#73
nathanfallet merged 2 commits into
mainfrom
fix/flow-subscription-races

Conversation

@nathanfallet
Copy link
Copy Markdown
Member

Summary

Two hangs that share one root cause: code acts first and only then subscribes to hear the result. Because messages flow through a replay = 0 shared flow, a message arriving in that window is dropped.

1. callCommand response race

It sent the request and only afterwards subscribed (responses.first { it.id == … }) to await the reply. A reply that arrived before the collector subscribed was lost → callCommand hangs forever (no timeout).

Fix: register an id -> CompletableDeferred<Message.Response> in a pendingRequests map (mutex-guarded) before sending; the receive loop completes the matching deferred, so the reply is captured regardless of await timing. The map is also a keyed registry — O(1) routing instead of N broadcast collectors, and a foundation for failing in-flight requests on disconnect / per-request timeouts (follow-ups).

2. expect / intercept subscribe-after-enable race

BaseRequestExpectation / BaseFetchInterception enabled the domain and only then launched the event collectors (lazily), so an event fired in that window was missed → getRequestEvent() hangs.

Fix: launch the collectors with CoroutineStart.UNDISPATCHED and before enable(), so the subscription is established before any event can fire.

Testability seam

The WebSocket is extracted behind a WebSocketTransport interface (default KtorWebSocketTransport), injected via a protected open fun createTransport() on DefaultConnection, so the message plumbing can be exercised without a real browser.

Verification (deterministic red → green)

Both tests use UnconfinedTestDispatcher + a fake transport that delivers the reply/event in the exact race window, so they fail against the old code and pass against the fix (not just guard tests):

  • CallCommandResponseRaceTest — delivers the response inside send(): RED (timeout) on responses.first, GREEN with the waiter map.
  • ExpectationSubscribeRaceTest — fires requestWillBeSent while handling Network.enable: RED (timeout) on enable-then-subscribe, GREEN with subscribe-before-enable.

Full :core:jvmTest (real headless Chrome — incl. RequestExpectationTest, FetchInterceptionTest, BrowserTest) + :opentelemetry:jvmTest pass. macOS native tests pass; Linux/MinGW targets compile.

Out of scope (tracked separately)

  • Failing in-flight callCommand deferreds on disconnect, and closing the WS HttpClient, are pre-existing limitations (the old code hung/leaked the same way). The new registry makes the disconnect fix clean — slated for the connection-lifecycle PR.

Test plan

  • ./gradlew :core:jvmTest --tests "*.CallCommandResponseRaceTest" --tests "*.ExpectationSubscribeRaceTest" — red on old code, green after
  • ./gradlew :core:jvmTest (real Chrome) — all pass
  • ./gradlew :opentelemetry:jvmTest — all pass
  • ./gradlew :core:macosArm64Test + Linux/MinGW compile — pass

Two hangs shared one root cause: code acted first and only then subscribed to
hear the result, and because messages flow through a replay-0 shared flow, a
message arriving in that window was dropped.

callCommand: it sent the request and only afterwards subscribed (responses.first)
to await the reply. A reply that arrived before the collector subscribed was lost
and callCommand hung forever. It now registers an id -> CompletableDeferred in a
pendingRequests map *before* sending; the receive loop completes the matching
deferred, so the reply is captured regardless of await timing. The map is also a
keyed registry (O(1) routing, and a foundation for failing in-flight requests on
disconnect and per-request timeouts later) rather than N broadcast collectors.

expect/intercept: BaseRequestExpectation/BaseFetchInterception enabled the domain
and only then launched the event collectors (lazily), so an event fired in that
window was missed and getRequestEvent() hung. They now launch the collectors with
CoroutineStart.UNDISPATCHED and *before* enable(), so the subscription is
established before any event can fire.

To make this testable without a real browser, the WebSocket is extracted behind a
WebSocketTransport interface (default KtorWebSocketTransport) injected via a
protected createTransport() factory on DefaultConnection.

Verified red->green deterministically (UnconfinedTestDispatcher + a fake transport
that delivers the reply/event in the exact race window):
- CallCommandResponseRaceTest: RED (hang/timeout) on the old responses.first path,
  GREEN with the waiter map.
- ExpectationSubscribeRaceTest: RED (timeout) on the old enable-then-subscribe
  ordering, GREEN with subscribe-before-enable.
Full :core:jvmTest (real Chrome) + :opentelemetry:jvmTest pass; macOS native tests
pass and Linux/MinGW targets compile.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

❌ Patch coverage is 77.58621% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../kdriver/core/connection/KtorWebSocketTransport.kt 62.06% 3 Missing and 8 partials ⚠️
...n/dev/kdriver/core/connection/DefaultConnection.kt 90.47% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@nathanfallet nathanfallet merged commit b5311b5 into main May 23, 2026
5 checks passed
@nathanfallet nathanfallet deleted the fix/flow-subscription-races branch May 23, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant