Skip to content

Fix target registry: thread-safety, single creation policy, wait-for-target in get()#75

Merged
nathanfallet merged 4 commits into
mainfrom
fix/target-registry
May 24, 2026
Merged

Fix target registry: thread-safety, single creation policy, wait-for-target in get()#75
nathanfallet merged 4 commits into
mainfrom
fix/target-registry

Conversation

@nathanfallet
Copy link
Copy Markdown
Member

Summary

The browser's target registry (the targets list) had three issues sharing one root cause: one mutable list populated by two code paths with different rules and no shared lock.

ISSUE-5 — thread-safety

targets was mutated under updateTargetInfoMutex on the event path, but read/written without it elsewhere (updateTargets(), and the non-suspend getters tabs/mainTab/targets) → ConcurrentModificationException/torn reads on a multi-threaded dispatcher. Now a private val mutableTargets is mutated only under the mutex, and an immutable copy is published to @Volatile targetsSnapshot after each mutation; the non-suspend getters read the snapshot. Browser.targets tightened MutableList<Connection>List<Connection> (no caller mutated it; OpenTelemetryBrowser delegates).

Why a snapshot rather than just marking the list @Volatile: volatile governs reassignment of the reference, not in-place add/remove, and MutableList isn't thread-safe — a reader iterating the live list would still hit ConcurrentModificationException. Readers iterate the immutable snapshot instead.

ISSUE-6 — inconsistent creation / duplicates

updateTargets() created plain Connections (invisible to tabs/mainTab, which filter Tab), while the event path created Tabs and appended without dedup — so a target could appear twice, or the initial page was a non-Tab. A single upsertTarget() now serves both paths: dedupe by targetId, page targets become Tabs.

ISSUE-8 — get() race / crash

get() did targets.filterIsInstance<Tab>().first { … }, throwing NoSuchElementException when the Tab wasn't registered yet (the targetCreated event races createTarget() returning; at start-up the initial page registers asynchronously). Replaced with awaitPageTab(), which polls the snapshot with a timeout. This is the mechanism behind the intermittent ElementTest.testGetChildren CI failures.

Verification (red → green)

New TargetRegistryTest (fake connection/tab via stubbed callCommand, no real browser):

  • updateTargets_registersPageTargetAsTab — RED on unfixed (createConnection → not a Tabtabs empty), GREEN after.
  • get_waitsForPageTabToAppear — RED on unfixed (.first {}NoSuchElementException), GREEN after.
  • updateTargets_dedupesByTargetId — guard.

Independently reviewed (safe to merge). Full :core:jvmTest (real Chrome) + :opentelemetry:jvmTest pass; macOS native tests pass; Linux/MinGW compile.

Test plan

  • ./gradlew :core:jvmTest --tests "*.TargetRegistryTest" — red on unfixed, green after
  • ./gradlew :core:jvmTest (real Chrome) — all pass
  • ./gradlew :opentelemetry:jvmTest — all pass
  • ./gradlew :core:macosArm64Test + Linux/MinGW compile — pass

…target in get()

The browser's target registry (the targets list) had three issues sharing one
root cause: one mutable list populated by two code paths with different rules
and no shared lock.

- Thread-safety (ISSUE-5). targets was mutated under updateTargetInfoMutex on the
  event path but read/written without it elsewhere (updateTargets, and the
  non-suspend getters tabs/mainTab/targets), risking ConcurrentModificationException
  or torn reads on a multi-threaded dispatcher. Now a private mutableTargets is
  mutated only under the mutex, and an immutable copy is published to a @volatile
  targetsSnapshot after every mutation; the non-suspend getters read the snapshot.
  Browser.targets is tightened MutableList<Connection> -> List<Connection>
  (no caller mutated it; OpenTelemetryBrowser delegates).

- Inconsistent creation (ISSUE-6). updateTargets() created plain Connections
  (invisible to tabs/mainTab, which filter Tab), while the event path created Tabs
  and appended without dedup, so a target could appear twice or the initial page
  was a non-Tab. A single upsertTarget() helper now serves both paths: dedupe by
  targetId, page targets become Tabs.

- get() race (ISSUE-8). get() did targets.filterIsInstance<Tab>().first { ... },
  which threw NoSuchElementException when the Tab had not been registered yet
  (the targetCreated event races createTarget() returning, and at start-up the
  initial page registers asynchronously). Replaced with awaitPageTab(), which
  polls the snapshot with a timeout.

Verified red->green with TargetRegistryTest (fake connection/tab via stubbed
callCommand): a page target is registered as a Tab, and get() waits for the page
tab to appear instead of throwing. Independently reviewed. Full :core:jvmTest
(real Chrome) + :opentelemetry:jvmTest pass; macOS native tests pass; Linux/MinGW
compile.
PR-C made the initial page register as a Tab, but only if getTargets() (in
start()'s updateTargets()) already returns it. getTargets() can return before
the browser has created the page, and no targetCreated event may have fired yet,
so browser.mainTab could still be null right after createBrowser — the residual
startup race behind the flaky `browser.mainTab ?: error(...)` tests
(e.g. testEvaluateWaitPromiseSuccess).

start() now ends with a best-effort awaitInitialPageTab(): it polls (re-running
updateTargets each tick so it works with or without auto-discovery events) until
a page Tab is registered, bounded by the same 10s timeout. It does NOT throw on
timeout — a browser may legitimately have no page, in which case mainTab stays
null as before.

This closes the mainTab startup race the same way the get() fix closed the
navigation race. Verified red->green with TargetRegistryTest.awaitInitialPageTab_
makesMainTabReadyOncePageAppears (mainTab null before the await; non-null once the
page appears during it). Full :core:jvmTest (real Chrome) + :opentelemetry:jvmTest
pass; macOS native tests pass; Linux/MinGW compile.
… timeouts

Two unrelated-to-the-fix CI-reliability changes so the build is a trustworthy signal:

- Set strategy.fail-fast: false so an intermittent failure on one OS leg no
  longer cancels the other two — the real per-OS result is visible instead of
  masked.
- Widen the real-browser network-event test timeouts from 3s (and a 1s handler
  wait) to 10s. These assert on requestWillBeSent/responseReceived/requestPaused
  arriving after a real navigation; 3s is too tight on loaded CI runners (notably
  Windows) and caused TimeoutCancellationException flakes in FetchInterceptionTest
  /RequestExpectationTest. The deterministic fake-transport unit tests keep their
  short timeouts.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

❌ Patch coverage is 92.50000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../kotlin/dev/kdriver/core/browser/DefaultBrowser.kt 92.50% 0 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

The on-runner HTML report isn't accessible from the Actions log, so a failing
assertion only shows its top frame. --info logs the test failure exception
(expected/actual) so CI-only failures can be diagnosed without re-uploading
reports. Diagnostic; can be trimmed once the flaky Windows test is understood.
@nathanfallet nathanfallet merged commit 5131422 into main May 24, 2026
4 of 5 checks passed
@nathanfallet nathanfallet deleted the fix/target-registry branch May 24, 2026 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant