Skip to content

SDK-471 Fix Swift access race in Pending/Fulfill#1063

Open
sumeruchat wants to merge 2 commits into
masterfrom
feature/SDK-471-fix-pending-access-race
Open

SDK-471 Fix Swift access race in Pending/Fulfill#1063
sumeruchat wants to merge 2 commits into
masterfrom
feature/SDK-471-fix-pending-access-race

Conversation

@sumeruchat
Copy link
Copy Markdown
Contributor

@sumeruchat sumeruchat commented May 11, 2026

What

Thread Sanitizer flagged a Swift access race inside Pending / Fulfill (reproducible in customer apps, GitHub issue #767, easiest repro via setUserId). All mutable state — successCallbacks, errorCallbacks, and result — was accessed with zero synchronization while URLSession delegate queues, persistenceContext, and main thread routinely fan into the same instance.

This change serializes all state mutations behind a private DispatchQueue (matching the established AuthManager.callbackQueue pattern) and snapshots callbacks before invoking them outside the critical section so re-entrant registrations cannot deadlock. Multi-event semantics of Fulfill (used by MockInAppDisplayer.onShow and similar in-tree callers) are preserved.

Changes

  • swift-sdk/Internal/Pending.swift — All reads/writes of result, successCallbacks, errorCallbacks now run inside stateQueue.sync. Callbacks fire OUTSIDE the sync block. Fulfill.resolve / reject route through a new setResult(_:) instead of relying on didSet. Repeated resolution / rejection still notifies every registered callback (Fulfill is still usable as an event signal — no behavior change).
  • onCompletion atomicity — both callbacks are appended inside one sync block; a .failure resolve landing between the two appends can no longer lose the error handler.
  • Late registration on resolved Pending — appends the callback and replays the stored result for the new callback only (previously iterated the full accumulated array, causing O(N²) firing). Late-registered callbacks remain in the list and receive subsequent resolves.
  • tests/unit-tests/PendingConcurrencyTests.swift — 11 new tests covering concurrent register+resolve, onCompletion atomicity under race (deterministic via DispatchGroup), re-entrant callback registration, flatMap cross-queue resolution, late-register-after-resolve, repeated resolve/reject behaviour, mixed resolve-after-reject, and a 1000-iteration stress run.

Impact

  • Breaking changes: None. The latent O(N²) firing on late-register-against-resolved-Pending is corrected — each late-registered callback now fires once per resolution rather than accumulating. No in-tree caller depended on the O(N²) behaviour; all existing tests pass.
  • Dependencies: None.
  • Performance: Net win — the O(N²) firing path is gone. Sync overhead is a single uncontended dispatch_sync per operation.

Testing

Automated (verified locally):

  • ./agent_test.sh PendingTests — all 9 original tests pass
  • ./agent_test.sh PendingConcurrencyTests — all 11 new tests pass
  • ./agent_test.sh InAppTests — 43/43 pass (covers MockInAppDisplayer.onShow multi-event use)
  • ./agent_test.sh AuthTests — 37/37 pass

How to verify TSan locally:

  1. Open swift-sdk.xcodeproj in Xcode
  2. Edit swift-sdk scheme → Test → Diagnostics → enable Thread Sanitizer
  3. Run PendingConcurrencyTests — expect zero TSan reports
  4. Optional: run a customer-style setUserId flow against a sample app with TSan enabled — the original prestolabs warning should be gone

Edge cases covered:

  • Resolve racing N=50 onSuccess registrations from a concurrent queue
  • Reject racing N=50 onError registrations
  • onCompletion(value:error:) partial-append race (200 iterations, DispatchGroup-synchronized)
  • Re-entrant onSuccess registration from inside an executing callback
  • flatMap chain crossing 3 queues (background → main → background)
  • Repeated resolve / reject notifies every registered callback
  • Mixed reject then resolve notifies the matching branch each time
  • Late-registered callbacks stay registered and receive subsequent resolves
  • concurrentPerform-based stress run of 1000 Fulfills

Thread Sanitizer flagged concurrent read/write on `successCallbacks`,
`errorCallbacks`, and `result` because `Pending` had zero synchronization
while URLSession delegate queues, persistenceContext, and main thread all
mutate the same instance during flows like `setUserId`. Serialize all
state mutations through a private `DispatchQueue` (matches the
`AuthManager.callbackQueue` convention) and snapshot callbacks before
invoking them outside the critical section so re-entrant registrations
do not deadlock.

Folds in three related correctness fixes:
- `onCompletion` now appends both callbacks atomically (previously a
  `.failure` resolve landing between the two appends would lose the
  error handler).
- Late registration on an already-resolved `Pending` now fires only the
  newly-registered callback (was: iterated the full accumulated array,
  causing O(N^2) firing).
- Double-resolve / double-reject are now ignored with a debug log
  (was: silently double-fired through `didSet`).

Adds `PendingConcurrencyTests` covering concurrent register+resolve,
`onCompletion` atomicity under race, re-entrant callback registration,
flatMap cross-queue resolution, late-register-after-resolve, double
resolve/reject, and a 1000-iteration stress run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.22%. Comparing base (e901170) to head (7300c66).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1063      +/-   ##
==========================================
+ Coverage   71.11%   71.22%   +0.10%     
==========================================
  Files         112      112              
  Lines        9328     9337       +9     
==========================================
+ Hits         6634     6650      +16     
+ Misses       2694     2687       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Address review feedback: the previous commit over-scoped the fix by
turning `Fulfill` into a one-shot promise. In-tree callers depend on
`Fulfill` being usable as a multi-event broadcast — `MockInAppDisplayer`
calls `onShow.resolve(with:)` once per shown in-app message and expects
the single registered `onSuccess` to fire each time. The one-shot guard
broke `testAutoShowInAppMultipleWithOrdering` and similar tests.

This commit narrows the fix to thread-safety only:

- `setResult` no longer drops duplicate resolves and no longer drains
  the callback arrays. Each call overwrites `result` and fires every
  currently-registered callback against a snapshot taken under the
  lock. Callbacks still fire OUTSIDE the critical section so re-entrant
  registrations cannot deadlock.
- `onSuccess` / `onError` / `onCompletion` always append the callback,
  including on already-resolved Pendings, so future resolves continue
  to notify late-registered listeners (matching prior behavior).
- The O(N^2) latent bug remains fixed: late registration on a resolved
  Pending replays the stored result for the newly-registered callback
  only, instead of iterating the full accumulated array.
- `onCompletion` still appends both callbacks inside a single
  `stateQueue.sync`, eliminating the partial-append race.

Tests updated:

- `testOnCompletionAtomicityUnderRace` switched from a 50ms `asyncAfter`
  drain (flaky under TSan / busy CI) to a `DispatchGroup`-driven assert
  that waits on actual completion before reading counters.
- Removed `testDoubleResolveIsIgnored`, `testDoubleRejectIsIgnored`,
  and `testResolveAfterRejectIsIgnored` — those semantics are no
  longer claimed.
- Added `testRepeatedResolveNotifiesRegisteredCallbacks`,
  `testRepeatedRejectNotifiesRegisteredCallbacks`,
  `testResolveAfterRejectNotifiesMatchingBranches`, and
  `testLateRegisterAfterResolveStaysRegisteredForFutureResolves` to
  pin down the preserved multi-event behavior.

Verified locally: PendingTests 9/9, PendingConcurrencyTests 11/11,
InAppTests 43/43 (including the previously-broken case), AuthTests
37/37.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant