Centralize timer-arming to close TOCTOU race#983
Open
jhugard wants to merge 3 commits into
Open
Conversation
Follow-up to #975 and #980: the CAS-then-Start pattern for retargeting m_timerDue had a TOCTOU window where thread A could win the CAS but before calling Start(), thread B could CAS+Start an earlier deadline, which thread A's Start() would then overwrite — stranding the earlier callback until independent traffic arrived. The fix in #980 added post-Start verification in SubmitPendingCallbacks, but the same unguarded pattern existed in QueueItem and PromoteReadyPendingCallbacks. This change extracts the CAS+Start+verify logic into three helpers (ArmTimerIfEarlier, ArmNextPendingCallback, RearmObservedDueTime) so the post-Start verification is applied uniformly at every call site. ~67 lines of duplicated inline CAS logic removed.
499ff7d to
8386b37
Compare
brianpepin
approved these changes
May 21, 2026
Contributor
brianpepin
left a comment
There was a problem hiding this comment.
Much cleaner. Thanks for the effort!
Collaborator
Author
I should have seen this race earlier, so thanks for pointing it out! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #975 and #980: retargeting
m_timerDuestill had a TOCTOU windowaround the non-atomic CAS -> Start sequence. Thread A could publish a later due
time, thread B could then publish and arm an earlier one, and thread A's
trailing
Start()could overwrite the OS timer with the later deadline andstrand the earlier callback until unrelated traffic arrived, as in #973.
The previous fix in #980 added a post-Start verification check in
SubmitPendingCallbacks, but the same unguarded pattern existed inQueueItemand
PromoteReadyPendingCallbacks.This change centralizes the CAS+Start+verify logic into three focused helpers
so every timer-arm path uses the same post-Start verification pattern without
burying it in call-site control flow.
The public
XTaskQueueAPI surface is unchanged. The fix is internal to thedelayed-callback scheduling path and matters to any consumer relying on timed
waits, deferred continuations, retry back-off, or timeout enforcement.
Bug addressed
Timer-arm TOCTOU overwrite. When two threads race to arm
m_timerDue, thesequence CAS → Start is not atomic. Between a successful CAS and the subsequent
Start()call, a concurrent thread can:m_timerDue.Start()with that earlier deadline.The first thread's
Start()then fires after thread B's, overwriting the OStimer with a later deadline. The earlier pending entry is stranded until some
unrelated timer fire or new submission happens to flush it.
The external symptom matches the stale-callback bug fixed in #975: a delayed
callback becomes runnable only after unrelated later work wakes the queue.
Under sustained concurrent delayed-callback submission this can show up as
latency jitter or starvation for a particular deadline bucket.
What changed
New helpers (
TaskQueue.cpp)Three private methods replace all inline CAS+Start patterns:
ArmTimerIfEarlier(dueTime)falseonly when the published due time moved later and the caller should re-evaluate.ArmTimerForNextPendingDueTime(previousDueTime, nextDueTime)m_timerDuelater. Returnsfalsewhen the caller must rescan because the published value changed again.RearmTimerIfDueTimeUnchanged(dueTime)falseif the observed due time is stale.Call-site changes
QueueItem: Replaced the inline CAS loop with a single call toArmTimerIfEarlier(entry.enqueueTime). The return value is intentionallyignored because the entry is already safe in
m_pendingListeven if anotherthread retargets the timer first.
PromoteReadyPendingCallbacks: Replaced the inline CAS loop thatpublished
nextItem.enqueueTimewithArmTimerForNextPendingDueTime(dueTime, nextItem.enqueueTime). Onfalse(world changed), the function refreshesnowanddueTimeand rescans.SubmitPendingCallbacks: Replaced the inline CAS + post-verificationstanza from Update TaskQueue with changes from Windows repo #980 with
RearmTimerIfDueTimeUnchanged(dueTime). Onfalse,the outer loop reloads state and re-evaluates.
The helper extraction removes the duplicated inline CAS logic from all three
call sites.
Design rationale
Why three helpers instead of one?
Each call site has distinct preconditions on the direction the published
deadline may move:
QueueItem: only moves earlier (new entry).PromoteReadyPendingCallbacks: moves later (just-fired deadline replacedwith next future one).
SubmitPendingCallbacks: re-arms the same value (early/stale timer fire).A single function would need mode flags or extra parameters to express these
constraints, and its post-Start verification logic would diverge per mode. Three
small functions with clear names and focused contracts are simpler to audit.
Why
<=inArmTimerIfEarlier?The
<=comparison keeps equality on the same verified path. That matters forbenign races where concurrent submitters compute the same due time, and it
keeps
ArmTimerIfEarliercorrect for callers that need to re-arm a deadlinethat is already published rather than strictly earlier.
Post-Start verification pattern
All three helpers perform the same post-
Start()check against the publishedvalue:
That verification closes the TOCTOU window: if another thread's CAS lands
between ours and
Start(), we detect it and either repair the earlier arm oryield to the thread that already published it, rather than leaving it stranded.
ABA safety
If
m_timerDuegoes throughX → Y → Xbetween our CAS and verification, weread back our own value and return
true. The concurrent thread that moved itto
Yand back called its ownStart(X), so the timer is correctly armed forX. ABA here is benign by construction.Public API surface
XTaskQueuesignatures change.XTaskQueueSubmitDelayedCallbackbehavior is unchanged from the caller'sperspective: the not-before-deadline guarantee is now upheld more reliably
under concurrent submission and timer retargeting races.
Validation
libHttpClientWindows unit test suite:87 passed, 0 failed, 0 skipped.
current tree still carries unrelated WebSocket connect and mock multi-match
expectation failures.
VerifyDelayedCallbackTimerRaceOnManualQueueVerifyFutureDelayedCallbackQueuedDuringEmptySweepDoesNotStallVerifyTerminationDoesNotEarlyPromoteSiblingDelayedCallbackVerifyStaleDelayedCallbackDoesNotEarlyPromoteNextPendingEntryThese tests were added in #975 and exercise the exact interleaving scenarios
that this change hardens. The TOCTOU window itself is probabilistic and
difficult to deterministically reproduce in a unit test (the window is a few
nanoseconds between CAS and Start), but the structural guarantee — that
post-Start verification is now uniformly applied — eliminates the class of bug
rather than papering over individual instances.
Additional downstream integration validation:
queues and delayed callbacks to implement sleep, timeout, and cancellation.
on that coroutine layer.
Scope
code paths).
unrelated task-queue behavior are unchanged.
the fix through the shared
TaskQueuePortImplscheduling logic.Review focus
<=semantics inArmTimerIfEarlier, especially around equal-deadlineraces between concurrent submitters.
ArmTimerForNextPendingDueTime's decision to returntruewhen another threadpublishes an earlier deadline (delegating responsibility to that thread's own
verify cycle).
RearmTimerIfDueTimeUnchanged's refusal to resurrect a stale deadline whenm_timerDuehas already advanced past it.true= stable/covered,false=re-evaluate) is clear and consistently used at each call site.