dialog: fix refcount race between BYE and timer expiry#3836
Open
NormB wants to merge 1 commit intoOpenSIPS:masterfrom
Open
dialog: fix refcount race between BYE and timer expiry#3836NormB wants to merge 1 commit intoOpenSIPS:masterfrom
NormB wants to merge 1 commit intoOpenSIPS:masterfrom
Conversation
b5b39c3 to
196b51f
Compare
ovidiusas
added a commit
to ovidiusas/opensips
that referenced
this pull request
Apr 7, 2026
887d200 to
42c3041
Compare
…3835) The previous fix (196b51f) narrowed the race window between BYE processing and dlg_ontimeout() but could not fully close it: a BYE arriving between the state check and dlg_end_dlg() creates a TM transaction ref (via dlg_set_tm_dialog_ctx) that outlives the dialog. When the timeout-BYE responses destroy the dialog first, the real BYE's transaction callback dereferences freed memory. The same race pattern exists in dlg_options_routine() and dlg_reinvite_routine(), where expired ping dialogs call dlg_end_dlg() without any state transition. Fix: replace racy state checks with atomic next_state_dlg(DLG_EVENT_REQBYE) calls. The hash entry lock guarantees only one code path -- timer, ping handler, or BYE handler -- wins the CONFIRMED -> DELETED transition. Since dual_bye_event() now only sees DELETED -> DELETED for the timeout-BYE responses, the cleanup that it would normally perform on the first CONFIRMED -> DELETED transition (rt_on_hangup, profile linkers, DLGCB_TERMINATED callback, DB removal) is moved into the winning timeout/ping path. Three paths fixed: - dlg_ontimeout() bye_on_timeout (dlg_handlers.c) - dlg_options_routine() expired pings (dlg_timer.c) - dlg_reinvite_routine() expired pings (dlg_timer.c) Ref: OpenSIPS#3835
42c3041 to
1c2d0bd
Compare
ovidiusas
added a commit
to ovidiusas/opensips
that referenced
this pull request
Apr 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix use-after-free crash caused by a race between BYE processing and
dlg_ontimeout()whenbye_on_timeoutis enabled (create_dialog("B")). Reproduced on ARM64 (aarch64) — matches the crash signature reported in #3835 (bogus ref -1/ SIGSEGV indlg_release_cloned_leg).Details
Two independent race paths exist between BYE processing and the dialog timer handler:
Race Path 1 — Timer removal outside lock:
next_state_dlg()transitions the dialog toDELETEDand releases the hash entry lock before the caller performsremove_dlg_timer(). A concurrent worker can observestate=DELETEDand begin its own cleanup, racing with the caller's timer removal.Race Path 2 — Stale state read in timer handler:
dlg_ontimeout()readsdlg->stateat three locations without the hash entry lock (dlg_handlers.c:2468,dlg_handlers.c:2491,dlg_handlers.c:2518). On ARM64's relaxed memory ordering, these reads can return a staleCONFIRMEDvalue after a BYE worker has already setDELETEDunder the lock. The timer handler then enters thebye_on_timeoutpath on a dialog that is already being torn down.Both races are architecture-sensitive — ARM64's relaxed memory ordering makes them significantly more likely than on x86_64. This is consistent with the reporter seeing crashes on Graviton while the issue is difficult to reproduce on x86.
Race window diagram
The same timer-removal-outside-lock pattern exists in all four callers of
next_state_dlg():dlg_handlers.cdlg_onroute()dlg_req_within.cdual_bye_event()dlg_replication.cdlg_replicated_delete()dlg_replication.cdrop_dlg()Solution
A two-part fix. Our testing indicates neither part alone is sufficient — both are needed to cover all race windows.
Part 1: Timer removal inside lock (
dlg_hash.c): Moveremove_dlg_timer()insidenext_state_dlg()'s lock scope. When transitioning toDELETED, the timer is removed atomically with the state change. All four callers' post-next_state_dlg()remove_dlg_timer()blocks are removed sincenext_state_dlg()now handles this centrally.Lock ordering note:
remove_dlg_timer()acquiresd_timer->lockinternally. The resulting hash entry lock → timer lock ordering is already established by existing code paths (e.g.,insert_dlg_timer()called while holding the hash lock).Part 2: Locked state read in
dlg_ontimeout()(dlg_handlers.c): Readdlg->stateunder the hash entry lock into a localdlg_statevariable, then usedlg_statefor all subsequent state-dependent decisions. The lock acquire/release provides the memory barriers needed for ARM64 visibility.Why both parts are needed
Part 1 alone was applied and tested on our ARM64 test VM — it still crashed within ~2 minutes. The reason: when the timer has already been dequeued by the timer wheel before the BYE arrives,
remove_dlg_timer()returns>0(not0), so Part 1's(*unref)++does not execute. The timer handler then proceeds with a staleCONFIRMEDstate and consumes the ref viaunref_dlg(dlg, 1), leading to premature destruction.Part 2 alone doesn't cover the window where the timer hasn't fired yet — a concurrent worker can see
state=DELETEDand race to unref before the timer ref has been accounted for.ARM64 reproduction results
Reproduced on QEMU aarch64 VM (Debian 12, 2 vCPUs, 2GB RAM) using unmodified OpenSIPS master source (commit
aad6b85). No simulation code or injected delays — crashes occurred naturally under load.Test methodology: high worker count (32) on 2 vCPUs, short dialog timeout (2s) with SIPp BYE timed ~100ms before timeout, stress-ng for CPU/memory/IO pressure, taskset pinning timer + workers to same CPU core.
Crash backtraces from unpatched runs:
GDB inspection of the Part-1-only crash confirmed use-after-free: the dialog at the crash address had
state=EARLY(1), ref=2— a completely different dialog occupying reused shared memory.Important caveat: these results are from a QEMU-emulated ARM64 environment, not production hardware. The crashes are consistent with the reported issue, but testing on production ARM64 hardware under real traffic patterns would provide additional confidence.
Refcount walkthrough (normal BYE, pre-fix crash, post-fix safe)
Normal BYE (with fix):
BYE + Timer race (current code, crash):
BYE + Timer race (with fix, safe):
Compatibility
No behavioral changes to existing call flows. The fix centralizes timer removal that was previously done by each caller, and adds a lock/read/unlock at the top of
dlg_ontimeout(). The lock ordering (hash entry → timer) is already established in the existing codebase. No configuration changes or migration needed.Note: the existing
DLG_FLAG_RACE_CONDITION_OCCURREDmechanism (SIP-level CANCEL/200-OK race viacreate_dialog("E")) is a different race at the signaling level. This fix protects that code path as well since it also flows throughdlg_ontimeout().Closing issues
Closes #3835