dialog: fix refcount race between BYE and timer expiry by NormB · Pull Request #3836 · OpenSIPS/opensips

NormB · 2026-03-05T18:40:56Z

Summary

Fix use-after-free crash caused by a race between BYE processing and dlg_ontimeout() when bye_on_timeout is enabled (create_dialog("B")). Reproduced on ARM64 (aarch64) — matches the crash signature reported in #3835 (bogus ref -1 / SIGSEGV in dlg_release_cloned_leg).

Details

Two independent race paths exist between BYE processing and the dialog timer handler:

Race Path 1 — Timer removal outside lock: next_state_dlg() transitions the dialog to DELETED and releases the hash entry lock before the caller performs remove_dlg_timer(). A concurrent worker can observe state=DELETED and begin its own cleanup, racing with the caller's timer removal.

Race Path 2 — Stale state read in timer handler: dlg_ontimeout() reads dlg->state at three locations without the hash entry lock (dlg_handlers.c:2468, dlg_handlers.c:2491, dlg_handlers.c:2518). On ARM64's relaxed memory ordering, these reads can return a stale CONFIRMED value after a BYE worker has already set DELETED under the lock. The timer handler then enters the bye_on_timeout path on a dialog that is already being torn down.

Both races are architecture-sensitive — ARM64's relaxed memory ordering makes them significantly more likely than on x86_64. This is consistent with the reporter seeing crashes on Graviton while the issue is difficult to reproduce on x86.

Race window diagram

Worker A (BYE handler):                     Worker B (timer handler):
----------------------------                ----------------------------------
                                            dlg_ontimeout() fires
                                            reads dlg->state WITHOUT hash lock
                                            sees state=CONFIRMED (stale on ARM64)

next_state_dlg(REQBYE)
  lock(hash_entry)
  state: CONFIRMED -> DELETED
  unref = 1
  unlock(hash_entry)
                                            still using stale CONFIRMED state
  +--- race window ---+                     enters bye_on_timeout path
  |  caller-side       |                    dlg_end_dlg() sends BYEs
  |  timer removal     |                    unref_dlg(dlg, 1)
  |  and unref not     |                      ^-- timer ref consumed
  |  yet performed     |
  +--------------------+

remove_dlg_timer()    <-- returns 0 (success) or races
unref_dlg(dlg, unref) <-- double-free of timer ref -> "bogus ref -1"

The same timer-removal-outside-lock pattern exists in all four callers of next_state_dlg():

Caller	File	Function
BYE processing	`dlg_handlers.c`	`dlg_onroute()`
Sequential forking BYE	`dlg_req_within.c`	`dual_bye_event()`
Replicated delete	`dlg_replication.c`	`dlg_replicated_delete()`
Bulk dialog drop	`dlg_replication.c`	`drop_dlg()`

Solution

A two-part fix. Our testing indicates neither part alone is sufficient — both are needed to cover all race windows.

Part 1: Timer removal inside lock (dlg_hash.c): Move remove_dlg_timer() inside next_state_dlg()'s lock scope. When transitioning to DELETED, the timer is removed atomically with the state change. All four callers' post-next_state_dlg() remove_dlg_timer() blocks are removed since next_state_dlg() now handles this centrally.

Lock ordering note: remove_dlg_timer() acquires d_timer->lock internally. The resulting hash entry lock → timer lock ordering is already established by existing code paths (e.g., insert_dlg_timer() called while holding the hash lock).

Part 2: Locked state read in dlg_ontimeout() (dlg_handlers.c): Read dlg->state under the hash entry lock into a local dlg_state variable, then use dlg_state for all subsequent state-dependent decisions. The lock acquire/release provides the memory barriers needed for ARM64 visibility.

Why both parts are needed

Part 1 alone was applied and tested on our ARM64 test VM — it still crashed within ~2 minutes. The reason: when the timer has already been dequeued by the timer wheel before the BYE arrives, remove_dlg_timer() returns >0 (not 0), so Part 1's (*unref)++ does not execute. The timer handler then proceeds with a stale CONFIRMED state and consumes the ref via unref_dlg(dlg, 1), leading to premature destruction.

Part 2 alone doesn't cover the window where the timer hasn't fired yet — a concurrent worker can see state=DELETED and race to unref before the timer ref has been accounted for.

Scenario	Part 1 alone	Part 2 alone	Both
BYE before timer fires	Safe	Safe	Safe
Timer fires, then BYE	Crash (stale read)	Safe	Safe
Timer reads state before BYE	Crash (stale read)	Crash (double cleanup)	Safe

ARM64 reproduction results

Reproduced on QEMU aarch64 VM (Debian 12, 2 vCPUs, 2GB RAM) using unmodified OpenSIPS master source (commit aad6b85). No simulation code or injected delays — crashes occurred naturally under load.

Test methodology: high worker count (32) on 2 vCPUs, short dialog timeout (2s) with SIPp BYE timed ~100ms before timeout, stress-ng for CPU/memory/IO pressure, taskset pinning timer + workers to same CPU core.

Run	Patch	Workers	CPS	Duration	Result
1	None	16	100	~3 min	SIGSEGV at ~20k calls
2	None	32	200	1m55s	SIGSEGV at ~15k calls
3	Part 1 only	32	200	~2 min	SIGABRT (same root cause)
4	Part 1 + Part 2	32	200	15 min	No crash (~180k calls)
5	Part 1 + Part 2	32	200	76 min	No crash (~920k calls)

Crash backtraces from unpatched runs:

#0  qm_frag_size (p=0x6e6f3d32723b) at mem/q_malloc.h:192
#1  _shm_free (ptr=..., file="dlg_handlers.c",
       function="dlg_release_cloned_leg", line=506)
#2  dlg_release_cloned_leg (dlg=0xffff96e2e230) at dlg_handlers.c:506
#3  push_reply_in_dialog (...) at dlg_handlers.c:570
#4  dlg_onreply (t=..., type=8) at dlg_handlers.c:665
#5  run_trans_callbacks (...) at t_hooks.c:233
#6  relay_reply (..., msg_status=200) at t_reply.c:1322

GDB inspection of the Part-1-only crash confirmed use-after-free: the dialog at the crash address had state=EARLY(1), ref=2 — a completely different dialog occupying reused shared memory.

Important caveat: these results are from a QEMU-emulated ARM64 environment, not production hardware. The crashes are consistent with the reported issue, but testing on production ARM64 hardware under real traffic patterns would provide additional confidence.

Refcount walkthrough (normal BYE, pre-fix crash, post-fix safe)

Normal BYE (with fix):

Initial:  ref=3 (base + hash + timer)

next_state_dlg(REQBYE):
  lock(hash_entry)
  state: CONFIRMED -> DELETED
  remove_dlg_timer() succeeds -> unref++ (now unref=2: hash + timer)
  unlock(hash_entry)

dlg_onroute():
  unref_dlg(dlg, 2)  -> ref 3 -> 1 (base ref remains for TM callback)

TM callback:
  unref_dlg(dlg, 1)  -> ref 1 -> 0 -> destroy

BYE + Timer race (current code, crash):

Initial:  ref=3 (base + hash + timer)

Worker A - next_state_dlg(REQBYE):
  lock -> state: CONFIRMED -> DELETED, unref=1 (hash only) -> unlock

                                    Worker B - dlg_ontimeout():
  [race window]                     reads state=CONFIRMED (stale)
                                    enters bye_on_timeout path
                                    unref_dlg(dlg, 1) -> ref 3 -> 2

Worker A - dlg_onroute():
  remove_dlg_timer() -> returns 0, unref=2
  unref_dlg(dlg, 2)  -> ref 2 -> 0 -> premature destroy

Worker B - bye_reply_cb:
  accesses freed dialog -> SIGSEGV / "bogus ref -1"

BYE + Timer race (with fix, safe):

Initial:  ref=3 (base + hash + timer)

Worker A - next_state_dlg(REQBYE):
  lock -> state: CONFIRMED -> DELETED
  remove_dlg_timer() succeeds -> unref=2 (under lock) -> unlock

                                    Worker B - dlg_ontimeout():
                                    lock -> reads state=DELETED -> unlock
                                    does NOT enter bye_on_timeout path

Worker A - dlg_onroute():
  unref_dlg(dlg, 2)  -> ref 3 -> 1 (TM callback ref remains)

TM callback:
  unref_dlg(dlg, 1)  -> ref 1 -> 0 -> clean destroy

Compatibility

No behavioral changes to existing call flows. The fix centralizes timer removal that was previously done by each caller, and adds a lock/read/unlock at the top of dlg_ontimeout(). The lock ordering (hash entry → timer) is already established in the existing codebase. No configuration changes or migration needed.

Note: the existing DLG_FLAG_RACE_CONDITION_OCCURRED mechanism (SIP-level CANCEL/200-OK race via create_dialog("E")) is a different race at the signaling level. This fix protects that code path as well since it also flows through dlg_ontimeout().

Closing issues

Closes #3835

…3835) The previous fix (196b51f) narrowed the race window between BYE processing and dlg_ontimeout() but could not fully close it: a BYE arriving between the state check and dlg_end_dlg() creates a TM transaction ref (via dlg_set_tm_dialog_ctx) that outlives the dialog. When the timeout-BYE responses destroy the dialog first, the real BYE's transaction callback dereferences freed memory. The same race pattern exists in dlg_options_routine() and dlg_reinvite_routine(), where expired ping dialogs call dlg_end_dlg() without any state transition. Fix: replace racy state checks with atomic next_state_dlg(DLG_EVENT_REQBYE) calls. The hash entry lock guarantees only one code path -- timer, ping handler, or BYE handler -- wins the CONFIRMED -> DELETED transition. Since dual_bye_event() now only sees DELETED -> DELETED for the timeout-BYE responses, the cleanup that it would normally perform on the first CONFIRMED -> DELETED transition (rt_on_hangup, profile linkers, DLGCB_TERMINATED callback, DB removal) is moved into the winning timeout/ping path. Three paths fixed: - dlg_ontimeout() bye_on_timeout (dlg_handlers.c) - dlg_options_routine() expired pings (dlg_timer.c) - dlg_reinvite_routine() expired pings (dlg_timer.c) Ref: OpenSIPS#3835

NormB mentioned this pull request Mar 5, 2026

[BUG] CRITICAL:dialog:_unref_dlg: bogus ref -2 with cnt 1 for dlg #3835

Open

NormB force-pushed the fix/gh3835-dialog-race branch from b5b39c3 to 196b51f Compare March 5, 2026 19:00

bogdan-iancu self-assigned this Mar 10, 2026

ovidiusas added a commit to ovidiusas/opensips that referenced this pull request Apr 7, 2026

Applying patch from OpenSIPS#3836 PR

361ca23

NormB force-pushed the fix/gh3835-dialog-race branch from 887d200 to 42c3041 Compare April 7, 2026 14:02

NormB force-pushed the fix/gh3835-dialog-race branch from 42c3041 to 1c2d0bd Compare April 7, 2026 16:00

ovidiusas added a commit to ovidiusas/opensips that referenced this pull request Apr 7, 2026

dialog: apply patch from OpenSIPS#3836

55718e1

bogdan-iancu added the investigating label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dialog: fix refcount race between BYE and timer expiry#3836

dialog: fix refcount race between BYE and timer expiry#3836
NormB wants to merge 1 commit into
OpenSIPS:masterfrom
NormB:fix/gh3835-dialog-race

NormB commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NormB commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants