Skip to content

rutabaga_gfx/cross_domain: fix audio glitches by dropping write lock#671

Open
aford173 wants to merge 1 commit intocontainers:mainfrom
aford173:fix-audio-glitches
Open

rutabaga_gfx/cross_domain: fix audio glitches by dropping write lock#671
aford173 wants to merge 1 commit intocontainers:mainfrom
aford173:fix-audio-glitches

Conversation

@aford173
Copy link
Copy Markdown

@aford173 aford173 commented May 7, 2026

The previous write() held item_state.lock() across the write_volatile call that performs the actual syscall, serializing every cross-domain CMD_WRITE behind any other operation on the items table — including add_item from process_receive, which fires for every new fd received via SCM_RIGHTS as guests open additional channels.

For per-period eventfd wakeups (e.g. PipeWire audio streams signaling host playback every audio period), this means the write completes only after any in-flight item_state operation finishes. Under stream-create churn — many guest applications opening streams concurrently, each delivering new fds via SCM_RIGHTS that hit add_item under the same lock — the wait can exceed the audio period budget and produce missed-deadline glitches at the host's audio output.

This change shrinks the critical section to a brief fd dup() under the lock, performs the syscall lock-free, and only re-acquires the lock if hang_up indicates the item should be removed. In the common case (hang_up == 0, e.g. repeated eventfd wakeups for an active stream) the table is no longer touched per write, eliminating both the contention and the previous "remove + conditional re-insert" churn.

Behavioral changes vs the old code:

  • Common case (hang_up == 0): item stays in the table; we hand out a dup'd fd for the write. Net behavior identical, lock hold time bounded by dup().
  • hang_up == 1: item removed in a separate phase after the write, instead of "removed unconditionally then re-inserted on hang_up == 0". Same observed end state.
  • Concurrent writes to the same id (no longer serialized): the kernel guarantees atomicity for eventfd writes (8 bytes) and pipe writes <= PIPE_BUF, which are the only two CrossDomainItem variants this branch handles. Each caller dup's its own fd and writes through it independently.

Verified with a synthetic reproducer: a sustained 1 kHz sine playing through a guest PipeWire stream while ~200 short-lived guest streams open and close concurrently (each issuing SCM_RIGHTS for new eventfds, contending with the sustained stream's per-period writes for item_state). Capturing the host sink monitor and counting sample-to-sample deltas exceeding a clean-sine threshold, the stress workload produced ~10 distinct glitch bursts in an 8-second capture before this change, and zero across five consecutive runs after.

The previous write() held item_state.lock() across the write_volatile call
that performs the actual syscall, serializing every cross-domain CMD_WRITE
behind any other operation on the items table — including add_item from
process_receive, which fires for every new fd received via SCM_RIGHTS as
guests open additional channels.

For per-period eventfd wakeups (e.g. PipeWire audio streams signaling host
playback every audio period), this means the write completes only after any
in-flight item_state operation finishes. Under stream-create churn — many
guest applications opening streams concurrently, each delivering new fds via
SCM_RIGHTS that hit add_item under the same lock — the wait can exceed the
audio period budget and produce missed-deadline glitches at the host's
audio output.

This change shrinks the critical section to a brief fd dup() under the lock,
performs the syscall lock-free, and only re-acquires the lock if hang_up
indicates the item should be removed. In the common case (hang_up == 0,
e.g. repeated eventfd wakeups for an active stream) the table is no longer
touched per write, eliminating both the contention and the previous
"remove + conditional re-insert" churn.

Behavioral changes vs the old code:
- Common case (hang_up == 0): item stays in the table; we hand out a dup'd
  fd for the write. Net behavior identical, lock hold time bounded by dup().
- hang_up == 1: item removed in a separate phase after the write, instead
  of "removed unconditionally then re-inserted on hang_up == 0". Same
  observed end state.
- Concurrent writes to the same id (no longer serialized): the kernel
  guarantees atomicity for eventfd writes (8 bytes) and pipe writes
  <= PIPE_BUF, which are the only two CrossDomainItem variants this branch
  handles. Each caller dup's its own fd and writes through it independently.

Verified with a synthetic reproducer: a sustained 1 kHz sine playing through
a guest PipeWire stream while ~200 short-lived guest streams open and close
concurrently (each issuing SCM_RIGHTS for new eventfds, contending with the
sustained stream's per-period writes for item_state). Capturing the host
sink monitor and counting sample-to-sample deltas exceeding a clean-sine
threshold, the stress workload produced ~10 distinct glitch bursts in an
8-second capture before this change, and zero across five consecutive runs
after.

Signed-off-by: Adam Ford <adam.ford@anodize.com>
@valpackett
Copy link
Copy Markdown
Contributor

Woah, that's quite the torture test scenario, did you ever actually experience stream-create churn in realistic regular desktop usage? 👀

I agree that current code is locking too much but I'm wary of the dup(), turning write() into dup()+write()+close() is quite some overhead. If only this were all async code, the forwarding of each fd would've just been an independent task… sigh.

Maybe we could use a refcount instead? Would only add a pointer chase for derefing the Rc stored in the table, better than two context-switches churning kernel state.


BTW, keep in mind that for libkrun 2.0 the plan is to switch to upstream rutabaga (#560) so changes should be proposed there as well (or ideally, first)

@aford173
Copy link
Copy Markdown
Author

aford173 commented May 8, 2026

@valpackett - I never saw it in real life, but I was playing with Claude. My end-game is run multiple instances of muvm on the same machine and have them share an audio pipeline, so I asked it to help craft a worst-case scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants