Signal before unlocking proxy context mutex in emscripten_proxy_finish by dschuff · Pull Request #26582 · emscripten-core/emscripten

dschuff · 2026-03-31T16:56:54Z

When finishing a proxied call, the following race condition can happen:

    pthread_mutex_lock(&ctx->sync.mutex);
    ctx->sync.state = DONE;
    remove_active_ctx(ctx);
    pthread_mutex_unlock(&ctx->sync.mutex);
--> thread is preempted or suspends here <---
    pthread_cond_signal(&ctx->sync.cond);

thread 2 in emscripten_proxy_sync_with_ctx:
(ctx is on this thread's stack)

  pthread_mutex_lock(&ctx.sync.mutex); <-- locks after unlock above
  while (ctx.sync.state == PENDING) { <--- reads sync.state == DONE
    pthread_cond_wait(&ctx.sync.cond, &ctx.sync.mutex); <-- doesn't run
  }
  pthread_mutex_unlock(&ctx.sync.mutex);
  int ret = ctx.sync.state == DONE;
  em_proxying_ctx_deinit(&ctx); <--- frees ctx and returns

Then thread 1 tries to run pthread_cond_signal on the freed ctx.

This same logic applies to cancel_ctx which is also called on the target
thread.

This may be what is causing flake in the pselect/ppoll tests on the CI
waterfall.

When finishing a proxied call, the following race condition can happen: thread 1 in emscripten_proxy_finish: pthread_mutex_lock(&ctx->sync.mutex); ctx->sync.state = DONE; remove_active_ctx(ctx); pthread_mutex_unlock(&ctx->sync.mutex); --> thread is preempted or suspends here <--- pthread_cond_signal(&ctx->sync.cond); thread 2 in emscripten_proxy_sync_with_ctx: (ctx is on this thread's stack) pthread_mutex_lock(&ctx.sync.mutex); <-- locks after unlock above while (ctx.sync.state == PENDING) { <--- reads sync.state == DONE pthread_cond_wait(&ctx.sync.cond, &ctx.sync.mutex); <-- doesn't run } pthread_mutex_unlock(&ctx.sync.mutex); int ret = ctx.sync.state == DONE; em_proxying_ctx_deinit(&ctx); <--- frees ctx and returns Then thread 1 tries to run pthread_cond_signal on the freed ctx. This may be what is causing flake in the pselect/ppoll tests on the CI waterfall.

sbc100 · 2026-03-31T17:12:14Z

The same pattern is repeated in cancel_ctx.

For some reason I think this must have been deliberate pattern by @tlively when he wrote this code in #18810?

sbc100 · 2026-03-31T17:13:23Z

I'm suspicious of the root cause analysis of the AI, even though this does indeed look like a real bug.

sbc100 · 2026-03-31T17:14:26Z

The reason I'm suspicious is that I can't see how addition of the ppoll / pselect tests would tickle this bug in a way that existings tests for poll and select do not.

sbc100 · 2026-03-31T17:19:02Z

I think the rational here might be : "In many cases, it is desirable to signal after unlocking the mutex to avoid the "woken" thread immediately blocking again while trying to acquire the lock that the signaler still holds."

However, this might not be safe in the this case due to the deallocation sequence.

sbc100 · 2026-03-31T17:27:23Z

For the codesize failure I think you might just need to rebase.

dschuff · 2026-03-31T17:54:14Z

I think the rational here might be : "In many cases, it is desirable to signal after unlocking the mutex to avoid the "woken" thread immediately blocking again while trying to acquire the lock that the signaler still holds."

However, this might not be safe in the this case due to the deallocation sequence.

Yeah that makes sense. I am also not sure whether this is affecting the current flake, but it looked like enough of a real possibility that it seemed worth doing. I'll check the other places in this code too.

dschuff · 2026-03-31T18:15:18Z

I think we need to update cancel_ctx too, since it's called on the target thread, and could race the same way.

sbc100 · 2026-03-31T18:24:45Z

LGTM, but I'm curious to see what @tlively has to say about the optimization here.

dschuff · 2026-03-31T20:33:47Z

FTR, it looks like #26586 is more likely the cause of the flake, but I still think the reasoning here is sound and we should apply this fix.

tlively · 2026-03-31T21:09:00Z

I don't think there is a problem here. The actual communication happens via ctx->sync.state = DONE;, which is correctly done before releasing the mutex. When thread two reads DONE, it knows it's safe to continue and does not need to wait for anything. Thread one still fires off a pthread_cond_signal, but it's ok that it doesn't wake anything up in this case; thread 2 is already safely proceeding with the work it would have started after receiving the notification.

sbc100 · 2026-03-31T23:25:26Z

We chatted offline and agreed (IIRC) that there is an issue here. I think @tlively is going to look into a solution that maybe keep the performance optimization here?

dschuff · 2026-04-01T00:04:29Z

We actually decided it wasn't worth trying to engineer something more performant right now. The problem can only happen if the proxying thread can wake up and try to acquire the lock faster than the target thread can continue to the next statement and unlock the lock. Which can realistically only happen when the target thread gets suspended at the wrong time (actually this is exactly the same situation where we have the UAF in the current code, and we don't know of any real problems caused by this currently).

This is an automatic change generated by tools/maint/rebaseline_tests.py. The following (2) test expectation files were updated by running the tests with `--rebaseline`: ``` codesize/test_codesize_minimal_pthreads.json: 26409 => 26409 [+0 bytes / +0.00%] codesize/test_codesize_minimal_pthreads_memgrowth.json: 26812 => 26812 [+0 bytes / +0.00%] Average change: +0.00% (+0.00% - +0.00%) ```

sbc100

1 byte code size saving too! lets go!

allsey87 · 2026-04-01T10:40:13Z

Back ported this to Emscripten 3.1.58 and all my hangs and ASan errors while running the numpy test suite disappeared. Thanks for solving this @dschuff!

@hoodmane you might be interested in this fix.

dschuff requested review from sbc100 and tlively March 31, 2026 16:58

Merge branch 'main' into proxy_cond

2d5e88c

also use this order in cancel_ctx

c9b54a9

Merge remote-tracking branch 'github/main' into proxy_cond

998f6ff

dschuff added 2 commits April 1, 2026 00:14

Add comment

c9847a0

sbc100 approved these changes Apr 1, 2026

View reviewed changes

sbc100 reviewed Apr 1, 2026

View reviewed changes

dschuff enabled auto-merge (squash) April 1, 2026 00:21

dschuff merged commit f3fef05 into emscripten-core:main Apr 1, 2026
41 checks passed

Conversation

dschuff commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbc100 commented Mar 31, 2026

Uh oh!

sbc100 commented Mar 31, 2026

Uh oh!

sbc100 commented Mar 31, 2026

Uh oh!

sbc100 commented Mar 31, 2026

Uh oh!

sbc100 commented Mar 31, 2026

Uh oh!

dschuff commented Mar 31, 2026

Uh oh!

dschuff commented Mar 31, 2026

Uh oh!

sbc100 commented Mar 31, 2026

Uh oh!

dschuff commented Mar 31, 2026

Uh oh!

tlively commented Mar 31, 2026

Uh oh!

sbc100 commented Mar 31, 2026

Uh oh!

dschuff commented Apr 1, 2026

Uh oh!

sbc100 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allsey87 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dschuff commented Mar 31, 2026 •

edited

Loading

allsey87 commented Apr 1, 2026 •

edited

Loading