Signal before unlocking proxy context mutex in emscripten_proxy_finish#26582
Signal before unlocking proxy context mutex in emscripten_proxy_finish#26582dschuff merged 6 commits intoemscripten-core:mainfrom
Conversation
When finishing a proxied call, the following race condition can happen:
thread 1 in emscripten_proxy_finish:
pthread_mutex_lock(&ctx->sync.mutex);
ctx->sync.state = DONE;
remove_active_ctx(ctx);
pthread_mutex_unlock(&ctx->sync.mutex);
--> thread is preempted or suspends here <---
pthread_cond_signal(&ctx->sync.cond);
thread 2 in emscripten_proxy_sync_with_ctx:
(ctx is on this thread's stack)
pthread_mutex_lock(&ctx.sync.mutex); <-- locks after unlock above
while (ctx.sync.state == PENDING) { <--- reads sync.state == DONE
pthread_cond_wait(&ctx.sync.cond, &ctx.sync.mutex); <-- doesn't run
}
pthread_mutex_unlock(&ctx.sync.mutex);
int ret = ctx.sync.state == DONE;
em_proxying_ctx_deinit(&ctx); <--- frees ctx and returns
Then thread 1 tries to run pthread_cond_signal on the freed ctx.
This may be what is causing flake in the pselect/ppoll tests on the CI
waterfall.
|
I'm suspicious of the root cause analysis of the AI, even though this does indeed look like a real bug. |
|
The reason I'm suspicious is that I can't see how addition of the |
|
I think the rational here might be : "In many cases, it is desirable to signal after unlocking the mutex to avoid the "woken" thread immediately blocking again while trying to acquire the lock that the signaler still holds." However, this might not be safe in the this case due to the deallocation sequence. |
|
For the codesize failure I think you might just need to rebase. |
Yeah that makes sense. I am also not sure whether this is affecting the current flake, but it looked like enough of a real possibility that it seemed worth doing. I'll check the other places in this code too. |
|
I think we need to update cancel_ctx too, since it's called on the target thread, and could race the same way. |
|
LGTM, but I'm curious to see what @tlively has to say about the optimization here. |
|
FTR, it looks like #26586 is more likely the cause of the flake, but I still think the reasoning here is sound and we should apply this fix. |
|
I don't think there is a problem here. The actual communication happens via |
|
We chatted offline and agreed (IIRC) that there is an issue here. I think @tlively is going to look into a solution that maybe keep the performance optimization here? |
|
We actually decided it wasn't worth trying to engineer something more performant right now. The problem can only happen if the proxying thread can wake up and try to acquire the lock faster than the target thread can continue to the next statement and unlock the lock. Which can realistically only happen when the target thread gets suspended at the wrong time (actually this is exactly the same situation where we have the UAF in the current code, and we don't know of any real problems caused by this currently). |
This is an automatic change generated by tools/maint/rebaseline_tests.py. The following (2) test expectation files were updated by running the tests with `--rebaseline`: ``` codesize/test_codesize_minimal_pthreads.json: 26409 => 26409 [+0 bytes / +0.00%] codesize/test_codesize_minimal_pthreads_memgrowth.json: 26812 => 26812 [+0 bytes / +0.00%] Average change: +0.00% (+0.00% - +0.00%) ```
sbc100
left a comment
There was a problem hiding this comment.
1 byte code size saving too! lets go!
When finishing a proxied call, the following race condition can happen:
thread 1 in emscripten_proxy_finish:
thread 2 in emscripten_proxy_sync_with_ctx:
(ctx is on this thread's stack)
Then thread 1 tries to run pthread_cond_signal on the freed ctx.
This same logic applies to cancel_ctx which is also called on the target
thread.
This may be what is causing flake in the pselect/ppoll tests on the CI
waterfall.