Fix: execution context queue stress tests failures #16472

ysbaddaden · 2025-12-02T17:45:34Z

Refactors and abstracts the stress test specs of Fiber::ExecutionContext::Runnables and Fiber::ExecutionContext::GlobalQueue so no loop will run forever: the "thread setup" part has been removed, the main fiber won't block waiting for the threads to be ready, and the threads' loop will eventually timeout, and the thread return, so the main fiber won't block while joining the threads.

I abstracted a helper because the different tests used the same structure, and it was painful & noisy to dup the logic.

This fixes the regular CI failures that occurred often on Darwin on CI, and that I just reproduced on Linux when running these specs in tight loops multiple times in parallel to overload the CPU cores.

Might fix #16470 or least let it fail (not hang for 6h).
Related to #15630.

Refactors and abstracts the stress test runs so no loop will run forever: the "thread setup" part has been removed, the main fiber won't block waiting for the threads to be ready, and the threads' loop will eventually timeout, and the thread return, so the main fiber won't block while joining the threads. This fixes the regular CI failures that occured often on Darwin, and may happen on Linux when running both specs in a tight loops multiple times in parallel to overload the CPU cores.

straight-shoota · 2025-12-02T20:05:44Z

With this patch, the spec does not get stuck any more on my machine using seed 88224. But it still blocks with 12968 🤷

There are two STRESS-* threads left with the following backtraces:

Thread 53 (Thread 0x7896727fe700 (LWP 1606)):
#0  0x00007896bf55ba47 in epoll_wait (epfd=67, events=0x7896727fcda8, maxevents=128,
    timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005a853e19d391 in wait () at /mnt/src/crystal/system/unix/epoll.cr:49
#2  0x00005a853e1a596b in run () at /mnt/src/crystal/event_loop/epoll.cr:52
#3  0x00005a853e1922cd in reschedule () at /mnt/src/crystal/scheduler.cr:144
#4  0x00005a853e19224a in reschedule () at /mnt/src/crystal/scheduler.cr:62
#5  0x00005a853e194ec6 in suspend () at /mnt/src/fiber.cr:351
#6  0x00005a853e1c5f4b in lock_slow () at /mnt/src/crystal/fd_lock.cr:122
#7  0x00005a853e1800d2 in system_read () at /mnt/src/crystal/fd_lock.cr:65
#8  0x00005a853e17ff9d in unbuffered_read () at /mnt/src/io/file_descriptor.cr:330
#9  0x00005a853e17fdd2 in read () at /mnt/src/io/buffered.cr:91
#10 0x00005a853e17fc66 in read_fully? () at /mnt/src/io.cr:544
#11 0x00005a853e17fb9b in read_fully () at /mnt/src/io.cr:527
#12 0x00005a853e17e936 in random_bytes () at /mnt/src/crystal/system/unix/urandom.cr:20
#13 0x00005a853e2c6e16 in random_bytes () at /mnt/src/random/secure.cr:27
#14 0x00005a853e2c6d21 in rand_type () at /mnt/src/random/secure.cr:30
#15 0x00005a853e2c6c61 in rand_type () at /mnt/src/random/secure.cr:30
#16 0x00005a853e2c6bdb in rand_range () at /mnt/src/random.cr:170
#17 0x00005a853e2c6aa7 in rand () at /mnt/src/random.cr:335
#18 0x00005a853e2c22ff in new () at /mnt/src/random/pcg32.cr:43
#19 0x00005a853e45e1ae in thread_default () at /mnt/src/random.cr:58
#20 0x00005a853ec86450 in sample () at /mnt/src/indexable.cr:968
#21 0x00005a853d951ac0 in -> ()
    at /mnt/spec/std/fiber/execution_context/runnables_spec.cr:242
#22 0x00005a853d94724d in -> () at /mnt/src/primitives.cr:414
#23 0x00005a853d9473a0 in -> () at /mnt/src/primitives.cr:414
#24 0x00005a853e1930c2 in start () at /mnt/src/primitives.cr:414
#25 0x00005a853e193b3e in thread_proc () at /mnt/src/crystal/system/unix/pthread.cr:47
#26 0x00005a853d760e56 in ~procProc(Pointer(Void), Pointer(Void)) ()
    at /mnt/spec/std/thread_spec.cr:8
#27 0x00005a853ef104f1 in GC_inner_start_routine ()
#28 0x00005a853ef07cd3 in GC_call_with_stack_base ()
#29 0x00007896bfc4e6db in start_thread (arg=0x7896727fe700) at pthread_create.c:463
#30 0x00007896bf55b71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 54 (Thread 0x789671ffd700 (LWP 1607)):
#0  0x00007896bf55ba47 in epoll_wait (epfd=65, events=0x789671ffbda8, maxevents=128,
    timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005a853e19d391 in wait () at /mnt/src/crystal/system/unix/epoll.cr:49
#2  0x00005a853e1a596b in run () at /mnt/src/crystal/event_loop/epoll.cr:52
#3  0x00005a853e1922cd in reschedule () at /mnt/src/crystal/scheduler.cr:144
#4  0x00005a853e19224a in reschedule () at /mnt/src/crystal/scheduler.cr:62
#5  0x00005a853e194ec6 in suspend () at /mnt/src/fiber.cr:351
#6  0x00005a853e1c5f4b in lock_slow () at /mnt/src/crystal/fd_lock.cr:122
#7  0x00005a853e1800d2 in system_read () at /mnt/src/crystal/fd_lock.cr:65
#8  0x00005a853e17ff9d in unbuffered_read () at /mnt/src/io/file_descriptor.cr:330
#9  0x00005a853e17fdd2 in read () at /mnt/src/io/buffered.cr:91
#10 0x00005a853e17fc66 in read_fully? () at /mnt/src/io.cr:544
#11 0x00005a853e17fb9b in read_fully () at /mnt/src/io.cr:527
#12 0x00005a853e17e936 in random_bytes () at /mnt/src/crystal/system/unix/urandom.cr:20
#13 0x00005a853e2c6e16 in random_bytes () at /mnt/src/random/secure.cr:27
#14 0x00005a853e2c6d21 in rand_type () at /mnt/src/random/secure.cr:30
#15 0x00005a853e2c6c61 in rand_type () at /mnt/src/random/secure.cr:30
#16 0x00005a853e2c6bdb in rand_range () at /mnt/src/random.cr:170
#17 0x00005a853e2c6aa7 in rand () at /mnt/src/random.cr:335
#18 0x00005a853e2c22ff in new () at /mnt/src/random/pcg32.cr:43
#19 0x00005a853e45e1ae in thread_default () at /mnt/src/random.cr:58
#20 0x00005a853ec86450 in sample () at /mnt/src/indexable.cr:968
#21 0x00005a853d951ac0 in -> ()
    at /mnt/spec/std/fiber/execution_context/runnables_spec.cr:242
#22 0x00005a853d94724d in -> () at /mnt/src/primitives.cr:414
#23 0x00005a853d9473a0 in -> () at /mnt/src/primitives.cr:414
#24 0x00005a853e1930c2 in start () at /mnt/src/primitives.cr:414
#25 0x00005a853e193b3e in thread_proc () at /mnt/src/crystal/system/unix/pthread.cr:47
#26 0x00005a853d760e56 in ~procProc(Pointer(Void), Pointer(Void)) ()
    at /mnt/spec/std/thread_spec.cr:8
#27 0x00005a853ef104f1 in GC_inner_start_routine ()
#28 0x00005a853ef07cd3 in GC_call_with_stack_base ()
#29 0x00007896bfc4e6db in start_thread (arg=0x789671ffd700) at pthread_create.c:463
#30 0x00007896bf55b71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

So apparently they are waiting for randomness.

ysbaddaden · 2025-12-03T13:04:53Z

Whenever I try to fix the issue, mingw fails, this time on ARM64 😡

Why is it using urandom?! stdlib should always use getrandom on linux 😕

ysbaddaden · 2025-12-03T13:11:34Z

Answering myself: because the libc method check macro doesn't work in older crystal releases!

So, multiple fixes:

always use getrandom on Linux;
only fallback to urandom on Android;
fix the stress test to use a local RNG per thread instead of the default one.

ysbaddaden · 2025-12-03T13:19:49Z

Aside: why is urandom failing with EAGAIN? It should never block.

Maybe it's wrong to make the fd non-blocking, and we should always read blocking instead since it should never block (readiness might not work).

straight-shoota · 2025-12-03T13:59:29Z

Oh, {% if LibC.has_method?(:getrandom) %} is one of the remaining top-level LibC.has_method? calls after #15635. These calls are broken in Crystal < 1.7.
I opened a separate issue about dealing with these: #16475

The spec always fails on CI for this specific target.

straight-shoota · 2025-12-09T22:19:01Z

spec/std/fiber/execution_context/spec_helper.cr

+  # Runs a multithreaded test by starting *n* threads, waiting for all the
+  # threads to have been started the *publish* proc.


Suggested change

# Runs a multithreaded test by starting *n* threads, waiting for all the

# threads to have been started the *publish* proc.

# Runs a multithreaded test by starting *n* threads, waiting for all the

# threads to have been started, then runs the *publish* proc.

straight-shoota · 2025-12-09T22:20:45Z

src/random.cr

  end

+  # See `#split`.
+  def self.split : Random


suggestion: We should probably add this in a separate PR because it's adding a new public feature.

ysbaddaden added 2 commits December 2, 2025 18:27

Add Thread::WaitGroup#wait(time)

d4d9bb5

ysbaddaden self-assigned this Dec 2, 2025

ysbaddaden added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:runtime topic:multithreading labels Dec 2, 2025

github-project-automation bot added this to Multi-threading Dec 2, 2025

github-project-automation bot moved this to Review in Multi-threading Dec 2, 2025

fixup! Fix: execution context queue stress tests failures

b434a79

straight-shoota mentioned this pull request Dec 3, 2025

Broken LibC.has_method? macros in top-level code #16475

Open

ysbaddaden mentioned this pull request Dec 4, 2025

ExecutionContext::Runnables stress test is flaky with Crystal 1.0 #16470

Closed

ysbaddaden and others added 4 commits December 4, 2025 11:24

Add Random.split [fixup crystal-lang#16342]

7a7e30c

Split random instead of creating one Random.thread_default per thread

85e4fd7

Fix: disable spec on aarch64-windows

07cdf45

The spec always fails on CI for this specific target.

Merge branch 'master' into fix/execution-context-stress-test-failures

0bc9ca3

straight-shoota approved these changes Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix: execution context queue stress tests failures #16472

Fix: execution context queue stress tests failures #16472

ysbaddaden commented Dec 2, 2025

Uh oh!

straight-shoota commented Dec 2, 2025

Uh oh!

ysbaddaden commented Dec 3, 2025

Uh oh!

ysbaddaden commented Dec 3, 2025 •

edited

Loading

Uh oh!

ysbaddaden commented Dec 3, 2025 •

edited

Loading

Uh oh!

straight-shoota commented Dec 3, 2025

Uh oh!

straight-shoota Dec 9, 2025

Uh oh!

straight-shoota Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Runs a multithreaded test by starting n threads, waiting for all the
		# threads to have been started the publish proc.

Uh oh!

Fix: execution context queue stress tests failures #16472

Are you sure you want to change the base?

Fix: execution context queue stress tests failures #16472

Conversation

ysbaddaden commented Dec 2, 2025

Uh oh!

straight-shoota commented Dec 2, 2025

Uh oh!

ysbaddaden commented Dec 3, 2025

Uh oh!

ysbaddaden commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

straight-shoota commented Dec 3, 2025

Uh oh!

straight-shoota Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

straight-shoota Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ysbaddaden commented Dec 3, 2025 •

edited

Loading

ysbaddaden commented Dec 3, 2025 •

edited

Loading