Skip to content

Conversation

@ysbaddaden
Copy link
Contributor

Refactors and abstracts the stress test specs of Fiber::ExecutionContext::Runnables and Fiber::ExecutionContext::GlobalQueue so no loop will run forever: the "thread setup" part has been removed, the main fiber won't block waiting for the threads to be ready, and the threads' loop will eventually timeout, and the thread return, so the main fiber won't block while joining the threads.

I abstracted a helper because the different tests used the same structure, and it was painful & noisy to dup the logic.

This fixes the regular CI failures that occurred often on Darwin on CI, and that I just reproduced on Linux when running these specs in tight loops multiple times in parallel to overload the CPU cores.

Might fix #16470 or least let it fail (not hang for 6h).
Related to #15630.

Refactors and abstracts the stress test runs so no loop will run
forever: the "thread setup" part has been removed, the main fiber won't
block waiting for the threads to be ready, and the threads' loop will
eventually timeout, and the thread return, so the main fiber won't block
while joining the threads.

This fixes the regular CI failures that occured often on Darwin, and may
happen on Linux when running both specs in a tight loops multiple times
in parallel to overload the CPU cores.
@ysbaddaden ysbaddaden self-assigned this Dec 2, 2025
@ysbaddaden ysbaddaden added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:runtime topic:multithreading labels Dec 2, 2025
@straight-shoota
Copy link
Member

With this patch, the spec does not get stuck any more on my machine using seed 88224. But it still blocks with 12968 🤷

There are two STRESS-* threads left with the following backtraces:

Thread 53 (Thread 0x7896727fe700 (LWP 1606)):
#0  0x00007896bf55ba47 in epoll_wait (epfd=67, events=0x7896727fcda8, maxevents=128,
    timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005a853e19d391 in wait () at /mnt/src/crystal/system/unix/epoll.cr:49
#2  0x00005a853e1a596b in run () at /mnt/src/crystal/event_loop/epoll.cr:52
#3  0x00005a853e1922cd in reschedule () at /mnt/src/crystal/scheduler.cr:144
#4  0x00005a853e19224a in reschedule () at /mnt/src/crystal/scheduler.cr:62
#5  0x00005a853e194ec6 in suspend () at /mnt/src/fiber.cr:351
#6  0x00005a853e1c5f4b in lock_slow () at /mnt/src/crystal/fd_lock.cr:122
#7  0x00005a853e1800d2 in system_read () at /mnt/src/crystal/fd_lock.cr:65
#8  0x00005a853e17ff9d in unbuffered_read () at /mnt/src/io/file_descriptor.cr:330
#9  0x00005a853e17fdd2 in read () at /mnt/src/io/buffered.cr:91
#10 0x00005a853e17fc66 in read_fully? () at /mnt/src/io.cr:544
#11 0x00005a853e17fb9b in read_fully () at /mnt/src/io.cr:527
#12 0x00005a853e17e936 in random_bytes () at /mnt/src/crystal/system/unix/urandom.cr:20
#13 0x00005a853e2c6e16 in random_bytes () at /mnt/src/random/secure.cr:27
#14 0x00005a853e2c6d21 in rand_type () at /mnt/src/random/secure.cr:30
#15 0x00005a853e2c6c61 in rand_type () at /mnt/src/random/secure.cr:30
#16 0x00005a853e2c6bdb in rand_range () at /mnt/src/random.cr:170
#17 0x00005a853e2c6aa7 in rand () at /mnt/src/random.cr:335
#18 0x00005a853e2c22ff in new () at /mnt/src/random/pcg32.cr:43
#19 0x00005a853e45e1ae in thread_default () at /mnt/src/random.cr:58
#20 0x00005a853ec86450 in sample () at /mnt/src/indexable.cr:968
#21 0x00005a853d951ac0 in -> ()
    at /mnt/spec/std/fiber/execution_context/runnables_spec.cr:242
#22 0x00005a853d94724d in -> () at /mnt/src/primitives.cr:414
#23 0x00005a853d9473a0 in -> () at /mnt/src/primitives.cr:414
#24 0x00005a853e1930c2 in start () at /mnt/src/primitives.cr:414
#25 0x00005a853e193b3e in thread_proc () at /mnt/src/crystal/system/unix/pthread.cr:47
#26 0x00005a853d760e56 in ~procProc(Pointer(Void), Pointer(Void)) ()
    at /mnt/spec/std/thread_spec.cr:8
#27 0x00005a853ef104f1 in GC_inner_start_routine ()
#28 0x00005a853ef07cd3 in GC_call_with_stack_base ()
#29 0x00007896bfc4e6db in start_thread (arg=0x7896727fe700) at pthread_create.c:463
#30 0x00007896bf55b71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 54 (Thread 0x789671ffd700 (LWP 1607)):
#0  0x00007896bf55ba47 in epoll_wait (epfd=65, events=0x789671ffbda8, maxevents=128,
    timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005a853e19d391 in wait () at /mnt/src/crystal/system/unix/epoll.cr:49
#2  0x00005a853e1a596b in run () at /mnt/src/crystal/event_loop/epoll.cr:52
#3  0x00005a853e1922cd in reschedule () at /mnt/src/crystal/scheduler.cr:144
#4  0x00005a853e19224a in reschedule () at /mnt/src/crystal/scheduler.cr:62
#5  0x00005a853e194ec6 in suspend () at /mnt/src/fiber.cr:351
#6  0x00005a853e1c5f4b in lock_slow () at /mnt/src/crystal/fd_lock.cr:122
#7  0x00005a853e1800d2 in system_read () at /mnt/src/crystal/fd_lock.cr:65
#8  0x00005a853e17ff9d in unbuffered_read () at /mnt/src/io/file_descriptor.cr:330
#9  0x00005a853e17fdd2 in read () at /mnt/src/io/buffered.cr:91
#10 0x00005a853e17fc66 in read_fully? () at /mnt/src/io.cr:544
#11 0x00005a853e17fb9b in read_fully () at /mnt/src/io.cr:527
#12 0x00005a853e17e936 in random_bytes () at /mnt/src/crystal/system/unix/urandom.cr:20
#13 0x00005a853e2c6e16 in random_bytes () at /mnt/src/random/secure.cr:27
#14 0x00005a853e2c6d21 in rand_type () at /mnt/src/random/secure.cr:30
#15 0x00005a853e2c6c61 in rand_type () at /mnt/src/random/secure.cr:30
#16 0x00005a853e2c6bdb in rand_range () at /mnt/src/random.cr:170
#17 0x00005a853e2c6aa7 in rand () at /mnt/src/random.cr:335
#18 0x00005a853e2c22ff in new () at /mnt/src/random/pcg32.cr:43
#19 0x00005a853e45e1ae in thread_default () at /mnt/src/random.cr:58
#20 0x00005a853ec86450 in sample () at /mnt/src/indexable.cr:968
#21 0x00005a853d951ac0 in -> ()
    at /mnt/spec/std/fiber/execution_context/runnables_spec.cr:242
#22 0x00005a853d94724d in -> () at /mnt/src/primitives.cr:414
#23 0x00005a853d9473a0 in -> () at /mnt/src/primitives.cr:414
#24 0x00005a853e1930c2 in start () at /mnt/src/primitives.cr:414
#25 0x00005a853e193b3e in thread_proc () at /mnt/src/crystal/system/unix/pthread.cr:47
#26 0x00005a853d760e56 in ~procProc(Pointer(Void), Pointer(Void)) ()
    at /mnt/spec/std/thread_spec.cr:8
#27 0x00005a853ef104f1 in GC_inner_start_routine ()
#28 0x00005a853ef07cd3 in GC_call_with_stack_base ()
#29 0x00007896bfc4e6db in start_thread (arg=0x789671ffd700) at pthread_create.c:463
#30 0x00007896bf55b71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

So apparently they are waiting for randomness.

@ysbaddaden
Copy link
Contributor Author

Whenever I try to fix the issue, mingw fails, this time on ARM64 😡

Why is it using urandom?! stdlib should always use getrandom on linux 😕

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Dec 3, 2025

Answering myself: because the libc method check macro doesn't work in older crystal releases!

So, multiple fixes:

  1. always use getrandom on Linux;
  2. only fallback to urandom on Android;
  3. fix the stress test to use a local RNG per thread instead of the default one.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Dec 3, 2025

Aside: why is urandom failing with EAGAIN? It should never block.

Maybe it's wrong to make the fd non-blocking, and we should always read blocking instead since it should never block (readiness might not work).

@straight-shoota
Copy link
Member

Oh, {% if LibC.has_method?(:getrandom) %} is one of the remaining top-level LibC.has_method? calls after #15635. These calls are broken in Crystal < 1.7.
I opened a separate issue about dealing with these: #16475

Comment on lines +23 to +24
# Runs a multithreaded test by starting *n* threads, waiting for all the
# threads to have been started the *publish* proc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Runs a multithreaded test by starting *n* threads, waiting for all the
# threads to have been started the *publish* proc.
# Runs a multithreaded test by starting *n* threads, waiting for all the
# threads to have been started, then runs the *publish* proc.

end

# See `#split`.
def self.split : Random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We should probably add this in a separate PR because it's adding a new public feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:multithreading topic:stdlib:runtime

Projects

Status: Review

Development

Successfully merging this pull request may close these issues.

ExecutionContext::Runnables stress test is flaky with Crystal 1.0

2 participants