Gracefully shutdown workers on timeout or high mem threshold#151
Gracefully shutdown workers on timeout or high mem threshold#151nickrobinson251 wants to merge 4 commits intomainfrom
Conversation
| end | ||
| signal = Base.SIGTERM | ||
| while true | ||
| while !process_exited(w.process) |
There was a problem hiding this comment.
I forgot that we never swapped this package over to use ConcurrentUtilities workers. We should probably do that and maintain that code in one place.
There was a problem hiding this comment.
it sure would be nice to maintain this in only one place... on the other hand i don't think it's ideal for a test framework to have dependencies (since then you can't use it to the test code that requires a different version of that same dependency), so i think if we were to maintain it in one place (rather than maintain a duplicate codebase here, that will slightly diverge over time) then we'd want ReTestItems to vendor it some other way
There was a problem hiding this comment.
(for vendoring it in some other way, we could include the other repo as a git submodule, or have the build script checkout the repo from a committed tag or something like that.)
but for now i agree it makes sense to keep it duplicated until we invest in a cleanup like that.
|
julia> w = ReTestItems.Worker(; threads="2")
Worker(pid=28632)
julia> close(w, :foo)
┌ Debug: closing worker 28632 from foo
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:112
┌ Debug: Error processing responses from worker 28632
│ exception =
│ EOFError: read end of file
│ Stacktrace:
│ [1] read(this::Sockets.TCPSocket, ::Type{UInt8})
│ @ Base ./stream.jl:980
│ [2] deserialize
│ @ ~/repos/rai-julia/usr/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814 [inlined]
│ [3] deserialize(s::Sockets.TCPSocket)
│ @ Serialization ~/repos/rai-julia/usr/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:801
│ [4] process_responses(w::ReTestItems.Workers.Worker, ev::Base.Event)
│ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:233
│ [5] (::ReTestItems.Workers.var"#9#17"{ReTestItems.Workers.Worker, Base.Event})()
│ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:185
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:250
┌ Debug: terminating worker 28632 from process_responses
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:75
┌ Debug: sending signal 15 to worker 28632
└ @ ReTestItems.Workers ~/repos/ReTestItems.jl/src/workers.jl:86
Worker 28632: [28632] signal (15): Terminated: 15
in expression starting at none:1 |
|
Hrmm yeah, good call nick. It looks like when the process shutsdown gracefully, it closes the pipe abruptly, surprising the "process_responses" task. So probably there needs to be a graceful shutdown message in the other direction too; either sent by the original task that is requesting the shutdown, or by the worker who is shutting down. |
|
Or you could set a "closing" field on the struct and check that in the exception handler. |
| @testset "clean shutdown ($w)" begin | ||
| close(w) | ||
| @test !process_running(w.process) | ||
| @test w.process.termsignal == Base.SIGTERM |
There was a problem hiding this comment.
I'm wondering where the "SIGTERM" comes from? Your PR description says:
terminate workers with SIGTERM
but i think if it's a graceful close, the process just exits, meaning there is no signal at all.
So maybe this should be:
| @test w.process.termsignal == Base.SIGTERM | |
| @test w.process.termsignal == 0 |
?
That's what i'm seeing when running the tests:
Expression: w.process.termsignal == Base.SIGTERM
Evaluated: 0 == 15
SIGTERM(neverSIGKILL) if we (the test framework) are doing it ourselvescc @NHDaly @kpamnany @Drvi
edit: i don't trust this existing
closefunction yet, so marked draft