Skip to content

Signal to socket refactor#502

Merged
TApplencourt merged 12 commits into
develfrom
signal_to_socket_refactor
May 20, 2026
Merged

Signal to socket refactor#502
TApplencourt merged 12 commits into
develfrom
signal_to_socket_refactor

Conversation

@TApplencourt
Copy link
Copy Markdown
Collaborator

No description provided.

TApplencourt and others added 9 commits May 19, 2026 16:31
The Ruby driver coordinated with its child daemons (sync_daemon_mpi,
sync_daemon_fs, thapi_sampling_daemon) via POSIX real-time signals
(SIGRTMIN+0..3) and a busy-spin on a Signal.trap-toggled @@ready flag.
That flag was shared across SyncDaemon and SamplingDaemon, which was
a latent race.

Switch to one SOCK_SEQPACKET UNIX socketpair per daemon. The parent
sends short ASCII tokens (INIT, LOCAL_BARRIER, GLOBAL_BARRIER, FINISH)
and blocks on recvmsg for the daemon's READY reply. No signal handlers,
no busy waits, no shared state.

The argv contract for the daemon binaries changes: they now take a
single <fd> argument (parent_pid is gone). These are internal binaries
shipped under BINDIR, but worth noting.

sync_daemon_mpi.c is renamed to .cpp so MPI session setup and the
message loop share the same C++ idioms (string_view dispatch); the
build switches to MPICXX, and configure.ac wraps AX_MPI/AX_MPI_SESSION
in AC_LANG_PUSH([C++]) so MPICXX is set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove manual = alignment on constexpr/using declarations, collapse
the dlsym reinterpret_cast lines to fit clang-format's continuation
indent, and break the if-then-return chains onto two lines (LLVM
default disallows short ifs on one line).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
C++ forbids a goto from jumping over a variable initialization.
CHECK_MPI expands to goto fn_exit, so const int fd must be declared
before the first CHECK_MPI call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ruby creates UNIX sockets with O_NONBLOCK on Linux by default. The
inherited fd carried that flag into the daemon, so its blocking read()
returned EAGAIN ("Resource temporarily unavailable") and the daemon
exited before processing any message. Clear the flag on the child end
before spawn in both SyncDaemon and SamplingDaemon.

Replace integration_tests/light_iprof_only_sync.sh with a Ruby driver
that speaks the new socketpair protocol. The old shell driver spawned
the daemon directly using the signal protocol (passing PARENT_PID as
argv) and is incompatible with the new <fd> argv contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The setter is not loaded by default in Ruby 3+; CI hit NoMethodError.
Reproduced locally and confirmed the require resolves it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract the socketpair protocol shared by iprof and its daemons:

- Ruby side: new DaemonClient module holds MSG_INIT/FINISH/READY,
  lazy_exec, finalize, wait_ready, and the spawn boilerplate (incl.
  the nonblock= dance). SyncDaemon and SamplingDaemon each include it
  and supply only their own gating predicate and per-instance state.
- C++ side: new utils/include/daemon_proto.hpp holds the MSG_*
  string_view constants, send_msg, and recv_expect. Both
  sync_daemon_mpi and thapi_sampling_daemon use it.
- Also: shrink the SyncDaemon daemon-path case statement, drop the
  send_ready lambda in sync_daemon_fs, and add the new header to
  utils/Makefile.am noinst_HEADERS.

Net: ~80 fewer lines across the daemon code (header included).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
These LOGGER.debug { "spawn(...)" } lines predated the socketpair
refactor when they included Process.pid (needed for kill-based IPC).
Now that spawn is gated by lazy_exec, the wrapping
LOGGER.with_info_block("Initialize ...Daemon ...") already logs entry
and exit; the inner debug is dead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ctive

- Lift daemon spawn boilerplate into DaemonClient#spawn (mirrors
  Process.spawn signature + log:); fd insertion and `fd => fd` cloexec
  clearing now hidden from callers.
- Both SyncDaemon and SamplingDaemon early-return on `unless active?`,
  so bogus THAPI_SYNC_DAEMON values and missing binaries no longer
  raise on runs that don't actually need a daemon.
- Replace top-level `sampling?` with `SamplingDaemon.active?` (class
  method); add SyncDaemon.active? for symmetry; hoist the instance
  `active?` delegate into DaemonClient.
- SyncDaemon#initialize daemon-path selection rewritten as case/when,
  with a better error pointing at --disable-sync-daemon-mpi.
- Fix SAMPLING_DAEMON_PATH copy-paste bug (was pointing at
  sync_daemon_mpi); switch sampling spawn to argv-form to drop the
  shell middleman.
- LocalMaster.setup guards SamplingDaemon against bt_analysis raising
  before teardown can run.
- Add integration test: bogus THAPI_SYNC_DAEMON is ignored when not
  under MPI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread xprof/Makefile.am Outdated
Comment thread xprof/xprof.rb.in Outdated
Thomas Applencourt and others added 3 commits May 20, 2026 18:54
Was discarding $? from Process.wait. If the daemon crashes during
teardown (e.g., MPI_Finalize segfault — see THAPI_SYNC_DAEMON_MPI_NO_FINALIZE
workaround in parallel_execution.bats), the failure was silent.
Switch to wait2 and warn on non-success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pull '-Wall -Wextra $(WERROR)' (plus '-std=c++17' for C++) out of
each per-target line into AM_CFLAGS / AM_CXXFLAGS. Per-target rules
that already set _CFLAGS / _CXXFLAGS now reference $(AM_*FLAGS)
explicitly, since automake target-specific flags replace rather
than extend AM_*.

Side fix: sampling/thapi_sampling_daemon was missing -std=c++17
entirely. It now picks it up via AM_CXXFLAGS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
clang-format-18 on daemon_proto.hpp (CI runs the same version) and
strip an over-aligned 'then nil' in sync_daemon_fs's case arm.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@TApplencourt TApplencourt changed the title [draft] Signal to socket refactor Signal to socket refactor May 20, 2026
Copy link
Copy Markdown
Contributor

@nscottnichols nscottnichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@TApplencourt TApplencourt merged commit 3f65297 into devel May 20, 2026
27 checks passed
@TApplencourt TApplencourt deleted the signal_to_socket_refactor branch May 20, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants