Skip to content

Conversation

@clumens
Copy link
Contributor

@clumens clumens commented Jan 6, 2026

No description provided.

@clumens clumens requested a review from nrwahl2 January 6, 2026 17:00
@clumens clumens force-pushed the execd-ipc-cleanup branch from 00e8ce1 to 720a70b Compare January 9, 2026 19:10
@clumens
Copy link
Contributor Author

clumens commented Jan 9, 2026

There's a push with all the important changes. The next push will just be to rebase on main and resolve a conflict.

This brings it in line with how all the other daemons (aside from
based and controld) are organized.
* Add doxygen comments to explain what each function does.

* Standardize log messages with what's happening in the same functions
  in attrd.

* Standardize variable names (c becomes client, qbc becomes c).

* Make sure return values and comments match what libqb documents.
@clumens clumens force-pushed the execd-ipc-cleanup branch from 720a70b to f7488cd Compare January 9, 2026 19:31
Copy link
Contributor

@nrwahl2 nrwahl2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to merge this and ask you to address the comments in a follow-up PR, but we may have a problem for remote acks on invalid message. Maybe I'll merge it and then merge my own follow-up PR to fix it.

/* We haven't read the complete message yet, so just return. */
return 0;

} else if (rc == pcmk_rc_ok) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never liked that of the three cases we handle, the middle one is the only one that doesn't return immediately. I sort of get it -- the first two are "expected" conditions, while the third is "some kind of error." I hope that we rearrange this later when we functionize the common code. I'd like to see either "error, more, OK" or "more, error, OK", so that the pcmk_rc_ok part doesn't sit inside an if block and so that we assign msg closer to where we use it.

* Rename invalid_msg to execd_invalid_msg and make it publically
  visible.  I think this might be something that other daemons should do
  in the future.

* Rename execd_process_message to execd_handle_request and make it take
  a pcmk__request_t instead of the arguments it was taking.

* Update the callers to construct and pass it a pcmk__request_t, in line
  with how other daemons function right now.
* Add execd_ipc_init to register callback functions and add them to the
  mainloop.  This means the callback struct can now be static.

* Add execd_ipc_cleanup to clean up.  This includes client dropping and
  cleanup tasks that were previously not happening.  I'm not sure if
  this was a bug or if it doesn't really matter, but it's still good to do
  this in all daemons.
This just brings it more in line with what's happening in the other
daemons.
* Add doxygen comments to explain what each function does.

* Standardize log messages with what's happening in the same functions
  in attrd.

* Minor code rearranging like initializing things to NULL and removing
  unnecessary spacing.

This file does a lot of the same kind of IPC stuff as execd_ipc.c does,
but for the specialized case of proxying to/from remote nodes.  It
therefore looks pretty similar to that file, and these changes make it
even more similar.  Nothing here is especially interesting.
I've only done this to files that I touched as part of this patch
series.  There's plenty more cleanups that could be done across execd
and I didn't want to get into that as part of working on its IPC code,
but doing this much seemed worth it.
* Use pcmk__ipc_server_name when calling mainloop_add_ipc_server for all
  but based, which has multiple servers with multliple names.

* Use the same phrasing for each, including pcmk__ipc_log_name.  I'm
  not sure the "inhibiting respawn" message is all that important.

* Upgrade message severity to critical.
request.ipc_flags should be an IPC message flag (crm_ipc_flags), not a
client flag (pcmk__client_flags).  Nothing in the execd_handle_request
chain uses ipc_flags (aside from sending an ACK on error), so just set
them to crm_ipc_flags_none.

And while I'm at it, do the same in a few other places for type clarity.
It shouldn't be possible for client->id to be NULL at this point.  We
have a pcmk__client_t, and the id member is set by execd_ipc_accept ->
pcmk__new_client -> client_from_connection.  The only way to fail at
that point appears to involve an assertion.

Likewise, the check in ipc_proxy_dispatch can also be removed.  For that
matter, there's a redundant check on client != NULL that can be removed
too.
Since handlers can be used for both IPC and CPG, the unregister function
needs to be called after we've also disconnected from the cluster.
And add trivial doxygen to functions that were missing them.
This is just for brevity.  We could potentially also use it for cleaning
up cib_{ro,rw,shm} but that requires changing all other callers of
pcmk__stop_based_ipc which I'm not interested in doing at the moment.
pcmk__ipc_send_ack should be an IPC message flag (crm_ipc_flags), not a
client flag (pcmk__client_flags).  Typically, we get the IPC flags from
the incoming message's pcmk__ipc_header_t (for instance, see the call to
pcmk__client_data2xml in attrd_ipc_dispatch).

However, we don't have that kind of message here.  The remote message is
a remote_header_v0 with the XML appended right after it.  There's no
pcmk__ipc_header_t, so we can't get the IPC flags.

It appears to be good enough to just hardcode crm_ipc_client_response as
the flags, however.  This flag is necessary for pcmk__ipc_create_ack_as
to actually construct an ACK.  None of the other flags seem appropriate.

Two notes:

(1) remote_header_v0 has its own flags member, but I can't find where
they are ever set.

(2) This whole thing almost certainly never sees any testing.  It
requires execd receiving a non-NULL but incorrect (garbled during
transmission?  incorrectly constructed?) message.  In that case, we
probably have bigger problems in which case the ACK handling isn't
getting any attention.
pcmk__ipc_send_ack(client, id, client->flags, PCMK__XE_NACK, NULL,
CRM_EX_PROTOCOL);
pcmk__ipc_send_ack(client, id, crm_ipc_client_response, PCMK__XE_NACK,
NULL, CRM_EX_PROTOCOL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, behavior on main and on this branch is the same. If you change execd_invalid_msg to just return true, here's what gets logged on the client side of this connection:

Jan 13 11:17:08.127 rhel9-scratch-1 pacemaker-controld  [681195] (lrmd_ipc_connect@lrmd_client.c:1122)  info: Connecting to executor
Jan 13 11:17:08.130 rhel9-scratch-1 pacemaker-controld  [681195] (crm_ipc_send@ipc_client.c:1439)       trace: Sending lrmd IPC request 4 of 135 bytes using -1ms timeout
Jan 13 11:17:08.130 rhel9-scratch-1 pacemaker-controld  [681195] (crm_ipc_send@ipc_client.c:1443)       trace: Text = <lrmd_command t="lrmd" lrmd_op="register" lrmd_clientname="crmd" lrmd_protocol_version="1.2"/>
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld  [681195] (internal_ipc_get_reply@ipc_client.c:1213)     trace: Expecting reply to lrmd IPC message 4
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld  [681195] (internal_ipc_get_reply@ipc_client.c:1280)     trace: Received 101-byte reply 4 to lrmd IPC 4: <nack function="execd_ipc_dispatch" line="173" status="76"/>
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld  [681195] (process_lrmd_handshake_reply@lrmd_client.c:1044)      error: Invalid registration message: (null)
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld  [681195] (process_lrmd_handshake_reply@lrmd_client.c:1045)      error: Bad reply   <nack function="execd_ipc_dispatch" line="173" status="76"/>
Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld  [681195] (lrmd_ipc_connection_destroy@lrmd_client.c:584)        info: Disconnected from local executor
Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld  [681195] (crm_ipc_destroy@ipc_client.c:993)     trace: Destroying inactive lrmd IPC connection
Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld  [681195] (try_local_executor_connect@controld_execd.c:332)      warning: Failed to connect to the local executor 1 time (30 max): Protocol error

And if we look at process_lrmd_handshake_reply where the reply is, uh, being processed, we see that it is expecting an actual reply, not a NACK. It looks for a PCMK__XA_LRMD_RC instead of a status="" attribute, and it looks for various other attributes that would never be set on a NACK. Basically, it's treating the NACK as an invalid message instead of a valid message that contains an error.

One way to fix this (not in this PR) would be to update process_lrmd_handshake_reply to expect a possible NACK. The other way to fix it (also not in this PR) would be to do what https://projects.clusterlabs.org/T991 suggests and remove the NACK entirely. Instead, we could have the server side of the connection send a reply with PCMK__XA_LRMD_RC set like is currently expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, we could have the server side of the connection send a reply with PCMK__XA_LRMD_RC set like is currently expected.

I suspect this is gonna make the most sense.

@nrwahl2 nrwahl2 merged commit 3e28cf7 into ClusterLabs:main Jan 13, 2026
1 check passed
@clumens clumens deleted the execd-ipc-cleanup branch January 15, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants