Standardize IPC code in execd #4018

clumens · 2026-01-06T17:00:34Z

No description provided.

daemons/execd/execd_ipc.c

daemons/execd/remoted_tls.c

daemons/execd/execd_ipc.c

daemons/execd/pacemaker-execd.c

daemons/execd/pacemaker-execd.h

daemons/execd/remoted_tls.c

clumens · 2026-01-09T19:11:25Z

There's a push with all the important changes. The next push will just be to rebase on main and resolve a conflict.

This brings it in line with how all the other daemons (aside from based and controld) are organized.

* Add doxygen comments to explain what each function does. * Standardize log messages with what's happening in the same functions in attrd. * Standardize variable names (c becomes client, qbc becomes c). * Make sure return values and comments match what libqb documents.

nrwahl2

I was going to merge this and ask you to address the comments in a follow-up PR, but we may have a problem for remote acks on invalid message. Maybe I'll merge it and then merge my own follow-up PR to fix it.

nrwahl2 · 2026-01-09T22:30:38Z

daemons/execd/execd_ipc.c

        /* We haven't read the complete message yet, so just return. */
        return 0;

    } else if (rc == pcmk_rc_ok) {


I have never liked that of the three cases we handle, the middle one is the only one that doesn't return immediately. I sort of get it -- the first two are "expected" conditions, while the third is "some kind of error." I hope that we rearrange this later when we functionize the common code. I'd like to see either "error, more, OK" or "more, error, OK", so that the pcmk_rc_ok part doesn't sit inside an if block and so that we assign msg closer to where we use it.

daemons/execd/execd_ipc.c

lib/common/ipc_server.c

daemons/execd/remoted_tls.c

daemons/execd/execd_messages.c

daemons/execd/pacemaker-execd.c

daemons/execd/remoted_tls.c

* Rename invalid_msg to execd_invalid_msg and make it publically visible. I think this might be something that other daemons should do in the future. * Rename execd_process_message to execd_handle_request and make it take a pcmk__request_t instead of the arguments it was taking. * Update the callers to construct and pass it a pcmk__request_t, in line with how other daemons function right now.

Nothing uses it yet.

* Add execd_ipc_init to register callback functions and add them to the mainloop. This means the callback struct can now be static. * Add execd_ipc_cleanup to clean up. This includes client dropping and cleanup tasks that were previously not happening. I'm not sure if this was a bug or if it doesn't really matter, but it's still good to do this in all daemons.

This just brings it more in line with what's happening in the other daemons.

* Add doxygen comments to explain what each function does. * Standardize log messages with what's happening in the same functions in attrd. * Minor code rearranging like initializing things to NULL and removing unnecessary spacing. This file does a lot of the same kind of IPC stuff as execd_ipc.c does, but for the specialized case of proxying to/from remote nodes. It therefore looks pretty similar to that file, and these changes make it even more similar. Nothing here is especially interesting.

I've only done this to files that I touched as part of this patch series. There's plenty more cleanups that could be done across execd and I didn't want to get into that as part of working on its IPC code, but doing this much seemed worth it.

* Use pcmk__ipc_server_name when calling mainloop_add_ipc_server for all but based, which has multiple servers with multliple names. * Use the same phrasing for each, including pcmk__ipc_log_name. I'm not sure the "inhibiting respawn" message is all that important. * Upgrade message severity to critical.

request.ipc_flags should be an IPC message flag (crm_ipc_flags), not a client flag (pcmk__client_flags). Nothing in the execd_handle_request chain uses ipc_flags (aside from sending an ACK on error), so just set them to crm_ipc_flags_none. And while I'm at it, do the same in a few other places for type clarity.

It shouldn't be possible for client->id to be NULL at this point. We have a pcmk__client_t, and the id member is set by execd_ipc_accept -> pcmk__new_client -> client_from_connection. The only way to fail at that point appears to involve an assertion. Likewise, the check in ipc_proxy_dispatch can also be removed. For that matter, there's a redundant check on client != NULL that can be removed too.

Since handlers can be used for both IPC and CPG, the unregister function needs to be called after we've also disconnected from the cluster.

And add trivial doxygen to functions that were missing them.

This is just for brevity. We could potentially also use it for cleaning up cib_{ro,rw,shm} but that requires changing all other callers of pcmk__stop_based_ipc which I'm not interested in doing at the moment.

pcmk__ipc_send_ack should be an IPC message flag (crm_ipc_flags), not a client flag (pcmk__client_flags). Typically, we get the IPC flags from the incoming message's pcmk__ipc_header_t (for instance, see the call to pcmk__client_data2xml in attrd_ipc_dispatch). However, we don't have that kind of message here. The remote message is a remote_header_v0 with the XML appended right after it. There's no pcmk__ipc_header_t, so we can't get the IPC flags. It appears to be good enough to just hardcode crm_ipc_client_response as the flags, however. This flag is necessary for pcmk__ipc_create_ack_as to actually construct an ACK. None of the other flags seem appropriate. Two notes: (1) remote_header_v0 has its own flags member, but I can't find where they are ever set. (2) This whole thing almost certainly never sees any testing. It requires execd receiving a non-NULL but incorrect (garbled during transmission? incorrectly constructed?) message. In that case, we probably have bigger problems in which case the ACK handling isn't getting any attention.

clumens · 2026-01-13T16:30:23Z

daemons/execd/remoted_tls.c

-        pcmk__ipc_send_ack(client, id, client->flags, PCMK__XE_NACK, NULL,
-                           CRM_EX_PROTOCOL);
+        pcmk__ipc_send_ack(client, id, crm_ipc_client_response, PCMK__XE_NACK,
+                           NULL, CRM_EX_PROTOCOL);


Okay, behavior on main and on this branch is the same. If you change execd_invalid_msg to just return true, here's what gets logged on the client side of this connection:

Jan 13 11:17:08.127 rhel9-scratch-1 pacemaker-controld [681195] (lrmd_ipc_connect@lrmd_client.c:1122) info: Connecting to executor Jan 13 11:17:08.130 rhel9-scratch-1 pacemaker-controld [681195] (crm_ipc_send@ipc_client.c:1439) trace: Sending lrmd IPC request 4 of 135 bytes using -1ms timeout Jan 13 11:17:08.130 rhel9-scratch-1 pacemaker-controld [681195] (crm_ipc_send@ipc_client.c:1443) trace: Text = <lrmd_command t="lrmd" lrmd_op="register" lrmd_clientname="crmd" lrmd_protocol_version="1.2"/> Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (internal_ipc_get_reply@ipc_client.c:1213) trace: Expecting reply to lrmd IPC message 4 Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (internal_ipc_get_reply@ipc_client.c:1280) trace: Received 101-byte reply 4 to lrmd IPC 4: <nack function="execd_ipc_dispatch" line="173" status="76"/> Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (process_lrmd_handshake_reply@lrmd_client.c:1044) error: Invalid registration message: (null) Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (process_lrmd_handshake_reply@lrmd_client.c:1045) error: Bad reply <nack function="execd_ipc_dispatch" line="173" status="76"/> Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld [681195] (lrmd_ipc_connection_destroy@lrmd_client.c:584) info: Disconnected from local executor Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld [681195] (crm_ipc_destroy@ipc_client.c:993) trace: Destroying inactive lrmd IPC connection Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld [681195] (try_local_executor_connect@controld_execd.c:332) warning: Failed to connect to the local executor 1 time (30 max): Protocol error

And if we look at process_lrmd_handshake_reply where the reply is, uh, being processed, we see that it is expecting an actual reply, not a NACK. It looks for a PCMK__XA_LRMD_RC instead of a status="" attribute, and it looks for various other attributes that would never be set on a NACK. Basically, it's treating the NACK as an invalid message instead of a valid message that contains an error.

One way to fix this (not in this PR) would be to update process_lrmd_handshake_reply to expect a possible NACK. The other way to fix it (also not in this PR) would be to do what https://projects.clusterlabs.org/T991 suggests and remove the NACK entirely. Instead, we could have the server side of the connection send a reply with PCMK__XA_LRMD_RC set like is currently expected.

Instead, we could have the server side of the connection send a reply with PCMK__XA_LRMD_RC set like is currently expected.

I suspect this is gonna make the most sense.

clumens requested a review from nrwahl2 January 6, 2026 17:00

nrwahl2 reviewed Jan 7, 2026

View reviewed changes

clumens force-pushed the execd-ipc-cleanup branch from 00e8ce1 to 720a70b Compare January 9, 2026 19:10

clumens added 3 commits January 9, 2026 14:11

Refactor: daemons: Split up IPC code in execd.

d629ce3

This brings it in line with how all the other daemons (aside from based and controld) are organized.

Refactor: daemons: Rename functions in execd_ipc.c.

462e7ab

clumens force-pushed the execd-ipc-cleanup branch from 720a70b to f7488cd Compare January 9, 2026 19:31

nrwahl2 reviewed Jan 10, 2026

View reviewed changes

clumens added 14 commits January 12, 2026 16:38

Refactor: libcrmcommon: Add pcmk__serve_execd_ipc.

b436713

Nothing uses it yet.

Refactor: daemons: Add some additional checks to execd_ipc_dispatch.

7f48b4f

This just brings it more in line with what's happening in the other daemons.

Low: daemons: Clean up XML memory in IPC dispatch functions.

1dda700

Refactor: daemons: Don't call _unregister_handlers from _ipc_cleanup.

0c2b94b

Since handlers can be used for both IPC and CPG, the unregister function needs to be called after we've also disconnected from the cluster.

Refactor: daemons: Use g_clear_pointer in IPC cleanup functions.

74852a8

And add trivial doxygen to functions that were missing them.

Refactor: daemons: Use g_clear_pointer in ipc_proxy_cleanup.

1eb3aca

This is just for brevity. We could potentially also use it for cleaning up cib_{ro,rw,shm} but that requires changing all other callers of pcmk__stop_based_ipc which I'm not interested in doing at the moment.

clumens force-pushed the execd-ipc-cleanup branch from f7488cd to 69411f2 Compare January 13, 2026 16:24

clumens commented Jan 13, 2026

View reviewed changes

nrwahl2 approved these changes Jan 13, 2026

View reviewed changes

nrwahl2 merged commit 3e28cf7 into ClusterLabs:main Jan 13, 2026
1 check passed

clumens deleted the execd-ipc-cleanup branch January 15, 2026 17:40

Standardize IPC code in execd #4018

Standardize IPC code in execd #4018

Conversation

clumens commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clumens commented Jan 9, 2026

Uh oh!

nrwahl2 left a comment

Choose a reason for hiding this comment

Uh oh!

nrwahl2 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clumens Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

nrwahl2 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants