-
Notifications
You must be signed in to change notification settings - Fork 352
Standardize IPC code in execd #4018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
00e8ce1 to
720a70b
Compare
|
There's a push with all the important changes. The next push will just be to rebase on main and resolve a conflict. |
This brings it in line with how all the other daemons (aside from based and controld) are organized.
* Add doxygen comments to explain what each function does. * Standardize log messages with what's happening in the same functions in attrd. * Standardize variable names (c becomes client, qbc becomes c). * Make sure return values and comments match what libqb documents.
720a70b to
f7488cd
Compare
nrwahl2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to merge this and ask you to address the comments in a follow-up PR, but we may have a problem for remote acks on invalid message. Maybe I'll merge it and then merge my own follow-up PR to fix it.
| /* We haven't read the complete message yet, so just return. */ | ||
| return 0; | ||
|
|
||
| } else if (rc == pcmk_rc_ok) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never liked that of the three cases we handle, the middle one is the only one that doesn't return immediately. I sort of get it -- the first two are "expected" conditions, while the third is "some kind of error." I hope that we rearrange this later when we functionize the common code. I'd like to see either "error, more, OK" or "more, error, OK", so that the pcmk_rc_ok part doesn't sit inside an if block and so that we assign msg closer to where we use it.
* Rename invalid_msg to execd_invalid_msg and make it publically visible. I think this might be something that other daemons should do in the future. * Rename execd_process_message to execd_handle_request and make it take a pcmk__request_t instead of the arguments it was taking. * Update the callers to construct and pass it a pcmk__request_t, in line with how other daemons function right now.
Nothing uses it yet.
* Add execd_ipc_init to register callback functions and add them to the mainloop. This means the callback struct can now be static. * Add execd_ipc_cleanup to clean up. This includes client dropping and cleanup tasks that were previously not happening. I'm not sure if this was a bug or if it doesn't really matter, but it's still good to do this in all daemons.
This just brings it more in line with what's happening in the other daemons.
* Add doxygen comments to explain what each function does. * Standardize log messages with what's happening in the same functions in attrd. * Minor code rearranging like initializing things to NULL and removing unnecessary spacing. This file does a lot of the same kind of IPC stuff as execd_ipc.c does, but for the specialized case of proxying to/from remote nodes. It therefore looks pretty similar to that file, and these changes make it even more similar. Nothing here is especially interesting.
I've only done this to files that I touched as part of this patch series. There's plenty more cleanups that could be done across execd and I didn't want to get into that as part of working on its IPC code, but doing this much seemed worth it.
* Use pcmk__ipc_server_name when calling mainloop_add_ipc_server for all but based, which has multiple servers with multliple names. * Use the same phrasing for each, including pcmk__ipc_log_name. I'm not sure the "inhibiting respawn" message is all that important. * Upgrade message severity to critical.
request.ipc_flags should be an IPC message flag (crm_ipc_flags), not a client flag (pcmk__client_flags). Nothing in the execd_handle_request chain uses ipc_flags (aside from sending an ACK on error), so just set them to crm_ipc_flags_none. And while I'm at it, do the same in a few other places for type clarity.
It shouldn't be possible for client->id to be NULL at this point. We have a pcmk__client_t, and the id member is set by execd_ipc_accept -> pcmk__new_client -> client_from_connection. The only way to fail at that point appears to involve an assertion. Likewise, the check in ipc_proxy_dispatch can also be removed. For that matter, there's a redundant check on client != NULL that can be removed too.
Since handlers can be used for both IPC and CPG, the unregister function needs to be called after we've also disconnected from the cluster.
And add trivial doxygen to functions that were missing them.
This is just for brevity. We could potentially also use it for cleaning
up cib_{ro,rw,shm} but that requires changing all other callers of
pcmk__stop_based_ipc which I'm not interested in doing at the moment.
pcmk__ipc_send_ack should be an IPC message flag (crm_ipc_flags), not a client flag (pcmk__client_flags). Typically, we get the IPC flags from the incoming message's pcmk__ipc_header_t (for instance, see the call to pcmk__client_data2xml in attrd_ipc_dispatch). However, we don't have that kind of message here. The remote message is a remote_header_v0 with the XML appended right after it. There's no pcmk__ipc_header_t, so we can't get the IPC flags. It appears to be good enough to just hardcode crm_ipc_client_response as the flags, however. This flag is necessary for pcmk__ipc_create_ack_as to actually construct an ACK. None of the other flags seem appropriate. Two notes: (1) remote_header_v0 has its own flags member, but I can't find where they are ever set. (2) This whole thing almost certainly never sees any testing. It requires execd receiving a non-NULL but incorrect (garbled during transmission? incorrectly constructed?) message. In that case, we probably have bigger problems in which case the ACK handling isn't getting any attention.
f7488cd to
69411f2
Compare
| pcmk__ipc_send_ack(client, id, client->flags, PCMK__XE_NACK, NULL, | ||
| CRM_EX_PROTOCOL); | ||
| pcmk__ipc_send_ack(client, id, crm_ipc_client_response, PCMK__XE_NACK, | ||
| NULL, CRM_EX_PROTOCOL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, behavior on main and on this branch is the same. If you change execd_invalid_msg to just return true, here's what gets logged on the client side of this connection:
Jan 13 11:17:08.127 rhel9-scratch-1 pacemaker-controld [681195] (lrmd_ipc_connect@lrmd_client.c:1122) info: Connecting to executor
Jan 13 11:17:08.130 rhel9-scratch-1 pacemaker-controld [681195] (crm_ipc_send@ipc_client.c:1439) trace: Sending lrmd IPC request 4 of 135 bytes using -1ms timeout
Jan 13 11:17:08.130 rhel9-scratch-1 pacemaker-controld [681195] (crm_ipc_send@ipc_client.c:1443) trace: Text = <lrmd_command t="lrmd" lrmd_op="register" lrmd_clientname="crmd" lrmd_protocol_version="1.2"/>
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (internal_ipc_get_reply@ipc_client.c:1213) trace: Expecting reply to lrmd IPC message 4
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (internal_ipc_get_reply@ipc_client.c:1280) trace: Received 101-byte reply 4 to lrmd IPC 4: <nack function="execd_ipc_dispatch" line="173" status="76"/>
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (process_lrmd_handshake_reply@lrmd_client.c:1044) error: Invalid registration message: (null)
Jan 13 11:17:08.131 rhel9-scratch-1 pacemaker-controld [681195] (process_lrmd_handshake_reply@lrmd_client.c:1045) error: Bad reply <nack function="execd_ipc_dispatch" line="173" status="76"/>
Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld [681195] (lrmd_ipc_connection_destroy@lrmd_client.c:584) info: Disconnected from local executor
Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld [681195] (crm_ipc_destroy@ipc_client.c:993) trace: Destroying inactive lrmd IPC connection
Jan 13 11:17:08.132 rhel9-scratch-1 pacemaker-controld [681195] (try_local_executor_connect@controld_execd.c:332) warning: Failed to connect to the local executor 1 time (30 max): Protocol error
And if we look at process_lrmd_handshake_reply where the reply is, uh, being processed, we see that it is expecting an actual reply, not a NACK. It looks for a PCMK__XA_LRMD_RC instead of a status="" attribute, and it looks for various other attributes that would never be set on a NACK. Basically, it's treating the NACK as an invalid message instead of a valid message that contains an error.
One way to fix this (not in this PR) would be to update process_lrmd_handshake_reply to expect a possible NACK. The other way to fix it (also not in this PR) would be to do what https://projects.clusterlabs.org/T991 suggests and remove the NACK entirely. Instead, we could have the server side of the connection send a reply with PCMK__XA_LRMD_RC set like is currently expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, we could have the server side of the connection send a reply with
PCMK__XA_LRMD_RCset like is currently expected.
I suspect this is gonna make the most sense.
No description provided.