Page MenuHomeClusterLabs Projects

Make remote reads asynchronous
Open (Work In Progress), HighPublic

Assigned To
Authored By
kgaillot
Jul 30 2024, 1:57 PM
Tags
  • Restricted Project
  • Restricted Project
  • Restricted Project
  • Restricted Project
  • Restricted Project
Referenced Files
None
Subscribers

Description

Currently, pcmk__read_remote_message() reads a single message from a remote connection synchronously with a timeout (typically 60 seconds). That means if the other side stops sending data while a message is being read, the remote daemon can block, making it unable to accept new connections until the timeout expires.

The reproducer would be bringing the network interface down on the connection host. The connection host should be fenced and the connection should restart on the other cluster node.

The read should be made asynchronous. If data isn't immediately available, return to the mainloop and try again in some small amount of time. New connections should cancel any partial reads. The timeout in pcmk__remote_ready() must be considered as well.

See also:

Event Timeline

kgaillot created this task.
kgaillot created this object with edit policy "Restricted Project (Project)".
kgaillot renamed this task from Make remote reads asynchronous to Make remote handshakes and reads asynchronous.Jul 30 2024, 2:14 PM
kgaillot renamed this task from Make remote handshakes and reads asynchronous to Make remote reads asynchronous.
kgaillot updated the task description. (Show Details)
kgaillot updated the task description. (Show Details)
kgaillot added projects: Restricted Project, Restricted Project.

My first thought here is to start by converting pcmk__remote_ready to be an async function that checks the remote to see if it's ready once and sends a result up to the mainloop. However, one caller of this function is lrmd_poll, which we do not use anywhere but I assume is public API. Do we need to continue keeping it (and lrmd_dispatch) around? If so, we might need sync and async versions of this function.

In T855#13286, @clumens wrote:

My first thought here is to start by converting pcmk__remote_ready to be an async function that checks the remote to see if it's ready once and sends a result up to the mainloop. However, one caller of this function is lrmd_poll, which we do not use anywhere but I assume is public API. Do we need to continue keeping it (and lrmd_dispatch) around? If so, we might need sync and async versions of this function.

Yeah that's public API, for someone who wants a non-mainloop connection. We could've deprecated it for 2.1.8, but oh well. Note that it calls pcmk__remote_ready() with a timeout of 0, which means return immediately without blocking, so it would be fine with the check-once approach.

Currently lrmd_remote_client_msg() calls pcmk__read_remote_message(), which loops over read_available_remote_data() up to the timeout.

I think the basic idea will be for lrmd_remote_client_msg() to do something like read_available_remote_data() directly, process the message if it has been received in its entirety, and otherwise return to the main loop. The main loop will call it again when more data is available. I don't think any timeout is necessary since it's simply dispatching data from the socket and has nothing else to do.

cib_remote_msg() (server side) and cib_remote_callback_dispatch()/cib_remote_command_dispatch() (client side) are equivalent for remote CIB administration and should be able to follow the same pattern.

cib_remote_perform_op() calls pcmk__read_remote_message() only for sync replies, so it could keep the current approach.

cib_tls_signon() calls pcmk__read_remote_message() to get an authentication reply. That's probably fine to do sync since it's for the client, so it can keep the current approach too.

lrmd_tls_dispatch() calls pcmk__read_remote_message() and is used in three different contexts: by lrmd_dispatch() for non-mainloop socket dispatch (another public API not used internally, and since there's no mainloop it should keep the current approach); by add_tls_to_mainloop() for mainloop socket dispatch (which should follow the new approach); and as a trigger callback (which is for saved messages and doesn't really matter for this purpose, but can stay with the mainloop socket dispatch).

read_remote_reply() calls pcmk__read_remote_message() and is called by lrmd_send_xml() -> lrmd_tls_send_recv() which is sync, so the current approach is fine.

kgaillot moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Aug 21 2024, 4:37 PM
clumens changed the task status from Open to WIP.Aug 28 2024, 2:43 PM

Following up on the state of this issue so I don't forget...

Everything except lrmd_tls_dispatch has been figured out and committed. I have a patch to introduce lrmd_tls_dispatch_async and it is very simple, but it causes problems that most often occur when running the RemoteBasic CTSlab test. You can also reproduce this on a test cluster by following along with what RemoteBasic does - add two real nodes and a VM to run pacemaker_remote, create the remote resource, and then create the Dummy resource. Stop and then start the cluster, and you'll see all sorts of XML IPC error messages.

These message seem to be caused by the following:

  • lrmd_tls_dispatch_async gets EAGAIN and reads part of a message, puts the bytes into the buffer, and returns to the mainloop.
  • lrmd_send_command gets called somewhere to send a message. This uses lrmd_send_command, which synchronously and immediately wants to read a reply.
  • There's bytes in the buffer (the first part of the message that was read), so read_remote_reply immediately calls pcmk__remote_message_xml in the beginning of that for loop.
  • Errors occur because the bytes in the buffer aren't a complete, valid XML document. These errors cascade causing probes to fail and nodes to be fenced.
  • lrmd_tls_dispatch_async never gets a chance to read the rest of the message.

Basically, mixing async and sync is causing problems and we need to convert lrmd_send_command to async before the last part of this can be done.