-
Notifications
You must be signed in to change notification settings - Fork 114
Refactor task tracking to ensure we await/abort any spawned tasks #619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor task tracking to ensure we await/abort any spawned tasks #619
Conversation
tnull
commented
Aug 19, 2025
Previously we a) added a new internal `Runtime` API that cleans up our internal logic and b) added tracking for spawned background tasks to be able to await/abort them on shutdown. Here we move the tracking into the `Runtime` object, which will allow us to easily extend the tracking to *any* spawned tasks in the next step.
We now drop the generic `spawn` from our internal `Runtime` API, ensuring we'd always have to either use `spawn_cancellable_background_task` or `spawn_background_task`.
|
👋 Thanks for assigning @TheBlueMatt as a reviewer! |
| } | ||
| } | ||
|
|
||
| pub fn spawn<F>(&self, future: F) -> JoinHandle<F::Output> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, considered doing the same for spawn_blocking, but a) blocking tasks can't be cancelled anyways and b) it is currently only used by the electrum client where each instance is wrapped in a tokio::time::timeout. It would be weird to wait on any of the blocking tasks during shutdown, if during operation we decided to move on/handled the timeout error already.
Any opinions here, @TheBlueMatt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion, other than that we should push hard to remove the blocking calls there (looks like at least the tx_sync.sync call will be non-blocking in LDK 0.2, and we should try the async version of the electrum client?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion, other than that we should push hard to remove the blocking calls there (looks like at least the
tx_sync.synccall will be non-blocking in LDK 0.2
Not sure where you did see that? I don't think anything changed, we'll have esplora-blocking, esplora-async, and the blocking electrum client
and we should try the async version of the electrum client?
There currently is no async version of the electrum client though? Evan recently put up an experimental crate (https://crates.io/crates/electrum_streaming_client), but I don't expect bdk_electrum to use it too soon. When it will/offers that, I agree that we should also support it / switch to it in lightning-transaction-sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where you did see that? I don't think anything changed, we'll have esplora-blocking, esplora-async, and the blocking electrum client
Oh, sorry, I misread that.
There currently is no async version of the electrum client though? Evan recently put up an experimental crate (https://crates.io/crates/electrum_streaming_client), but I don't expect bdk_electrum to use it too soon. When it will/offers that, I agree that we should also support it / switch to it in lightning-transaction-sync.
Ah, somehow I thought there was. That's....frustrating.
Sadly we can't just leave this be entirely. If we're connecting blocks Things might happen to make our ChannelManager need re-persistence, which means we shouldn't exit the LDK background processor until the sync logic is done with its work. cf #620
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly we can't just leave this be entirely. If we're connecting blocks Things might happen to make our
ChannelManagerneed re-persistence, which means we shouldn't exit the LDK background processor until the sync logic is done with its work. cf #620
We aren't connecting blocks though? We can't abort the blocking tasks, so all we could do is to wait for them, but then again, this would leave us open to any kinds of stalling etc, which is why we employ the timeouts in the first place. FWIW, we also have timeouts on the shutdown itself, because we also can't just get stuck forever if some task decides to never return for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my understanding (based on the name) of calls like tx_sync.sync(confirmables) and bdk_electrum_client.sync(...) was that they were actually fetching and then connecting blocks/transaction data to the underlying logic (in the first case LDK, in the second BDK). I get that we can't really just block on them finishing their work forever, but this does seem like a somewhat urgent issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my understanding (based on the name) of calls like
tx_sync.sync(confirmables)andbdk_electrum_client.sync(...)was that they were actually fetching and then connecting blocks/transaction data to the underlying logic (in the first case LDK, in the second BDK). I get that we can't really just block on them finishing their work forever, but this does seem like a somewhat urgent issue.
Not quite, that's basically true for LDK, but BDK by now decoupled retrieving the data from applying it to the wallet. So bdk_electrum_client.sync/bdk_electrum_client.full_scan just do the retrieval, and the results is applied separately afterwards. Meaning, if the higher-level task is cancelled/awaited, the spawn_blocking tasks would just eventually end/timeout but nothing would be applied.
The story is indeed a bit different for LDK as tx_sync.sync which directly connects the updates to the confirmables as part of the (blocking) task. I wonder if we should refactor lightning-transaction-sync to allow for a similar decoupling of users. That would probably require lightningdevkit/rust-lightning#3867 and then sync wouldn't directly apply the updates to the confirmables, but rather return a Vec<BlockUpdate> or similar, moving the responsibility to apply these updates to all Confirm implementations to the user.
FWIW, that would also be nice for the (async) callgraph, as otherwise the tokio task that is responsible for syncing would necessarily call back into our Confirm implementation, possibly also driving persistence, resulting broadcasts, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The story is indeed a bit different for LDK as tx_sync.sync which directly connects the updates to the confirmables as part of the (blocking) task. I wonder if we should refactor lightning-transaction-sync to allow for a similar decoupling of users. That would probably require lightningdevkit/rust-lightning#3867 and then sync wouldn't directly apply the updates to the confirmables, but rather return a Vec or similar, moving the responsibility to apply these updates to all Confirm implementations to the user.
That might be one way to address it, yea, but of course the "right" way is for the fetching of resources to actually be async :)
FWIW, that would also be nice for the (async) callgraph, as otherwise the tokio task that is responsible for syncing would necessarily call back into our Confirm implementation, possibly also driving persistence, resulting broadcasts, etc.
Afaiu as of LDK 0.2 all the persistence (hopefully the ChannelMonitors included) will be spawn-and-don't-block, so it shouldn't be an issue :). That's already the case for transaction broadcasting, so I'm not sure why that's a concern here.
|
Grrr, CI has really become unusable due to #618 right now :( |
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff itself LGTM, but I also noticed #620.
|
Landing this, will open a follow-up PR for #620 tomorrow. |