Introduce `Runtime` object allowng to detect outer runtime context #543

tnull · 2025-05-19T15:10:46Z

Closes #491

Instead of holding an Arc<RwLock<Option<Arc<tokio::runtime::Runtime>>>
and dealing with stuff like tokio::task::block_in_place at all
callsites, we introduce a Runtime object that takes care of the state
transitions, and allows to detect and reuse an outer runtime context.

We also adjust the with_runtime API to take a tokio::runtime::Handle
rather than an Arc<Runtime>.

~~We then also go ahead an reuse said Runtime object for VssStore.~~

(cc @andrei-21)

ldk-reviews-bot · 2025-05-19T15:10:48Z

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

src/chain/electrum.rs

src/runtime.rs

src/chain/electrum.rs

src/runtime.rs

andrei-21

I have tested with my prototype, everything works ok.

src/runtime.rs

joostjager · 2025-05-23T09:43:59Z

src/runtime.rs

+
+	pub fn block_on<F: Future>(&self, future: F) -> Result<F::Output, RuntimeError> {
+		let handle = self.handle()?;
+		Ok(tokio::task::block_in_place(move || handle.block_on(future)))


After our offline chat, I was looking up the code for block_in_place: https://github.com/tokio-rs/tokio/blob/17d8c2b29d94550f504d8fd76d8d8aaf66095864/tokio/src/runtime/scheduler/multi_thread/worker.rs#L358

It indeed uses a thread local context inside.

If you use this anyway, isn't the only correct way to always also use tokio::runtime::Handle::try_current() ?

We spoke about it, but seeing this again it feels a bit undefined to mix things in this non-transparent way.

You mentioned the case where there is no current runtime though. But that is detectable too, and in that case the call doesn't need to be wrapped in block_on?

Code comments exactly explaining why this block_in_place -> block_on chaining is needed would be helpful too.

Yeah, please take a look at the approach I just pushed: now went with preferring the Handle/spawned runtime everywhere, except in block_on where the outer context will take precedence.

src/runtime.rs

tnull · 2025-05-23T11:44:08Z

Now pushed a new approach that initializes the Runtime in the builder, which allows to clean up a lot of the error cases. Kinda makes sense to go this way, as starting/stopping the runtime is out of our control anyways, if the user gives us a handle to use.

joostjager

Nice clean up of the error cases indeed.

src/runtime.rs

joostjager · 2025-05-23T11:59:22Z

src/runtime.rs

+		// during `block_on`, as this is the context `block_in_place` would operate on. So we try
+		// to detect the outer context here, and otherwise use whatever was set during
+		// initialization.
+		let handle = tokio::runtime::Handle::try_current().unwrap_or(self.handle());


If this is used here, shouldn't it be used everywhere (spawn, spawn_blocking)?

I'm very confused: below you argue against invisible capture, here you say we should invisibly capture the context from any entry point? No, I think generally using what was set on startup and only making an exception for block_on makes sense.

It is not the same. In new the handle is saved and there's the implicit requirement for it to stay alive.

Here it isn't saved. It's just used if present, and otherwise we fall back to what the user configured and knows they need to keep alive.

Why make an exception for block_on? I'd think it is better to be consistent, and if it can't be avoid in block_on it should be done everywhere?

joostjager · 2025-05-23T12:03:49Z

src/runtime.rs

+impl Runtime {
+	pub fn new() -> Result<Self, std::io::Error> {
+		let mode = match tokio::runtime::Handle::try_current() {
+			Ok(handle) => RuntimeMode::Handle(handle),


It still makes me feel a bit uncomfortable that that handle is capture here invisibly, and that there's the implicit assumption that the user will keep this alive.

Removing it here, and letting the user pass it in themselves via the builder seems to be a more transparent way to signal that it is used and needs to remain available.

Hmm, I see your concern, but I'm not sure. I think the current behavior is the expected default behavior that just makes it work transparently for any upstream users. And we need to auto-detect for block_on at the very least anyways. Not sure if @TheBlueMatt has an opinion here, since he (as a user) requested auto-detection in #491?

joostjager · 2025-05-23T12:05:05Z

src/builder.rs

+				log_error!(logger, "Failed to setup tokio runtime: {}", e);
+				BuildError::RuntimeSetupFailed
+			})?)
+		};


Code reuse opportunity

NACK, IMO it's much more readable to keep short blocks like this inlined, instead of having the reader jump all over the file, losing context.

I don't agree. I think eliminating the risk of future code changes not being applied in both places is more important than having the reader jump to a function. The function can have a descriptive name too such as build_runtime. I don't think it is bad for readability at all.

Your argument of jumping all over the file would also apply to code that isn't necessarily duplicated. Because also in that case, the reader would have to jump. I think that would lead to long function bodies, of which there are already too many in ldk, and those absolutely do not help with readability.

src/event.rs

src/gossip.rs

joostjager · 2025-05-23T12:08:30Z

src/gossip.rs

 	}
 }

 impl FutureSpawner for RuntimeSpawner {


Can this be implemented directly onto the new Runtime?

Good thought, unfortunately no as GossipVerifier::new takes FutureSpawner by value, and of course we can't impl FutureSpawner for Arc<Runtime> as both sides of it would include non-local types.

Is there really no way to do this? Implement an interface on a local type and then pass it to the other crate? Or would it require a different type on the LDK side for FutureSpawner?

Perhaps longer term moving Runtime to rust-lightning could be a direction too?

Is there really no way to do this? Implement an interface on a local type and then pass it to the other crate?

There is a way to do this, which is the newtype pattern, which is essentially what we have.

Or would it require a different type on the LDK side for FutureSpawner?

The easiest way to avoid this would be to have GossipVerifier::new take a Deref<Target = FutureSpawner> or Borrow<FutureSpawner>, but honestly a newtype isn't too bad in this case, IMO.

Interesting. I might, for the async kv store pr, use Deref<Target=FS> then already as a preparation. But ofc the wrapper is no big deal.

src/lib.rs

tnull · 2025-05-23T12:35:37Z

I have tested with my prototype, everything works ok.

@andrei-21 Thanks! Mind reconfirming this works as expected with the new approach?

tnull

Hmm, seems switching VssStore over to use the same runtime has the integration test hang. Will need some more debugging.

joostjager · 2025-05-23T13:02:21Z

src/lib.rs

 	pub fn disconnect(&self, counterparty_node_id: PublicKey) -> Result<(), Error> {
-		let rt_lock = self.runtime.read().unwrap();
-		if rt_lock.is_none() {
+		if !*self.is_running.read().unwrap() {


Out of scope for this PR, but curious: we talked about the type-state pattern for Runtime and that it may not be ideal. Could the pattern work for Node though? So start returning a type that exposes the method, and that type being consumed by stop.

Yeah, we had considered that previously, too. Maybe, although I'm not sure if we want to force this pattern on our users, i.e., they'd need to create something akin to an enum wrapper that could hold variants StoppedNode/StartedNode, or would pipe through different versions of the node in their app, depending on the node state. Also, in the future we'd like to handle (re-)starting on persistence failure for the users, and if they hold a StartedNode object, there is really no way to force them to drop it.

I see. I have no experience with the pattern. It looks pretty powerful in combination with Rust's type system. Don't know if the pipe through can be done with a wrapper around the state types. Restart could be a method on the StartedNode perhaps. Either way, this was just a side remark.

andrei-21 · 2025-05-23T13:18:08Z

I have tested with my prototype, everything works ok.

@andrei-21 Thanks! Mind reconfirming this works as expected with the new approach?

Tried again, seems to work as expected.

tnull · 2025-05-23T13:19:30Z

Tried again, seems to work as expected.

Thanks again!

ldk-reviews-bot · 2025-05-26T00:01:01Z

🔔 1st Reminder

Hey @andrei-21! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

tnull · 2025-08-14T13:40:52Z

Rebased this on current main to make some progress finally. I'm tempted to punt on the VSS part of it, as we're about to get a 'real' async KVStore with #462, so dropped that commit for now.

Instead of holding an `Arc<RwLock<Option<Arc<tokio::runtime::Runtime>>>` and dealing with stuff like `tokio::task::block_in_place` at all callsites, we introduce a `Runtime` object that takes care of the state transitions, and allows to detect and reuse an outer runtime context. We also adjust the `with_runtime` API to take a `tokio::runtime::Handle` rather than an `Arc<Runtime>`.

tnull · 2025-08-15T13:38:50Z

Rebased on main to resolve conflicts.

ldk-reviews-bot · 2025-08-16T13:50:53Z

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

TheBlueMatt

The diff itself looks correct. I didn't try to verify that there aren't missing places where we should check for is_running, but did leave some suggestions to improve test coverage marginally.

TheBlueMatt · 2025-08-17T21:43:50Z

src/lib.rs

 				self.logger,
 				"Active runtime tasks left prior to shutdown: {}",
-				metrics_runtime.metrics().active_tasks_count()
+				runtime_handle.metrics().active_tasks_count()


Can we add a debug assertion that its zero? It seems like it would be pretty easy to spawn a long-running task that loops forever, has a reference to the owned runtime, and prevents it from ever droping.

I think it might not necessarily be 0, especially when reusing an outer runtime. Currently we still have some gossip verification and HTLC-forwarding tasks that aren't necessarily finished before this point, but the latter will def. go away with #462.

Right but we really need to test that we don't have forever-running tasks which leak. As written it would be really easy for a bug to slip in that lets a task run forever and stop() leaves a handful of threads lying around. Maybe that means tests need to wait for gossip verification and HTLC forwarding tasks even if prod builds don't.

Right but we really need to test that we don't have forever-running tasks which leak. As written it would be really easy for a bug to slip in that lets a task run forever and stop() leaves a handful of threads lying around.

Yes, that is a possibility. I guess we could move the newly-introduced background_tasks and cancellable_background_tasks JoinSets into Runtime, and then only expose spawn_background_task/spawn_cancellable_background_task methods, ensuring that whenever we spawn we track the JoinHandle and abort or await it on shutdown.

Mmm, that sounds like a much better approach than just trying to test it.

Did it in #619

TheBlueMatt · 2025-08-17T21:47:56Z

src/payment/bolt11.rs

 	payment_store: Arc<PaymentStore>,
 	peer_store: Arc<PeerStore<Arc<Logger>>>,
 	config: Arc<Config>,
+	is_running: Arc<RwLock<bool>>,


Since we have our own Runtime type, can we move this into the Runtime and then add a debug_assertion when spawning a new task that running is true? Seems like that would add some additional test coverage and simplify the diff somewhat.

See below: is_running is really a property of Node that (as of this PR) doesn't mean "Runtime is available" (which we now assume), but is also a reentrancy guard for runtime state transitions.

But the Runtime (not tokio Runtime but rather the ldk-node Runtime) is also a "property of the Node", so why shouldn't "you can use the runtime" be in that. Again this really feels like it's missing a lot of debug assertions - bugs slipping in in the future here seems really likely.

TheBlueMatt · 2025-08-17T21:49:53Z

src/builder.rs

 	let background_tasks = Mutex::new(None);
 	let cancellable_background_tasks = Mutex::new(None);

+	let is_running = Arc::new(RwLock::new(false));


This should really be an AtomicBool, not a RwLock<bool>.

No, we chose an RwLock for a reason, namely that we acquire and hold the lock during start/stop which keeps us from running into weird intermediate states if users would call start/stop in quick succession.

tnull · 2025-08-18T06:40:19Z

Landing this for now, might open a PR with minor follow-ups.

We previously attempted to drop the internal runtime from `VssStore`, resulting into blocking behavior. While we recently made changes that improved our situation (having VSS CI pass again pretty reliably), we just ran into yet another case where the VSS CI hung (cf. https://github.com/lightningdevkit/vss-server/actions/runs/19023212819/job/54322173817?pr=59). Here we attempt to restore even more of the original pre- ab3d78d / lightningdevkit#543 behavior to get rid of the reappearing blocking behavior, i.e., only use the internal runtime in `VssStore`.

tnull marked this pull request as draft May 19, 2025 15:10

This was referenced May 20, 2025

Uses node's runtime in sync_wallets() #541

Closed

Drop VssStore runtime in block_in_place() #545

Closed

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch from d42f762 to 2f43096 Compare May 22, 2025 09:27

tnull changed the title ~~Introduce Runtime object and allow to take a tokio::runtime::Handle~~ Introduce Runtime object allowng to detect outer runtime context May 22, 2025

tnull marked this pull request as ready for review May 22, 2025 09:29

ldk-reviews-bot requested a review from joostjager May 22, 2025 09:29

joostjager reviewed May 22, 2025

View reviewed changes

src/chain/electrum.rs Outdated Show resolved Hide resolved

src/runtime.rs Outdated Show resolved Hide resolved

src/runtime.rs Outdated Show resolved Hide resolved

joostjager reviewed May 22, 2025

View reviewed changes

src/chain/electrum.rs Outdated Show resolved Hide resolved

src/chain/electrum.rs Outdated Show resolved Hide resolved

src/runtime.rs Outdated Show resolved Hide resolved

src/runtime.rs Outdated Show resolved Hide resolved

andrei-21 approved these changes May 22, 2025

View reviewed changes

src/runtime.rs Outdated Show resolved Hide resolved

joostjager reviewed May 23, 2025

View reviewed changes

src/runtime.rs Outdated Show resolved Hide resolved

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch from 2f43096 to 2f32e15 Compare May 23, 2025 11:42

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch 3 times, most recently from 3d7c8a6 to 48a42f1 Compare May 23, 2025 12:00

joostjager reviewed May 23, 2025

View reviewed changes

tnull requested a review from andrei-21 May 23, 2025 12:34

tnull commented May 23, 2025

View reviewed changes

joostjager reviewed May 23, 2025

View reviewed changes

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch 3 times, most recently from f65ad68 to 838c1bc Compare May 23, 2025 13:17

andrei-21 approved these changes May 26, 2025

View reviewed changes

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch 8 times, most recently from 15f300b to 47207f5 Compare May 27, 2025 12:10

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch from 47207f5 to b838720 Compare July 7, 2025 07:43

tnull mentioned this pull request Jul 7, 2025

Hold runtime lock during stop and reduce timeout values across the board #538

Closed

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch from b838720 to 636c357 Compare August 14, 2025 13:26

tnull mentioned this pull request Aug 14, 2025

Wait on all background tasks to finish (or abort) #612

Merged

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch from 636c357 to 56e7977 Compare August 14, 2025 13:50

tnull requested a review from TheBlueMatt August 14, 2025 13:50

tnull added this to Weekly Goals Aug 14, 2025

tnull self-assigned this Aug 14, 2025

tnull added the weekly goal Someone wants to land this this week label Aug 14, 2025

tnull moved this to Goal: Merge in Weekly Goals Aug 14, 2025

tnull force-pushed the 2025-05-allow-to-use-runtime-handle branch from 56e7977 to 4879002 Compare August 15, 2025 13:38

TheBlueMatt approved these changes Aug 17, 2025

View reviewed changes

tnull merged commit 110ab06 into lightningdevkit:main Aug 18, 2025
11 of 15 checks passed

github-project-automation bot moved this from Goal: Merge to Done in Weekly Goals Aug 18, 2025

tnull mentioned this pull request Nov 3, 2025

Only use internal runtime in VssStore #693

Closed

Introduce Runtime object allowng to detect outer runtime context #543

Introduce Runtime object allowng to detect outer runtime context #543

Uh oh!

Conversation

tnull commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrei-21 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tnull commented May 23, 2025

Uh oh!

joostjager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tnull May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tnull commented May 23, 2025

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Introduce `Runtime` object allowng to detect outer runtime context #543

Introduce `Runtime` object allowng to detect outer runtime context #543

tnull commented May 19, 2025 •

edited

Loading

ldk-reviews-bot commented May 19, 2025 •

edited

Loading

joostjager May 23, 2025 •

edited

Loading

joostjager May 23, 2025 •

edited

Loading

tnull May 23, 2025 •

edited

Loading

joostjager May 23, 2025 •

edited

Loading

joostjager May 26, 2025 •

edited

Loading