-
Notifications
You must be signed in to change notification settings - Fork 114
Introduce Runtime object allowng to detect outer runtime context
#543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Runtime object allowng to detect outer runtime context
#543
Conversation
|
👋 Thanks for assigning @TheBlueMatt as a reviewer! |
d42f762 to
2f43096
Compare
Runtime object and allow to take a tokio::runtime::HandleRuntime object allowng to detect outer runtime context
andrei-21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested with my prototype, everything works ok.
src/runtime.rs
Outdated
|
|
||
| pub fn block_on<F: Future>(&self, future: F) -> Result<F::Output, RuntimeError> { | ||
| let handle = self.handle()?; | ||
| Ok(tokio::task::block_in_place(move || handle.block_on(future))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After our offline chat, I was looking up the code for block_in_place: https://github.com/tokio-rs/tokio/blob/17d8c2b29d94550f504d8fd76d8d8aaf66095864/tokio/src/runtime/scheduler/multi_thread/worker.rs#L358
It indeed uses a thread local context inside.
If you use this anyway, isn't the only correct way to always also use tokio::runtime::Handle::try_current() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke about it, but seeing this again it feels a bit undefined to mix things in this non-transparent way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned the case where there is no current runtime though. But that is detectable too, and in that case the call doesn't need to be wrapped in block_on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code comments exactly explaining why this block_in_place -> block_on chaining is needed would be helpful too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, please take a look at the approach I just pushed: now went with preferring the Handle/spawned runtime everywhere, except in block_on where the outer context will take precedence.
2f43096 to
2f32e15
Compare
|
Now pushed a new approach that initializes the |
3d7c8a6 to
48a42f1
Compare
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice clean up of the error cases indeed.
src/runtime.rs
Outdated
| // during `block_on`, as this is the context `block_in_place` would operate on. So we try | ||
| // to detect the outer context here, and otherwise use whatever was set during | ||
| // initialization. | ||
| let handle = tokio::runtime::Handle::try_current().unwrap_or(self.handle()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is used here, shouldn't it be used everywhere (spawn, spawn_blocking)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very confused: below you argue against invisible capture, here you say we should invisibly capture the context from any entry point? No, I think generally using what was set on startup and only making an exception for block_on makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the same. In new the handle is saved and there's the implicit requirement for it to stay alive.
Here it isn't saved. It's just used if present, and otherwise we fall back to what the user configured and knows they need to keep alive.
Why make an exception for block_on? I'd think it is better to be consistent, and if it can't be avoid in block_on it should be done everywhere?
| impl Runtime { | ||
| pub fn new() -> Result<Self, std::io::Error> { | ||
| let mode = match tokio::runtime::Handle::try_current() { | ||
| Ok(handle) => RuntimeMode::Handle(handle), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still makes me feel a bit uncomfortable that that handle is capture here invisibly, and that there's the implicit assumption that the user will keep this alive.
Removing it here, and letting the user pass it in themselves via the builder seems to be a more transparent way to signal that it is used and needs to remain available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see your concern, but I'm not sure. I think the current behavior is the expected default behavior that just makes it work transparently for any upstream users. And we need to auto-detect for block_on at the very least anyways. Not sure if @TheBlueMatt has an opinion here, since he (as a user) requested auto-detection in #491?
| log_error!(logger, "Failed to setup tokio runtime: {}", e); | ||
| BuildError::RuntimeSetupFailed | ||
| })?) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code reuse opportunity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NACK, IMO it's much more readable to keep short blocks like this inlined, instead of having the reader jump all over the file, losing context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree. I think eliminating the risk of future code changes not being applied in both places is more important than having the reader jump to a function. The function can have a descriptive name too such as build_runtime. I don't think it is bad for readability at all.
Your argument of jumping all over the file would also apply to code that isn't necessarily duplicated. Because also in that case, the reader would have to jump. I think that would lead to long function bodies, of which there are already too many in ldk, and those absolutely do not help with readability.
| } | ||
| } | ||
|
|
||
| impl FutureSpawner for RuntimeSpawner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be implemented directly onto the new Runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought, unfortunately no as GossipVerifier::new takes FutureSpawner by value, and of course we can't impl FutureSpawner for Arc<Runtime> as both sides of it would include non-local types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really no way to do this? Implement an interface on a local type and then pass it to the other crate? Or would it require a different type on the LDK side for FutureSpawner?
Perhaps longer term moving Runtime to rust-lightning could be a direction too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really no way to do this? Implement an interface on a local type and then pass it to the other crate?
There is a way to do this, which is the newtype pattern, which is essentially what we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or would it require a different type on the LDK side for FutureSpawner?
The easiest way to avoid this would be to have GossipVerifier::new take a Deref<Target = FutureSpawner> or Borrow<FutureSpawner>, but honestly a newtype isn't too bad in this case, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I might, for the async kv store pr, use Deref<Target=FS> then already as a preparation. But ofc the wrapper is no big deal.
@andrei-21 Thanks! Mind reconfirming this works as expected with the new approach? |
tnull
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, seems switching VssStore over to use the same runtime has the integration test hang. Will need some more debugging.
| pub fn disconnect(&self, counterparty_node_id: PublicKey) -> Result<(), Error> { | ||
| let rt_lock = self.runtime.read().unwrap(); | ||
| if rt_lock.is_none() { | ||
| if !*self.is_running.read().unwrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this PR, but curious: we talked about the type-state pattern for Runtime and that it may not be ideal. Could the pattern work for Node though? So start returning a type that exposes the method, and that type being consumed by stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we had considered that previously, too. Maybe, although I'm not sure if we want to force this pattern on our users, i.e., they'd need to create something akin to an enum wrapper that could hold variants StoppedNode/StartedNode, or would pipe through different versions of the node in their app, depending on the node state. Also, in the future we'd like to handle (re-)starting on persistence failure for the users, and if they hold a StartedNode object, there is really no way to force them to drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I have no experience with the pattern. It looks pretty powerful in combination with Rust's type system. Don't know if the pipe through can be done with a wrapper around the state types. Restart could be a method on the StartedNode perhaps. Either way, this was just a side remark.
f65ad68 to
838c1bc
Compare
Tried again, seems to work as expected. |
Thanks again! |
|
🔔 1st Reminder Hey @andrei-21! This PR has been waiting for your review. |
15f300b to
47207f5
Compare
47207f5 to
b838720
Compare
b838720 to
636c357
Compare
|
Rebased this on current main to make some progress finally. I'm tempted to punt on the VSS part of it, as we're about to get a 'real' async KVStore with #462, so dropped that commit for now. |
636c357 to
56e7977
Compare
Instead of holding an `Arc<RwLock<Option<Arc<tokio::runtime::Runtime>>>` and dealing with stuff like `tokio::task::block_in_place` at all callsites, we introduce a `Runtime` object that takes care of the state transitions, and allows to detect and reuse an outer runtime context. We also adjust the `with_runtime` API to take a `tokio::runtime::Handle` rather than an `Arc<Runtime>`.
56e7977 to
4879002
Compare
|
Rebased on main to resolve conflicts. |
|
🔔 1st Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff itself looks correct. I didn't try to verify that there aren't missing places where we should check for is_running, but did leave some suggestions to improve test coverage marginally.
| self.logger, | ||
| "Active runtime tasks left prior to shutdown: {}", | ||
| metrics_runtime.metrics().active_tasks_count() | ||
| runtime_handle.metrics().active_tasks_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a debug assertion that its zero? It seems like it would be pretty easy to spawn a long-running task that loops forever, has a reference to the owned runtime, and prevents it from ever droping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might not necessarily be 0, especially when reusing an outer runtime. Currently we still have some gossip verification and HTLC-forwarding tasks that aren't necessarily finished before this point, but the latter will def. go away with #462.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but we really need to test that we don't have forever-running tasks which leak. As written it would be really easy for a bug to slip in that lets a task run forever and stop() leaves a handful of threads lying around. Maybe that means tests need to wait for gossip verification and HTLC forwarding tasks even if prod builds don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but we really need to test that we don't have forever-running tasks which leak. As written it would be really easy for a bug to slip in that lets a task run forever and stop() leaves a handful of threads lying around.
Yes, that is a possibility. I guess we could move the newly-introduced background_tasks and cancellable_background_tasks JoinSets into Runtime, and then only expose spawn_background_task/spawn_cancellable_background_task methods, ensuring that whenever we spawn we track the JoinHandle and abort or await it on shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, that sounds like a much better approach than just trying to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it in #619
| payment_store: Arc<PaymentStore>, | ||
| peer_store: Arc<PeerStore<Arc<Logger>>>, | ||
| config: Arc<Config>, | ||
| is_running: Arc<RwLock<bool>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have our own Runtime type, can we move this into the Runtime and then add a debug_assertion when spawning a new task that running is true? Seems like that would add some additional test coverage and simplify the diff somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below: is_running is really a property of Node that (as of this PR) doesn't mean "Runtime is available" (which we now assume), but is also a reentrancy guard for runtime state transitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the Runtime (not tokio Runtime but rather the ldk-node Runtime) is also a "property of the Node", so why shouldn't "you can use the runtime" be in that. Again this really feels like it's missing a lot of debug assertions - bugs slipping in in the future here seems really likely.
| let background_tasks = Mutex::new(None); | ||
| let cancellable_background_tasks = Mutex::new(None); | ||
|
|
||
| let is_running = Arc::new(RwLock::new(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be an AtomicBool, not a RwLock<bool>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we chose an RwLock for a reason, namely that we acquire and hold the lock during start/stop which keeps us from running into weird intermediate states if users would call start/stop in quick succession.
|
Landing this for now, might open a PR with minor follow-ups. |
We previously attempted to drop the internal runtime from `VssStore`, resulting into blocking behavior. While we recently made changes that improved our situation (having VSS CI pass again pretty reliably), we just ran into yet another case where the VSS CI hung (cf. https://github.com/lightningdevkit/vss-server/actions/runs/19023212819/job/54322173817?pr=59). Here we attempt to restore even more of the original pre- ab3d78d / lightningdevkit#543 behavior to get rid of the reappearing blocking behavior, i.e., only use the internal runtime in `VssStore`.
We previously attempted to drop the internal runtime from `VssStore`, resulting into blocking behavior. While we recently made changes that improved our situation (having VSS CI pass again pretty reliably), we just ran into yet another case where the VSS CI hung (cf. https://github.com/lightningdevkit/vss-server/actions/runs/19023212819/job/54322173817?pr=59). Here we attempt to restore even more of the original pre- ab3d78d / lightningdevkit#543 behavior to get rid of the reappearing blocking behavior, i.e., only use the internal runtime in `VssStore`.
Closes #491
Instead of holding an
Arc<RwLock<Option<Arc<tokio::runtime::Runtime>>>and dealing with stuff like
tokio::task::block_in_placeat allcallsites, we introduce a
Runtimeobject that takes care of the statetransitions, and allows to detect and reuse an outer runtime context.
We also adjust the
with_runtimeAPI to take atokio::runtime::Handlerather than an
Arc<Runtime>.We then also go ahead an reuse saidRuntimeobject forVssStore.(cc @andrei-21)