Nexus must hang onto qorb resolvers used for MGS updates#8466
Nexus must hang onto qorb resolvers used for MGS updates#8466davepacheco merged 22 commits intomainfrom
Conversation
| /// DNS resolver used by MgsUpdateDriver for MGS | ||
| // We don't need to do anything with this, but we can't let it be dropped | ||
| // while Nexus is running. | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Normally these resolvers are handed off to the pool, so their lifetime is coupled with the pool itself.
Do you think qorb could have done anything more clear to identify that the resolver object needs to be kept alive for resolution to keep happening?
There was a problem hiding this comment.
Good question. Maybe update the Resolver docs to say something like:
Resolvers generally do the bulk of their work (e.g., DNS resolution) in a separate tokio task. Dropping the resolver aborts that task. Any
Receivers previously returned bymonitor()will contain the last-updated set of backends indefinitely.
This last thing isn't qorb's fault but I found it to be a surprising footgun (with watch channels, I guess) and so worth calling out. I thought I'd have seen a RecvError because the other end of the channel got dropped. But my watch consumer only ever uses borrow() so it didn't notice the channel was closed.
I'd also update the docs (and maybe name?) for monitor(). The name sounds like it's going to take some action and the docs say "Start running a resolver". But that's not right. The resolver is already running before you call it. Maybe call it subscribe() and just drop the "Start running a resolver" sentence? I understand though if it's not worth making a breaking change for this.
I also think it's worth mentioning the thing above under monitor(), something like:
Note that if the
Resolvergets dropped, thenReceivers previously returned by this method will stop getting updated, but they will contain to report the last known set of backends.
There was a problem hiding this comment.
Updated docs in oxidecomputer/qorb#112 for Resolver::monitor
|
See #8291 for testing notes. |
The failure mode here is that when using Nexus to update SPs, all the updates fail with messages like this:
It says "found no MGS backends in DNS" but there are MGS backends in DNS. The problem is that we dropped the resolvers and so they're not doing any DNS resolution.