Conversation
When workerd reloads, and the container is still running, there might be a chance that the container loses connectivity. This changes make sure that we can recover the proxy sidecar we were running, and listen in the same port we were before in Workerd.
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR adds cancellation of stale Docker cleanup tasks during workerd reloads and recovery of sidecar proxy state when a container is still running after restart.
Issues (highest severity first):
-
parseAs<uint16_t>()can throw ininspectSidecarEgressPort()recovery path — If the sidecar container's args contain a malformed--http-egress-portvalue,parseAswill throw akj::Exception. Since this is called fromstatus(), which is the initial health-check RPC, an unhandled parse failure would break status reporting for a container that is otherwise running fine. The recovery path should degrade gracefully tokj::noneinstead. -
status()now performs side effects (starts listeners/sidecars) without joininggetRpcTurn()queue — Previouslystatus()was read-only. The new recovery code instatus()callsensureEgressListenerStarted()andensureSidecarStarted(), both of which mutate state. The atomic exchange guards prevent double-execution, but ifstatus()andstart()interleave atco_awaitpoints, you could get surprising ordering (e.g.,status()starts the listener on the recovered port, thenstart()tries to start the sidecar withcontainerSidecarStarted = falseoverriding thetruethatstatus()just stored). This isn't a correctness bug today because the atomic guards protect the critical sections, but it's a fragile arrangement. Consider whetherstatus()should acquire a mutating turn or whether the recovery should be deferred to a separate method that does.
| auto args = jsonRoot.getArgs(); | ||
| for (auto i = 0u; i < args.size(); i++) { | ||
| if (args[i] == "--http-egress-port" && i + 1 < args.size()) { | ||
| co_return kj::str(args[i + 1]).parseAs<uint16_t>(); |
There was a problem hiding this comment.
parseAs<uint16_t>() throws if the value is not a valid integer. Since this is a recovery/best-effort path inside status(), a parse failure shouldn't crash the status check. Consider catching the exception and returning kj::none:
| co_return kj::str(args[i + 1]).parseAs<uint16_t>(); | |
| auto parsed = kj::str(args[i + 1]).tryParseAs<uint16_t>(); | |
| KJ_IF_SOME(port, parsed) { | |
| co_return port; | |
| } | |
| co_return kj::Maybe<uint16_t>(kj::none); |
There was a problem hiding this comment.
I think a parse failure has to crash the status check as its unexpected state that we dont have the flag set!
|
Review posted successfully. Here's a summary of what I flagged: PR #6201 adds two features: (1) cancellation of stale Docker cleanup tasks when a new Issues found:
The canceler lifetime management and declaration ordering in |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6201 +/- ##
==========================================
- Coverage 70.59% 70.55% -0.04%
==========================================
Files 413 413
Lines 109986 110059 +73
Branches 18120 18134 +14
==========================================
+ Hits 77644 77653 +9
- Misses 21531 21596 +65
+ Partials 10811 10810 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
017e554 to
8cd69a7
Compare
…ntainerClient takes over The new system makes it so we have a containerCleanupCanceler reference that a container destructor uses to wrap async calls that should be cancelled if a new ContainerClient of the same id takes over. This fixes issues where we abort a DO container, but then we call it again, racing the old destructor tasks against the new ContainerClient.
8cd69a7 to
79757a5
Compare
containers: Fix workerd reloads with egress interception
containers: Cancel async destruction from ContainerClient if a new ContainerClient takes over