Skip to content

containers: Cancel async destruction from ContainerClient if a new ContainerClient takes over, and be able to recover container sidecar proxies#6201

Open
gabivlj wants to merge 2 commits intomainfrom
gv/fix-reload-by-cancelling-tasks
Open

containers: Cancel async destruction from ContainerClient if a new ContainerClient takes over, and be able to recover container sidecar proxies#6201
gabivlj wants to merge 2 commits intomainfrom
gv/fix-reload-by-cancelling-tasks

Conversation

@gabivlj
Copy link
Contributor

@gabivlj gabivlj commented Feb 27, 2026

containers: Fix workerd reloads with egress interception

When workerd reloads, and the container is still running, there might be
a chance that the container loses connectivity.

This changes make sure that we can recover the proxy sidecar we were
running, and listen in the same port we were before in Workerd.

containers: Cancel async destruction from ContainerClient if a new ContainerClient takes over

The new system makes it so we have a containerCleanupCanceler reference
that a container destructor uses to wrap async calls that should be
cancelled if a new ContainerClient of the same id takes over.

This fixes issues where we abort a DO container, but then we call it
again, racing the old destructor tasks against the new ContainerClient.

When workerd reloads, and the container is still running, there might be
a chance that the container loses connectivity.

This changes make sure that we can recover the proxy sidecar we were
running, and listen in the same port we were before in Workerd.
@gabivlj gabivlj requested review from a team as code owners February 27, 2026 20:22
Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR adds cancellation of stale Docker cleanup tasks during workerd reloads and recovery of sidecar proxy state when a container is still running after restart.

Issues (highest severity first):

  1. parseAs<uint16_t>() can throw in inspectSidecarEgressPort() recovery path — If the sidecar container's args contain a malformed --http-egress-port value, parseAs will throw a kj::Exception. Since this is called from status(), which is the initial health-check RPC, an unhandled parse failure would break status reporting for a container that is otherwise running fine. The recovery path should degrade gracefully to kj::none instead.

  2. status() now performs side effects (starts listeners/sidecars) without joining getRpcTurn() queue — Previously status() was read-only. The new recovery code in status() calls ensureEgressListenerStarted() and ensureSidecarStarted(), both of which mutate state. The atomic exchange guards prevent double-execution, but if status() and start() interleave at co_await points, you could get surprising ordering (e.g., status() starts the listener on the recovered port, then start() tries to start the sidecar with containerSidecarStarted = false overriding the true that status() just stored). This isn't a correctness bug today because the atomic guards protect the critical sections, but it's a fragile arrangement. Consider whether status() should acquire a mutating turn or whether the recovery should be deferred to a separate method that does.

auto args = jsonRoot.getArgs();
for (auto i = 0u; i < args.size(); i++) {
if (args[i] == "--http-egress-port" && i + 1 < args.size()) {
co_return kj::str(args[i + 1]).parseAs<uint16_t>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseAs<uint16_t>() throws if the value is not a valid integer. Since this is a recovery/best-effort path inside status(), a parse failure shouldn't crash the status check. Consider catching the exception and returning kj::none:

Suggested change
co_return kj::str(args[i + 1]).parseAs<uint16_t>();
auto parsed = kj::str(args[i + 1]).tryParseAs<uint16_t>();
KJ_IF_SOME(port, parsed) {
co_return port;
}
co_return kj::Maybe<uint16_t>(kj::none);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a parse failure has to crash the status check as its unexpected state that we dont have the flag set!

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Feb 27, 2026

Review posted successfully. Here's a summary of what I flagged:

PR #6201 adds two features: (1) cancellation of stale Docker cleanup tasks when a new ContainerClient takes over the same container ID during workerd reloads, and (2) recovery of sidecar proxy state by inspecting running sidecar containers.

Issues found:

  1. Medium: parseAs<uint16_t>() can throw in recovery path (container-client.c++:578) — The inspectSidecarEgressPort() method parses the sidecar's --http-egress-port arg, but if it's malformed, the exception propagates up through status() and breaks the health check. Posted a suggestion to use tryParseAs instead.

  2. Low: status() now has side effects without RPC turn serialization — The recovery logic starts listeners and sidecars from status(), which doesn't join the getRpcTurn() queue. The atomic guards prevent double-execution, but interleaving with start() at co_await points could produce surprising ordering. Noted as a design concern.

The canceler lifetime management and declaration ordering in server.c++ look correct. The removal of setAutoRemove(true) from the sidecar is necessary for the recovery feature. Test changes properly reorder interceptOutboundHttp before start().

github run

@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 0% with 77 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.55%. Comparing base (f3b58ab) to head (79757a5).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
src/workerd/server/container-client.c++ 0.00% 51 Missing ⚠️
src/workerd/server/server.c++ 0.00% 26 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6201      +/-   ##
==========================================
- Coverage   70.59%   70.55%   -0.04%     
==========================================
  Files         413      413              
  Lines      109986   110059      +73     
  Branches    18120    18134      +14     
==========================================
+ Hits        77644    77653       +9     
- Misses      21531    21596      +65     
+ Partials    10811    10810       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gabivlj gabivlj force-pushed the gv/fix-reload-by-cancelling-tasks branch 4 times, most recently from 017e554 to 8cd69a7 Compare February 27, 2026 23:03
…ntainerClient takes over

The new system makes it so we have a containerCleanupCanceler reference
that a container destructor uses to wrap async calls that should be
cancelled if a new ContainerClient of the same id takes over.

This fixes issues where we abort a DO container, but then we call it
again, racing the old destructor tasks against the new ContainerClient.
@gabivlj gabivlj force-pushed the gv/fix-reload-by-cancelling-tasks branch from 8cd69a7 to 79757a5 Compare February 28, 2026 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants