fix: use SSA for HTTPProxy child resource writes to eliminate 409 races#170
Draft
drewr wants to merge 1 commit into
Draft
fix: use SSA for HTTPProxy child resource writes to eliminate 409 races#170drewr wants to merge 1 commit into
drewr wants to merge 1 commit into
Conversation
The HTTPProxy controller used CreateOrUpdate (read-modify-write) for Gateway, HTTPRoute, and EndpointSlice child resources. When multiple reconcile goroutines ran simultaneously — triggered by the .Owns() watch chain firing as each child was created — they raced on the resource version, producing a burst of 409 Conflict errors. Controller-runtime then applied exponential backoff; after ~15 consecutive failures the wait reached 3-4 minutes, silencing the controller until its next periodic tick. Root cause: concurrent goroutines doing Get → Update is inherently racy. Requeuing quickly on conflict treats the symptom, not the cause. Fix: replace CreateOrUpdate with Server-Side Apply (client.Apply + ForceOwnership) for Gateway, HTTPRoute, HTTPRouteFilter, and EndpointSlice. Concurrent goroutines applying the same desired state with the same field manager are deduplicated by the API server — the second apply is a no-op rather than a 409. Additional properties of SSA that improve the controller: - An idempotent apply that produces no server-side change fires no watch event, naturally reducing re-queue churn from the .Owns() watches. - The BackendCertHostnameAnnotation on EndpointSlice is already set (or absent) by collectDesiredResources; SSA + ForceOwnership ensures it is added when present and removed when absent without an explicit delete call. - The complex read-modify-write in the EndpointSlice mutate function is removed. Gateway listener hostname note: the gateway controller sets listener hostnames via plain Update (no SSA field manager), making them unmanaged in SSA terms. ForceOwnership on a list item would clear unmanaged fields not in the apply payload. We carry forward any hostname the gateway controller has already set in the single Get we retain for ownership-conflict detection. This becomes unnecessary once the gateway controller adopts SSA itself. Closes: network-services-operator#166
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #166. Supersedes #169 (which treated the symptom rather than the cause).
When a tunnel (HTTPProxy) is created, the
.Owns()watch chain fires as each child resource (Gateway, HTTPRoute, EndpointSlice) is created, queuing concurrent reconciles of the same HTTPProxy. Those concurrent reconciles all ran the sameCreateOrUpdate(read → compute → write) path, racing on the resource version and producing a burst of 409 Conflict errors. Controller-runtime applied exponential backoff; after ~15 failures the wait reached 3-4 minutes, silencing the operator until the next periodic tick.Fix
Replace
CreateOrUpdatewith Server-Side Apply (client.Apply+ForceOwnership) for Gateway, HTTPRoute, HTTPRouteFilter, and EndpointSlice.With SSA, concurrent goroutines applying the same desired state with the same field manager are deduplicated by the API server. The second (and third, fourth…) apply is a no-op rather than a 409. No resource-version conflict is possible.
Additional improvements from SSA:
.Owns()re-queue that triggered the race in the first place largely disappears for steady-state reconciles.BackendCertHostnameAnnotationsimplified: the annotation is already set or absent on each desired EndpointSlice incollectDesiredResources. SSA + ForceOwnership adds it when present and removes it when absent — no explicitdelete()call needed. The per-field merge logic inside theCreateOrUpdatemutate function is removed.Gateway listener hostname note
The gateway controller sets listener hostnames via plain
Update(no SSA field manager). In SSA terms those hostnames are unmanaged.ForceOwnershipon a list item (matched by merge keyname) takes ownership of all fields within that item, which would clear unmanaged fields not in the apply payload.To avoid losing those hostnames, the single
Getwe already need for ownership-conflict detection also reads any hostname the gateway controller has already written, and we carry it forward in the apply payload. This becomes unnecessary once the gateway controller adopts SSA itself (a natural follow-up).Testing
All existing
TestHTTPProxyReconcilesubtests pass, includingaddress_and_hostname_propagationwhich exercises the hostname carry-forward path. No new test was added — the correctness is structural: concurrent SSA applies of the same state cannot produce a 409.Related