Skip to content

Improve resiliency#31

Open
mweibel wants to merge 9 commits into
mainfrom
improve-resiliency
Open

Improve resiliency#31
mweibel wants to merge 9 commits into
mainfrom
improve-resiliency

Conversation

@mweibel
Copy link
Copy Markdown
Collaborator

@mweibel mweibel commented May 12, 2026

  • setup HTTP timeout configuration and reusing of HTTP transport to avoid needing to do DNS lookups on every NewClient instantiation and keeping idle connections between reconciles
  • improve timeout handling per reconcile and depending on the API endpoint (creates have longer timeouts than reads, f.e.)
  • use consistent requeueAfter timeouts for creates to avoid potential issues with create -> timeout -> create -> timeout .. loops as outlined in feat: server groups #7 (comment)
  • do not delete server group on cluster delete, if servers are still present
  • a bit of refactoring for test handling (replacing t.Fatal with g.Fail to use the test framework everywhere)
  • handle network/LB pending errors specificly
  • set up concurrency flags for cluster (between 1-4) and machine controllers (between 1-10). Initially, we wanted to hardcode concurrency to 1 because API-side we don't have idempotent APIs for certain CREATEs (e.g. server groups) but this would prevent to reconcile different clusters and machines concurrently. We added a mutex for server group creation to prevent concurrency issues in that place instead.
  • fix a bug when a LoadBalancer doesn't get out of degraded state anymore because we didn't reconcile pool members anymore in the degraded state (reported by beta tester)
  • fixes network CIDR overlap when using Cilium by choosing a different CIDR
  • remove LoadBalancer IPFamily because that was an unused field, plus the API does not support this anyway.

mweibel added 8 commits May 12, 2026 11:17
- Add IsTimeoutError/IsDeadlineExceeded helpers to cloudscale client
- Configure client with tuned http.Transport (5s dial, 10s response header, 60s overall)
- Wrap all .Create() call sites with IsTimeoutError checks and 90s requeue
- Update reconcileNetwork and reconcileManagedNetwork to return (ctrl.Result, error)
- Add comprehensive timeout handling unit tests across all controller files
Prevent 'Cannot delete non-empty server group' race condition during cluster
deletion by checking CloudscaleMachine resources for owned servers before
attempting server group deletion. When owned servers are still present,
return a sentinel error that triggers a 10s graceful requeue instead of
exponential backoff.

Also skip server groups containing foreign servers (servers from other
clusters sharing this server group) to prevent accidental deletion.
Convert the 'There are still one or more load balancer pool members'
API error into a graceful 10s requeue instead of propagating it as
a failure (which triggers exponential backoff up to 30 minutes).

Added isLBPoolMembersError() helper to detect the specific API error
message and errNetworkNotReady sentinel error for controlled flow
through the delete path.
Replace the "not running" check in reconcileLoadBalancer with a
switch that only blocks on "changing". When the LB is degraded or error,
proceed with member reconciliation to remove stale pool members. Previously, a degraded LB
might've block forever, creating a deadlock.

Also refactor and clean up the load balancer tests:
- Add constants for all used cloudscale LB API statuses (running, changing,
  degraded, error).
- Fix the "creating" mock status.
- Introduce reusable "exists" mock helpers and auto-default nil mocks,
  cutting orchestrator test setup from ~20 lines to ~3 lines each.
- Remove 12 duplicate sub-reconciler tests already covered by
  ensureResource tests in cloudscale_services_test.go.
cilium CNI CIDR is by default `10.0.0.0/24`. If we have the same network
subnet CIDR, the loadbalancer health check may break which results in a
broken cluster.
Setting 172.18.0.0/24 avoids this:
 - does not overlap with Cilium's default cluster-pool 10.0.0.0/8.
 - does not overlap with the template's clusterNetwork.pods.cidrBlocks:
 ["192.168.0.0/16"] or services.cidrBlocks: ["10.96.0.0/12"].
 - avoids 172.17.0.0/16 (Docker's default bridge on many Linux hosts).

Old range: 10.100.0.0/24 — still inside Cilium's 10.0.0.0/8, only works for small clusters.
…e-fast

avoids test failures due to timeout because conformance takes a long time
@mweibel
Copy link
Copy Markdown
Collaborator Author

mweibel commented May 12, 2026

E2E tests ran through

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant