Improve resiliency by mweibel · Pull Request #31 · cloudscale-ch/cluster-api-provider-cloudscale

mweibel · 2026-05-12T09:28:28Z

setup HTTP timeout configuration and reusing of HTTP transport to avoid needing to do DNS lookups on every NewClient instantiation and keeping idle connections between reconciles
improve timeout handling per reconcile and depending on the API endpoint (creates have longer timeouts than reads, f.e.)
use consistent requeueAfter timeouts for creates to avoid potential issues with create -> timeout -> create -> timeout .. loops as outlined in feat: server groups #7 (comment)
do not delete server group on cluster delete, if servers are still present
a bit of refactoring for test handling (replacing t.Fatal with g.Fail to use the test framework everywhere)
handle network/LB pending errors specificly
set up concurrency flags for cluster (between 1-4) and machine controllers (between 1-10). Initially, we wanted to hardcode concurrency to 1 because API-side we don't have idempotent APIs for certain CREATEs (e.g. server groups) but this would prevent to reconcile different clusters and machines concurrently. We added a mutex for server group creation to prevent concurrency issues in that place instead.
fix a bug when a LoadBalancer doesn't get out of degraded state anymore because we didn't reconcile pool members anymore in the degraded state (reported by beta tester)
fixes network CIDR overlap when using Cilium by choosing a different CIDR
remove LoadBalancer IPFamily because that was an unused field, plus the API does not support this anyway.

- Add IsTimeoutError/IsDeadlineExceeded helpers to cloudscale client - Configure client with tuned http.Transport (5s dial, 10s response header, 60s overall) - Wrap all .Create() call sites with IsTimeoutError checks and 90s requeue - Update reconcileNetwork and reconcileManagedNetwork to return (ctrl.Result, error) - Add comprehensive timeout handling unit tests across all controller files

Prevent 'Cannot delete non-empty server group' race condition during cluster deletion by checking CloudscaleMachine resources for owned servers before attempting server group deletion. When owned servers are still present, return a sentinel error that triggers a 10s graceful requeue instead of exponential backoff. Also skip server groups containing foreign servers (servers from other clusters sharing this server group) to prevent accidental deletion.

Convert the 'There are still one or more load balancer pool members' API error into a graceful 10s requeue instead of propagating it as a failure (which triggers exponential backoff up to 30 minutes). Added isLBPoolMembersError() helper to detect the specific API error message and errNetworkNotReady sentinel error for controlled flow through the delete path.

Replace the "not running" check in reconcileLoadBalancer with a switch that only blocks on "changing". When the LB is degraded or error, proceed with member reconciliation to remove stale pool members. Previously, a degraded LB might've block forever, creating a deadlock. Also refactor and clean up the load balancer tests: - Add constants for all used cloudscale LB API statuses (running, changing, degraded, error). - Fix the "creating" mock status. - Introduce reusable "exists" mock helpers and auto-default nil mocks, cutting orchestrator test setup from ~20 lines to ~3 lines each. - Remove 12 duplicate sub-reconciler tests already covered by ensureResource tests in cloudscale_services_test.go.

cilium CNI CIDR is by default `10.0.0.0/24`. If we have the same network subnet CIDR, the loadbalancer health check may break which results in a broken cluster. Setting 172.18.0.0/24 avoids this: - does not overlap with Cilium's default cluster-pool 10.0.0.0/8. - does not overlap with the template's clusterNetwork.pods.cidrBlocks: ["192.168.0.0/16"] or services.cidrBlocks: ["10.96.0.0/12"]. - avoids 172.17.0.0/16 (Docker's default bridge on many Linux hosts). Old range: 10.100.0.0/24 — still inside Cilium's 10.0.0.0/8, only works for small clusters.

…e-fast avoids test failures due to timeout because conformance takes a long time

mweibel · 2026-05-12T14:14:00Z

✅ E2E tests ran through

mweibel added 8 commits May 12, 2026 11:17

chore: replace t.Fatal with gomega-style g.Fail

52ac391

refactoring

7d8d299

improve http client, concurrency, etc.

e767695

mweibel had a problem deploying to e2e May 12, 2026 09:30 — with GitHub Actions Failure

fix(e2e): increase timeout, use concurrency flags, default conformanc…

f0bb584

…e-fast avoids test failures due to timeout because conformance takes a long time

mweibel temporarily deployed to e2e May 12, 2026 12:00 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resiliency#31

Improve resiliency#31
mweibel wants to merge 9 commits into
mainfrom
improve-resiliency

mweibel commented May 12, 2026 •

edited

Loading

Uh oh!

mweibel commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mweibel commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mweibel commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mweibel commented May 12, 2026 •

edited

Loading