Improve resiliency#31
Open
mweibel wants to merge 9 commits into
Open
Conversation
- Add IsTimeoutError/IsDeadlineExceeded helpers to cloudscale client - Configure client with tuned http.Transport (5s dial, 10s response header, 60s overall) - Wrap all .Create() call sites with IsTimeoutError checks and 90s requeue - Update reconcileNetwork and reconcileManagedNetwork to return (ctrl.Result, error) - Add comprehensive timeout handling unit tests across all controller files
Prevent 'Cannot delete non-empty server group' race condition during cluster deletion by checking CloudscaleMachine resources for owned servers before attempting server group deletion. When owned servers are still present, return a sentinel error that triggers a 10s graceful requeue instead of exponential backoff. Also skip server groups containing foreign servers (servers from other clusters sharing this server group) to prevent accidental deletion.
Convert the 'There are still one or more load balancer pool members' API error into a graceful 10s requeue instead of propagating it as a failure (which triggers exponential backoff up to 30 minutes). Added isLBPoolMembersError() helper to detect the specific API error message and errNetworkNotReady sentinel error for controlled flow through the delete path.
Replace the "not running" check in reconcileLoadBalancer with a switch that only blocks on "changing". When the LB is degraded or error, proceed with member reconciliation to remove stale pool members. Previously, a degraded LB might've block forever, creating a deadlock. Also refactor and clean up the load balancer tests: - Add constants for all used cloudscale LB API statuses (running, changing, degraded, error). - Fix the "creating" mock status. - Introduce reusable "exists" mock helpers and auto-default nil mocks, cutting orchestrator test setup from ~20 lines to ~3 lines each. - Remove 12 duplicate sub-reconciler tests already covered by ensureResource tests in cloudscale_services_test.go.
cilium CNI CIDR is by default `10.0.0.0/24`. If we have the same network subnet CIDR, the loadbalancer health check may break which results in a broken cluster. Setting 172.18.0.0/24 avoids this: - does not overlap with Cilium's default cluster-pool 10.0.0.0/8. - does not overlap with the template's clusterNetwork.pods.cidrBlocks: ["192.168.0.0/16"] or services.cidrBlocks: ["10.96.0.0/12"]. - avoids 172.17.0.0/16 (Docker's default bridge on many Linux hosts). Old range: 10.100.0.0/24 — still inside Cilium's 10.0.0.0/8, only works for small clusters.
…e-fast avoids test failures due to timeout because conformance takes a long time
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.