feat(start): wait for cluster readiness on Kind/K3d cluster start#4897
feat(start): wait for cluster readiness on Kind/K3d cluster start#4897devantler wants to merge 4 commits into
Conversation
`ksail cluster start` for Kind (Vanilla) and K3d (K3s) previously returned
as soon as the node container was started, leaving the cluster racing the
API server's authorizer warm-up: an operation run immediately after start
could hit a transient 403 ("kubernetes-admin cannot list namespaces") or
the asynchronous creation of the default ServiceAccount. The Talos Docker
path already blocked on readiness; Kind/K3d now match that behavior.
Add `k8s.WaitForClusterReady`, which waits for the API server `/readyz`
and then polls a basic authorized read (listing namespaces), retrying any
error — including the transient warm-up 403 — until it succeeds or times
out. Kind and K3d `Start()` invoke it after the node container starts,
via an injectable seam so unit tests run without a live cluster. K3d gains
a kubeconfig field (threaded from `Connection.Kubeconfig`) to build the
client; Kind reuses its existing kubeconfig path.
Validated with unit tests and manually: create a Kind cluster, stop,
start (now blocks ~6s on readiness), then immediately
`ksail workload get namespaces` succeeds with no race.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
❌MegaLinter analysis: Error❌ COPYPASTE / jscpd - 7 errors❌ REPOSITORY / osv-scanner - 1 error❌ ACTION / zizmor - 1 error✅ Linters with no issuesactionlint, bash-exec, git_diff, hadolint, jsonlint, lychee, markdown-table-formatter, markdownlint, prettier, prettier, shellcheck, shfmt, stylelint, syft, trivy-sbom, trufflehog, v8r, v8r, yamllint Notices📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining See detailed reports in MegaLinter artifacts
|
There was a problem hiding this comment.
Pull request overview
This PR makes ksail cluster start for Kind (Vanilla) and K3d (K3s) block until the Kubernetes API is not only reachable but also usable (i.e., an authorized read succeeds), aligning behavior with the existing Talos readiness gating to avoid post-start transient 403/SA races.
Changes:
- Add
pkg/k8s.WaitForClusterReady(...), which waits for/readyzand then polls an authorizedNamespaces().List(...)until success or timeout. - Update Kind and K3d provisioners’
Start()to invoke the readiness wait after starting nodes/cluster, with injectable seams to keep unit tests clusterless. - Thread kubeconfig path into K3d provisioner via a new
WithKubeconfig(...)and factory wiring; add unit tests covering readiness wait behavior and error propagation.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/svc/provisioner/cluster/kind/provisioner.go | Calls readiness wait after StartNodes using kind-<name> context. |
| pkg/svc/provisioner/cluster/kind/provisioner_test.go | Adds tests asserting readiness wait is invoked, skipped on start failure, and error propagation. |
| pkg/svc/provisioner/cluster/kind/export_test.go | Adds WithWaitForReadyForTest seam for clusterless Start() tests. |
| pkg/svc/provisioner/cluster/k3d/provisioner.go | Calls readiness wait after k3d cluster start; adds kubeconfig field + WithKubeconfig. |
| pkg/svc/provisioner/cluster/k3d/provisioner_test.go | Adds tests for readiness wait invocation, error propagation, and kubeconfig defaults/override. |
| pkg/svc/provisioner/cluster/k3d/export_test.go | Adds WithWaitForReadyForTest seam and KubeconfigForTest. |
| pkg/svc/provisioner/cluster/factory.go | Wires cluster connection kubeconfig into the K3d provisioner via WithKubeconfig. |
| pkg/k8s/wait.go | Introduces WaitForClusterReady and internal waitForAuthorizedRead. |
| pkg/k8s/wait_test.go | Adds unit tests for authorized-read success, retry-on-403, timeout, and invalid kubeconfig behavior. |
| pkg/k8s/export_test.go | Exposes waitForAuthorizedRead via WaitForAuthorizedReadForTest for unit testing with fake clientsets. |
| pkg/k8s/errors.go | Adds ErrClusterNotReady sentinel error. |
…lling Address review feedback on WaitForClusterReady: - Canonicalize the resolved kubeconfig path with fsutil.EvalCanonicalPath (resolves symlinks) before loading it, since the path can come from user/config input — aligns with the repo's path-safety practices. - Lower the initial readiness-poll backoff from 1s to 100ms (named initialPollInterval), so readiness is detected promptly after the API server reports ready and unit tests no longer pay a fixed ~1s startup delay. The ×2 backoff and 5s cap are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code Coverage OverviewLanguages: Go Go / code-coverage/goThe overall coverage in the branch is 55%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
Updated |
|
ℹ️ The failing Fixed separately in #4898 (deduplicates all 7; verified 0 clones via the real MegaLinter image). Once #4898 merges, rebasing this branch will turn mega-linter green here. 🤖 automated note from a Claude Code session |
WithKubeconfig trimmed the path only for its emptiness check but stored the original, so a value with leading/trailing whitespace would be passed to the post-start readiness wait and fail kubeconfig loading. Trim once and store the trimmed value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

What & why
ksail cluster startfor Kind (Vanilla) and K3d (K3s) previously returned as soon as the node container was started (KindProvisioner.Startjust calledinfraProvider.StartNodes;K3dProvisioner.Startjust ran the k3d start command). That left the cluster racing the API server's authorizer warm-up — an operation run immediately aftercluster startcould hit a transient403("kubernetes-admin cannot list namespaces") or the asynchronous creation of a namespace's default ServiceAccount.The Talos Docker path already blocked on readiness (
waitForDockerClusterReadyAfterStart). This PR brings Kind and K3d in line, so users get a genuinely ready cluster fromksail cluster start.Changes
pkg/k8s: newWaitForClusterReady(ctx, kubeconfigPath, contextName)— waits for the API server/readyz(reusingWaitForAPIServer), then polls a basic authorized read (Namespaces().List) until it succeeds. Any error — including the transient warm-up403— is treated as "not ready yet" and retried until success or timeout (ErrClusterNotReady).pkg/svc/provisioner/cluster/kind):Start()calls the readiness wait withkind-<name>afterStartNodes.pkg/svc/provisioner/cluster/k3d):Start()calls the readiness wait withk3d-<name>after the start command. Adds akubeconfigfield (threaded fromConnection.Kubeconfigvia a newWithKubeconfigsetter in the factory; defaults to~/.kube/config).waitForReadyseam (WithWaitForReadyForTest) so unit tests run without a live cluster. The no-distributionMultiProvisioner.Startpath inherits this automatically.Testing
pkg/k8s); per-provisioner tests asserting the wait runs with the correct kubeconfig +kind-/k3d-context, that errors propagate, and that the wait is skipped when the start command itself fails.go build ./...,go test ./...pass;golangci-lintclean on the diff.stop→start(now blocks ~6.4 s on readiness, vs. near-instant before) → immediatelyksail workload get namespacesreturned all namespaces with no 403 race. Cluster and mirror-registry containers cleaned up afterward.Reviewer notes
SetupDinD→WaitForDefaultServiceAccountworkaround referenced in the original motivation lives on a separate branch, not onmain. There is no readiness-gate workaround on this branch to remove. The system-test action's 3-attempt retry loop guards genuinecluster startfailures (and now also a readiness-wait timeout), so it remains useful. The CI gate can be dropped when that sibling branch is reconciled with this in-Start()wait.🤖 Generated with Claude Code