Skip to content

feat(start): wait for cluster readiness on Kind/K3d cluster start#4897

Open
devantler wants to merge 4 commits into
mainfrom
claude/exciting-heyrovsky-6d1681
Open

feat(start): wait for cluster readiness on Kind/K3d cluster start#4897
devantler wants to merge 4 commits into
mainfrom
claude/exciting-heyrovsky-6d1681

Conversation

@devantler
Copy link
Copy Markdown
Contributor

What & why

ksail cluster start for Kind (Vanilla) and K3d (K3s) previously returned as soon as the node container was started (KindProvisioner.Start just called infraProvider.StartNodes; K3dProvisioner.Start just ran the k3d start command). That left the cluster racing the API server's authorizer warm-up — an operation run immediately after cluster start could hit a transient 403 ("kubernetes-admin cannot list namespaces") or the asynchronous creation of a namespace's default ServiceAccount.

The Talos Docker path already blocked on readiness (waitForDockerClusterReadyAfterStart). This PR brings Kind and K3d in line, so users get a genuinely ready cluster from ksail cluster start.

Changes

  • pkg/k8s: new WaitForClusterReady(ctx, kubeconfigPath, contextName) — waits for the API server /readyz (reusing WaitForAPIServer), then polls a basic authorized read (Namespaces().List) until it succeeds. Any error — including the transient warm-up 403 — is treated as "not ready yet" and retried until success or timeout (ErrClusterNotReady).
  • Kind (pkg/svc/provisioner/cluster/kind): Start() calls the readiness wait with kind-<name> after StartNodes.
  • K3d (pkg/svc/provisioner/cluster/k3d): Start() calls the readiness wait with k3d-<name> after the start command. Adds a kubeconfig field (threaded from Connection.Kubeconfig via a new WithKubeconfig setter in the factory; defaults to ~/.kube/config).
  • Both use an injectable waitForReady seam (WithWaitForReadyForTest) so unit tests run without a live cluster. The no-distribution MultiProvisioner.Start path inherits this automatically.

Testing

  • New unit tests: authorized-read success / transient-403-retry / timeout (pkg/k8s); per-provisioner tests asserting the wait runs with the correct kubeconfig + kind-/k3d- context, that errors propagate, and that the wait is skipped when the start command itself fails.
  • go build ./..., go test ./... pass; golangci-lint clean on the diff.
  • Manual (real Kind cluster): create → stopstart (now blocks ~6.4 s on readiness, vs. near-instant before) → immediately ksail workload get namespaces returned all namespaces with no 403 race. Cluster and mirror-registry containers cleaned up afterward.

Reviewer notes

  • The CI "wait for host cluster ready after start" gate / SetupDinDWaitForDefaultServiceAccount workaround referenced in the original motivation lives on a separate branch, not on main. There is no readiness-gate workaround on this branch to remove. The system-test action's 3-attempt retry loop guards genuine cluster start failures (and now also a readiness-wait timeout), so it remains useful. The CI gate can be dropped when that sibling branch is reconciled with this in-Start() wait.

🤖 Generated with Claude Code

`ksail cluster start` for Kind (Vanilla) and K3d (K3s) previously returned
as soon as the node container was started, leaving the cluster racing the
API server's authorizer warm-up: an operation run immediately after start
could hit a transient 403 ("kubernetes-admin cannot list namespaces") or
the asynchronous creation of the default ServiceAccount. The Talos Docker
path already blocked on readiness; Kind/K3d now match that behavior.

Add `k8s.WaitForClusterReady`, which waits for the API server `/readyz`
and then polls a basic authorized read (listing namespaces), retrying any
error — including the transient warm-up 403 — until it succeeds or times
out. Kind and K3d `Start()` invoke it after the node container starts,
via an injectable seam so unit tests run without a live cluster. K3d gains
a kubeconfig field (threaded from `Connection.Kubeconfig`) to build the
client; Kind reuses its existing kubeconfig path.

Validated with unit tests and manually: create a Kind cluster, stop,
start (now blocks ~6s on readiness), then immediately
`ksail workload get namespaces` succeeds with no race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 21:44
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

MegaLinter analysis: Error

❌ COPYPASTE / jscpd - 7 errors
Clone found (typescript):
 - vsce/src/mcp/schemaClient.ts [325:18 - 332:33] (7 lines, 56 tokens)
   vsce/src/mcp/schemaClient.ts [309:20 - 316:26]

Clone found (typescript):
 - vsce/src/ksail/clusters.ts [381:34 - 390:46] (9 lines, 59 tokens)
   vsce/src/ksail/clusters.ts [361:35 - 370:47]

Clone found (typescript):
 - vsce/src/ksail/clusters.ts [406:44 - 415:46] (9 lines, 61 tokens)
   vsce/src/ksail/clusters.ts [341:33 - 350:48]

Clone found (typescript):
 - vsce/src/ksail/clusters.ts [426:36 - 435:48] (9 lines, 59 tokens)
   vsce/src/ksail/clusters.ts [361:35 - 370:47]

Clone found (typescript):
 - vsce/src/commands/prompts.ts [658:29 - 669:3] (11 lines, 55 tokens)
   vsce/src/commands/prompts.ts [201:8 - 211:3]

Clone found (go):
 - pkg/svc/tenant/argocd.go [400:61 - 413:37] (13 lines, 99 tokens)
   pkg/fsutil/configmanager/talos/manager.go [231:64 - 244:37]

Clone found (typescript):
 - vsce/src/extension.ts [162:46 - 173:18] (11 lines, 59 tokens)
   vsce/src/extension.ts [140:39 - 151:5]

┌────────────┬────────────────┬─────────────┬──────────────┬──────────────┬──────────────────┬───────────────────┐
│ Format     │ Files analyzed │ Total lines │ Total tokens │ Clones found │ Duplicated lines │ Duplicated tokens │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ toml       │ 6              │ 64          │ 277          │ 0            │ 0 (0%)           │ 0 (0%)            │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ go         │ 592            │ 98153       │ 642913       │ 1            │ 13 (0.01%)       │ 99 (0.02%)        │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ markdown   │ 7              │ 302         │ 638          │ 0            │ 0 (0%)           │ 0 (0%)            │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ typescript │ 23             │ 4084        │ 22492        │ 6            │ 56 (1.37%)       │ 349 (1.55%)       │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ javascript │ 15             │ 1108        │ 9310         │ 0            │ 0 (0%)           │ 0 (0%)            │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ tsx        │ 12             │ 1781        │ 14802        │ 0            │ 0 (0%)           │ 0 (0%)            │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ css        │ 2              │ 45          │ 148          │ 0            │ 0 (0%)           │ 0 (0%)            │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ smarty     │ 1              │ 63          │ 1039         │ 0            │ 0 (0%)           │ 0 (0%)            │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ bash       │ 2              │ 52          │ 164          │ 0            │ 0 (0%)           │ 0 (0%)            │
├────────────┼────────────────┼─────────────┼──────────────┼──────────────┼──────────────────┼───────────────────┤
│ Total:     │ 660            │ 105652      │ 691783       │ 7            │ 69 (0.07%)       │ 448 (0.06%)       │
└────────────┴────────────────┴─────────────┴──────────────┴──────────────┴──────────────────┴───────────────────┘
Found 7 clones.
HTML report saved to megalinter-reports/copy-paste/html/
ERROR: jscpd found too many duplicates (0.07%) over threshold (0%)
Error: ERROR: jscpd found too many duplicates (0.07%) over threshold (0%)
    at ThresholdReporter.report (/node-deps/node_modules/@jscpd/finder/dist/index.js:615:13)
    at /node-deps/node_modules/@jscpd/finder/dist/index.js:109:18
    at Array.forEach (<anonymous>)
    at /node-deps/node_modules/@jscpd/finder/dist/index.js:108:22
    at async /node-deps/node_modules/jscpd/dist/bin/jscpd.js:9:5
❌ REPOSITORY / osv-scanner - 1 error
Scanning dir .
Starting filesystem walk for root: /
Scanned docs/package-lock.json file and found 549 packages
Scanned desktop/go.mod file and found 401 packages
Scanned web/ui/package-lock.json file and found 187 packages
Scanned go.mod file and found 965 packages
Scanned vsce/package-lock.json file and found 592 packages
End status: 306 dirs visited, 2039 inodes visited, 5 Extract calls, 186.26389ms elapsed, 186.26409ms wall time
Filtered 1 local/unscannable package/s from the scan.
Failed to run code analysis (govulncheck) on 'desktop/go.mod' because govulncheck: loading packages: 
There are errors with the provided package patterns:

-: # github.com/webview/webview_go
# [pkg-config --cflags  -- gtk+-3.0 webkit2gtk-4.0]
Package gtk+-3.0 was not found in the pkg-config search path.
Perhaps you should add the directory containing `gtk+-3.0.pc'
to the PKG_CONFIG_PATH environment variable
Package 'gtk+-3.0' not found
Package 'webkit2gtk-4.0' not found
/go/pkg/mod/github.com/webview/webview_go@v0.0.0-20240831120633-6173450d4dd6/webview.go:26:8: could not import C (no metadata for C)
desktop/main.go:64:42: cannot use webview.HintNone (constant unknown with invalid type) as webview.Hint value in argument to view.SetSize

For details on package patterns, see https://pkg.go.dev/cmd/go#hdr-Package_lists_and_patterns.

(the Go toolchain is required)
❌ ACTION / zizmor - 1 error
INFO zizmor: 🌈 zizmor v1.25.0
fatal: no audit was performed
'artipacked' audit failed on file://.github/workflows/cd.yaml

Caused by:
    0: error in 'artipacked' audit
    1: couldn't list tags for actions/checkout
    2: request error while accessing GitHub API
    3: HTTP status client error (401 Unauthorized) for url (https://github.com/actions/checkout.git/git-upload-pack)


[ZizmorLinter] Zizmor failed to reach the GitHub API.
To allow zizmor to use GITHUB_TOKEN, add the following to your .mega-linter.yml:
ACTION_ZIZMOR_UNSECURED_ENV_VARIABLES:
  - GITHUB_TOKEN

✅ Linters with no issues

actionlint, bash-exec, git_diff, hadolint, jsonlint, lychee, markdown-table-formatter, markdownlint, prettier, prettier, shellcheck, shfmt, stylelint, syft, trivy-sbom, trufflehog, v8r, v8r, yamllint

Notices

📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining SECURITY_SUGGESTIONS: false)

See detailed reports in MegaLinter artifacts

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes ksail cluster start for Kind (Vanilla) and K3d (K3s) block until the Kubernetes API is not only reachable but also usable (i.e., an authorized read succeeds), aligning behavior with the existing Talos readiness gating to avoid post-start transient 403/SA races.

Changes:

  • Add pkg/k8s.WaitForClusterReady(...), which waits for /readyz and then polls an authorized Namespaces().List(...) until success or timeout.
  • Update Kind and K3d provisioners’ Start() to invoke the readiness wait after starting nodes/cluster, with injectable seams to keep unit tests clusterless.
  • Thread kubeconfig path into K3d provisioner via a new WithKubeconfig(...) and factory wiring; add unit tests covering readiness wait behavior and error propagation.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/svc/provisioner/cluster/kind/provisioner.go Calls readiness wait after StartNodes using kind-<name> context.
pkg/svc/provisioner/cluster/kind/provisioner_test.go Adds tests asserting readiness wait is invoked, skipped on start failure, and error propagation.
pkg/svc/provisioner/cluster/kind/export_test.go Adds WithWaitForReadyForTest seam for clusterless Start() tests.
pkg/svc/provisioner/cluster/k3d/provisioner.go Calls readiness wait after k3d cluster start; adds kubeconfig field + WithKubeconfig.
pkg/svc/provisioner/cluster/k3d/provisioner_test.go Adds tests for readiness wait invocation, error propagation, and kubeconfig defaults/override.
pkg/svc/provisioner/cluster/k3d/export_test.go Adds WithWaitForReadyForTest seam and KubeconfigForTest.
pkg/svc/provisioner/cluster/factory.go Wires cluster connection kubeconfig into the K3d provisioner via WithKubeconfig.
pkg/k8s/wait.go Introduces WaitForClusterReady and internal waitForAuthorizedRead.
pkg/k8s/wait_test.go Adds unit tests for authorized-read success, retry-on-403, timeout, and invalid kubeconfig behavior.
pkg/k8s/export_test.go Exposes waitForAuthorizedRead via WaitForAuthorizedReadForTest for unit testing with fake clientsets.
pkg/k8s/errors.go Adds ErrClusterNotReady sentinel error.

Comment thread pkg/k8s/wait.go Outdated
Comment thread pkg/k8s/wait.go
Comment thread pkg/k8s/wait_test.go
@devantler devantler marked this pull request as ready for review May 26, 2026 21:52
…lling

Address review feedback on WaitForClusterReady:

- Canonicalize the resolved kubeconfig path with fsutil.EvalCanonicalPath
  (resolves symlinks) before loading it, since the path can come from
  user/config input — aligns with the repo's path-safety practices.
- Lower the initial readiness-poll backoff from 1s to 100ms (named
  initialPollInterval), so readiness is detected promptly after the API
  server reports ready and unit tests no longer pay a fixed ~1s startup
  delay. The ×2 backoff and 5s cap are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ksail-bot ksail-bot Bot enabled auto-merge May 26, 2026 21:56
@github-code-quality
Copy link
Copy Markdown
Contributor

github-code-quality Bot commented May 26, 2026

Code Coverage Overview

Languages: Go

Go / code-coverage/go

The overall coverage in the branch is 55%. Coverage data for the branch is not yet available.

Show a code coverage summary of the most covered files.
File 6d2a270 +/-
pkg/cli/cmd/wor...load/gen/gen.go 96%
pkg/notify/progress.go 94%
pkg/svc/image/exporter.go 92%
pkg/fsutil/conf...alos/configs.go 84%
pkg/cli/cmd/cipher/cipher.go 79%
internal/contro...r_controller.go 79%
pkg/fsutil/gene...os/generator.go 78%
pkg/cli/ui/chat/model.go 70%
pkg/cli/cmd/cluster/cluster.go 55%
pkg/cli/cmd/wor...oad/workload.go 52%

Updated May 27, 2026 05:34 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.

@devantler
Copy link
Copy Markdown
Contributor Author

ℹ️ The failing 🧹 Lint - mega-linter check here is not caused by this PR's changes. MegaLinter runs jscpd with VALIDATE_ALL_CODEBASE: true + threshold: 0, so it full-scans the whole repo and flags 7 pre-existing clones in main (6 in vsce/*.ts, 1 Go clone tenant/argocd.goconfigmanager/talos/manager.go) — none in files this PR touches.

Fixed separately in #4898 (deduplicates all 7; verified 0 clones via the real MegaLinter image). Once #4898 merges, rebasing this branch will turn mega-linter green here.

🤖 automated note from a Claude Code session

Copilot AI review requested due to automatic review settings May 27, 2026 05:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comment thread pkg/svc/provisioner/cluster/k3d/provisioner.go
WithKubeconfig trimmed the path only for its emptiness check but stored the
original, so a value with leading/trailing whitespace would be passed to the
post-start readiness wait and fail kubeconfig loading. Trim once and store the
trimmed value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants