Improve test isolation to reduce flakiness#619
Merged
Conversation
PR 5 landed a COUNT(DISTINCT (writer, id)) assertion in TestK8sDuckLakeConcurrentWriters to mask a flake where the test harness's at-least-once retry semantics double-INSERTed batches when the prior destructive test (TestK8sDuckLakeDurabilityAcrossWorkerRestart) left the control plane still doing housekeeping work. That workaround ducks the real problem: a regression that emitted duplicate rows would now silently pass. This PR puts the test back on a COUNT(*) check and addresses the isolation gap properly. waitForControlPlaneIdle (controlplane_idle_helper_test.go) polls the K8s apiserver until the worker-pod set is stable for 3 consecutive ticks at 1s intervals, with every pod Running + PodReady=True and nothing in a DeletionTimestamp/Pending state. That's the test isolation primitive that was missing: destructive tests now call it at the end so the next test doesn't run while the CP is mid-cleanup (post-retire artifact reaper, warm-pool replenishment activation). Wired into the two destructive tests in the suite: - TestK8sDuckLakeDurabilityAcrossWorkerRestart (deletes a worker pod) - TestK8sWorkerCrashRecovery (also deletes a worker pod) Signals are K8s-API-only by design. Runtime-store transitions (spawning, activating, draining) aren't directly observed but every CP-side state change with visible cluster side effects shows up here: the CP can't move a worker row to a stable state without making K8s API calls, and it's the apiserver load from those calls that drops the next test's port-forward. Per tests/k8s/CLAUDE.md the destructive TestMain blocks running unit tests in this package; helper logic is verified by integration use. Compile-only check via 'go test -c -tags k8s_integration kubernetes' passes; full integration validation comes from CI.
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
PR #618 landed a `COUNT(DISTINCT (writer, id))` assertion in `TestK8sDuckLakeConcurrentWriters` to unblock CI when the test flaked. The flake was real — but the fix masked it instead of solving it. A future regression that emitted duplicate rows would now silently pass that assertion.
What's actually happening
`TestK8sDuckLakeDurabilityAcrossWorkerRestart` deletes a worker pod, calls `waitForWorkerReplacement` (which only checks for one ready replacement), and returns. The CP's housekeeping continues for several seconds after that helper returns:
That trailing apiserver load is what bumps the next test's port-forward. When the next test happens to be `TestK8sDuckLakeConcurrentWriters` (4 goroutines doing concurrent INSERTs), a port-forward drop mid-INSERT lets the at-least-once retry helper double-INSERT a 25-row batch — and the test sees `row count = 175, want 100`.
What this PR does
Revert the `COUNT(DISTINCT (writer, id))` assertion back to `COUNT(*)`. A regression that produced duplicate rows must be caught here.
Add `waitForControlPlaneIdle(t, timeout)` in a new `controlplane_idle_helper_test.go`. Polls the K8s apiserver until:
Wire the helper into the two destructive tests:
What this PR doesn't fix
Runtime-store transitions (`spawning`/`activating`/`draining`) aren't directly observed. The argument for K8s-API-only is that the CP can't make a worker row reach a stable state (idle, hot, hot_idle) without making K8s API calls, so any state change with cluster-visible side effects — the side effects that drive the apiserver load that broke the port-forward — will show up here. If a regression appears where the CP transitions rows without touching K8s, we'd need to bring the runtime store into the helper.
Helper logic is verified by integration use only — `tests/k8s/CLAUDE.md` documents that the package's `TestMain` is destructive against the active kubeconfig and forbids running anything but the full integration suite. Compile + vet are clean via `go test -c -tags 'k8s_integration kubernetes'`.
Test plan
🤖 Generated with Claude Code