Skip to content

Improve test isolation to reduce flakiness#619

Merged
bill-ph merged 1 commit into
mainfrom
codex/pr6-test-isolation
May 24, 2026
Merged

Improve test isolation to reduce flakiness#619
bill-ph merged 1 commit into
mainfrom
codex/pr6-test-isolation

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented May 24, 2026

Why

PR #618 landed a `COUNT(DISTINCT (writer, id))` assertion in `TestK8sDuckLakeConcurrentWriters` to unblock CI when the test flaked. The flake was real — but the fix masked it instead of solving it. A future regression that emitted duplicate rows would now silently pass that assertion.

What's actually happening

`TestK8sDuckLakeDurabilityAcrossWorkerRestart` deletes a worker pod, calls `waitForWorkerReplacement` (which only checks for one ready replacement), and returns. The CP's housekeeping continues for several seconds after that helper returns:

  • Health checker fires `MarkLost` CAS.
  • `DeleteWorkerArtifacts` is async (`go p.retireWorkerPod`).
  • Warm-pool replenishment may kick off a second fresh spawn.
  • The replacement gets activated (lifecycle CAS, STS broker, Flight RPC).

That trailing apiserver load is what bumps the next test's port-forward. When the next test happens to be `TestK8sDuckLakeConcurrentWriters` (4 goroutines doing concurrent INSERTs), a port-forward drop mid-INSERT lets the at-least-once retry helper double-INSERT a 25-row batch — and the test sees `row count = 175, want 100`.

What this PR does

  1. Revert the `COUNT(DISTINCT (writer, id))` assertion back to `COUNT(*)`. A regression that produced duplicate rows must be caught here.

  2. Add `waitForControlPlaneIdle(t, timeout)` in a new `controlplane_idle_helper_test.go`. Polls the K8s apiserver until:

    • Every worker pod is `Running` + `PodReady=True` + no `DeletionTimestamp`.
    • The set of worker pod names is stable for 3 consecutive ticks at 1s.
  3. Wire the helper into the two destructive tests:

    • `TestK8sDuckLakeDurabilityAcrossWorkerRestart` (deletes the latest worker).
    • `TestK8sWorkerCrashRecovery` (deletes a worker, verifies recovery).

What this PR doesn't fix

Runtime-store transitions (`spawning`/`activating`/`draining`) aren't directly observed. The argument for K8s-API-only is that the CP can't make a worker row reach a stable state (idle, hot, hot_idle) without making K8s API calls, so any state change with cluster-visible side effects — the side effects that drive the apiserver load that broke the port-forward — will show up here. If a regression appears where the CP transitions rows without touching K8s, we'd need to bring the runtime store into the helper.

Helper logic is verified by integration use only — `tests/k8s/CLAUDE.md` documents that the package's `TestMain` is destructive against the active kubeconfig and forbids running anything but the full integration suite. Compile + vet are clean via `go test -c -tags 'k8s_integration kubernetes'`.

Test plan

  • CI's `k8s-integration-tests` job passes (stable across reruns).
  • Compile-only check passes.
  • No unit tests for the helper in this package (TestMain blocks it). If the helper needs unit coverage long term, the right move is a subpackage that doesn't share TestMain — call that out separately.

🤖 Generated with Claude Code

PR 5 landed a COUNT(DISTINCT (writer, id)) assertion in
TestK8sDuckLakeConcurrentWriters to mask a flake where the test
harness's at-least-once retry semantics double-INSERTed batches when
the prior destructive test (TestK8sDuckLakeDurabilityAcrossWorkerRestart)
left the control plane still doing housekeeping work. That workaround
ducks the real problem: a regression that emitted duplicate rows would
now silently pass.

This PR puts the test back on a COUNT(*) check and addresses the
isolation gap properly.

waitForControlPlaneIdle (controlplane_idle_helper_test.go) polls the
K8s apiserver until the worker-pod set is stable for 3 consecutive
ticks at 1s intervals, with every pod Running + PodReady=True and
nothing in a DeletionTimestamp/Pending state. That's the test
isolation primitive that was missing: destructive tests now call it
at the end so the next test doesn't run while the CP is mid-cleanup
(post-retire artifact reaper, warm-pool replenishment activation).

Wired into the two destructive tests in the suite:
- TestK8sDuckLakeDurabilityAcrossWorkerRestart (deletes a worker pod)
- TestK8sWorkerCrashRecovery (also deletes a worker pod)

Signals are K8s-API-only by design. Runtime-store transitions
(spawning, activating, draining) aren't directly observed but every
CP-side state change with visible cluster side effects shows up here:
the CP can't move a worker row to a stable state without making K8s
API calls, and it's the apiserver load from those calls that drops
the next test's port-forward.

Per tests/k8s/CLAUDE.md the destructive TestMain blocks running unit
tests in this package; helper logic is verified by integration use.
Compile-only check via 'go test -c -tags k8s_integration kubernetes'
passes; full integration validation comes from CI.
@bill-ph bill-ph changed the title Revert COUNT(DISTINCT) workaround; add waitForControlPlaneIdle helper Improve test isolation to reduce flakiness May 24, 2026
@bill-ph bill-ph marked this pull request as ready for review May 24, 2026 14:37
@bill-ph bill-ph merged commit f45cee7 into main May 24, 2026
22 checks passed
@bill-ph bill-ph deleted the codex/pr6-test-isolation branch May 24, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant