Skip to content

fix(e2e): reduce E2E test flakiness#8480

Open
r2k1 wants to merge 2 commits intomainfrom
e2e-flakiness-fixes
Open

fix(e2e): reduce E2E test flakiness#8480
r2k1 wants to merge 2 commits intomainfrom
e2e-flakiness-fixes

Conversation

@r2k1
Copy link
Copy Markdown
Contributor

@r2k1 r2k1 commented May 8, 2026

Summary

E2E pipeline on main has ~84% failure rate (46/55 builds failed over 3 weeks). This PR addresses 6 distinct flakiness sources.

Changes (4 files, 56 insertions, 32 deletions)

1. Wait for cloud-provider initialization before pod scheduling (kube.go)

WaitUntilNodeReady() returned success on NodeReady=True even when the node still had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint. Test pods don't tolerate this taint → stay Pending → timeout.

Now keeps polling until the taint is removed.

2. Tolerate transient FailedCreatePodSandBox events (kube.go)

The "loopback interrupted system call" is a known transient kernel error. Kubelet retries sandbox creation automatically, but the test framework treated the first FailedCreatePodSandBox event as fatal (maxRetries=0).

Changed to count aggregate sandbox failures via event.Count and only fail after the threshold is exceeded. validatePodRunning passes maxRetries=3. Other callers still use maxRetries=0 (unchanged behavior).

3. Add DHCP rule to iptables allowlist (validation.go)

ValidateIPTablesCompatibleWithCiliumEBPF() rejected a normal host DHCP INPUT rule (-A INPUT -p udp -m udp --dport 68 -j ACCEPT). This is standard OS networking, unrelated to eBPF host routing.

4. Increase wireserver validation timeout (validation.go)

validateWireServerBlocked polled for 1 minute, but iptables FORWARD rules can take longer to converge after kube-proxy chain recreation. Increased to 3 minutes.

5. Raise CSE installDeps threshold (scenario_cse_perf_test.go)

installDeps threshold was 90s with comment "no direct prod data; generous for full install". The function does apt locks + repo setup + apt-get update + bulk package install — inherently variable. Raised to 120s for both Ubuntu 22.04 and 24.04.

6. Deduplicate CSE timing events (cse_timing.go)

Events appearing in both the primary GA events directory and handler-version subdirectories created duplicate subtests (e.g. Task_installDeps#01). Deduplicates by (TaskName, StartTime, EndTime).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets major sources of E2E test flakiness by tightening readiness gating, reducing validator false positives, and stabilizing timing-based assertions. It also extends scenario mutator hooks to receive cluster context, enabling scenarios that depend on cluster-level derived values (e.g., a proxy URL).

Changes:

  • Make node readiness waiting more accurate by requiring cloud-provider uninitialized taint removal before proceeding.
  • Reduce false-positive failures by expanding the eBPF/iptables allowlist and deduplicating CSE timing events; relax overly tight CSE perf thresholds.
  • Update E2E scenario mutator function signatures to accept *Cluster, adjust specific scenarios (e.g., Flatcar AzureCNI), and skip a consistently failing Ubuntu 20.04 FIPS lane.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
e2e/kube.go Improves node readiness gating; adds proxy ConfigMap/DaemonSet + proxy discovery; fixes DaemonSet CreateOrUpdate mutation.
e2e/cluster.go Adds cluster-level ProxyURL plumbing, sets up private DNS for API server, and broadens retryable cluster-create errors.
e2e/aks_model.go Adds 409-conflict handling/waits for private DNS zone and VNet link creation.
e2e/validation.go Allows a standard DHCP INPUT rule in the eBPF-host-routing iptables compatibility allowlist.
e2e/cse_timing.go Deduplicates extracted CSE event timings that can appear in multiple event directories.
e2e/scenario_cse_perf_test.go Adjusts full-install timing thresholds (notably installDeps) and updates mutator signatures.
e2e/types.go Updates BootstrapConfigMutator / AKSNodeConfigMutator signatures to include *Cluster.
e2e/test_helpers.go Threads *Cluster through mutator invocations (including pre-provision flow).
e2e/scenario_test.go Removes forced Azure CNI plugin settings for Flatcar AzureCNI, skips Ubuntu2004FIPS, adds HTTPS proxy + private DNS scenario, updates mutator signatures across scenarios.
e2e/scenario_win_test.go Updates bootstrap mutator signatures for Windows scenarios.
e2e/scenario_localdns_hosts_test.go Updates mutator signatures for LocalDNS hosts plugin scenarios.
e2e/scenario_gpu_managed_experience_test.go Updates mutator signatures for GPU managed experience scenarios.
e2e/scenario_gpu_daemonset_test.go Updates mutator signatures for GPU daemonset scenario.
Comments suppressed due to low confidence (2)

e2e/kube.go:330

  • EnsureDebugDaemonsets now creates the proxy ConfigMap/DaemonSet for every non-network-isolated cluster. If the proxy is only needed for a subset of scenarios, consider decoupling it from the general "debug daemonsets" setup so failures in pulling/starting the proxy don’t break unrelated tests during cluster preparation.
		return nil
	})
	if err != nil {
		return err
	}
	return nil
}

func (k *Kubeclient) createKubernetesSecret(ctx context.Context, namespace, secretName, registryName, username, password string) error {
	defer toolkit.LogStepCtxf(ctx, "creating kubernetes secret %s in namespace %s for registry %s", secretName, namespace, registryName)()

e2e/kube.go:606

  • The proxy DaemonSet runs with HostNetwork: true and exposes a fixed HostPort (8888) while tolerating all taints. This effectively opens an unauthenticated forward proxy on every system-pool node IP, which is reachable from within the VNet and may be abused or interfere with other host processes that might already bind the port. Consider scoping this down (e.g., run on a single chosen node/instance, tighten tolerations/nodeSelectors, and/or add network-level restrictions) and document the intended threat model for this test-only proxy.

@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from ba7f563 to e29f86b Compare May 8, 2026 04:56
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from e29f86b to fb806d0 Compare May 8, 2026 04:57
Copilot AI review requested due to automatic review settings May 8, 2026 04:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread e2e/kube.go Outdated
Comment on lines +172 to +176
// Check if node still has the cloud-provider uninitialized taint
// which prevents normal pods from being scheduled
for _, taint := range node.Spec.Taints {
if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" {
t.Logf("node %s is ready but still has uninitialized taint, waiting for cloud-provider initialization. Taints: %s", node.Name, string(nodeTaints))
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from fb806d0 to 2473076 Compare May 8, 2026 05:01
Copilot AI review requested due to automatic review settings May 8, 2026 05:02
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from 2473076 to 7d30d00 Compare May 8, 2026 05:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread e2e/kube.go
// Wait for cloud-provider to remove the uninitialized taint,
// otherwise normal pods can't be scheduled on this node.
for _, taint := range node.Spec.Taints {
if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's noisy

Comment thread e2e/kube.go
Comment on lines +172 to +176
// Wait for cloud-provider to remove the uninitialized taint,
// otherwise normal pods can't be scheduled on this node.
for _, taint := range node.Spec.Taints {
if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" {
return false, nil
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from 7d30d00 to 1dcf1c6 Compare May 8, 2026 05:07
Copilot AI review requested due to automatic review settings May 8, 2026 05:14
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from 1dcf1c6 to b0a3691 Compare May 8, 2026 05:14
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from b0a3691 to b5692bc Compare May 8, 2026 05:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (3)

e2e/kube.go:106

  • WaitUntilPodRunningWithRetry says it ignores stale FailedCreatePodSandBox events, but it lists Events only by involvedObject.name and does not filter by the current pod UID / creation time. This means old events from previous pods with the same name can trigger retries/deletes against the new pod, causing flaky behavior. Consider filtering events by involvedObject.uid == pod.UID (or by event timestamp >= pod.CreationTimestamp) and/or using a field selector that includes the UID.
		// Check for FailedCreatePodSandBox events
		events, err := k.Typed.CoreV1().Events(pod.Namespace).List(ctx, metav1.ListOptions{FieldSelector: "involvedObject.name=" + pod.Name})
		if err == nil {

e2e/validation.go:160

  • WaitUntilPodRunningWithRetry deletes the pod on FailedCreatePodSandBox and then continues polling for a pod with the same name, but for pods created directly (no controller), nothing will recreate it. With the new maxRetries=3 usage in validatePodRunning, this can turn a transient sandbox failure into a guaranteed timeout/hang. Either avoid deleting unmanaged pods (no OwnerReferences), or move the retry loop up to validatePodRunning so it can delete+recreate the pod spec explicitly.
		if err != nil {

e2e/validation.go:160

  • PR description mentions changes in e2e/scenario_test.go (removing forced networkPlugin=azure for Flatcar AzureCNI tests and skipping Ubuntu2004FIPS lane), but those changes are not present in this PR branch. Please either update the PR description to match the actual code changes, or include the missing scenario_test.go updates.
		if err != nil {

@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from b5692bc to 172a16a Compare May 8, 2026 05:25
Copilot AI review requested due to automatic review settings May 8, 2026 05:26
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from 172a16a to 1787d71 Compare May 8, 2026 05:26
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from 1787d71 to b1e817d Compare May 8, 2026 05:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

e2e/validation.go:145

  • Pod name uniqueness can be lost here: you append a random suffix and then call truncatePodName(), but truncatePodName truncates to 63 chars from the end. For long base names (e.g., based on s.Runtime.VM.KubeName), this can truncate off the random suffix entirely, reintroducing name collisions/AlreadyExists errors on retries. Consider truncating the base name to leave room for "-" (or updating truncatePodName to preserve the suffix), and also note that callers/logs still reference the original pod.Name even though the created pod name differs.
	truncatePodName(s.T, pod)
	start := time.Now()

Comment thread e2e/kube.go
Comment on lines 172 to +176

for _, cond := range node.Status.Conditions {
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
// Wait for cloud-provider to remove the uninitialized taint,
// otherwise normal pods can't be scheduled on this node.
r2k1 and others added 2 commits May 8, 2026 17:32
Add E2E test for node bootstrapping with HTTPProxyConfig set and
private DNS zone for the API server FQDN. Regression coverage for
IcM 603699115 / ADO#31707996.

Changes:
- Refactor BootstrapConfigMutator and AKSNodeConfigMutator to accept
  *Cluster parameter, enabling scenarios to access cluster properties
- Deploy Python-based CONNECT proxy DaemonSet on all non-isolated
  clusters using mcr.microsoft.com/cbl-mariner/base/python:3
- Create private DNS zone for API server FQDN on all non-isolated
  clusters, linked to VNet with A record
- Add Test_Ubuntu2204_HTTPProxy_PrivateDNS scenario
- Fix cluster creation retry to handle NotFound errors

Test verified: node boots, CSE completes, kubelet starts, node Ready,
test pod runs. Proxy receives CONNECT traffic from CSE outbound check.

Fixes: ADO#31707996

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert a1bebdc (feat(e2e): add HTTP_PROXY + private DNS test scenario)
which had issues on the e2e-flakiness-fixes branch.

Analysis of 55 E2E builds on main (3 weeks) showed 84% failure rate.
Root causes identified and fixed:

1. Node readiness race (kube.go): WaitUntilNodeReady() returned success
   on NodeReady=True even when node still had the cloud-provider
   uninitialized taint, preventing test pod scheduling. Now waits for
   taint removal before declaring node ready.

2. IPtables false positives (validation.go): iptables eBPF-host-routing
   validator rejected a normal host DHCP INPUT rule (UDP/68) not in its
   allowlist. Added to allowlist.

3. CSE timing threshold (scenario_cse_perf_test.go): installDeps 90s
   threshold was set with 'no direct prod data' and consistently
   exceeded by the network-heavy apt workflow. Raised to 120s.

4. Duplicate CSE events (cse_timing.go): events appearing in both GA
   events directory and handler subdirectories created spurious
   Task_installDeps#01 subtests. Added deduplication.

5. Broken Ubuntu2004FIPS lane (scenario_test.go): Test added on
   2026-04-22 without VMSS FIPS capability setup, never green. Skipped
   until properly fixed.

Dropped from earlier version: Flatcar AzureCNI networkPlugin removal.
Rubber duck review found removing networkPlugin=azure defaults to
kubenet (not none), which would break tests differently. Proper fix
requires PR #7463 (set to none instead).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@r2k1 r2k1 force-pushed the e2e-flakiness-fixes branch from b1e817d to beeaaaa Compare May 8, 2026 05:32
@r2k1 r2k1 changed the title fix(e2e): address multiple sources of E2E test flakiness fix(e2e): reduce E2E test flakiness May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants