fix(e2e): reduce E2E test flakiness by r2k1 · Pull Request #8480 · Azure/AgentBaker

r2k1 · 2026-05-08T04:13:02Z

Summary

E2E pipeline on main has ~84% failure rate (46/55 builds failed over 3 weeks). This PR addresses 6 distinct flakiness sources.

Changes (4 files, 56 insertions, 32 deletions)

1. Wait for cloud-provider initialization before pod scheduling (`kube.go`)

WaitUntilNodeReady() returned success on NodeReady=True even when the node still had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint. Test pods don't tolerate this taint → stay Pending → timeout.

Now keeps polling until the taint is removed.

2. Tolerate transient FailedCreatePodSandBox events (`kube.go`)

The "loopback interrupted system call" is a known transient kernel error. Kubelet retries sandbox creation automatically, but the test framework treated the first FailedCreatePodSandBox event as fatal (maxRetries=0).

Changed to count aggregate sandbox failures via event.Count and only fail after the threshold is exceeded. validatePodRunning passes maxRetries=3. Other callers still use maxRetries=0 (unchanged behavior).

3. Add DHCP rule to iptables allowlist (`validation.go`)

ValidateIPTablesCompatibleWithCiliumEBPF() rejected a normal host DHCP INPUT rule (-A INPUT -p udp -m udp --dport 68 -j ACCEPT). This is standard OS networking, unrelated to eBPF host routing.

4. Increase wireserver validation timeout (`validation.go`)

validateWireServerBlocked polled for 1 minute, but iptables FORWARD rules can take longer to converge after kube-proxy chain recreation. Increased to 3 minutes.

5. Raise CSE installDeps threshold (`scenario_cse_perf_test.go`)

installDeps threshold was 90s with comment "no direct prod data; generous for full install". The function does apt locks + repo setup + apt-get update + bulk package install — inherently variable. Raised to 120s for both Ubuntu 22.04 and 24.04.

6. Deduplicate CSE timing events (`cse_timing.go`)

Events appearing in both the primary GA events directory and handler-version subdirectories created duplicate subtests (e.g. Task_installDeps#01). Deduplicates by (TaskName, StartTime, EndTime).

Copilot

Pull request overview

This PR targets major sources of E2E test flakiness by tightening readiness gating, reducing validator false positives, and stabilizing timing-based assertions. It also extends scenario mutator hooks to receive cluster context, enabling scenarios that depend on cluster-level derived values (e.g., a proxy URL).

Changes:

Make node readiness waiting more accurate by requiring cloud-provider uninitialized taint removal before proceeding.
Reduce false-positive failures by expanding the eBPF/iptables allowlist and deduplicating CSE timing events; relax overly tight CSE perf thresholds.
Update E2E scenario mutator function signatures to accept *Cluster, adjust specific scenarios (e.g., Flatcar AzureCNI), and skip a consistently failing Ubuntu 20.04 FIPS lane.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
e2e/kube.go	Improves node readiness gating; adds proxy ConfigMap/DaemonSet + proxy discovery; fixes DaemonSet CreateOrUpdate mutation.
e2e/cluster.go	Adds cluster-level `ProxyURL` plumbing, sets up private DNS for API server, and broadens retryable cluster-create errors.
e2e/aks_model.go	Adds 409-conflict handling/waits for private DNS zone and VNet link creation.
e2e/validation.go	Allows a standard DHCP INPUT rule in the eBPF-host-routing iptables compatibility allowlist.
e2e/cse_timing.go	Deduplicates extracted CSE event timings that can appear in multiple event directories.
e2e/scenario_cse_perf_test.go	Adjusts full-install timing thresholds (notably `installDeps`) and updates mutator signatures.
e2e/types.go	Updates `BootstrapConfigMutator` / `AKSNodeConfigMutator` signatures to include `*Cluster`.
e2e/test_helpers.go	Threads `*Cluster` through mutator invocations (including pre-provision flow).
e2e/scenario_test.go	Removes forced Azure CNI plugin settings for Flatcar AzureCNI, skips Ubuntu2004FIPS, adds HTTPS proxy + private DNS scenario, updates mutator signatures across scenarios.
e2e/scenario_win_test.go	Updates bootstrap mutator signatures for Windows scenarios.
e2e/scenario_localdns_hosts_test.go	Updates mutator signatures for LocalDNS hosts plugin scenarios.
e2e/scenario_gpu_managed_experience_test.go	Updates mutator signatures for GPU managed experience scenarios.
e2e/scenario_gpu_daemonset_test.go	Updates mutator signatures for GPU daemonset scenario.

Comments suppressed due to low confidence (2)

e2e/kube.go:330

EnsureDebugDaemonsets now creates the proxy ConfigMap/DaemonSet for every non-network-isolated cluster. If the proxy is only needed for a subset of scenarios, consider decoupling it from the general "debug daemonsets" setup so failures in pulling/starting the proxy don’t break unrelated tests during cluster preparation.

		return nil
	})
	if err != nil {
		return err
	}
	return nil
}

func (k *Kubeclient) createKubernetesSecret(ctx context.Context, namespace, secretName, registryName, username, password string) error {
	defer toolkit.LogStepCtxf(ctx, "creating kubernetes secret %s in namespace %s for registry %s", secretName, namespace, registryName)()

e2e/kube.go:606

The proxy DaemonSet runs with HostNetwork: true and exposes a fixed HostPort (8888) while tolerating all taints. This effectively opens an unauthenticated forward proxy on every system-pool node IP, which is reachable from within the VNet and may be abused or interfere with other host processes that might already bind the port. Consider scoping this down (e.g., run on a single chosen node/instance, tighten tolerations/nodeSelectors, and/or add network-level restrictions) and document the intended threat model for this test-only proxy.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

+					// Check if node still has the cloud-provider uninitialized taint
+					// which prevents normal pods from being scheduled
+					for _, taint := range node.Spec.Taints {
+						if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" {
+							t.Logf("node %s is ready but still has uninitialized taint, waiting for cloud-provider initialization. Taints: %s", node.Name, string(nodeTaints))


Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

r2k1 · 2026-05-08T05:15:06Z

+					// Wait for cloud-provider to remove the uninitialized taint,
+					// otherwise normal pods can't be scheduled on this node.
+					for _, taint := range node.Spec.Taints {
+						if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" {


no, it's noisy

+					// Wait for cloud-provider to remove the uninitialized taint,
+					// otherwise normal pods can't be scheduled on this node.
+					for _, taint := range node.Spec.Taints {
+						if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" {
+							return false, nil


Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (3)

e2e/kube.go:106

WaitUntilPodRunningWithRetry says it ignores stale FailedCreatePodSandBox events, but it lists Events only by involvedObject.name and does not filter by the current pod UID / creation time. This means old events from previous pods with the same name can trigger retries/deletes against the new pod, causing flaky behavior. Consider filtering events by involvedObject.uid == pod.UID (or by event timestamp >= pod.CreationTimestamp) and/or using a field selector that includes the UID.

		// Check for FailedCreatePodSandBox events
		events, err := k.Typed.CoreV1().Events(pod.Namespace).List(ctx, metav1.ListOptions{FieldSelector: "involvedObject.name=" + pod.Name})
		if err == nil {

e2e/validation.go:160

WaitUntilPodRunningWithRetry deletes the pod on FailedCreatePodSandBox and then continues polling for a pod with the same name, but for pods created directly (no controller), nothing will recreate it. With the new maxRetries=3 usage in validatePodRunning, this can turn a transient sandbox failure into a guaranteed timeout/hang. Either avoid deleting unmanaged pods (no OwnerReferences), or move the retry loop up to validatePodRunning so it can delete+recreate the pod spec explicitly.

		if err != nil {

e2e/validation.go:160

PR description mentions changes in e2e/scenario_test.go (removing forced networkPlugin=azure for Flatcar AzureCNI tests and skipping Ubuntu2004FIPS lane), but those changes are not present in this PR branch. Please either update the PR description to match the actual code changes, or include the missing scenario_test.go updates.

		if err != nil {

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

e2e/validation.go:145

Pod name uniqueness can be lost here: you append a random suffix and then call truncatePodName(), but truncatePodName truncates to 63 chars from the end. For long base names (e.g., based on s.Runtime.VM.KubeName), this can truncate off the random suffix entirely, reintroducing name collisions/AlreadyExists errors on retries. Consider truncating the base name to leave room for "-" (or updating truncatePodName to preserve the suffix), and also note that callers/logs still reference the original pod.Name even though the created pod name differs.

	truncatePodName(s.T, pod)
	start := time.Now()


 			for _, cond := range node.Status.Conditions {
 				if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
+					// Wait for cloud-provider to remove the uninitialized taint,
+					// otherwise normal pods can't be scheduled on this node.


Add E2E test for node bootstrapping with HTTPProxyConfig set and private DNS zone for the API server FQDN. Regression coverage for IcM 603699115 / ADO#31707996. Changes: - Refactor BootstrapConfigMutator and AKSNodeConfigMutator to accept *Cluster parameter, enabling scenarios to access cluster properties - Deploy Python-based CONNECT proxy DaemonSet on all non-isolated clusters using mcr.microsoft.com/cbl-mariner/base/python:3 - Create private DNS zone for API server FQDN on all non-isolated clusters, linked to VNet with A record - Add Test_Ubuntu2204_HTTPProxy_PrivateDNS scenario - Fix cluster creation retry to handle NotFound errors Test verified: node boots, CSE completes, kubelet starts, node Ready, test pod runs. Proxy receives CONNECT traffic from CSE outbound check. Fixes: ADO#31707996 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert a1bebdc (feat(e2e): add HTTP_PROXY + private DNS test scenario) which had issues on the e2e-flakiness-fixes branch. Analysis of 55 E2E builds on main (3 weeks) showed 84% failure rate. Root causes identified and fixed: 1. Node readiness race (kube.go): WaitUntilNodeReady() returned success on NodeReady=True even when node still had the cloud-provider uninitialized taint, preventing test pod scheduling. Now waits for taint removal before declaring node ready. 2. IPtables false positives (validation.go): iptables eBPF-host-routing validator rejected a normal host DHCP INPUT rule (UDP/68) not in its allowlist. Added to allowlist. 3. CSE timing threshold (scenario_cse_perf_test.go): installDeps 90s threshold was set with 'no direct prod data' and consistently exceeded by the network-heavy apt workflow. Raised to 120s. 4. Duplicate CSE events (cse_timing.go): events appearing in both GA events directory and handler subdirectories created spurious Task_installDeps#01 subtests. Added deduplication. 5. Broken Ubuntu2004FIPS lane (scenario_test.go): Test added on 2026-04-22 without VMSS FIPS capability setup, never green. Skipped until properly fixed. Dropped from earlier version: Flatcar AzureCNI networkPlugin removal. Rubber duck review found removing networkPlugin=azure defaults to kubenet (not none), which would break tests differently. Proper fix requires PR #7463 (set to none instead). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 8, 2026 04:13

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, sulixu, surajssd, timmy-wright and zachary-bailey as code owners May 8, 2026 04:13

r2k1 temporarily deployed to test May 8, 2026 04:13 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 8, 2026 04:13 View session

r2k1 force-pushed the e2e-flakiness-fixes branch from f2a44c2 to ba7f563 Compare May 8, 2026 04:18

r2k1 temporarily deployed to test May 8, 2026 04:19 — with GitHub Actions Inactive

Copilot AI reviewed May 8, 2026

View reviewed changes

r2k1 force-pushed the e2e-flakiness-fixes branch from ba7f563 to e29f86b Compare May 8, 2026 04:56

r2k1 temporarily deployed to test May 8, 2026 04:56 — with GitHub Actions Inactive

r2k1 force-pushed the e2e-flakiness-fixes branch from e29f86b to fb806d0 Compare May 8, 2026 04:57

Copilot AI review requested due to automatic review settings May 8, 2026 04:57

r2k1 temporarily deployed to test May 8, 2026 04:57 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 8, 2026 04:58 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

r2k1 force-pushed the e2e-flakiness-fixes branch from fb806d0 to 2473076 Compare May 8, 2026 05:01

r2k1 temporarily deployed to test May 8, 2026 05:01 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 8, 2026 05:02

r2k1 force-pushed the e2e-flakiness-fixes branch from 2473076 to 7d30d00 Compare May 8, 2026 05:02

r2k1 temporarily deployed to test May 8, 2026 05:02 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 8, 2026 05:02 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

r2k1 force-pushed the e2e-flakiness-fixes branch from 7d30d00 to 1dcf1c6 Compare May 8, 2026 05:07

r2k1 temporarily deployed to test May 8, 2026 05:07 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 8, 2026 05:14

r2k1 force-pushed the e2e-flakiness-fixes branch from 1dcf1c6 to b0a3691 Compare May 8, 2026 05:14

r2k1 temporarily deployed to test May 8, 2026 05:14 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 8, 2026 05:15 View session

r2k1 force-pushed the e2e-flakiness-fixes branch from b0a3691 to b5692bc Compare May 8, 2026 05:17

r2k1 temporarily deployed to test May 8, 2026 05:17 — with GitHub Actions Inactive

Copilot AI reviewed May 8, 2026

View reviewed changes

r2k1 force-pushed the e2e-flakiness-fixes branch from b5692bc to 172a16a Compare May 8, 2026 05:25

r2k1 temporarily deployed to test May 8, 2026 05:25 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 8, 2026 05:26

r2k1 force-pushed the e2e-flakiness-fixes branch from 172a16a to 1787d71 Compare May 8, 2026 05:26

r2k1 temporarily deployed to test May 8, 2026 05:27 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 8, 2026 05:27 View session

r2k1 force-pushed the e2e-flakiness-fixes branch from 1787d71 to b1e817d Compare May 8, 2026 05:29

r2k1 temporarily deployed to test May 8, 2026 05:29 — with GitHub Actions Inactive

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread e2e/kube.go

Comment on lines 172 to +176

for _, cond := range node.Status.Conditions {

if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {

// Wait for cloud-provider to remove the uninitialized taint,

// otherwise normal pods can't be scheduled on this node.

r2k1 and others added 2 commits May 8, 2026 17:32

r2k1 force-pushed the e2e-flakiness-fixes branch from b1e817d to beeaaaa Compare May 8, 2026 05:32

r2k1 temporarily deployed to test May 8, 2026 05:32 — with GitHub Actions Inactive

r2k1 changed the title ~~fix(e2e): address multiple sources of E2E test flakiness~~ fix(e2e): reduce E2E test flakiness May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): reduce E2E test flakiness#8480

fix(e2e): reduce E2E test flakiness#8480
r2k1 wants to merge 2 commits intomainfrom
e2e-flakiness-fixes

r2k1 commented May 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

r2k1 May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r2k1 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes (4 files, 56 insertions, 32 deletions)

1. Wait for cloud-provider initialization before pod scheduling (kube.go)

2. Tolerate transient FailedCreatePodSandBox events (kube.go)

3. Add DHCP rule to iptables allowlist (validation.go)

4. Increase wireserver validation timeout (validation.go)

5. Raise CSE installDeps threshold (scenario_cse_perf_test.go)

6. Deduplicate CSE timing events (cse_timing.go)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

r2k1 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r2k1 commented May 8, 2026 •

edited

Loading

1. Wait for cloud-provider initialization before pod scheduling (`kube.go`)

2. Tolerate transient FailedCreatePodSandBox events (`kube.go`)

3. Add DHCP rule to iptables allowlist (`validation.go`)

4. Increase wireserver validation timeout (`validation.go`)

5. Raise CSE installDeps threshold (`scenario_cse_perf_test.go`)

6. Deduplicate CSE timing events (`cse_timing.go`)