Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets major sources of E2E test flakiness by tightening readiness gating, reducing validator false positives, and stabilizing timing-based assertions. It also extends scenario mutator hooks to receive cluster context, enabling scenarios that depend on cluster-level derived values (e.g., a proxy URL).
Changes:
- Make node readiness waiting more accurate by requiring cloud-provider uninitialized taint removal before proceeding.
- Reduce false-positive failures by expanding the eBPF/iptables allowlist and deduplicating CSE timing events; relax overly tight CSE perf thresholds.
- Update E2E scenario mutator function signatures to accept
*Cluster, adjust specific scenarios (e.g., Flatcar AzureCNI), and skip a consistently failing Ubuntu 20.04 FIPS lane.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| e2e/kube.go | Improves node readiness gating; adds proxy ConfigMap/DaemonSet + proxy discovery; fixes DaemonSet CreateOrUpdate mutation. |
| e2e/cluster.go | Adds cluster-level ProxyURL plumbing, sets up private DNS for API server, and broadens retryable cluster-create errors. |
| e2e/aks_model.go | Adds 409-conflict handling/waits for private DNS zone and VNet link creation. |
| e2e/validation.go | Allows a standard DHCP INPUT rule in the eBPF-host-routing iptables compatibility allowlist. |
| e2e/cse_timing.go | Deduplicates extracted CSE event timings that can appear in multiple event directories. |
| e2e/scenario_cse_perf_test.go | Adjusts full-install timing thresholds (notably installDeps) and updates mutator signatures. |
| e2e/types.go | Updates BootstrapConfigMutator / AKSNodeConfigMutator signatures to include *Cluster. |
| e2e/test_helpers.go | Threads *Cluster through mutator invocations (including pre-provision flow). |
| e2e/scenario_test.go | Removes forced Azure CNI plugin settings for Flatcar AzureCNI, skips Ubuntu2004FIPS, adds HTTPS proxy + private DNS scenario, updates mutator signatures across scenarios. |
| e2e/scenario_win_test.go | Updates bootstrap mutator signatures for Windows scenarios. |
| e2e/scenario_localdns_hosts_test.go | Updates mutator signatures for LocalDNS hosts plugin scenarios. |
| e2e/scenario_gpu_managed_experience_test.go | Updates mutator signatures for GPU managed experience scenarios. |
| e2e/scenario_gpu_daemonset_test.go | Updates mutator signatures for GPU daemonset scenario. |
Comments suppressed due to low confidence (2)
e2e/kube.go:330
EnsureDebugDaemonsetsnow creates the proxy ConfigMap/DaemonSet for every non-network-isolated cluster. If the proxy is only needed for a subset of scenarios, consider decoupling it from the general "debug daemonsets" setup so failures in pulling/starting the proxy don’t break unrelated tests during cluster preparation.
return nil
})
if err != nil {
return err
}
return nil
}
func (k *Kubeclient) createKubernetesSecret(ctx context.Context, namespace, secretName, registryName, username, password string) error {
defer toolkit.LogStepCtxf(ctx, "creating kubernetes secret %s in namespace %s for registry %s", secretName, namespace, registryName)()
e2e/kube.go:606
- The proxy DaemonSet runs with
HostNetwork: trueand exposes a fixedHostPort(8888) while tolerating all taints. This effectively opens an unauthenticated forward proxy on every system-pool node IP, which is reachable from within the VNet and may be abused or interfere with other host processes that might already bind the port. Consider scoping this down (e.g., run on a single chosen node/instance, tighten tolerations/nodeSelectors, and/or add network-level restrictions) and document the intended threat model for this test-only proxy.
| // Check if node still has the cloud-provider uninitialized taint | ||
| // which prevents normal pods from being scheduled | ||
| for _, taint := range node.Spec.Taints { | ||
| if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" { | ||
| t.Logf("node %s is ready but still has uninitialized taint, waiting for cloud-provider initialization. Taints: %s", node.Name, string(nodeTaints)) |
| // Wait for cloud-provider to remove the uninitialized taint, | ||
| // otherwise normal pods can't be scheduled on this node. | ||
| for _, taint := range node.Spec.Taints { | ||
| if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" { |
| // Wait for cloud-provider to remove the uninitialized taint, | ||
| // otherwise normal pods can't be scheduled on this node. | ||
| for _, taint := range node.Spec.Taints { | ||
| if taint.Key == "node.cloudprovider.kubernetes.io/uninitialized" { | ||
| return false, nil |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (3)
e2e/kube.go:106
- WaitUntilPodRunningWithRetry says it ignores stale FailedCreatePodSandBox events, but it lists Events only by involvedObject.name and does not filter by the current pod UID / creation time. This means old events from previous pods with the same name can trigger retries/deletes against the new pod, causing flaky behavior. Consider filtering events by involvedObject.uid == pod.UID (or by event timestamp >= pod.CreationTimestamp) and/or using a field selector that includes the UID.
// Check for FailedCreatePodSandBox events
events, err := k.Typed.CoreV1().Events(pod.Namespace).List(ctx, metav1.ListOptions{FieldSelector: "involvedObject.name=" + pod.Name})
if err == nil {
e2e/validation.go:160
- WaitUntilPodRunningWithRetry deletes the pod on FailedCreatePodSandBox and then continues polling for a pod with the same name, but for pods created directly (no controller), nothing will recreate it. With the new maxRetries=3 usage in validatePodRunning, this can turn a transient sandbox failure into a guaranteed timeout/hang. Either avoid deleting unmanaged pods (no OwnerReferences), or move the retry loop up to validatePodRunning so it can delete+recreate the pod spec explicitly.
if err != nil {
e2e/validation.go:160
- PR description mentions changes in e2e/scenario_test.go (removing forced networkPlugin=azure for Flatcar AzureCNI tests and skipping Ubuntu2004FIPS lane), but those changes are not present in this PR branch. Please either update the PR description to match the actual code changes, or include the missing scenario_test.go updates.
if err != nil {
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
e2e/validation.go:145
- Pod name uniqueness can be lost here: you append a random suffix and then call truncatePodName(), but truncatePodName truncates to 63 chars from the end. For long base names (e.g., based on s.Runtime.VM.KubeName), this can truncate off the random suffix entirely, reintroducing name collisions/AlreadyExists errors on retries. Consider truncating the base name to leave room for "-" (or updating truncatePodName to preserve the suffix), and also note that callers/logs still reference the original pod.Name even though the created pod name differs.
truncatePodName(s.T, pod)
start := time.Now()
|
|
||
| for _, cond := range node.Status.Conditions { | ||
| if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue { | ||
| // Wait for cloud-provider to remove the uninitialized taint, | ||
| // otherwise normal pods can't be scheduled on this node. |
Add E2E test for node bootstrapping with HTTPProxyConfig set and private DNS zone for the API server FQDN. Regression coverage for IcM 603699115 / ADO#31707996. Changes: - Refactor BootstrapConfigMutator and AKSNodeConfigMutator to accept *Cluster parameter, enabling scenarios to access cluster properties - Deploy Python-based CONNECT proxy DaemonSet on all non-isolated clusters using mcr.microsoft.com/cbl-mariner/base/python:3 - Create private DNS zone for API server FQDN on all non-isolated clusters, linked to VNet with A record - Add Test_Ubuntu2204_HTTPProxy_PrivateDNS scenario - Fix cluster creation retry to handle NotFound errors Test verified: node boots, CSE completes, kubelet starts, node Ready, test pod runs. Proxy receives CONNECT traffic from CSE outbound check. Fixes: ADO#31707996 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert a1bebdc (feat(e2e): add HTTP_PROXY + private DNS test scenario) which had issues on the e2e-flakiness-fixes branch. Analysis of 55 E2E builds on main (3 weeks) showed 84% failure rate. Root causes identified and fixed: 1. Node readiness race (kube.go): WaitUntilNodeReady() returned success on NodeReady=True even when node still had the cloud-provider uninitialized taint, preventing test pod scheduling. Now waits for taint removal before declaring node ready. 2. IPtables false positives (validation.go): iptables eBPF-host-routing validator rejected a normal host DHCP INPUT rule (UDP/68) not in its allowlist. Added to allowlist. 3. CSE timing threshold (scenario_cse_perf_test.go): installDeps 90s threshold was set with 'no direct prod data' and consistently exceeded by the network-heavy apt workflow. Raised to 120s. 4. Duplicate CSE events (cse_timing.go): events appearing in both GA events directory and handler subdirectories created spurious Task_installDeps#01 subtests. Added deduplication. 5. Broken Ubuntu2004FIPS lane (scenario_test.go): Test added on 2026-04-22 without VMSS FIPS capability setup, never green. Skipped until properly fixed. Dropped from earlier version: Flatcar AzureCNI networkPlugin removal. Rubber duck review found removing networkPlugin=azure defaults to kubenet (not none), which would break tests differently. Proper fix requires PR #7463 (set to none instead). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
E2E pipeline on
mainhas ~84% failure rate (46/55 builds failed over 3 weeks). This PR addresses 6 distinct flakiness sources.Changes (4 files, 56 insertions, 32 deletions)
1. Wait for cloud-provider initialization before pod scheduling (
kube.go)WaitUntilNodeReady()returned success onNodeReady=Trueeven when the node still hadnode.cloudprovider.kubernetes.io/uninitialized=true:NoScheduletaint. Test pods don't tolerate this taint → stay Pending → timeout.Now keeps polling until the taint is removed.
2. Tolerate transient FailedCreatePodSandBox events (
kube.go)The "loopback interrupted system call" is a known transient kernel error. Kubelet retries sandbox creation automatically, but the test framework treated the first
FailedCreatePodSandBoxevent as fatal (maxRetries=0).Changed to count aggregate sandbox failures via
event.Countand only fail after the threshold is exceeded.validatePodRunningpassesmaxRetries=3. Other callers still usemaxRetries=0(unchanged behavior).3. Add DHCP rule to iptables allowlist (
validation.go)ValidateIPTablesCompatibleWithCiliumEBPF()rejected a normal host DHCP INPUT rule (-A INPUT -p udp -m udp --dport 68 -j ACCEPT). This is standard OS networking, unrelated to eBPF host routing.4. Increase wireserver validation timeout (
validation.go)validateWireServerBlockedpolled for 1 minute, but iptables FORWARD rules can take longer to converge after kube-proxy chain recreation. Increased to 3 minutes.5. Raise CSE installDeps threshold (
scenario_cse_perf_test.go)installDepsthreshold was 90s with comment "no direct prod data; generous for full install". The function does apt locks + repo setup + apt-get update + bulk package install — inherently variable. Raised to 120s for both Ubuntu 22.04 and 24.04.6. Deduplicate CSE timing events (
cse_timing.go)Events appearing in both the primary GA events directory and handler-version subdirectories created duplicate subtests (e.g.
Task_installDeps#01). Deduplicates by(TaskName, StartTime, EndTime).