Skip to content

Flaky test report: committed-code failures on 2026-05-23 #274

@andrross

Description

@andrross

Summary

Two distinct tests failed against committed code (Timer builds on main) in the 24-hour window ending 2026-05-23T10:00 UTC. Both are chronic flaky tests that did not reproduce locally with the original seed, indicating timing-dependent failures.

Failing Tests

1. MixedClusterClientYamlTestSuiteIT.test {p0=cluster.health/10_basic/cluster health with closed index}

Field Value
Build 77997
Seed DE77915B3B9483AC
Module qa/mixed-cluster (BWC test against v3.6.1)
Error expected [2xx] but got [408 Request Timeout] — cluster was red with 51 unassigned shards
Reproduced locally No — passed with seed
First seen 2024-03-25
Total builds affected 135
Pattern Worsening — quiet Aug 2025–Mar 2026, resurfaced Apr 2026 (9 builds) and May 2026 (11 builds), coinciding with m7a.8xlarge runner migration

2. SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase

Field Value
Build 77989
Seed B6D9D10CA83177C3
Module qa/smoke-test-http
Error AssertionError in ensureSearchTaskIsCancelledassertBusy timed out waiting for task cancellation
Reproduced locally No — passed with seed
First seen 2024-04-04
Total builds affected 205
Pattern Significantly worsening — steady low-level flakiness since Apr 2024, major spike Nov 2025 (41 builds), now at worst-ever May 2026 (33 builds in 23 days). Clearly exacerbated by faster CI runners.

Summary Table

Test Builds Affected First Seen Trend Reproduced
SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase 205 2024-04-04 Significantly worsening No
MixedClusterClientYamlTestSuiteIT.test {p0=cluster.health/10_basic/cluster health with closed index} 135 2024-03-25 Worsening (resurfaced) No

Notes

  • Neither test reproduced with the original seed, which is expected for timing-dependent failures. The seeds control randomization of test parameters but not thread scheduling, network timing, or GC pauses.
  • Both tests show increased failure rates starting April 2026, consistent with the CI runner migration from m5.8xlarge to m7a.8xlarge (faster CPUs amplify race windows).
  • SearchRestCancellationIT is the higher-priority target: 205 builds affected and actively worsening. The failure is an assertBusy timeout waiting for search task cancellation, suggesting the cancellation propagation path has a timing sensitivity that faster hardware exposes more frequently.
  • MixedClusterClientYamlTestSuiteIT failure is a cluster health timeout in a BWC mixed-cluster scenario — likely related to shard allocation timing in a heterogeneous cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions