Skip to content

Flaky test report: committed-code failures on 2026-05-22 #273

@andrross

Description

@andrross

Summary

This report covers test failures observed in committed-code CI builds (Timer runs on main and Post Merge Action builds) during the 24-hour period ending 2026-05-22T10:00 UTC.

10 distinct test failures were identified. Local reproduction was attempted for each using the original seed from the failing build.

Reproduction Results

# Test Build Seed Reproduced Locally?
1 KeywordTermsAggregatorTests.testStarTreeKeywordTerms 77938 AF74A8FECAD5C079 ✅ Yes (deterministic)
2 ClusterShardLimitIT.testOpenIndexOverLimit 77968 B9CEDF3158B2FA25 ❌ No
3 FlightOutboundHandlerContextPropagationTests.testContextHeaderPropagatedToResponseHeaders 77853 D26119B5B11BFDBA ❌ No
4 RemoteStoreKafkaIT.testDefaultGetIngestionState 77924 B7CC48B54A189AE4 ❌ No
5 WarmIndexSegmentReplicationIT.testNodeDropWithOngoingReplication 77966 9A058A1B6229E7D5 ❌ No
6 IngestFromKafkaIT.testAllActiveOffsetBasedLag 77844 FCAF83325C4634AA ❌ No
7 DeleteSnapshotIT.testDeleteShallowCopySnapshot 77965 3498D183473E67DC ❌ No
8 FullRollingRestartIT.testFullRollingRestart 77847 1D9351DC770AC776 ❌ No
9 FullRollingRestartIT.testFullRollingRestart_withNoRecoveryPayloadAndSource 77936 409E6360D8D4B93A ❌ No
10 IngestPipelineFromKafkaIT.testFieldMappingWithVersionAndPipeline 77925 E62FB90D706E6B7F ❌ No

Historical Failure Patterns (sorted by total builds affected)

Test First Seen Total Builds Affected Recent Trend (last 5 months) Pattern
FullRollingRestartIT.testFullRollingRestart* 2024-10-11 260 0→35→25→22→27 Chronic, stable. High-frequency flake since Feb 2026. Consistently ~25 builds/month.
KeywordTermsAggregatorTests.testStarTreeKeywordTerms 2025-01-29 52 3→2→4→3→3 Chronic, stable. Low but steady rate (~3/month). Deterministic with seed — likely a real bug.
ClusterShardLimitIT.testOpenIndexOverLimit 2025-10-15 52 7→7→6→7→11 Chronic, worsening. Uptick in May 2026 (11 builds vs ~7 prior).
IngestFromKafkaIT.testAllActiveOffsetBasedLag 2025-10-15 37 0→0→8→13→14 Worsening rapidly. Went from 0 to 14/month since March 2026.
DeleteSnapshotIT.testDeleteShallowCopySnapshot 2024-04-06 32 1→1→1→0→2 Chronic, low-rate. Long-lived flake at ~1/month.
FlightOutboundHandlerContextPropagationTests.testContextHeaderPropagatedToResponseHeaders 2026-04-14 22 — →11→11 New, stable. Appeared mid-April 2026 (coincides with runner migration to m7a.8xlarge). ~11/month.
WarmIndexSegmentReplicationIT.testNodeDropWithOngoingReplication 2025-03-17 16 1→0→1→3→3 Chronic, slightly worsening. Uptick since April 2026.
RemoteStoreKafkaIT.testDefaultGetIngestionState 2025-04-02 5 0→1→0→0→1 Rare. Very low frequency, sporadic.
IngestPipelineFromKafkaIT.testFieldMappingWithVersionAndPipeline 2026-05-21 1 — → — → — → — →1 Brand new. First ever failure yesterday. May be a new regression or one-off.
PercolatorMatchedSlotSubFetchPhaseTests.classMethod* 2026-05-21 1 — → — → — → — →1 Infrastructure. JVM shutdown race, not a real test failure.

*PercolatorMatchedSlotSubFetchPhaseTests excluded from reproduction — failure was IllegalStateException: Shutdown in progress during class initialization, not a test logic failure.

Key Observations

  1. FullRollingRestartIT is the highest-impact flake by far (260 builds). Both variants fail with SEGMENT replication strategy. Seeds are not deterministic — this is a timing/concurrency-dependent failure.

  2. KeywordTermsAggregatorTests.testStarTreeKeywordTerms is the only test that reproduced deterministically with its seed. This strongly suggests a real bug rather than a timing flake. Error: expected:<0> but was:<10>.

  3. FlightOutboundHandlerContextPropagationTests appeared exactly when CI runners moved from m5.8xlarge to m7a.8xlarge (mid-April 2026). This is likely a CPU-speed-amplified timing issue.

  4. IngestFromKafkaIT.testAllActiveOffsetBasedLag is worsening rapidly (0→14/month in 3 months). Warrants investigation.

  5. ClusterShardLimitIT.testOpenIndexOverLimit shows a May 2026 uptick (11 vs typical 7). May also be runner-migration related.

  6. None of the integration tests (ClusterShardLimitIT, FullRollingRestartIT, WarmIndexSegmentReplicationIT, DeleteSnapshotIT, Kafka tests) reproduced with their seeds locally. This is expected — these are multi-node cluster tests where the seed controls randomization but not thread scheduling or network timing.

Methodology

  • Failures queried from the OpenSearch metrics cluster (gradle-check-* index)
  • Historical patterns aggregated across all build types (including PR builds) using monthly date histograms with unique build count
  • Reproduction attempted using ./gradlew <module>:<task> --tests "<class>.<method>" -Dtests.seed=<SEED> on the current main branch
  • Environment: Linux 5.10, JDK 25.0.3, 16 CPUs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions