Summary
This report covers test failures observed in committed-code CI builds (Timer runs on main and Post Merge Action builds) during the 24-hour period ending 2026-05-22T10:00 UTC.
10 distinct test failures were identified. Local reproduction was attempted for each using the original seed from the failing build.
Reproduction Results
| # |
Test |
Build |
Seed |
Reproduced Locally? |
| 1 |
KeywordTermsAggregatorTests.testStarTreeKeywordTerms |
77938 |
AF74A8FECAD5C079 |
✅ Yes (deterministic) |
| 2 |
ClusterShardLimitIT.testOpenIndexOverLimit |
77968 |
B9CEDF3158B2FA25 |
❌ No |
| 3 |
FlightOutboundHandlerContextPropagationTests.testContextHeaderPropagatedToResponseHeaders |
77853 |
D26119B5B11BFDBA |
❌ No |
| 4 |
RemoteStoreKafkaIT.testDefaultGetIngestionState |
77924 |
B7CC48B54A189AE4 |
❌ No |
| 5 |
WarmIndexSegmentReplicationIT.testNodeDropWithOngoingReplication |
77966 |
9A058A1B6229E7D5 |
❌ No |
| 6 |
IngestFromKafkaIT.testAllActiveOffsetBasedLag |
77844 |
FCAF83325C4634AA |
❌ No |
| 7 |
DeleteSnapshotIT.testDeleteShallowCopySnapshot |
77965 |
3498D183473E67DC |
❌ No |
| 8 |
FullRollingRestartIT.testFullRollingRestart |
77847 |
1D9351DC770AC776 |
❌ No |
| 9 |
FullRollingRestartIT.testFullRollingRestart_withNoRecoveryPayloadAndSource |
77936 |
409E6360D8D4B93A |
❌ No |
| 10 |
IngestPipelineFromKafkaIT.testFieldMappingWithVersionAndPipeline |
77925 |
E62FB90D706E6B7F |
❌ No |
Historical Failure Patterns (sorted by total builds affected)
| Test |
First Seen |
Total Builds Affected |
Recent Trend (last 5 months) |
Pattern |
FullRollingRestartIT.testFullRollingRestart* |
2024-10-11 |
260 |
0→35→25→22→27 |
Chronic, stable. High-frequency flake since Feb 2026. Consistently ~25 builds/month. |
KeywordTermsAggregatorTests.testStarTreeKeywordTerms |
2025-01-29 |
52 |
3→2→4→3→3 |
Chronic, stable. Low but steady rate (~3/month). Deterministic with seed — likely a real bug. |
ClusterShardLimitIT.testOpenIndexOverLimit |
2025-10-15 |
52 |
7→7→6→7→11 |
Chronic, worsening. Uptick in May 2026 (11 builds vs ~7 prior). |
IngestFromKafkaIT.testAllActiveOffsetBasedLag |
2025-10-15 |
37 |
0→0→8→13→14 |
Worsening rapidly. Went from 0 to 14/month since March 2026. |
DeleteSnapshotIT.testDeleteShallowCopySnapshot |
2024-04-06 |
32 |
1→1→1→0→2 |
Chronic, low-rate. Long-lived flake at ~1/month. |
FlightOutboundHandlerContextPropagationTests.testContextHeaderPropagatedToResponseHeaders |
2026-04-14 |
22 |
— →11→11 |
New, stable. Appeared mid-April 2026 (coincides with runner migration to m7a.8xlarge). ~11/month. |
WarmIndexSegmentReplicationIT.testNodeDropWithOngoingReplication |
2025-03-17 |
16 |
1→0→1→3→3 |
Chronic, slightly worsening. Uptick since April 2026. |
RemoteStoreKafkaIT.testDefaultGetIngestionState |
2025-04-02 |
5 |
0→1→0→0→1 |
Rare. Very low frequency, sporadic. |
IngestPipelineFromKafkaIT.testFieldMappingWithVersionAndPipeline |
2026-05-21 |
1 |
— → — → — → — →1 |
Brand new. First ever failure yesterday. May be a new regression or one-off. |
PercolatorMatchedSlotSubFetchPhaseTests.classMethod* |
2026-05-21 |
1 |
— → — → — → — →1 |
Infrastructure. JVM shutdown race, not a real test failure. |
*PercolatorMatchedSlotSubFetchPhaseTests excluded from reproduction — failure was IllegalStateException: Shutdown in progress during class initialization, not a test logic failure.
Key Observations
-
FullRollingRestartIT is the highest-impact flake by far (260 builds). Both variants fail with SEGMENT replication strategy. Seeds are not deterministic — this is a timing/concurrency-dependent failure.
-
KeywordTermsAggregatorTests.testStarTreeKeywordTerms is the only test that reproduced deterministically with its seed. This strongly suggests a real bug rather than a timing flake. Error: expected:<0> but was:<10>.
-
FlightOutboundHandlerContextPropagationTests appeared exactly when CI runners moved from m5.8xlarge to m7a.8xlarge (mid-April 2026). This is likely a CPU-speed-amplified timing issue.
-
IngestFromKafkaIT.testAllActiveOffsetBasedLag is worsening rapidly (0→14/month in 3 months). Warrants investigation.
-
ClusterShardLimitIT.testOpenIndexOverLimit shows a May 2026 uptick (11 vs typical 7). May also be runner-migration related.
-
None of the integration tests (ClusterShardLimitIT, FullRollingRestartIT, WarmIndexSegmentReplicationIT, DeleteSnapshotIT, Kafka tests) reproduced with their seeds locally. This is expected — these are multi-node cluster tests where the seed controls randomization but not thread scheduling or network timing.
Methodology
- Failures queried from the OpenSearch metrics cluster (
gradle-check-* index)
- Historical patterns aggregated across all build types (including PR builds) using monthly date histograms with unique build count
- Reproduction attempted using
./gradlew <module>:<task> --tests "<class>.<method>" -Dtests.seed=<SEED> on the current main branch
- Environment: Linux 5.10, JDK 25.0.3, 16 CPUs
Summary
This report covers test failures observed in committed-code CI builds (Timer runs on
mainand Post Merge Action builds) during the 24-hour period ending 2026-05-22T10:00 UTC.10 distinct test failures were identified. Local reproduction was attempted for each using the original seed from the failing build.
Reproduction Results
KeywordTermsAggregatorTests.testStarTreeKeywordTermsAF74A8FECAD5C079ClusterShardLimitIT.testOpenIndexOverLimitB9CEDF3158B2FA25FlightOutboundHandlerContextPropagationTests.testContextHeaderPropagatedToResponseHeadersD26119B5B11BFDBARemoteStoreKafkaIT.testDefaultGetIngestionStateB7CC48B54A189AE4WarmIndexSegmentReplicationIT.testNodeDropWithOngoingReplication9A058A1B6229E7D5IngestFromKafkaIT.testAllActiveOffsetBasedLagFCAF83325C4634AADeleteSnapshotIT.testDeleteShallowCopySnapshot3498D183473E67DCFullRollingRestartIT.testFullRollingRestart1D9351DC770AC776FullRollingRestartIT.testFullRollingRestart_withNoRecoveryPayloadAndSource409E6360D8D4B93AIngestPipelineFromKafkaIT.testFieldMappingWithVersionAndPipelineE62FB90D706E6B7FHistorical Failure Patterns (sorted by total builds affected)
FullRollingRestartIT.testFullRollingRestart*KeywordTermsAggregatorTests.testStarTreeKeywordTermsClusterShardLimitIT.testOpenIndexOverLimitIngestFromKafkaIT.testAllActiveOffsetBasedLagDeleteSnapshotIT.testDeleteShallowCopySnapshotFlightOutboundHandlerContextPropagationTests.testContextHeaderPropagatedToResponseHeadersWarmIndexSegmentReplicationIT.testNodeDropWithOngoingReplicationRemoteStoreKafkaIT.testDefaultGetIngestionStateIngestPipelineFromKafkaIT.testFieldMappingWithVersionAndPipelinePercolatorMatchedSlotSubFetchPhaseTests.classMethod**PercolatorMatchedSlotSubFetchPhaseTests excluded from reproduction — failure was
IllegalStateException: Shutdown in progressduring class initialization, not a test logic failure.Key Observations
FullRollingRestartITis the highest-impact flake by far (260 builds). Both variants fail with SEGMENT replication strategy. Seeds are not deterministic — this is a timing/concurrency-dependent failure.KeywordTermsAggregatorTests.testStarTreeKeywordTermsis the only test that reproduced deterministically with its seed. This strongly suggests a real bug rather than a timing flake. Error:expected:<0> but was:<10>.FlightOutboundHandlerContextPropagationTestsappeared exactly when CI runners moved from m5.8xlarge to m7a.8xlarge (mid-April 2026). This is likely a CPU-speed-amplified timing issue.IngestFromKafkaIT.testAllActiveOffsetBasedLagis worsening rapidly (0→14/month in 3 months). Warrants investigation.ClusterShardLimitIT.testOpenIndexOverLimitshows a May 2026 uptick (11 vs typical 7). May also be runner-migration related.None of the integration tests (ClusterShardLimitIT, FullRollingRestartIT, WarmIndexSegmentReplicationIT, DeleteSnapshotIT, Kafka tests) reproduced with their seeds locally. This is expected — these are multi-node cluster tests where the seed controls randomization but not thread scheduling or network timing.
Methodology
gradle-check-*index)./gradlew <module>:<task> --tests "<class>.<method>" -Dtests.seed=<SEED>on the currentmainbranch