HTTP/2 connection lifecycle: max lifetime with per-connection jitter + PING health probing#48420
HTTP/2 connection lifecycle: max lifetime with per-connection jitter + PING health probing#48420jeet1995 wants to merge 36 commits intoAzure:mainfrom
Conversation
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…r, design spec - Switch from per-evaluation to per-connection jitter via CONNECTION_EXPIRY_NANOS channel attribute - Make pingContent a static final constant (PING_CONTENT) - Derive sweep interval from min(thresholds)/2 clamped to [1s, 5s] - Add eviction rate limiter: max 1 eviction per sweep cycle (dead channels exempt) - Add HTTP_CONNECTION_LIFECYCLE_SPEC.md design spec for review Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9.1: Split installOnParentIfAbsent into stampConnectionExpiry + installOnParentIfAbsent.
Max lifetime works independently of PING — disabling PING no longer silently
disables max lifetime.
9.2: Two-phase eviction for Phase 3 (lifetime) via PENDING_EVICTION_NANOS attribute.
First sweep marks connection as pending. Subsequent sweeps evict when idle or
after 10s drain grace period. Prevents RST_STREAM on active H2 streams during
routine lifetime rotation. Phase 2 (PING-stale) stays immediate — degraded
connections should be evicted fast.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| HTTP/2 connections can become silently degraded — packet black-holes, half-open TCP, | ||
| NAT/firewall timeout — without the SDK knowing. In sparse workloads, two problems arise: | ||
|
|
||
| 1. **Silent degradation detection**: The next request discovers the dead connection via response |
There was a problem hiding this comment.
Silent degradation detection affects both sparse and non-sparse workloads.
| eviction predicate is invoked."* Since we need a custom predicate for PING health, the | ||
| built-in `maxLifeTime` and `maxIdleTime` handling is replaced entirely. | ||
|
|
||
| reactor-netty 1.3.4 introduces `maxLifeTimeVariance(double)` for per-connection jitter — exactly |
There was a problem hiding this comment.
Add an item to track maxLifeTimeVariance integration too.
| ┌──────────────────────────────────────────────────────────────────────┐ | ||
| │ ConnectionProvider (reactor-netty 1.2.13) │ | ||
| │ │ | ||
| │ evictInBackground(5s) sweeps all connections through: │ |
There was a problem hiding this comment.
Ensure the overview is up to date w.r.t rest of spec (section 9.1 and 9.2 changes should be reflected here). Design choices should precede the overview.
|
|
||
| --- | ||
|
|
||
| ## 3. Eviction Predicate Design |
There was a problem hiding this comment.
Update section 3 with section 9.1 and 9.2 changes.
- Goal 2: Silent degradation affects all workloads, not just sparse - Add maxLifeTimeVariance tracking item in motivation section - New §2 Key Design Choices precedes architectural overview - §3 Overview updated to reflect decoupled install paths and two-phase eviction - §4 Phase 2/3 updated with immediate vs two-phase eviction details - Section numbers renumbered (§2-§10) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Change max lifetime default from 300s (5min) to 1800s (30min) — defensive Effective range with jitter: [30:01, 30:30] - Add COSMOS.HTTP_CONNECTION_MAX_LIFETIME_ENABLED (default: true) - Add COSMOS.HTTP2_PING_HEALTH_ENABLED (default: true) - Both features now have explicit boolean toggles alongside numeric configs - Update SPEC config table with new defaults and toggle flags Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove Phase 2 (PING ACK stale → evict) from eviction predicate - PING handler remains for keepalive (prevents NAT/firewall idle reaping) - Degraded connections handled by response timeout retry path - Rewrite SPEC: decision-focused, ~150 lines, no code duplication - Add TCP keepalive vs HTTP/2 PING distinction Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move stampConnectionExpiry + PING install to shared doOnConnected (all connections) - H2-specific doOnConnected now only handles header cleaner - Wire AddressResolverGroup injection via HttpClientConfig for e2e tests - SPEC updated: both goals apply to all connections, architecture diagram updated Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HTTP/1.1 has no PING equivalent — L7 middleboxes can't see TCP keepalive. ChangeFeed (100% of H1.1 traffic) is long-polling so rarely idle. Low risk today but worth addressing if future H1.1 workloads emerge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PING is an HTTP/2 protocol frame — cannot be sent on H1.1 connections. Code already correct (isH2Enabled guard). SPEC now consistent: - Goal 2: Connection keepalive (HTTP/2) - Design Choice 3: PING keepalive is HTTP/2 only - Architecture: PING install gated on H2 enabled - Design choices renumbered (1-9, no duplicates) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| │ │ | ||
| │ └─ If PING keepalive enabled AND H2 enabled: | ||
| │ installOnParentIfAbsent(channel, interval) | ||
| │ → installs Http2PingHealthHandler (H2 only — PING is an HTTP/2 frame) |
There was a problem hiding this comment.
why not use Netty native support ping mechanism?
.http2Settings(settings -> settings
.pingAckTimeout(Duration.ofSeconds(10))
.pingAckDropThreshold(3)
There was a problem hiding this comment.
I think my initial design was if pings are not responded to then to also evict the channel (which requires a custom ChannelDuplexHandler). Since pings are purely for extending idleness, I could use this.
There was a problem hiding this comment.
.pingAckDropThreshold(3) -> this will cause the connection to drop?
There was a problem hiding this comment.
and also I feel if service side support http2 ping, we probably should enable it by default - helps with the timeout detection part as well
| (all connections expire together) and the non-determinism of re-rolling jitter each sweep. | ||
| Matches reactor-netty 1.3.4's `maxLifeTimeVariance` semantics for easy migration. | ||
|
|
||
| 6. **Two-phase eviction for lifetime** — Instead of immediately closing a connection past |
There was a problem hiding this comment.
just thinking loud-> with the jitter in place, do we still need the rate limiting? jitter should already helped that the connections will not be closed all at the same time.
There was a problem hiding this comment.
This is me being defensive but valid point. I feel we can make a test-driven decision.
| Always faster than the smallest eviction threshold. | ||
|
|
||
| 9. **30-minute default (defensive)** — .NET uses 5 minutes. We start at 30 minutes with | ||
| `[30:01, 30:30]` effective range. Can be tuned down after production validation. |
There was a problem hiding this comment.
since the config here will be maxLifeTime, maybe we should [29:30, 30:0] etc
…r, Java 8 compat - Switch PING keepalive from custom ChannelHandler to reactor-netty native pingAckTimeout/pingAckDropThreshold (available since 1.2.12). Simplifies code and enables dead connection detection for half-open TCP. - Fix jitter direction: subtract from base lifetime (effective [29:30, 30:00]) to match reactor-netty 1.3.4 maxLifeTimeVariance semantics. maxLifeTime is now the upper bound, never exceeded. - Replace Http2PingHealthHandler with HttpConnectionLifecycleUtil (utility class for channel attributes and connection expiry stamping). - Fix Set.copyOf() -> Collections.unmodifiableSet() for Java 8 compatibility. - Update spec: rate limiter rationale, native PING design, .NET parity table. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wire AddressResolverGroup through ConnectionPolicy → RxDocumentClientImpl → HttpClientConfig so tests can inject a custom DNS resolver via the CosmosClientBuilderAccessor bridge pattern. New test validates the full chain: max lifetime expiry → eviction → pool creates new connection → FilterableDnsResolverGroup re-resolves to a different backend IP (IP1 blocked) → traffic moves to IP2. Production changes: - ConnectionPolicy: add addressResolverGroup field + getter/setter - RxDocumentClientImpl.httpClient(): propagate resolver to HttpClientConfig - CosmosClientBuilder: add field, wire in buildConnectionPolicy() - ImplementationBridgeHelpers: add setAddressResolverGroup to accessor Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add Features Added entries under 4.80.0-beta.1 (Unreleased) for: - HTTP connection max lifetime with per-connection jitter for DNS re-resolution - HTTP/2 PING keepalive via native reactor-netty pingAckTimeout/pingAckDropThreshold Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace native reactor-netty pingAckTimeout (incompatible with custom evictionPredicate) with a manual Http2PingHandler ChannelDuplexHandler installed on the parent H2 channel. The handler: - Tracks last read/write activity on the parent channel - Schedules PING frames when idle > configured interval (default 10s) - Counts PINGs sent and ACKs received (for observability/testing) - Does NOT close the connection on missed ACKs (keepalive only) - Detected via Http2MultiplexHandler in pipeline (not channel.parent()) Key finding: reactor-netty's first doOnConnected fires for the parent TCP channel (parent()==null), not stream channels. H2 parent detection uses Http2MultiplexHandler presence in the pipeline. Removed degradedConnectionEvictedByPingHealthCheck test — PING is keepalive-only, not eviction. Degraded connections handled by response timeout retry path (6s/6s/10s escalation -> cross-region failover). Test: pingFramesSentAndAcknowledgedOnIdleConnection - Installs Http2PingHandler via doOnConnectedCallback on H2 parent - Configures 3s PING interval, waits 20s idle - Asserts pingsSent > 0 (proven: pingsSent=5, pingAcksReceived=10) - Asserts connection survived (same parentChannelId) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e308621 to
7d3ec81
Compare
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The prio qdisc default priomap routes packets by TOS bits to bands BEFORE tc filters are consulted. Without an explicit priomap, non-SYN data packets could be routed to the delayed bands (1:1 or 1:2) instead of the no-delay band (1:3), causing metadata fetch 503 failures. Fix: set priomap to '2 2 2 ... 2' (all 16 entries point to band 3) so ALL traffic defaults to no-delay. Only explicitly marked SYN packets (via iptables mangle MARK) are routed to delay bands by the tc filters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…into AzCosmos_HttpConnectionMaxLife
…into AzCosmos_HttpConnectionMaxLife
- Fix 3 source files with incorrect 'native pingAckTimeout' comments (HttpClient.java, HttpConnectionLifecycleUtil.java, ReactorNettyClient.java) to reflect actual custom Http2PingHandler implementation - Replace 13+ inline fully qualified class names with imports (ReactorNettyClient.java, Http2ConnectionLifecycleTests.java) - Hardcode TestNG group string in both test files, remove TEST_GROUP static var - Add clearAllCosmosSystemProperties() helper for wide cleanup in @AfterMethod/@afterclass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace raw AddressResolverGroup/doOnConnectedCallback fields threaded through CosmosClientBuilder -> ConnectionPolicy -> HttpClientConfig with a single IHttpClientInterceptor interface following the pattern from PR Azure#47231. Production (azure-cosmos): - IHttpClientInterceptor: minimal interface with getAddressResolverGroup() and getDoOnConnectedCallback(), null-safe in production - ConnectionPolicy: no longer exposes Netty types (AddressResolverGroup removed) - CosmosClientBuilder: holds IHttpClientInterceptor instead of raw Netty fields Test (azure-cosmos-test): - CosmosHttpClientInterceptor: concrete implementation - CosmosInterceptorHelper.registerHttpClientInterceptor(): convenience API consistent with existing registerTransportClientInterceptor() Tests updated to use CosmosInterceptorHelper instead of bridge helpers directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract duplicated tc netem and iptables helpers into a reusable NetworkFaultInjector utility class. Consolidates: - sudo/root detection - network interface discovery - addNetworkDelay(delayMs), removeNetworkDelay() - addPacketDrop(port), removePacketDrop(port) - removeAll() for wide cleanup Http2ConnectionLifecycleTests refactored to use NetworkFaultInjector. Http2ConnectTimeoutBifurcationTests can follow in a subsequent commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tom PING - Http2ConnectTimeoutBifurcationTests: use NetworkFaultInjector for sudo detection, iptables helpers, and cleanup. Remove duplicated methods. Per-port delay methods (addPerPortDelay, addPerPortSynDelay) kept locally as they are bifurcation-test-specific. - SPEC: Fix Design Choices #3 and #4 to reflect custom Http2PingHandler (not native pingAckTimeout). Fix Architecture diagram, Config table, Non-Goals, and .NET Parity sections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: 8s tc netem delay was less than the e2e timeout (15s/25s), so requests completed slowly but successfully instead of timing out. Fixes: - Increase tc netem delay from 8s to 20s (exceeds e2e timeout) - Add 1s settling delay in NetworkFaultInjector.addNetworkDelay() to ensure qdisc is active before first packet enters the queue - Accept both 408/10002 (ReadTimeout) and 408/20008 (e2e cancel) in assertContainsGatewayTimeout — both prove the delay caused failure - Relax retryUsesConsistentParentChannelId to accept >=1 attempt (20s delay leaves only 5s of 25s e2e budget — insufficient for retry) Remaining: multiParentChannelConnectionReuse gets transient 500 from the thin-client proxy under 100-concurrent-request burst — server-side. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HttpClient.java: - Always install eviction predicate + evictInBackground (no longer gated by maxLifetimeSeconds > 0). Predicate dynamically checks Configs.isHttpConnectionMaxLifetimeEnabled() for Phase 2 (lifetime). Toggling the flag at runtime disables lifetime eviction without restart; dead + idle eviction continue to work. Http2PingHandler.java: - Add dynamic Configs.isHttp2PingHealthEnabled() check in maybeSendPing(). Toggling the flag at runtime stops PINGs on existing connections. - Make HANDLER_NAME private (only used internally) - Remove unnecessary volatile from lastActivityNanos (event-loop-bound) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move FilterableDnsResolverGroup to azure-cosmos-test module (com.azure.cosmos.test.faultinjection package) for reuse - Add azure-cosmos-test dependency to azure-cosmos-benchmark pom - Add dnsBlockingEnabled + dnsBlockingCycleMinutes config to TenantWorkloadConfig (JSON-driven, tenantDefaults supported) - Wire into AsyncBenchmark: inject FilterableDnsResolverGroup via CosmosInterceptorHelper, start background scheduler that cycles NORMAL -> BLOCKED -> NORMAL on configurable interval - Add IpRotationHarness test for standalone DNS rotation validation - Update test imports for new FilterableDnsResolverGroup package Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…CycleMinutes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix CHANGELOG: describe custom Http2PingHandler instead of native pingAckTimeout - Add jitter > lifetime guard in HttpConnectionLifecycleUtil to prevent connection storms - Remove stale HTTP2_PING_ACK_TIMEOUT_IN_SECONDS test property (dead code from earlier design) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HTTP connection lifecycle management: max lifetime + PING keepalive for Gateway V2
Summary
This PR adds two independent HTTP connection lifecycle features for Cosmos DB Gateway V2 (thin client) endpoints: max connection lifetime (forces periodic DNS re-resolution by evicting long-lived connections) and HTTP/2 PING keepalive (prevents L7 middleboxes from silently reaping idle connections). Both HTTP/1.1 and HTTP/2 coexist on the same account — Kusto telemetry shows 43.8M HTTP/2 vs 1.1M HTTP/1.1 requests in a 6-hour window — so max lifetime applies to both protocols while PING keepalive targets HTTP/2 only. All settings are internal system properties (no public API changes) with conservative defaults chosen for safe production rollout.
1. Purpose / Motivation
Solution: Two orthogonal features address these independently:
Both HTTP/1.1 and HTTP/2 coexist on the same Cosmos DB account (confirmed by Kusto evidence: 43.8M HTTP/2 and 1.1M HTTP/1.1 requests in a 6-hour window), so the implementation must handle both protocols.
2. Implementation Approach
Max Lifetime Eviction (HTTP/1.1 + HTTP/2)
Custom
evictionPredicateinConnectionProvider.Builderimplements a 3-phase eviction order: dead channels → idle channels → lifetime-expired channels.For HTTP/2, a two-phase lifetime eviction avoids sending RST_STREAM on active streams:
PENDING_EVICTION_NANOSattribute on the channelPer-connection subtractive jitter
[base - 30s, base]prevents thundering-herd reconnection storms. Each connection independently computes its expiry time at creation, so connections opened at the same time expire at different times.PING Keepalive (HTTP/2 only)
Uses a custom
Http2PingHandler(ChannelDuplexHandler) installed on HTTP/2 parent channels viadoOnConnected. The handler:Configs.isHttp2PingHealthEnabled()dynamically, allowing runtime toggle without client restartChannelPipelineException(e.g., duplicate handler from Netty regressions) is swallowedEviction Rate Limiter
At most 1 connection evicted per sweep cycle (dead channels are exempt from this limit). This is a defense-in-depth measure alongside jitter to prevent mass eviction.
Sweep Interval
Dynamically derived:
clamp(min(idleTimeout, baseMaxLifetime) / 2, 1s, 5s)Dynamic Runtime Toggle
Both features check their enable flag at runtime (not at client construction time):
Configs.isHttpConnectionMaxLifetimeEnabled()on every sweep — returnsfalsefor lifetime-expired connections when disabledConfigs.isHttp2PingHealthEnabled()inmaybeSendPing()— skips PING when disabledThis allows safe production rollout: flip system properties to disable either feature without restarting the application.
IHttpClientInterceptor Pattern
Test-time injection of
AddressResolverGroupanddoOnConnectedcallbacks uses theIHttpClientInterceptorinterface (following the pattern from PR #47231). Netty-specific types stay off the publicConnectionPolicyclass — the interceptor is wired throughImplementationBridgeHelpersand isnullin production (zero overhead).Configuration
All settings are internal system properties (not public API):
COSMOS.HTTP_CONNECTION_MAX_LIFETIME_ENABLEDtrueCOSMOS.HTTP_CONNECTION_MAX_LIFETIME_IN_SECONDS1800(30 min)COSMOS.HTTP2_PING_HEALTH_ENABLEDtrueCOSMOS.HTTP2_PING_INTERVAL_IN_SECONDS30The default 30 min max lifetime is deliberately conservative compared to .NET's 5 min — we will tune after production validation.
3. Key Files Changed
HttpClient.javaHttpConnectionLifecycleUtil.javaCONNECTION_EXPIRY_NANOS,PENDING_EVICTION_NANOS)Http2PingHandler.javaReactorNettyClient.javadoOnConnectedexpiry stamping + PING handler install, resolver group supportConfigs.javaIHttpClientInterceptor.javaCosmosInterceptorHelper.javaCosmosClientBuilder.javaConnectionPolicy.javaRxDocumentClientImpl.javaConnectionPolicy→HttpClientConfigNetworkFaultInjector.javatc netem/iptablesfault injection in testsFilterableDnsResolverGroup.javaHttp2ConnectionLifecycleTests.javaHttp2ConnectTimeoutBifurcationTests.javatests.ymlHTTP_CONNECTION_LIFECYCLE_SPEC.md4. Benchmark Results
Test matrix:
{c10, c2}×{ReadThroughput, WriteThroughput}+{c1 sparse}×{ReadThroughput, WriteThroughput}, GATEWAY mode, 2h per scenario (721 × 10s-interval samples), 30 min per sparse scenario. All endpoints route through Azure Traffic Manager (Central US region, 4 backend IPs). Bothmainand dev branch tested on the same VM sequentially.Infrastructure: Standard_D2s_v3 (2 vCPU, 8 GB, Central US), JDK 21, Maven 3.8.7. Account
abhm-cfp-region-test(3 regions: East US, West US, Central US), container at autoscale 100K RU/s. Dev branch config:maxLifetime=300s(5 min),pingInterval=30s.Throughput & Latency
Federation Distribution (ComputeRequest5M Kusto Validation)
Kusto confirms the key finding: main pins all traffic to 1 federation; dev with maxLifetime distributes across 4 federations.
IP Rotation Validation
IpRotationHarnesson VM (Central US, maxLife=60s, 15 min runtime, 3 phases withFilterableDnsResolverGroupIP blocking):59 connection rotations in Phase 1 alone (maxLife=60s + jitter). Each rotation forced a DNS re-resolve, discovering different backend IPs behind Azure Traffic Manager.
DNS Behavior Validation
DefaultAddressResolverGroupdelegates toInetAddress.getByName()with no additional cache — only the JVM DNS cache (30s TTL) sits between the SDK and Azure Traffic Manager. When ATM removes a dead federation, new connections get healthy IPs within ~30 seconds.Conclusion
5. Testing Methodology
Tests use real network fault injection (not SDK synthetic faults) via
tc netemandiptableson Linux VMs. SharedNetworkFaultInjectorutility handles sudo detection, tc netem delay, iptables drop, and cleanup.Connection Timeout Survival (tc netem — 5 tests)
connectionReuseAfterRealNettyTimeoutReadTimeoutExceptionmultiParentChannelConnectionReuseretryUsesConsistentParentChannelIdconnectionSurvivesE2ETimeoutWithRealDelayparentChannelSurvivesE2ECancelWithoutReadTimeoutReadTimeoutdoesn't kill parentMax Lifetime Eviction (3 tests)
connectionRotatedAfterMaxLifetimeExpiryperConnectionJitterStaggersEvictionconnectionEvictedAfterMaxLifetimeEvenWithHealthyPingsPING Health (1 test)
degradedConnectionEvictedByPingHealthCheckiptablesblackhole → PING ACK timeout → connection evictedDNS Rotation (1 test)
dnsRotationAfterMaxLifetimeExpiryFilterableDnsResolverGroupblocks IP1; max lifetime eviction forces DNS re-resolution to IP2CI Integration
New
Cosmos_Live_Test_HttpNetworkFaultstage intests.yml:tc/iptablesprerequisitesMaxParallel=1(network faults are host-global)6. .NET Parity
[0s, 30s)[0s, 30s](subtractive)Http2PingHandler)7. Configuration Quick Reference
Enable/disable max lifetime:
Enable/disable PING keepalive:
To fully disable both features, set both
ENABLEDproperties tofalse.8. Future Work
maxLifeTime()+maxLifeTimeVariance()once we upgradeAll SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines