Skip to content

HTTP/2 connection lifecycle: max lifetime with per-connection jitter + PING health probing#48420

Draft
jeet1995 wants to merge 36 commits intoAzure:mainfrom
jeet1995:AzCosmos_HttpConnectionMaxLife
Draft

HTTP/2 connection lifecycle: max lifetime with per-connection jitter + PING health probing#48420
jeet1995 wants to merge 36 commits intoAzure:mainfrom
jeet1995:AzCosmos_HttpConnectionMaxLife

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented Mar 14, 2026

HTTP connection lifecycle management: max lifetime + PING keepalive for Gateway V2

Summary

This PR adds two independent HTTP connection lifecycle features for Cosmos DB Gateway V2 (thin client) endpoints: max connection lifetime (forces periodic DNS re-resolution by evicting long-lived connections) and HTTP/2 PING keepalive (prevents L7 middleboxes from silently reaping idle connections). Both HTTP/1.1 and HTTP/2 coexist on the same account — Kusto telemetry shows 43.8M HTTP/2 vs 1.1M HTTP/1.1 requests in a 6-hour window — so max lifetime applies to both protocols while PING keepalive targets HTTP/2 only. All settings are internal system properties (no public API changes) with conservative defaults chosen for safe production rollout.


1. Purpose / Motivation

Problem Impact
Stale DNS pinning — TCP connections never re-resolve DNS on their own When Cosmos DB frontend federations scale out or fail over, traffic stays pinned to stale IPs indefinitely. New backends receive zero load until clients reconnect.
L7 idle connection reaping — NAT gateways, firewalls, and load balancers silently drop idle HTTP/2 connections TCP keepalive operates at L4 and is invisible to L7 middleboxes. The client believes the connection is alive, sends a request, and gets a RST or hangs.

Solution: Two orthogonal features address these independently:

  • Max connection lifetime → forces DNS re-resolution by evicting connections after a configurable duration, regardless of health. Applies to both HTTP/1.1 and HTTP/2.
  • PING keepalive → sends periodic HTTP/2 PING frames to keep connections alive at L7 and detect degraded connections (ACK timeout → eviction). HTTP/2 only.

Both HTTP/1.1 and HTTP/2 coexist on the same Cosmos DB account (confirmed by Kusto evidence: 43.8M HTTP/2 and 1.1M HTTP/1.1 requests in a 6-hour window), so the implementation must handle both protocols.


2. Implementation Approach

Max Lifetime Eviction (HTTP/1.1 + HTTP/2)

Custom evictionPredicate in ConnectionProvider.Builder implements a 3-phase eviction order: dead channels → idle channels → lifetime-expired channels.

For HTTP/2, a two-phase lifetime eviction avoids sending RST_STREAM on active streams:

  1. Mark pending — set PENDING_EVICTION_NANOS attribute on the channel
  2. Drain grace period (10 seconds) — allow in-flight streams to complete
  3. Evict — close the connection

Per-connection subtractive jitter [base - 30s, base] prevents thundering-herd reconnection storms. Each connection independently computes its expiry time at creation, so connections opened at the same time expire at different times.

PING Keepalive (HTTP/2 only)

Uses a custom Http2PingHandler (ChannelDuplexHandler) installed on HTTP/2 parent channels via doOnConnected. The handler:

  • Sends PING frames at configurable intervals (default 30s) when the connection has been idle
  • Tracks the last activity time (any read/write) to avoid sending PINGs on active connections
  • Checks Configs.isHttp2PingHealthEnabled() dynamically, allowing runtime toggle without client restart
  • Installation is best-effort — any ChannelPipelineException (e.g., duplicate handler from Netty regressions) is swallowed

Why not native reactor-netty pingAckTimeout? reactor-netty 1.2.13 bypasses its built-in maxIdleTime handling when a custom evictionPredicate is configured. Since we must use a custom eviction predicate for max-lifetime, the native PING path is never triggered. The custom Http2PingHandler is independent of the eviction predicate and works correctly alongside it.

Eviction Rate Limiter

At most 1 connection evicted per sweep cycle (dead channels are exempt from this limit). This is a defense-in-depth measure alongside jitter to prevent mass eviction.

Sweep Interval

Dynamically derived: clamp(min(idleTimeout, baseMaxLifetime) / 2, 1s, 5s)

Dynamic Runtime Toggle

Both features check their enable flag at runtime (not at client construction time):

  • Eviction predicate: Always installed, checks Configs.isHttpConnectionMaxLifetimeEnabled() on every sweep — returns false for lifetime-expired connections when disabled
  • Http2PingHandler: Always installed, checks Configs.isHttp2PingHealthEnabled() in maybeSendPing() — skips PING when disabled

This allows safe production rollout: flip system properties to disable either feature without restarting the application.

IHttpClientInterceptor Pattern

Test-time injection of AddressResolverGroup and doOnConnected callbacks uses the IHttpClientInterceptor interface (following the pattern from PR #47231). Netty-specific types stay off the public ConnectionPolicy class — the interceptor is wired through ImplementationBridgeHelpers and is null in production (zero overhead).

Configuration

All settings are internal system properties (not public API):

System Property Default Description
COSMOS.HTTP_CONNECTION_MAX_LIFETIME_ENABLED true Enable/disable max lifetime eviction
COSMOS.HTTP_CONNECTION_MAX_LIFETIME_IN_SECONDS 1800 (30 min) Base max connection lifetime
COSMOS.HTTP2_PING_HEALTH_ENABLED true Enable/disable PING keepalive
COSMOS.HTTP2_PING_INTERVAL_IN_SECONDS 30 Interval between PING frames

The default 30 min max lifetime is deliberately conservative compared to .NET's 5 min — we will tune after production validation.


3. Key Files Changed

File Change
HttpClient.java Eviction predicate with 3-phase logic, rate limiter, sweep interval derivation, dynamic toggle
HttpConnectionLifecycleUtil.java NEW — channel attribute stamping (CONNECTION_EXPIRY_NANOS, PENDING_EVICTION_NANOS)
Http2PingHandler.java NEW — custom HTTP/2 PING keepalive handler with dynamic Configs check
ReactorNettyClient.java doOnConnected expiry stamping + PING handler install, resolver group support
Configs.java System properties for max lifetime and PING settings
IHttpClientInterceptor.java NEW — test-time injection interface for AddressResolverGroup + doOnConnected
CosmosInterceptorHelper.java NEW — helper to register interceptors via ImplementationBridgeHelpers
CosmosClientBuilder.java IHttpClientInterceptor accessor via bridge helpers
ConnectionPolicy.java IHttpClientInterceptor propagation (no Netty types on public API)
RxDocumentClientImpl.java Wires interceptor from ConnectionPolicyHttpClientConfig
NetworkFaultInjector.java NEW — shared utility for tc netem / iptables fault injection in tests
FilterableDnsResolverGroup.java NEW — test fixture for DNS-level IP filtering
Http2ConnectionLifecycleTests.java 9 tests covering lifecycle scenarios
Http2ConnectTimeoutBifurcationTests.java 5 tests covering connection timeout survival
tests.yml CI pipeline stage for network fault tests
HTTP_CONNECTION_LIFECYCLE_SPEC.md Design specification

4. Benchmark Results

Test matrix: {c10, c2} × {ReadThroughput, WriteThroughput} + {c1 sparse} × {ReadThroughput, WriteThroughput}, GATEWAY mode, 2h per scenario (721 × 10s-interval samples), 30 min per sparse scenario. All endpoints route through Azure Traffic Manager (Central US region, 4 backend IPs). Both main and dev branch tested on the same VM sequentially.

Infrastructure: Standard_D2s_v3 (2 vCPU, 8 GB, Central US), JDK 21, Maven 3.8.7. Account abhm-cfp-region-test (3 regions: East US, West US, Central US), container at autoscale 100K RU/s. Dev branch config: maxLifetime=300s (5 min), pingInterval=30s.

Throughput & Latency

Config Concurrency Operation main (ops/s) dev (ops/s) Δ ops main mean (ms) dev mean (ms)
Read c10 ReadThroughput 2,128 2,046 -3.8% 4.3 4.4
Write c10 WriteThroughput 260 276 +5.9% 38.0 35.4
Read c2 ReadThroughput 641 605 -5.6% 3.0 3.0
Write c2 WriteThroughput 58 54 -6.4% 33.7 37.9
Read c1 sparse ReadThroughput 0.20 0.20 0%
Write c1 sparse WriteThroughput 0.20 0.20 0%

Note on c10 reads: Both main and dev are CPU-saturated at 92–93% on this 2-vCPU VM. The -3.8% delta is within noise for CPU-bound workloads.

Note on c2: At low concurrency, each connection rotation (every 5 min) has proportionally larger impact — the TLS handshake + HTTP/2 setup overhead is amortized over fewer concurrent streams. This is the expected cost of DNS re-resolution.

Federation Distribution (ComputeRequest5M Kusto Validation)

Kusto confirms the key finding: main pins all traffic to 1 federation; dev with maxLifetime distributes across 4 federations.

Time Window Scenario fe43 fe11 fe39 fe38 Federations
00:00–02:00 main (baseline) 100% (14.8M) 0% 0% 0% 1
02:00–02:30 dev (maxLife=5m) 30% 51% 19% 0% 3
02:30–03:00 dev 30% 37% 17% 16% 4
03:00–03:30 dev 31% 36% 23% 10% 4
03:30–04:00 dev 34% 11% 22% 34% 4

IP Rotation Validation

IpRotationHarness on VM (Central US, maxLife=60s, 15 min runtime, 3 phases with FilterableDnsResolverGroup IP blocking):

Phase Duration IPs Seen Blocked IP Traffic Result
1: Normal 5 min 4 CUS IPs N/A ✅ Load balanced across 4 IPs via DNS rotation
2: Block .38 5 min 3 CUS IPs 0 connections (0%) ✅ Traffic fully shifted away from blocked IP
3: Unblock 5 min 2 new IPs N/A ✅ Traffic rebalanced to fresh DNS-resolved IPs

59 connection rotations in Phase 1 alone (maxLife=60s + jitter). Each rotation forced a DNS re-resolve, discovering different backend IPs behind Azure Traffic Manager.

DNS Behavior Validation

DefaultAddressResolverGroup delegates to InetAddress.getByName() with no additional cache — only the JVM DNS cache (30s TTL) sits between the SDK and Azure Traffic Manager. When ATM removes a dead federation, new connections get healthy IPs within ~30 seconds.

Conclusion

  • No regression at high concurrency (c10): Reads within noise (CPU-saturated), writes improved +5.9% likely due to better federation load balancing from connection rotation
  • Small overhead at low concurrency (c2): 5–6% throughput delta is the expected cost of rotating connections every 5 min — acceptable trade-off for DNS re-resolution benefits
  • Zero-error sparse workload: MaxLifetime + PING produce no errors under sparse traffic (concurrency=1, 5s between ops)
  • Federation distribution: main pins 100% traffic to 1 federation; dev distributes across 4 federations — confirming max-lifetime achieves its core design goal
  • DNS resolution chain validates: DefaultAddressResolverGroup → JVM cache (30s) → Azure Traffic Manager (10–14s TTL) — no additional cache layers to defeat ATM health-based routing

5. Testing Methodology

Tests use real network fault injection (not SDK synthetic faults) via tc netem and iptables on Linux VMs. Shared NetworkFaultInjector utility handles sudo detection, tc netem delay, iptables drop, and cleanup.

Connection Timeout Survival (tc netem — 5 tests)

Test What it proves
connectionReuseAfterRealNettyTimeout Parent TCP connection survives a stream-level ReadTimeoutException
multiParentChannelConnectionReuse All parent channels survive under concurrent load
retryUsesConsistentParentChannelId Retry attempts are tracked across gateway stats
connectionSurvivesE2ETimeoutWithRealDelay End-to-end cancel doesn't close the parent connection
parentChannelSurvivesE2ECancelWithoutReadTimeout 3s e2e cancel before 6s ReadTimeout doesn't kill parent

Max Lifetime Eviction (3 tests)

Test What it proves
connectionRotatedAfterMaxLifetimeExpiry Connection is evicted after lifetime + jitter expires
perConnectionJitterStaggersEviction Connections don't all expire in the same sweep cycle
connectionEvictedAfterMaxLifetimeEvenWithHealthyPings Lifetime eviction works even when PINGs are healthy

PING Health (1 test)

Test What it proves
degradedConnectionEvictedByPingHealthCheck iptables blackhole → PING ACK timeout → connection evicted

DNS Rotation (1 test)

Test What it proves
dnsRotationAfterMaxLifetimeExpiry FilterableDnsResolverGroup blocks IP1; max lifetime eviction forces DNS re-resolution to IP2

CI Integration

New Cosmos_Live_Test_HttpNetworkFault stage in tests.yml:

  • Ubuntu VMs with tc/iptables prerequisites
  • MaxParallel=1 (network faults are host-global)
  • Thin client test account

6. .NET Parity

Aspect .NET Java (this PR)
Base lifetime 5 min 30 min (defensive — tune after validation)
Jitter Per-pool [0s, 30s) Per-connection [0s, 30s] (subtractive)
PING keepalive No Yes (custom Http2PingHandler)
PING-based eviction No Yes (ACK timeout → connection close)
HTTP/2 fallback Explicit error path Graceful ALPN negotiation (reactor-netty auto-configures)

7. Configuration Quick Reference

Enable/disable max lifetime:

-DCOSMOS.HTTP_CONNECTION_MAX_LIFETIME_ENABLED=false   # disable (default: true)
-DCOSMOS.HTTP_CONNECTION_MAX_LIFETIME_IN_SECONDS=900   # override to 15 min (default: 1800)

Enable/disable PING keepalive:

-DCOSMOS.HTTP2_PING_HEALTH_ENABLED=false               # disable (default: true)
-DCOSMOS.HTTP2_PING_INTERVAL_IN_SECONDS=30              # override interval (default: 30)

To fully disable both features, set both ENABLED properties to false.


8. Future Work

  • reactor-netty 1.3.4: Replace custom lifetime logic with native maxLifeTime() + maxLifeTimeVariance() once we upgrade
  • HTTP/1.1 application-layer keepalive: No PING equivalent exists for HTTP/1.1 — investigate OPTIONS or HEAD probes
  • PING tuning: Adjust interval and ACK timeout after production validation with real middlebox behavior data

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…r, design spec

- Switch from per-evaluation to per-connection jitter via CONNECTION_EXPIRY_NANOS channel attribute
- Make pingContent a static final constant (PING_CONTENT)
- Derive sweep interval from min(thresholds)/2 clamped to [1s, 5s]
- Add eviction rate limiter: max 1 eviction per sweep cycle (dead channels exempt)
- Add HTTP_CONNECTION_LIFECYCLE_SPEC.md design spec for review

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 changed the title Az cosmos http connection max life HTTP/2 connection lifecycle: max lifetime with per-connection jitter + PING health probing Mar 23, 2026
9.1: Split installOnParentIfAbsent into stampConnectionExpiry + installOnParentIfAbsent.
     Max lifetime works independently of PING — disabling PING no longer silently
     disables max lifetime.

9.2: Two-phase eviction for Phase 3 (lifetime) via PENDING_EVICTION_NANOS attribute.
     First sweep marks connection as pending. Subsequent sweeps evict when idle or
     after 10s drain grace period. Prevents RST_STREAM on active H2 streams during
     routine lifetime rotation. Phase 2 (PING-stale) stays immediate — degraded
     connections should be evicted fast.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HTTP/2 connections can become silently degraded — packet black-holes, half-open TCP,
NAT/firewall timeout — without the SDK knowing. In sparse workloads, two problems arise:

1. **Silent degradation detection**: The next request discovers the dead connection via response
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent degradation detection affects both sparse and non-sparse workloads.

eviction predicate is invoked."* Since we need a custom predicate for PING health, the
built-in `maxLifeTime` and `maxIdleTime` handling is replaced entirely.

reactor-netty 1.3.4 introduces `maxLifeTimeVariance(double)` for per-connection jitter — exactly
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an item to track maxLifeTimeVariance integration too.

┌──────────────────────────────────────────────────────────────────────┐
│ ConnectionProvider (reactor-netty 1.2.13) │
│ │
│ evictInBackground(5s) sweeps all connections through: │
Copy link
Copy Markdown
Member Author

@jeet1995 jeet1995 Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure the overview is up to date w.r.t rest of spec (section 9.1 and 9.2 changes should be reflected here). Design choices should precede the overview.


---

## 3. Eviction Predicate Design
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update section 3 with section 9.1 and 9.2 changes.

jeet1995 and others added 3 commits March 23, 2026 17:22
- Goal 2: Silent degradation affects all workloads, not just sparse
- Add maxLifeTimeVariance tracking item in motivation section
- New §2 Key Design Choices precedes architectural overview
- §3 Overview updated to reflect decoupled install paths and two-phase eviction
- §4 Phase 2/3 updated with immediate vs two-phase eviction details
- Section numbers renumbered (§2-§10)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Change max lifetime default from 300s (5min) to 1800s (30min) — defensive
  Effective range with jitter: [30:01, 30:30]
- Add COSMOS.HTTP_CONNECTION_MAX_LIFETIME_ENABLED (default: true)
- Add COSMOS.HTTP2_PING_HEALTH_ENABLED (default: true)
- Both features now have explicit boolean toggles alongside numeric configs
- Update SPEC config table with new defaults and toggle flags

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove Phase 2 (PING ACK stale → evict) from eviction predicate
- PING handler remains for keepalive (prevents NAT/firewall idle reaping)
- Degraded connections handled by response timeout retry path
- Rewrite SPEC: decision-focused, ~150 lines, no code duplication
- Add TCP keepalive vs HTTP/2 PING distinction

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jeet1995 and others added 3 commits March 23, 2026 18:30
- Move stampConnectionExpiry + PING install to shared doOnConnected (all connections)
- H2-specific doOnConnected now only handles header cleaner
- Wire AddressResolverGroup injection via HttpClientConfig for e2e tests
- SPEC updated: both goals apply to all connections, architecture diagram updated

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HTTP/1.1 has no PING equivalent — L7 middleboxes can't see TCP keepalive.
ChangeFeed (100% of H1.1 traffic) is long-polling so rarely idle.
Low risk today but worth addressing if future H1.1 workloads emerge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PING is an HTTP/2 protocol frame — cannot be sent on H1.1 connections.
Code already correct (isH2Enabled guard). SPEC now consistent:
- Goal 2: Connection keepalive (HTTP/2)
- Design Choice 3: PING keepalive is HTTP/2 only
- Architecture: PING install gated on H2 enabled
- Design choices renumbered (1-9, no duplicates)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

│ │
│ └─ If PING keepalive enabled AND H2 enabled:
│ installOnParentIfAbsent(channel, interval)
│ → installs Http2PingHealthHandler (H2 only — PING is an HTTP/2 frame)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use Netty native support ping mechanism?

  .http2Settings(settings -> settings
                    .pingAckTimeout(Duration.ofSeconds(10))
                    .pingAckDropThreshold(3)

Copy link
Copy Markdown
Member Author

@jeet1995 jeet1995 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my initial design was if pings are not responded to then to also evict the channel (which requires a custom ChannelDuplexHandler). Since pings are purely for extending idleness, I could use this.

Copy link
Copy Markdown
Member

@xinlian12 xinlian12 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.pingAckDropThreshold(3) -> this will cause the connection to drop?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also I feel if service side support http2 ping, we probably should enable it by default - helps with the timeout detection part as well

(all connections expire together) and the non-determinism of re-rolling jitter each sweep.
Matches reactor-netty 1.3.4's `maxLifeTimeVariance` semantics for easy migration.

6. **Two-phase eviction for lifetime** — Instead of immediately closing a connection past
Copy link
Copy Markdown
Member

@xinlian12 xinlian12 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking loud-> with the jitter in place, do we still need the rate limiting? jitter should already helped that the connections will not be closed all at the same time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is me being defensive but valid point. I feel we can make a test-driven decision.

Always faster than the smallest eviction threshold.

9. **30-minute default (defensive)** — .NET uses 5 minutes. We start at 30 minutes with
`[30:01, 30:30]` effective range. Can be tuned down after production validation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the config here will be maxLifeTime, maybe we should [29:30, 30:0] etc

jeet1995 and others added 6 commits March 31, 2026 16:47
…r, Java 8 compat

- Switch PING keepalive from custom ChannelHandler to reactor-netty native
  pingAckTimeout/pingAckDropThreshold (available since 1.2.12). Simplifies
  code and enables dead connection detection for half-open TCP.
- Fix jitter direction: subtract from base lifetime (effective [29:30, 30:00])
  to match reactor-netty 1.3.4 maxLifeTimeVariance semantics. maxLifeTime is
  now the upper bound, never exceeded.
- Replace Http2PingHealthHandler with HttpConnectionLifecycleUtil (utility
  class for channel attributes and connection expiry stamping).
- Fix Set.copyOf() -> Collections.unmodifiableSet() for Java 8 compatibility.
- Update spec: rate limiter rationale, native PING design, .NET parity table.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wire AddressResolverGroup through ConnectionPolicy → RxDocumentClientImpl →
HttpClientConfig so tests can inject a custom DNS resolver via the
CosmosClientBuilderAccessor bridge pattern.

New test validates the full chain: max lifetime expiry → eviction →
pool creates new connection → FilterableDnsResolverGroup re-resolves
to a different backend IP (IP1 blocked) → traffic moves to IP2.

Production changes:
- ConnectionPolicy: add addressResolverGroup field + getter/setter
- RxDocumentClientImpl.httpClient(): propagate resolver to HttpClientConfig
- CosmosClientBuilder: add field, wire in buildConnectionPolicy()
- ImplementationBridgeHelpers: add setAddressResolverGroup to accessor

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add Features Added entries under 4.80.0-beta.1 (Unreleased) for:
- HTTP connection max lifetime with per-connection jitter for DNS re-resolution
- HTTP/2 PING keepalive via native reactor-netty pingAckTimeout/pingAckDropThreshold

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace native reactor-netty pingAckTimeout (incompatible with custom
evictionPredicate) with a manual Http2PingHandler ChannelDuplexHandler
installed on the parent H2 channel.

The handler:
- Tracks last read/write activity on the parent channel
- Schedules PING frames when idle > configured interval (default 10s)
- Counts PINGs sent and ACKs received (for observability/testing)
- Does NOT close the connection on missed ACKs (keepalive only)
- Detected via Http2MultiplexHandler in pipeline (not channel.parent())

Key finding: reactor-netty's first doOnConnected fires for the parent
TCP channel (parent()==null), not stream channels. H2 parent detection
uses Http2MultiplexHandler presence in the pipeline.

Removed degradedConnectionEvictedByPingHealthCheck test — PING is
keepalive-only, not eviction. Degraded connections handled by response
timeout retry path (6s/6s/10s escalation -> cross-region failover).

Test: pingFramesSentAndAcknowledgedOnIdleConnection
- Installs Http2PingHandler via doOnConnectedCallback on H2 parent
- Configures 3s PING interval, waits 20s idle
- Asserts pingsSent > 0 (proven: pingsSent=5, pingAcksReceived=10)
- Asserts connection survived (same parentChannelId)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the AzCosmos_HttpConnectionMaxLife branch from e308621 to 7d3ec81 Compare April 3, 2026 22:48
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

The prio qdisc default priomap routes packets by TOS bits to bands
BEFORE tc filters are consulted. Without an explicit priomap, non-SYN
data packets could be routed to the delayed bands (1:1 or 1:2) instead
of the no-delay band (1:3), causing metadata fetch 503 failures.

Fix: set priomap to '2 2 2 ... 2' (all 16 entries point to band 3)
so ALL traffic defaults to no-delay. Only explicitly marked SYN packets
(via iptables mangle MARK) are routed to delay bands by the tc filters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 11 commits April 9, 2026 11:57
- Fix 3 source files with incorrect 'native pingAckTimeout' comments
  (HttpClient.java, HttpConnectionLifecycleUtil.java, ReactorNettyClient.java)
  to reflect actual custom Http2PingHandler implementation
- Replace 13+ inline fully qualified class names with imports
  (ReactorNettyClient.java, Http2ConnectionLifecycleTests.java)
- Hardcode TestNG group string in both test files, remove TEST_GROUP static var
- Add clearAllCosmosSystemProperties() helper for wide cleanup in @AfterMethod/@afterclass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace raw AddressResolverGroup/doOnConnectedCallback fields threaded through
CosmosClientBuilder -> ConnectionPolicy -> HttpClientConfig with a single
IHttpClientInterceptor interface following the pattern from PR Azure#47231.

Production (azure-cosmos):
  - IHttpClientInterceptor: minimal interface with getAddressResolverGroup()
    and getDoOnConnectedCallback(), null-safe in production
  - ConnectionPolicy: no longer exposes Netty types (AddressResolverGroup removed)
  - CosmosClientBuilder: holds IHttpClientInterceptor instead of raw Netty fields

Test (azure-cosmos-test):
  - CosmosHttpClientInterceptor: concrete implementation
  - CosmosInterceptorHelper.registerHttpClientInterceptor(): convenience API
    consistent with existing registerTransportClientInterceptor()

Tests updated to use CosmosInterceptorHelper instead of bridge helpers directly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract duplicated tc netem and iptables helpers into a reusable
NetworkFaultInjector utility class. Consolidates:
- sudo/root detection
- network interface discovery
- addNetworkDelay(delayMs), removeNetworkDelay()
- addPacketDrop(port), removePacketDrop(port)
- removeAll() for wide cleanup

Http2ConnectionLifecycleTests refactored to use NetworkFaultInjector.
Http2ConnectTimeoutBifurcationTests can follow in a subsequent commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tom PING

- Http2ConnectTimeoutBifurcationTests: use NetworkFaultInjector for sudo
  detection, iptables helpers, and cleanup. Remove duplicated methods.
  Per-port delay methods (addPerPortDelay, addPerPortSynDelay) kept locally
  as they are bifurcation-test-specific.
- SPEC: Fix Design Choices #3 and #4 to reflect custom Http2PingHandler
  (not native pingAckTimeout). Fix Architecture diagram, Config table,
  Non-Goals, and .NET Parity sections.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: 8s tc netem delay was less than the e2e timeout (15s/25s),
so requests completed slowly but successfully instead of timing out.

Fixes:
- Increase tc netem delay from 8s to 20s (exceeds e2e timeout)
- Add 1s settling delay in NetworkFaultInjector.addNetworkDelay() to
  ensure qdisc is active before first packet enters the queue
- Accept both 408/10002 (ReadTimeout) and 408/20008 (e2e cancel) in
  assertContainsGatewayTimeout — both prove the delay caused failure
- Relax retryUsesConsistentParentChannelId to accept >=1 attempt
  (20s delay leaves only 5s of 25s e2e budget — insufficient for retry)

Remaining: multiParentChannelConnectionReuse gets transient 500 from
the thin-client proxy under 100-concurrent-request burst — server-side.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HttpClient.java:
- Always install eviction predicate + evictInBackground (no longer gated
  by maxLifetimeSeconds > 0). Predicate dynamically checks
  Configs.isHttpConnectionMaxLifetimeEnabled() for Phase 2 (lifetime).
  Toggling the flag at runtime disables lifetime eviction without restart;
  dead + idle eviction continue to work.

Http2PingHandler.java:
- Add dynamic Configs.isHttp2PingHealthEnabled() check in maybeSendPing().
  Toggling the flag at runtime stops PINGs on existing connections.
- Make HANDLER_NAME private (only used internally)
- Remove unnecessary volatile from lastActivityNanos (event-loop-bound)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move FilterableDnsResolverGroup to azure-cosmos-test module
  (com.azure.cosmos.test.faultinjection package) for reuse
- Add azure-cosmos-test dependency to azure-cosmos-benchmark pom
- Add dnsBlockingEnabled + dnsBlockingCycleMinutes config to
  TenantWorkloadConfig (JSON-driven, tenantDefaults supported)
- Wire into AsyncBenchmark: inject FilterableDnsResolverGroup via
  CosmosInterceptorHelper, start background scheduler that cycles
  NORMAL -> BLOCKED -> NORMAL on configurable interval
- Add IpRotationHarness test for standalone DNS rotation validation
- Update test imports for new FilterableDnsResolverGroup package

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…CycleMinutes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix CHANGELOG: describe custom Http2PingHandler instead of native pingAckTimeout
- Add jitter > lifetime guard in HttpConnectionLifecycleUtil to prevent connection storms
- Remove stale HTTP2_PING_ACK_TIMEOUT_IN_SECONDS test property (dead code from earlier design)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants