Skip to content

perf(h2,tls): hybrid emit selector — DRAIN small bodies, GATHER large (#30)#32

Merged
EdmondDantes merged 2 commits into
mainfrom
30-performance-opt-2
May 19, 2026
Merged

perf(h2,tls): hybrid emit selector — DRAIN small bodies, GATHER large (#30)#32
EdmondDantes merged 2 commits into
mainfrom
30-performance-opt-2

Conversation

@EdmondDantes
Copy link
Copy Markdown
Contributor

Summary

  • New hybrid TLS emit selector for HTTP/2 sessions: small responses take the DRAIN path (mem_send + BIO_write, no gather alloc churn), bodies ≥ 2 KiB or streaming take the GATHER path (NO_COPY refs + one SSL_write_ex).
  • Streams are pinned on a per-session counter at submit time; emit selects per-pass based on whether any large stream is in flight.
  • TRUE_ASYNC_H2_TLS_EMIT_MODE env override for A/B testing (`drain` / `gather` / `hybrid` default), read once and cached.
  • docs/H2_TLS_EMIT_STRATEGIES.md describes the three paths and the arithmetic that picks between them.

Bench (release PHP, h2 TLS, c=100 m=32, h2load -t 1, 10s × N median)

body gather drain hybrid best
dyn 3B 162k 235k 243k hybrid
dyn 16K 58k 43k 57k hybrid
dyn 64K 18k 11k 18k hybrid
static 100B 125k 146k 145k drain ≈ hybrid
static 4K 83k 76k ~83k gather, hybrid catches it via threshold
static 16K 55k 40k 61k hybrid
static 64K 17k 12k 17k hybrid

Profile (perf record -F 999 -g, static 4K): gather lowers `memmove` from 8.6% → 6.8% (one body memcpy vs two in DRAIN) at the cost of +1.1pp `_emalloc` for the gather scratch — net 0.7pp CPU translates to ~10% RPS.

Test plan

  • `make` clean rebuild — green
  • `tests/phpt` full suite — 165/170 PASS, 4 pre-existing FAIL (TCP fragmentation, ThreadPool / bootloader, static-workers — all fail identically on parent commit `2b4e3e4`), 1 SKIP
  • h2load alternating drain/hybrid/gather sanity, dynamic 3B / 16K / 64K — hybrid best-of-three

Notes

Documents the two HTTP/2 TLS emit paths and the per-pass selector that
sits between them, with the per-strategy memcpy / allocation arithmetic
and the bench numbers driving the threshold. Companion to the Phase 1
implementation work on the same issue.
Two TLS emit paths now coexist behind an adaptive selector:

  DRAIN  — drain nghttp2 via mem_send into a 16 KiB stack buffer and
           BIO_write straight into the plaintext BIO. No records[] /
           body_refs[] gather machinery, no per-pass emalloc churn.
           Wins on short responses where alloc/zval_ptr_dtor cost
           dominates.

  GATHER — drive nghttp2 via session_send + NO_COPY callbacks, fold
           frames into records[] (with body_refs[] keeping bodies
           alive), then memcpy everything into stage[] and ship with
           one SSL_write_ex. Wins on bodies that fill at least one
           TLS record (amortises cipher setup; only one memcpy of
           the body instead of two — mem_send + BIO_write).

Selector lives on http2_session_t::large_streams_pending. Each submit
site (dynamic submit_response / submit_response_streaming, static
buffered + streaming submit) pins the counter when the response body
exceeds H2_TLS_HYBRID_LARGE_THRESHOLD (2 KiB); cb_on_stream_close
unpins. Streaming responses with unknown total size are pessimistically
treated as large. http2_session_emit takes DRAIN while the counter is
zero, GATHER otherwise.

Override the selector with TRUE_ASYNC_H2_TLS_EMIT_MODE = drain | gather
| hybrid (default) for A/B testing; env is read once and cached.

Bench (release PHP, h2 TLS, c=100 m=32, h2load -t 1, 10s × N median):

  body          gather    drain    hybrid
  static 100B   125k      146k     145k    drain win (~17%)
  static 1K     111k      120k     ~120k   drain win (~9%)
  static 4K     83k       76k      ~83k    gather win (~10%)
  static 16K    55k       40k      61k     gather win
  static 64K    17k       12k      17k     gather win
  dyn 3B        204k      264k     268k    drain win
  dyn 16K      70k       54k      75k     gather win
  dyn 64K      20k       13k      19k     gather win

Profile diff at static 4K (perf record -F999 -g): gather lowers
memmove from 8.57% to 6.75% (one body memcpy vs two in DRAIN), at
the cost of +1.14pp _emalloc for the gather scratch arrays — net
−0.7pp CPU translates to the ~10% RPS win.

phpt: server/h2 26/26, server/static+tls 27/28 (pre-existing
004-static-workers failure, unrelated).
@EdmondDantes EdmondDantes linked an issue May 19, 2026 that may be closed by this pull request
4 tasks
@github-actions
Copy link
Copy Markdown
Contributor

Coverage

Total lines: 77.12% → 77.16% (+0.04 pp)

File Baseline Current Δ Touched
src/http2/http2_session.c 86.03% 87.14% +1.12 pp
src/http2/http2_static_response.c 71.72% 71.67% -0.04 pp
src/http2/http2_strategy.c 72.22% 72.54% +0.32 pp
src/http3/http3_callbacks.c 79.21% 78.61% -0.59 pp

@EdmondDantes EdmondDantes merged commit 8a8a4cc into main May 19, 2026
5 checks passed
@EdmondDantes EdmondDantes deleted the 30-performance-opt-2 branch May 19, 2026 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance opt 2

1 participant