Goal
Restore baseline-h2 / 1024c from current ~138k back to ≥190k (pre-regression a29459a), without losing the win on larger bodies (+105% on 16K, +27% on 64K).
Context
A bisect across 55 commits identified a single culprit: 597a474 "perf(tls): coalesce h2 emit through one SSL_write_ex per tick". The gather pattern is more expensive than the simple BIO_write loop on small responses.
Profiler (perf record :u), top deltas GOOD a29459a → BAD 597a474:
| Function |
GOOD |
BAD |
Δ |
zval_ptr_dtor |
1.8 |
5.1 |
+3.4 |
_emalloc |
9.9 |
12.9 |
+3.0 |
zend_hash_str_find |
2.5 |
4.8 |
+2.3 |
zend_hash_destroy |
— |
1.8 |
new |
http_response_free |
1.0 |
2.3 |
+1.4 |
Root cause: per-pass emalloc(records[]) + emalloc(body_refs[]) plus a batch of OBJ_RELEASE calls in a row. On a 3B body that means 300+ records per pass → hot churn in _emalloc and cache-cold teardown in zval_ptr_dtor.
kTLS is out of scope — the memory-BIO architecture is incompatible with kernel-side crypto.
Subtasks (Track A)
Work order
- A1 + A3 (reuse arrays + defer OBJ_RELEASE) — cheapest, 1-2 hours, should recover most of the −31%.
- Bench → if ≥180k is restored, A2/A4 are optional.
- Otherwise A2 → A4.
- Each change is a separate commit
perf(tls): ... with the bench result in the commit message.
Out of scope
- Do NOT fully revert 597a474 — it has a genuine win on 16K+ bodies.
- Do NOT touch the h2c iov path.
- Do NOT touch handshake / BIO bridge.
- No kTLS / sendfile fast-path.
Acceptance metrics
| Metric |
Target |
| baseline-h2 / c=100 m=32 / body=3B |
≥190k |
| baseline-h2 / c=100 m=32 / body=16K |
keep +105% |
| baseline-h2 / c=100 m=32 / body=64K |
keep +27% |
_emalloc in perf |
≤10% |
zval_ptr_dtor in perf |
≤2% |
Bench: /home/edmond/bisect-work/run.sh (h2 TLS c=100 m=32). Regression threshold: ≥180k, otherwise revert.
Goal
Restore baseline-h2 / 1024c from current ~138k back to ≥190k (pre-regression
a29459a), without losing the win on larger bodies (+105% on 16K, +27% on 64K).Context
A bisect across 55 commits identified a single culprit:
597a474"perf(tls): coalesce h2 emit through one SSL_write_ex per tick". The gather pattern is more expensive than the simpleBIO_writeloop on small responses.Profiler (
perf record :u), top deltas GOODa29459a→ BAD597a474:zval_ptr_dtor_emalloczend_hash_str_findzend_hash_destroyhttp_response_freeRoot cause: per-pass
emalloc(records[])+emalloc(body_refs[])plus a batch ofOBJ_RELEASEcalls in a row. On a 3B body that means 300+ records per pass → hot churn in_emallocand cache-cold teardown inzval_ptr_dtor.kTLS is out of scope — the memory-BIO architecture is incompatible with kernel-side crypto.
Subtasks (Track A)
records[]/body_refs[]across passesOBJ_RELEASEuntilon_stream_closebyte_cap(16K → 64K-128K)Work order
perf(tls): ...with the bench result in the commit message.Out of scope
Acceptance metrics
_emallocin perfzval_ptr_dtorin perfBench:
/home/edmond/bisect-work/run.sh(h2 TLS c=100 m=32). Regression threshold: ≥180k, otherwise revert.