Skip to content

OHTTP updates#74

Merged
adambalogh merged 3 commits into
claude/anonymous-inference-privacy-SgzWNfrom
ani/testing-fixes
May 18, 2026
Merged

OHTTP updates#74
adambalogh merged 3 commits into
claude/anonymous-inference-privacy-SgzWNfrom
ani/testing-fixes

Conversation

@dixitaniket
Copy link
Copy Markdown
Collaborator

No description provided.

Comment thread tee_gateway/__main__.py
Comment thread tee_gateway/__main__.py
@adambalogh adambalogh marked this pull request as ready for review May 18, 2026 15:53
@adambalogh adambalogh changed the title [WIP] OHTTP updates OHTTP updates May 18, 2026
@adambalogh adambalogh merged commit 95a17ee into claude/anonymous-inference-privacy-SgzWN May 18, 2026
5 checks passed
adambalogh added a commit that referenced this pull request May 18, 2026
* Add OHTTP-style anonymous inference endpoint

Implements RFC 9458 Oblivious HTTP encapsulation so clients can submit chat
completions through an independent relay without exposing their IP to the
enclave or their prompt to the relay. The HPKE X25519 keypair is generated
alongside the existing RSA signing key and bound to the same nitriding
registration digest, so the Nitro attestation document commits to both.

- tee_gateway/ohttp.py: HPKE wrap/unwrap helpers (DHKEM(X25519)/HKDF-SHA256/
  ChaCha20-Poly1305). Response keying derived per-context per RFC 9458 §4.2.
- tee_gateway/tee_manager.py: HPKE keypair, key-config blob, attestation
  document now includes the HPKE public key.
- tee_gateway/controllers/ohttp_controller.py: /v1/ohttp dispatches the
  decrypted request to the existing chat handler, scrubs identifying fields
  before forwarding upstream, refuses stream=true.
- /v1/ohttp/config exposes the HPKE key config for client discovery.
- Test coverage: round-trip, wrong-suite, truncated input, tampered ciphertext.

Known limitation: payment gating is not yet wired for this endpoint; a
blind-token layer will follow in a separate change.

https://claude.ai/code/session_01WyddtSz2rtiP61LtVJbsJy

* Update test_ohttp.py

* lint

* Add OHTTP anonymous chat completions with x402 payment integration (#71)

* OHTTP: derive HPKE from TEE RSA key + gate /v1/ohttp behind x402

* Replace the random os.urandom() seed for the HPKE keypair with an HKDF
  derivation from the RSA TEE private key (PKCS8 DER) salted with the RSA
  public DER. The HPKE keypair is now a deterministic function of the
  attested RSA key — anything that attests the RSA signing key implicitly
  covers the X25519 OHTTP key, with no separate randomness source to
  attest. Domain-separated info "og-tee-hpke-x25519-v1" pins the
  derivation to this use.
* ohttp.generate_keypair() -> ohttp.derive_keypair(seed), with explicit
  >=32-byte seed validation. Tests cover deterministic output for the
  same seed and rejection of short seeds.
* Add /v1/ohttp to the x402 payment middleware routes with the same
  CHAT_COMPLETIONS_OPG_SESSION_MAX_SPEND cap and upto scheme used by
  /v1/chat/completions. Anonymous inference is now metered identically
  to the public chat endpoint.
* Bridge the encrypted request/response back to the token-based cost
  calculator via a thread-local set in the OHTTP controller. The
  calculator detects path=/v1/ohttp and uses the stashed plaintext
  inner request/response instead of the (unparseable) ciphertext bytes
  the middleware would otherwise see.
* Fix the response-export length to max(Nn, Nk) per RFC 9458 §4.5; the
  prior _NK was equal here for ChaCha20-Poly1305 but would silently
  break under a different AEAD.

* Refactor /v1/ohttp as a thin WSGI wrapper around /v1/chat/completions

Replace the parallel routing/pricing logic with an in-process WSGI sub-
request: the OHTTP handler decrypts, dispatches the inner request as a
POST /v1/chat/completions through the app's own wsgi_app, captures the
status/headers/body, then encrypts and returns. Everything that already
existed for the public chat endpoint — x402 payment verification, the
pre-inference pricing gate, LangChain routing, post-inference cost
settlement, TEE response signing — runs unchanged for OHTTP requests.

* /v1/ohttp is no longer in the x402 RouteConfig table. Gating happens
  naturally when the sub-request hits /v1/chat/completions; the payment
  header travels inside the sealed envelope as `x-payment` so the relay
  never sees it.
* The thread-local side channel and the OHTTP-specific branch in
  _session_cost_calculator are removed — there is now only one cost
  calculator path for the whole gateway.
* Inner request envelope: `{"x-payment": "...", "body": {...}}`. Inner
  response envelope: `{"status": int, "headers": {...}, "body": ...}`,
  forwarding only x402/TEE settlement headers back to the client.
* Pre-decap errors stay plaintext; post-decap errors are sealed so the
  relay can't distinguish failure modes by response shape.

* Revert HPKE key derivation; keep random HPKE keypair independent of RSA

Reverts deriving the OHTTP X25519 keypair from the RSA TEE private key.
The HPKE keypair is now freshly random per enclave boot (os.urandom(32)
fed to pyhpke's DeriveKeyPair). The attestation binding still works
because nitriding's transcript covers both public keys, but the two
private keys no longer share a derivation surface: a compromise of one
cannot be used to recover the other.

* ohttp.generate_keypair() restored; ohttp.derive_keypair() removed.
* tee_manager.TEEKeyManager no longer pulls HKDF; HPKE keypair is
  generated independently right after the RSA keypair.
* Test for deterministic derivation replaced with an independence test
  that asserts two generate_keypair() calls return different pubkeys.

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Relay-pays OHTTP: x-payment from outer header, surface usage to relay

Switches /v1/ohttp to the relay-pays model. The client encrypts only
a chat-completion request — no payment material — and a relay between
the client and the enclave supplies the x402 payment as a standard
outer-request header. The enclave reads x-payment from the outer
request, attaches it to the in-process sub-request to
/v1/chat/completions, and lets the existing x402 middleware verify
and settle exactly as it would for a public call.

* Inner plaintext is now bare chat-completion JSON; the {x-payment,
  body} envelope is gone since payment travels outside the seal.
* On 2xx the response body is still HPKE-sealed (it contains user
  prompts/completions), but the outer response surfaces token usage
  as headers so the relay can bill: X-Usage-Prompt-Tokens,
  X-Usage-Completion-Tokens, X-Usage-Total-Tokens, X-Usage-Model.
  x402 settlement and TEE signature headers are also forwarded.
* On non-2xx (402 payment required, validation errors) the body is
  forwarded as plaintext so the relay can read x402 payment
  requirements, retry with a larger payment, or surface errors.
  These bodies never contain user prompts/completions.
* Privacy: relay sees ciphertext + usage + settlement + relay-side
  wallet; never sees prompts, completions, or the client's IP.
  Unlinkability holds unless relay and enclave collude.

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Chunked OHTTP: stream SSE inference responses end-to-end

Adds streaming support per draft-ietf-ohai-chunked-ohttp-08. When the
inner chat-completion request has stream=true, /v1/ohttp pipes the
sub-request's SSE events through a chunked OHTTP encrypter and yields
them as they arrive, instead of buffering. Non-streaming requests
continue to use the existing single-shot RFC 9458 §4.5 path.

ohttp.py:
* QUIC varint encode/decode helpers (RFC 9000 §16).
* New _LABEL_CHUNKED_RESPONSE = "message/bhttp chunked response" and a
  second secret export at decap time; DecapsulatedRequest now carries
  response_key + response_key_chunked so the controller can decide
  which mode to use AFTER inspecting the decrypted body.
* ChunkedResponseEncrypter: response_nonce header, varint(len)||ct per
  chunk (AAD=""), zero-prefix final chunk (AAD=b"final") so truncation
  is detectable, per-chunk nonce = aead_nonce XOR encode_be(counter).
* Extracted _derive_response_keys() shared between single-shot and
  chunked paths (HKDF-Extract on enc||response_nonce, then Expand twice
  for "key" and "nonce").

ohttp_controller.py:
* Drop the stream=true rejection. Pass stream through to the inner
  sub-request and detect text/event-stream in the captured headers.
* _wsgi_subrequest now returns the raw iterator instead of draining,
  so the streaming path can pipe chunks through Flask without
  buffering. close() still invoked downstream to trigger x402
  settlement.
* _build_streaming_response: look-ahead-by-one over the inner SSE
  iterator so the last event is sealed with AAD=b"final"; content-type
  message/ohttp-chunked-res; x402/TEE settlement headers forwarded.
  Usage stats stay inside the encrypted stream (final SSE event); the
  relay bills via X-Upto-Session as usual.

Tests: varint round-trip across all 4 length classes, chunked
response round-trip with a hand-rolled client-side decrypter that
walks the varint frames and verifies AAD=b"final", double-finalize
rejection. 96 unit tests total now passing.

* README: document /v1/ohttp anonymous inference + chunked streaming

Adds the two OHTTP endpoints to the API table and a concise section
covering the relay-pays flow, the single-shot vs chunked response
modes, billing channel for each mode, and the relay/enclave/client
trust split. Refs RFC 9458 and draft-ietf-ohai-chunked-ohttp-08.

* Add scripts/test_ohttp.py local smoke-test client

Mirrors scripts/test_bytedance.py but exercises /v1/ohttp end-to-end:
fetches /v1/ohttp/config, cross-checks the HPKE pubkey against the
/signing-key attestation document, HPKE-encapsulates a chat request,
POSTs to /v1/ohttp, and decrypts the response. Supports both single-
shot and chunked OHTTP (--stream); the chunked path decrypts the
varint-framed sealed stream incrementally so you can see SSE events
arrive in real time. Includes a hand-rolled QUIC varint reader so the
script stays usable as a standalone client SDK reference.

Usage examples in the module docstring.

* Forward Authorization header on OHTTP sub-request

The OpenAPI spec declares a global ApiKeyAuth requirement; connexion
enforces it on /v1/chat/completions before any handler runs and
returns 401 "No authorization token provided" when missing. Our
WSGI sub-request from /v1/ohttp arrived without an Authorization
header, so OHTTP requests bounced with 401 before reaching the
chat backend.

security_controller.info_from_ApiKeyAuth is an intentional
passthrough (x402 is the real access control) so any token value
satisfies the schema check. Forward the outer Authorization header
to the sub-request when the relay supplied one, else inject a
placeholder bearer token.

* OHTTP: use a fixed dummy Authorization on the inner sub-request

Don't forward the outer Authorization header to the chat sub-request —
anything the relay attached there (API keys, JWT subjects, bearer
tokens, ...) could re-identify the client and defeat unlinkability.
A constant "Bearer ohttp" placeholder satisfies connexion's
ApiKeyAuth schema check (security_controller is a passthrough; x402
is the real access control) and keeps every OHTTP request
indistinguishable at this layer.

* Add TEE_GATEWAY_DEV_SKIP_X402 dev escape hatch

Set the env var to "1" before /v1/keys is POSTed and the gateway will
skip attaching the x402 payment middleware. Lets developers smoke-test
/v1/chat/completions and /v1/ohttp locally without a reachable
facilitator URL — without it, the middleware's first-request
initialize() blows up on facilitator DNS lookups.

Logs a WARNING when active and is explicitly NOT for production use.

* test_ohttp.py: dump the outgoing OHTTP request so you can eyeball it

Prints the request line, headers, the inner plaintext (clearly labeled
as never-on-the-wire), then a breakdown of the encapsulated body:
the 7-byte OHTTP header, the 32-byte ephemeral X25519 enc, and an
xxd-style hex dump of the AEAD ciphertext. Makes it visually obvious
that the relay only sees opaque sealed bytes — no prompt content, no
model name, no API key, nothing.

* Revert "Add TEE_GATEWAY_DEV_SKIP_X402 dev escape hatch"

This reverts commit 58908aa.

* tee_manager: refuse to register if HPKE pubkey is missing

The v2 attestation transcript labels both the RSA SPKI and the X25519
HPKE pubkey, but the previous (self.hpke_public_key_raw or b"") fallback
would silently produce a "v2"-labeled digest that actually only covers
RSA whenever hpke_public_key_raw was None or empty. A verifier trusting
the label would then accept an enclave whose HPKE key was never bound
to attestation.

Add an explicit length check (must be exactly 32 bytes) outside the
broad try/except, so a real misconfiguration raises clearly instead of
being masked as the "Could not register with nitriding (may not be in
TEE)" warning. Today _generate_keys() always sets both keys so this is
a defense-in-depth guard against future partial-init regressions.

* ohttp: normalize decap failures to ValueError with a generic message

decapsulate_request's docstring promised ValueError on malformed input,
but recipient.open() raises pyhpke / cryptography exception types on
AEAD tag failure, bad ephemeral keys, etc., so the contract was a lie.
The error strings from those libraries can encode oracle information
about which specific check failed (tag verification vs. length vs. KDF),
which would turn the function into a padding-oracle-style side channel
if any caller logged with exc_info=True.

* Wrap the crypto path (create_recipient_context + open) and re-raise
  as ValueError("HPKE decapsulation failed") with `from None` so the
  underlying exception chain is suppressed entirely. Don't wrap the
  HKDF exports — those are deterministic and can't fail on valid input.
* Bump the minimum input length to 7 + 32 + 16 so truncated inputs hit
  our own "too short" ValueError instead of whatever pyhpke would raise.
* Tighten test_rejects_tampered_ciphertext from pytest.raises(Exception)
  to pytest.raises(ValueError, match="HPKE decapsulation failed") so the
  contract is enforced by tests, not just documented.

* ohttp_controller: fix inaccurate privacy claim in docstring

The previous wording said the relay "never sees the client's IP", which
is wrong — in the relay-pays model the client connects directly to the
relay, so the relay necessarily sees the client's IP at the network
layer. The actual privacy property is that the ENCLAVE never sees the
client's IP (it only sees the relay's), and the relay sees only the
encapsulated ciphertext (plus billing metadata it needs), not the
prompt or completion.

Reword to spell out network position vs. compute position for each
party and the precise unlinkability claim (and the collusion caveat).

* README: fix the Anonymous Inference trust-split wording

Mirror the docstring correction in ohttp_controller.py: the relay does
see the client's IP at the network layer (it terminates the TCP/TLS
connection). What it doesn't see is request/response content. The
unlinkability claim is that the ENCLAVE never sees the client's IP and
therefore can't tie a plaintext request to a specific end user.

* docs: clarify how usage stats reach the relay vs. how x402 settles

Previous wording on streaming was wrong / muddled: it implied the relay
could extract usage "inside the encrypted stream's final SSE event",
which is nonsense — the relay can't decrypt. Rewrite both the
controller docstring and the README Anonymous Inference section to
state the actual behavior:

* Source of truth for billing is x402 settlement against the relay's
  x-payment under the `upto` scheme. The gateway settles the real
  cost in both modes.
* stream=false: outer response ALSO exposes X-Usage-* headers so the
  relay can do its own per-token bookkeeping.
* stream=true: NO per-token detail in outer headers — they ship
  before any body chunk, so token counts aren't known yet, and the
  sealed body is opaque. The relay learns the settled amount by
  querying the facilitator with X-Upto-Session or via the next
  X-Payment-Response. Only the client sees per-token detail (from
  the final SSE event inside the decrypted stream).

Drop the "Usage to relay" column from the response-modes table since
the billing channel is now described in its own paragraph.

* ci: add tee_gateway/test/test_ohttp.py to the CI test list

.github/workflows/test.yml runs an explicit list of test files; the
new OHTTP test module wasn't in it, so the HPKE/varint/chunked-OHTTP
cryptography code was passing locally but not exercised in CI on PRs
or pushes to main. Add it to the list.

Followups to flag separately (out of scope for this PR): two other
local-passing unit-test files are similarly excluded from CI —
test_chat_controller.py and test_completions_controller.py — so the
chat and completions controllers don't get continuous coverage either.

* ci: discover tee_gateway/test/ as a directory

Switch the unit-tests step from an explicit per-file list to directory
discovery so newly added test modules aren't silently excluded the way
test_ohttp (this PR) was — and the way test_chat_controller and
test_completions_controller still are on main.

test_price_feed_integration.py raises unittest.SkipTest at module
import when RUN_INTEGRATION_TESTS isn't set, so directory discovery
picks it up but skips cleanly: 188 passed / 3 skipped, same as the
previous explicit invocation plus the two never-covered controller
test files.

tests/test_pricing.py lives outside the package test dir so it stays
listed explicitly.

* ohttp_controller: preserve duplicate forwarded headers (no dict collapse)

WSGI passes response headers as a list of (name, value) tuples precisely
because HTTP allows multi-valued headers. Two cases relevant to us:
RFC 7230 §3.2.2 (duplicates merge by comma but order matters) and
RFC 7235 §4.1 (WWW-Authenticate may legally repeat — one entry per
challenge scheme). The previous dict comprehension flattened the list
and would silently drop a duplicate if x402 ever emitted multiple
payment challenges or any other multi-valued header in our allowlist.

Keep `forwarded` as a list of tuples in both _build_outer_response and
_build_streaming_response; convert the dict-by-construction
_extract_usage_headers values to tuples at merge time. Werkzeug's
Response(headers=...) accepts either form and preserves duplicates when
given a list.

* ohttp_controller: drop unused OHTTP_MEDIA_TYPE constant

The request media type was defined for symmetry with the response
constants but never read. Decapsulation itself is the security gate;
the unauthenticated Content-Type header gives us nothing to enforce.
The response constants (OHTTP_RESPONSE_MEDIA_TYPE,
OHTTP_CHUNKED_RESPONSE_MEDIA_TYPE) are still in use.

* pricing

* size limit

* lint

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* usage

* todo

* simplify pricing

* cost

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* controller test

* lint

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* OHTTP updates (#74)

* updates test

* updates

* updates

* readme

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: balogh.adam@icloud.com <adambalogh@mac.mynetworksettings.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Aniket Dixit <47004499+dixitaniket@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants