OHTTP updates#74
Merged
adambalogh merged 3 commits intoMay 18, 2026
Merged
Conversation
adambalogh
reviewed
May 18, 2026
adambalogh
reviewed
May 18, 2026
95a17ee
into
claude/anonymous-inference-privacy-SgzWN
5 checks passed
adambalogh
added a commit
that referenced
this pull request
May 18, 2026
* Add OHTTP-style anonymous inference endpoint Implements RFC 9458 Oblivious HTTP encapsulation so clients can submit chat completions through an independent relay without exposing their IP to the enclave or their prompt to the relay. The HPKE X25519 keypair is generated alongside the existing RSA signing key and bound to the same nitriding registration digest, so the Nitro attestation document commits to both. - tee_gateway/ohttp.py: HPKE wrap/unwrap helpers (DHKEM(X25519)/HKDF-SHA256/ ChaCha20-Poly1305). Response keying derived per-context per RFC 9458 §4.2. - tee_gateway/tee_manager.py: HPKE keypair, key-config blob, attestation document now includes the HPKE public key. - tee_gateway/controllers/ohttp_controller.py: /v1/ohttp dispatches the decrypted request to the existing chat handler, scrubs identifying fields before forwarding upstream, refuses stream=true. - /v1/ohttp/config exposes the HPKE key config for client discovery. - Test coverage: round-trip, wrong-suite, truncated input, tampered ciphertext. Known limitation: payment gating is not yet wired for this endpoint; a blind-token layer will follow in a separate change. https://claude.ai/code/session_01WyddtSz2rtiP61LtVJbsJy * Update test_ohttp.py * lint * Add OHTTP anonymous chat completions with x402 payment integration (#71) * OHTTP: derive HPKE from TEE RSA key + gate /v1/ohttp behind x402 * Replace the random os.urandom() seed for the HPKE keypair with an HKDF derivation from the RSA TEE private key (PKCS8 DER) salted with the RSA public DER. The HPKE keypair is now a deterministic function of the attested RSA key — anything that attests the RSA signing key implicitly covers the X25519 OHTTP key, with no separate randomness source to attest. Domain-separated info "og-tee-hpke-x25519-v1" pins the derivation to this use. * ohttp.generate_keypair() -> ohttp.derive_keypair(seed), with explicit >=32-byte seed validation. Tests cover deterministic output for the same seed and rejection of short seeds. * Add /v1/ohttp to the x402 payment middleware routes with the same CHAT_COMPLETIONS_OPG_SESSION_MAX_SPEND cap and upto scheme used by /v1/chat/completions. Anonymous inference is now metered identically to the public chat endpoint. * Bridge the encrypted request/response back to the token-based cost calculator via a thread-local set in the OHTTP controller. The calculator detects path=/v1/ohttp and uses the stashed plaintext inner request/response instead of the (unparseable) ciphertext bytes the middleware would otherwise see. * Fix the response-export length to max(Nn, Nk) per RFC 9458 §4.5; the prior _NK was equal here for ChaCha20-Poly1305 but would silently break under a different AEAD. * Refactor /v1/ohttp as a thin WSGI wrapper around /v1/chat/completions Replace the parallel routing/pricing logic with an in-process WSGI sub- request: the OHTTP handler decrypts, dispatches the inner request as a POST /v1/chat/completions through the app's own wsgi_app, captures the status/headers/body, then encrypts and returns. Everything that already existed for the public chat endpoint — x402 payment verification, the pre-inference pricing gate, LangChain routing, post-inference cost settlement, TEE response signing — runs unchanged for OHTTP requests. * /v1/ohttp is no longer in the x402 RouteConfig table. Gating happens naturally when the sub-request hits /v1/chat/completions; the payment header travels inside the sealed envelope as `x-payment` so the relay never sees it. * The thread-local side channel and the OHTTP-specific branch in _session_cost_calculator are removed — there is now only one cost calculator path for the whole gateway. * Inner request envelope: `{"x-payment": "...", "body": {...}}`. Inner response envelope: `{"status": int, "headers": {...}, "body": ...}`, forwarding only x402/TEE settlement headers back to the client. * Pre-decap errors stay plaintext; post-decap errors are sealed so the relay can't distinguish failure modes by response shape. * Revert HPKE key derivation; keep random HPKE keypair independent of RSA Reverts deriving the OHTTP X25519 keypair from the RSA TEE private key. The HPKE keypair is now freshly random per enclave boot (os.urandom(32) fed to pyhpke's DeriveKeyPair). The attestation binding still works because nitriding's transcript covers both public keys, but the two private keys no longer share a derivation surface: a compromise of one cannot be used to recover the other. * ohttp.generate_keypair() restored; ohttp.derive_keypair() removed. * tee_manager.TEEKeyManager no longer pulls HKDF; HPKE keypair is generated independently right after the RSA keypair. * Test for deterministic derivation replaced with an independence test that asserts two generate_keypair() calls return different pubkeys. --------- Co-authored-by: Claude <noreply@anthropic.com> * Relay-pays OHTTP: x-payment from outer header, surface usage to relay Switches /v1/ohttp to the relay-pays model. The client encrypts only a chat-completion request — no payment material — and a relay between the client and the enclave supplies the x402 payment as a standard outer-request header. The enclave reads x-payment from the outer request, attaches it to the in-process sub-request to /v1/chat/completions, and lets the existing x402 middleware verify and settle exactly as it would for a public call. * Inner plaintext is now bare chat-completion JSON; the {x-payment, body} envelope is gone since payment travels outside the seal. * On 2xx the response body is still HPKE-sealed (it contains user prompts/completions), but the outer response surfaces token usage as headers so the relay can bill: X-Usage-Prompt-Tokens, X-Usage-Completion-Tokens, X-Usage-Total-Tokens, X-Usage-Model. x402 settlement and TEE signature headers are also forwarded. * On non-2xx (402 payment required, validation errors) the body is forwarded as plaintext so the relay can read x402 payment requirements, retry with a larger payment, or surface errors. These bodies never contain user prompts/completions. * Privacy: relay sees ciphertext + usage + settlement + relay-side wallet; never sees prompts, completions, or the client's IP. Unlinkability holds unless relay and enclave collude. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Chunked OHTTP: stream SSE inference responses end-to-end Adds streaming support per draft-ietf-ohai-chunked-ohttp-08. When the inner chat-completion request has stream=true, /v1/ohttp pipes the sub-request's SSE events through a chunked OHTTP encrypter and yields them as they arrive, instead of buffering. Non-streaming requests continue to use the existing single-shot RFC 9458 §4.5 path. ohttp.py: * QUIC varint encode/decode helpers (RFC 9000 §16). * New _LABEL_CHUNKED_RESPONSE = "message/bhttp chunked response" and a second secret export at decap time; DecapsulatedRequest now carries response_key + response_key_chunked so the controller can decide which mode to use AFTER inspecting the decrypted body. * ChunkedResponseEncrypter: response_nonce header, varint(len)||ct per chunk (AAD=""), zero-prefix final chunk (AAD=b"final") so truncation is detectable, per-chunk nonce = aead_nonce XOR encode_be(counter). * Extracted _derive_response_keys() shared between single-shot and chunked paths (HKDF-Extract on enc||response_nonce, then Expand twice for "key" and "nonce"). ohttp_controller.py: * Drop the stream=true rejection. Pass stream through to the inner sub-request and detect text/event-stream in the captured headers. * _wsgi_subrequest now returns the raw iterator instead of draining, so the streaming path can pipe chunks through Flask without buffering. close() still invoked downstream to trigger x402 settlement. * _build_streaming_response: look-ahead-by-one over the inner SSE iterator so the last event is sealed with AAD=b"final"; content-type message/ohttp-chunked-res; x402/TEE settlement headers forwarded. Usage stats stay inside the encrypted stream (final SSE event); the relay bills via X-Upto-Session as usual. Tests: varint round-trip across all 4 length classes, chunked response round-trip with a hand-rolled client-side decrypter that walks the varint frames and verifies AAD=b"final", double-finalize rejection. 96 unit tests total now passing. * README: document /v1/ohttp anonymous inference + chunked streaming Adds the two OHTTP endpoints to the API table and a concise section covering the relay-pays flow, the single-shot vs chunked response modes, billing channel for each mode, and the relay/enclave/client trust split. Refs RFC 9458 and draft-ietf-ohai-chunked-ohttp-08. * Add scripts/test_ohttp.py local smoke-test client Mirrors scripts/test_bytedance.py but exercises /v1/ohttp end-to-end: fetches /v1/ohttp/config, cross-checks the HPKE pubkey against the /signing-key attestation document, HPKE-encapsulates a chat request, POSTs to /v1/ohttp, and decrypts the response. Supports both single- shot and chunked OHTTP (--stream); the chunked path decrypts the varint-framed sealed stream incrementally so you can see SSE events arrive in real time. Includes a hand-rolled QUIC varint reader so the script stays usable as a standalone client SDK reference. Usage examples in the module docstring. * Forward Authorization header on OHTTP sub-request The OpenAPI spec declares a global ApiKeyAuth requirement; connexion enforces it on /v1/chat/completions before any handler runs and returns 401 "No authorization token provided" when missing. Our WSGI sub-request from /v1/ohttp arrived without an Authorization header, so OHTTP requests bounced with 401 before reaching the chat backend. security_controller.info_from_ApiKeyAuth is an intentional passthrough (x402 is the real access control) so any token value satisfies the schema check. Forward the outer Authorization header to the sub-request when the relay supplied one, else inject a placeholder bearer token. * OHTTP: use a fixed dummy Authorization on the inner sub-request Don't forward the outer Authorization header to the chat sub-request — anything the relay attached there (API keys, JWT subjects, bearer tokens, ...) could re-identify the client and defeat unlinkability. A constant "Bearer ohttp" placeholder satisfies connexion's ApiKeyAuth schema check (security_controller is a passthrough; x402 is the real access control) and keeps every OHTTP request indistinguishable at this layer. * Add TEE_GATEWAY_DEV_SKIP_X402 dev escape hatch Set the env var to "1" before /v1/keys is POSTed and the gateway will skip attaching the x402 payment middleware. Lets developers smoke-test /v1/chat/completions and /v1/ohttp locally without a reachable facilitator URL — without it, the middleware's first-request initialize() blows up on facilitator DNS lookups. Logs a WARNING when active and is explicitly NOT for production use. * test_ohttp.py: dump the outgoing OHTTP request so you can eyeball it Prints the request line, headers, the inner plaintext (clearly labeled as never-on-the-wire), then a breakdown of the encapsulated body: the 7-byte OHTTP header, the 32-byte ephemeral X25519 enc, and an xxd-style hex dump of the AEAD ciphertext. Makes it visually obvious that the relay only sees opaque sealed bytes — no prompt content, no model name, no API key, nothing. * Revert "Add TEE_GATEWAY_DEV_SKIP_X402 dev escape hatch" This reverts commit 58908aa. * tee_manager: refuse to register if HPKE pubkey is missing The v2 attestation transcript labels both the RSA SPKI and the X25519 HPKE pubkey, but the previous (self.hpke_public_key_raw or b"") fallback would silently produce a "v2"-labeled digest that actually only covers RSA whenever hpke_public_key_raw was None or empty. A verifier trusting the label would then accept an enclave whose HPKE key was never bound to attestation. Add an explicit length check (must be exactly 32 bytes) outside the broad try/except, so a real misconfiguration raises clearly instead of being masked as the "Could not register with nitriding (may not be in TEE)" warning. Today _generate_keys() always sets both keys so this is a defense-in-depth guard against future partial-init regressions. * ohttp: normalize decap failures to ValueError with a generic message decapsulate_request's docstring promised ValueError on malformed input, but recipient.open() raises pyhpke / cryptography exception types on AEAD tag failure, bad ephemeral keys, etc., so the contract was a lie. The error strings from those libraries can encode oracle information about which specific check failed (tag verification vs. length vs. KDF), which would turn the function into a padding-oracle-style side channel if any caller logged with exc_info=True. * Wrap the crypto path (create_recipient_context + open) and re-raise as ValueError("HPKE decapsulation failed") with `from None` so the underlying exception chain is suppressed entirely. Don't wrap the HKDF exports — those are deterministic and can't fail on valid input. * Bump the minimum input length to 7 + 32 + 16 so truncated inputs hit our own "too short" ValueError instead of whatever pyhpke would raise. * Tighten test_rejects_tampered_ciphertext from pytest.raises(Exception) to pytest.raises(ValueError, match="HPKE decapsulation failed") so the contract is enforced by tests, not just documented. * ohttp_controller: fix inaccurate privacy claim in docstring The previous wording said the relay "never sees the client's IP", which is wrong — in the relay-pays model the client connects directly to the relay, so the relay necessarily sees the client's IP at the network layer. The actual privacy property is that the ENCLAVE never sees the client's IP (it only sees the relay's), and the relay sees only the encapsulated ciphertext (plus billing metadata it needs), not the prompt or completion. Reword to spell out network position vs. compute position for each party and the precise unlinkability claim (and the collusion caveat). * README: fix the Anonymous Inference trust-split wording Mirror the docstring correction in ohttp_controller.py: the relay does see the client's IP at the network layer (it terminates the TCP/TLS connection). What it doesn't see is request/response content. The unlinkability claim is that the ENCLAVE never sees the client's IP and therefore can't tie a plaintext request to a specific end user. * docs: clarify how usage stats reach the relay vs. how x402 settles Previous wording on streaming was wrong / muddled: it implied the relay could extract usage "inside the encrypted stream's final SSE event", which is nonsense — the relay can't decrypt. Rewrite both the controller docstring and the README Anonymous Inference section to state the actual behavior: * Source of truth for billing is x402 settlement against the relay's x-payment under the `upto` scheme. The gateway settles the real cost in both modes. * stream=false: outer response ALSO exposes X-Usage-* headers so the relay can do its own per-token bookkeeping. * stream=true: NO per-token detail in outer headers — they ship before any body chunk, so token counts aren't known yet, and the sealed body is opaque. The relay learns the settled amount by querying the facilitator with X-Upto-Session or via the next X-Payment-Response. Only the client sees per-token detail (from the final SSE event inside the decrypted stream). Drop the "Usage to relay" column from the response-modes table since the billing channel is now described in its own paragraph. * ci: add tee_gateway/test/test_ohttp.py to the CI test list .github/workflows/test.yml runs an explicit list of test files; the new OHTTP test module wasn't in it, so the HPKE/varint/chunked-OHTTP cryptography code was passing locally but not exercised in CI on PRs or pushes to main. Add it to the list. Followups to flag separately (out of scope for this PR): two other local-passing unit-test files are similarly excluded from CI — test_chat_controller.py and test_completions_controller.py — so the chat and completions controllers don't get continuous coverage either. * ci: discover tee_gateway/test/ as a directory Switch the unit-tests step from an explicit per-file list to directory discovery so newly added test modules aren't silently excluded the way test_ohttp (this PR) was — and the way test_chat_controller and test_completions_controller still are on main. test_price_feed_integration.py raises unittest.SkipTest at module import when RUN_INTEGRATION_TESTS isn't set, so directory discovery picks it up but skips cleanly: 188 passed / 3 skipped, same as the previous explicit invocation plus the two never-covered controller test files. tests/test_pricing.py lives outside the package test dir so it stays listed explicitly. * ohttp_controller: preserve duplicate forwarded headers (no dict collapse) WSGI passes response headers as a list of (name, value) tuples precisely because HTTP allows multi-valued headers. Two cases relevant to us: RFC 7230 §3.2.2 (duplicates merge by comma but order matters) and RFC 7235 §4.1 (WWW-Authenticate may legally repeat — one entry per challenge scheme). The previous dict comprehension flattened the list and would silently drop a duplicate if x402 ever emitted multiple payment challenges or any other multi-valued header in our allowlist. Keep `forwarded` as a list of tuples in both _build_outer_response and _build_streaming_response; convert the dict-by-construction _extract_usage_headers values to tuples at merge time. Werkzeug's Response(headers=...) accepts either form and preserves duplicates when given a list. * ohttp_controller: drop unused OHTTP_MEDIA_TYPE constant The request media type was defined for symmetry with the response constants but never read. Decapsulation itself is the security gate; the unauthenticated Content-Type header gives us nothing to enforce. The response constants (OHTTP_RESPONSE_MEDIA_TYPE, OHTTP_CHUNKED_RESPONSE_MEDIA_TYPE) are still in use. * pricing * size limit * lint * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * usage * todo * simplify pricing * cost * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * controller test * lint * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * OHTTP updates (#74) * updates test * updates * updates * readme --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: balogh.adam@icloud.com <adambalogh@mac.mynetworksettings.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Aniket Dixit <47004499+dixitaniket@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.