RFC-0054: HUD Integration for Out-of-Tree CI Results#96
Conversation
Defines the HUD-side ingestion and display layer for OOT CI results, building on RFC-0050 (Cross-Repository CI Relay). Covers write path, storage schemas, DB protection, security, and three frontend views. Reference implementation: subinz1/test-infra#1
|
Hi @jewelkm89! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
There was a problem hiding this comment.
A few quick high level comments based on the PR description (I really appreciate the diagram btw!):
-
I see some payload validation happening in HUD. Why not have that in Relay Server instead? In HUD, auth check should likely be limited to authenticating the relay server's header value and a maaaaybe payload cap to ensure we don't overburden the database. Everything else probably will find a better home in Relay Server, where the logic can be centralized. And the payload cap is likely worth adding to the relay server as well (HUD's check is more of a sanity check as the db guardian.)
-
Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3
Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.
ZainRizvi
left a comment
There was a problem hiding this comment.
Nice work — the architecture diagrams are really clear and the RFC builds on RFC-0050 well. A few things to sort out, mostly around aligning with the L2 PR's actual wire format and a couple of bugs in the extraction logic.
| const pr = cb.payload?.pull_request; | ||
|
|
||
| const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`; | ||
| const now = new Date().toISOString(); |
There was a problem hiding this comment.
There are a couple of bugs here:
-
started_atis set tonow()on every callback, includingcompleted. That's the time HUD received the POST, not when the job started. On a completed callback,started_at ≈ completed_at, so the ClickHouseduration_secondsalias will always be ≈ 0. Neitherstarted_atnorcompleted_atare available in thegithubcontext, so the action should capture its own wall-clock time and include it in the payload —in_progresssendsstarted_at,completedsendscompleted_at. Then HUD stores what the action reports instead of usingnow(). -
queue_timeis set fromtrusted.ci_metrics?.queue_time, which isnullon completed callbacks (the relay only sends it onin_progress). This needs to skip settingqueue_timewhen it's null rather than writing null over a valid value. Same applies toexecution_timeonin_progresscallbacks.
Also, the DynamoDB write needs to use UpdateItem with SET expressions instead of PutItem. PutItem does a full item replace, so even with the above fixes, the completed callback would clobber queue_time and started_at from the in_progress record. UpdateItem lets each callback write only its own fields.
There was a problem hiding this comment.
All three fixed:
- started_at/completed_at now use downstream-reported timestamps (wf.started_at, wf.completed_at) instead of new Date().toISOString(). Note: these fields need to be added to the L2 PR's callback action (action.yml), which currently doesn't send wall-clock timestamps.
- queue_time/execution_time are only set when the relay provides a non-null value. in_progress sets queue_time; completed sets execution_time. Neither overwrites the other.
- Changed from PutItem to UpdateItem with SET expressions throughout — Hop 3, data flow table, code samples, and implementation plan.
| const wf = cb.workflow; | ||
| const pr = cb.payload?.pull_request; | ||
|
|
||
| const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`; |
There was a problem hiding this comment.
This key needs to include the job name and attempt number. The callback action runs inside a specific job, and if a downstream repo uses it in multiple jobs within the same workflow (build, test-cpu, test-gpu), all those jobs share the same github.workflow name and would write to the same dynamoKey — last writer wins, the rest are silently lost. And without the attempt number, job retries would overwrite the original run.
Let's add github.job and github.run_attempt to the key: {repo}/{delivery_id}/{workflow_name}/{job_name}/{attempt}. That gives each job its own row, preserves retry history, and makes the /oot/[org]/[repo] matrix view ("columns = OOT jobs") actually useful.
There was a problem hiding this comment.
Thanks you for the thorough review @ZainRizvi, Changed from {repo}/{delivery_id}/{workflow_name} to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}. Added job_name and run_attempt to both DynamoDB and ClickHouse schemas. The L2 callback action needs to be updated to include github.job and github.run_attempt in the payload.
|
|
||
| ### Sample Payloads | ||
|
|
||
| #### In-Progress Callback |
There was a problem hiding this comment.
These samples don't match the actual wire format. They show a flat structure with top-level status, head_sha, pr_number, etc. But the L2 PR's action.yml sends a nested structure — delivery_id and payload from the dispatch envelope, with a workflow sibling containing status/conclusion/name/url. Your own TypeScript code (RelayPayload, extractDynamoRecord) parses the nested format correctly, so the code is right — but a downstream developer implementing from these samples would produce payloads the system can't parse. These need to match the actual {delivery_id, payload, workflow} structure from the action.
There was a problem hiding this comment.
Replaced all flat-structure samples with the actual nested wire format, shown at two stages:
Stage 1 (downstream → relay): {event_type, delivery_id, payload, workflow} — matching the callback action's output from action.yml
Stage 2 (relay → HUD): {trusted: {verified_repo, ci_metrics}, untrusted: {callback_payload: ...}} — matching forward_to_hud in hud.py
Also added a note about fields (job_name, run_attempt, started_at, completed_at) that need to be added to the L2 action.
| This hop requires **no new code**. The existing infrastructure handles it: | ||
|
|
||
| 1. **DynamoDB Streams** (enabled on the table with `NEW_AND_OLD_IMAGES`) captures every write | ||
| 2. **`clickhouse-replicator-dynamo`** receives stream events and inserts `completed` records into ClickHouse (`in_progress` records are filtered out — they serve only as mutable state in DynamoDB) |
There was a problem hiding this comment.
in_progress records need to be replicated to ClickHouse too, not just completed. Current practice for in-tree CI is we update job status in ClickHouse (in_progress → completed) and OOT should follow the same pattern. SharedReplacingMergeTree already supports upserts. Without this, the /pr/N page wouldn't be able to show running jobs.
There was a problem hiding this comment.
Fixed — both in_progress and completed records now replicate to ClickHouse. The replicator forwards all DynamoDB stream events without filtering. SharedReplacingMergeTree handles the upsert — completed replaces in_progress for the same dynamoKey after merges.
| | 2 | Schema validation + payload caps (2MB max body) | `400 Bad Request` | | ||
| | 3 | Write to DynamoDB | `502 Bad Gateway` | | ||
|
|
||
| The HUD API endpoint (`torchci/pages/api/oot/results.ts`) follows the existing `webhookToDynamo` pattern: |
There was a problem hiding this comment.
Schema validation and field extraction should live in the Relay Server, not HUD. The relay already does OIDC, allowlist checks, and rate limiting — adding schema validation there keeps HUD simple (authenticate the relay, write what it's told) and avoids duplicating validation if other consumers of relay data appear later. HUD should still enforce a payload size cap as a safety net.
There was a problem hiding this comment.
We have moved — schema validation is now at the relay (Hop 1, step 5). The HUD API endpoint is simplified to: validate X-OOT-Relay-Token header, enforce 2MB payload cap, extract/flatten the record, and UpdateItem write. HUD trusts that the relay has already validated the structure.
| | Hop | From → To | Auth Mechanism | Credential Scope | | ||
| |-----|-----------|----------------|------------------| | ||
| | 1 | Downstream → Result Handler | OIDC token (GHA-issued, 5 min TTL) | No secrets needed by downstream | | ||
| | 2 | Result Handler → HUD API | `X-Hud-Internal-Bot` header | Same pattern as DrCI / trymerge | |
There was a problem hiding this comment.
Let's use a dedicated auth header for the relay instead of X-Hud-Internal-Bot. That header is shared across multiple systems (DrCI, trymerge) and has only been used for read-only operations. Since this is a write path, giving the relay its own key limits blast radius if any single one leaks.
There was a problem hiding this comment.
Definitely @ZainRizvi,
we have changed to X-OOT-Relay-Token — a dedicated header scoped only to the OOT relay write path. Added an explanation in the RFC that this isolates the write path so key rotation or revocation doesn't affect other internal HUD consumers (DrCI, trymerge).
- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: moved schema validation from HUD to relay (Hop 1)
- Replication: both in_progress and completed records go to ClickHouse
- Timestamps: use downstream-reported started_at/completed_at
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Schema: added job_name, run_attempt columns
- Rate limit: 10 → 20 req/min (matches L2 PR default)
- Error handling: updated to match actual retry + raise behavior
- Sample payloads: rewritten to match actual nested {trusted, untrusted} wire format
Hello @ZainRizvi, Agreed — moved schema validation to the relay (Hop 1, step 5). The relay now validates delivery_id, workflow.status, and conclusion presence before forwarding. HUD is simplified to: auth (X-OOT-Relay-Token) + payload cap (2MB) + UpdateItem write. This aligns with the L2 PR's result_handler.py which already validates these fields at the relay level |
Fixed — removed all "completed only" language. Both in_progress and completed records now replicate to ClickHouse. SharedReplacingMergeTree handles the deduplication — when the completed record arrives for the same dynamoKey, it replaces the in_progress row after merges. This matches in-tree CI behavior and allows the /pr/N page to show running jobs.
Changed everywhere — diagrams, data flow table, authentication flow table, code samples, and implementation plan. The relay now uses a dedicated X-OOT-Relay-Token header instead of the shared X-Hud-Internal-Bot. This isolates the OOT write path so rotating/revoking the token doesn't affect DrCI/trymerge. |
Address @ZainRizvi's review on pytorch/rfcs#96: - Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: removed schema validation from HUD (moved to relay) - Removed daily budget enforcement - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Timestamps: use downstream-reported started_at/completed_at instead of now() - Timing metrics: only set queue_time/execution_time when non-null - ClickHouse schema: added job_name, run_attempt columns - Queries: select job_name, run_attempt as proper columns - Frontend: updated interfaces to include new fields
KarhouTam
left a comment
There was a problem hiding this comment.
Hey, @jewelkm89. Thanks for this RFC, really a great work! According to the previous comments and suggestions on CRCR L2 implementation PR, some inconsistencies existed between the current implementation and this RFC. I've left some comments. PTAL!
| | Transition | Action | | ||
| |------------|--------| | ||
| | No record → `in_progress` | Accepted | | ||
| | No record → `completed` | Accepted (handles missed `in_progress` callback) | |
There was a problem hiding this comment.
Why No record → completed should be accepted?
There was a problem hiding this comment.
Hello @KarhouTam,
During the initial design, we have decided that if the in_progress couldn't make it to the DynamoDB due to outages, the downstream repo should still be allowed to insert the completed record. Please refer comment.
There was a problem hiding this comment.
After further review and aligning with the L2 implementation, we've updated the RFC to reject No record → completed. With check_run_id-based state tracking and the DISPATCHED prerequisite, accepting a completed without prior in_progress would bypass the state integrity check. The strict DISPATCHED → IN_PROGRESS → COMPLETED flow provides better replay and fabrication protection.
There was a problem hiding this comment.
Currently, when an in_progress callback is lost, the system returns a 400 error, which requires the on-call operator to manually rerun the job. While this behavior was designed to reflect real-world network instability, we have decided after thoughtful internal discussion that it would be more efficient to ensure a completed callback is always preceded by an in_progress callback.
There was a problem hiding this comment.
Thanks for confirming @KarhouTam. This aligns with the update we made to the RFC — No record → completed is now rejected, and the strict DISPATCHED → IN_PROGRESS → COMPLETED flow is enforced. If in_progress is lost, the operator reruns the job rather than accepting a completed without state integrity. The RFC reflects this as the agreed-upon behavior.
| "workflow": { | ||
| "status": "in_progress", | ||
| "conclusion": null, | ||
| "name": "xpu-ci", | ||
| "url": "https://github.com/{org}/{repo}/actions/runs/24033272679", | ||
| "job_name": "test-xpu-float32", | ||
| "run_attempt": 1, | ||
| "started_at": "2025-04-28T10:15:30Z" |
There was a problem hiding this comment.
In my state machine implementation, the Redis key now includes check_run_id instead of run_attempt. Since both values change with reruns, I believe check_run_id is a better fit for representing a unique job than {job_name}:{run_attempt}. Accordingly, the Redis state key has been updated to oot:state:{delivery_id}:{repo}:{check_run_id}. run_attempt still in the dict, but not used by relay server.
To improve HUD usability, I added run_id and job_name to the workflow dictionary, which keep consistent among all runs. This allows the HUD to use a grouping key like {job_name}:{run_id}, preventing a single downstream workflow from appearing as multiple rows due to reruns. Under this approach, every unique check_run_id is treated as a separate job.
There was a problem hiding this comment.
Great improvements @KarhouTam. Aligned the RFC with your changes:
- State machine key: Updated from
run_idtocheck_run_id—oot:state:{delivery_id}:{repo}:{check_run_id}. Sincecheck_run_idis GitHub-assigned and unique per job execution, it's a better uniqueness key than{job_name}:{run_attempt}. - 3-state model: Adopted the
DISPATCHED → IN_PROGRESS → COMPLETEDflow with no shortcuts.DISPATCHEDis repo-level (set by webhook handler),IN_PROGRESS/COMPLETEDare job-level percheck_run_id. Added your mermaid diagram. - New fields:
check_run_id,run_id,schema_versionadded to the workflow interface, DynamoDB/ClickHouse schemas, sample payloads, and the reference implementation. - DynamoKey: Switched from
{job_name}/{run_attempt}to{job_name}/{check_run_id}for per-execution uniqueness. - HUD grouping: Frontend groups by
job_nameand keeps the latestrun_attemptper PR, preventing reruns from appearing as duplicate rows.
run_attempt is still in the dict and stored for informational display, but not used for keying.
| > [!NOTE] | ||
| > The callback token is not strictly required for L2 (dashboard only). However, L1 is where the token gets minted, and once L1 is deployed and downstream repos depend on the current `client_payload` shape, retrofitting it becomes a breaking change. We recommend adding it now as optional groundwork for L3/L4. | ||
|
|
||
| #### State Machine for Status Transitions |
There was a problem hiding this comment.
The current state machine differs slightly from this RFC. Here are the state transitions:
According to previous comments in the L2 implementation PR, this state machine has three states:
DISPATCHED(set only once at the webhook handler per repository; this state is repo-level and every job needs it to update its state).IN_PROGRESS(job-level, cannot be duplicated; reruns are treated as separate jobs).COMPLETED(job-level, cannot be duplicated; reruns are treated as separate jobs).
Note that the direction graph below is for a single check run, reruns have different check_run_id and are treated as separate jobs, so they won't violate the state machine since they won't have a prior IN_PROGRESS or COMPLETED record.
stateDiagram-v2
direction LR
[*] --> DISPATCHED: webhook sends
DISPATCHED --> IN_PROGRESS: first callback
IN_PROGRESS --> COMPLETED: completion
IN_PROGRESS --> IN_PROGRESS: ❌ duplicate
DISPATCHED --> COMPLETED: ❌ skip IN_PROGRESS
COMPLETED --> COMPLETED: ❌ duplicate
COMPLETED --> IN_PROGRESS: ❌ wrong direction
[*] --> IN_PROGRESS: ❌ no dispatch
[*] --> COMPLETED: ❌ no dispatch
State types:
DISPATCHED: Repo-level state (check_run_id=DISPATCH_CHECK_RUN_ID) — one per repository.IN_PROGRESS/COMPLETED: Job-level states — independent per job.
Valid flows:
- Main:
DISPATCHED → IN_PROGRESS → COMPLETED(linear) - Reruns:
IN_PROGRESS → IN_PROGRESSorCOMPLETED → IN_PROGRESS(update timestamps)
Rejected:
- Skipping
IN_PROGRESS, duplicateCOMPLETED, or missingDISPATCHEDstate. - Replay attack: duplicate
IN_PROGRESS(samecheck_run_id).
| } | ||
| ``` | ||
|
|
||
| The `extractDynamoRecord` function flattens this into a single record. It uses `started_at` and `completed_at` timestamps reported by the downstream action (wall-clock time when the callback was sent), rather than generating timestamps at HUD. Timing metrics (`queue_time`, `execution_time`) are only set when the relay provides a non-null value, preventing `completed` callbacks from clobbering `queue_time` with `null`: |
There was a problem hiding this comment.
For the retry scenario, queue_time may be an invalid value. I suppose we should not update this value in this case.
There was a problem hiding this comment.
Good catch @can-gaa-hou. On retries (run_attempt > 1), queue_time = dispatch_timestamp → rerun_in_progress_timestamp could be hours or days old — meaningless as a CI metric.
The fix belongs on the relay side: the relay should set queue_time = null when run_attempt > 1. On the HUD side, we already only write queue_time when the relay provides a non-null value, so no HUD changes are needed.
Updated the RFC to document this behavior.
|
|
||
| The relay's Result Handler validates the callback body and produces a `{trusted, untrusted}` payload: | ||
|
|
||
| - `trusted` — relay-generated fields: `verified_repo` (OIDC-proven identity) and `ci_metrics` (relay-measured `queue_time`, `execution_time`) |
There was a problem hiding this comment.
downstream_repo_level is included for HUD convenience now.
Values: L1 ~ L4 (string)
There was a problem hiding this comment.
Thanks @KarhouTam. Good call — updated the RFC and reference implementation to source downstream_repo_level from trusted (relay-determined from the allowlist) instead of from the untrusted callback payload. Changes:
RFC:
- Added
downstream_repo_levelto thetrustedfields description andRelayPayloadinterface extractDynamoRecordnow reads it fromtrusted.downstream_repo_level- Both sample payloads (
in_progressandcompleted) now include"downstream_repo_level": "L2"in thetrustedsection
Reference implementation (subinz1/test-infra#1):
- Added
downstream_repo_leveltoRelayTrustedandOotWorkflowJobRecordinterfaces extractDynamoRecordextracts it fromtrusted.downstream_repo_level
The DynamoDB and ClickHouse schemas already had the column from a prior commit — only the ingestion/extraction path was missing.
| workflow: { | ||
| status: string; | ||
| conclusion?: string | null; | ||
| name: string; | ||
| url: string; | ||
| job_name?: string; | ||
| run_attempt?: number; | ||
| started_at?: string; | ||
| completed_at?: string; |
There was a problem hiding this comment.
workflow now is like:
workflow: dict = {
"schema_version": ...,
"status": ...,
"conclusion": ...,
"name": ...,
"url": ...,
"run_attempt": ...,
"job_name": ...,
"check_run_id": ...,
"run_id": ...,
"started_at": ...,
"completed_at": ...,
"test-results": ...
}There was a problem hiding this comment.
Thanks for the updated field list @KarhouTam. Updated the RFC and reference implementation to align:
- Added
schema_version,check_run_id, andrun_idto the workflow interface, DynamoDB/ClickHouse schemas, queries, and sample payloads - Switched
dynamoKeyfromrun_attempttocheck_run_idfor per-execution uniqueness - Fixed
test-resultskey to use the hyphenated form matching the L2 action (wastest_results— would have silently failed in JS)
One question: should test-results stay hyphenated, or would it be cleaner to normalize to test_results in the action? Hyphenated keys require bracket notation in TS (wf["test-results"]). Happy either way — just want to confirm the convention.
There was a problem hiding this comment.
Good one! Actually, in the latest action.yml, we changed it to workflow["test_results"] already.
Sorry for the misleading comment.
There was a problem hiding this comment.
Thanks @KarhouTam. We've updated both the RFC and reference implementation to use test_results (underscore) — it matches the L2 action's Python code (workflow["test_results"] = ...) and is more convenient for TypeScript (dot notation vs bracket notation). Also aligned test_results to be summary counts only (total/passed/failed/skipped), with detailed test results going via artifact_url as the action describes.
The L2 relay now includes downstream_repo_level (L1–L4) in the trusted dict, determined from the allowlist rather than self-reported by downstream. Updated the RelayPayload interface, extractDynamoRecord, and sample payloads to read it from trusted.
Align with updated L2 workflow dict: add schema_version, check_run_id, and run_id to the workflow interface, DynamoDB schema table, ClickHouse schema, extractDynamoRecord, and sample payloads. Switch dynamoKey from run_attempt to check_run_id for per-execution uniqueness. Fix test-results key to use the hyphenated form matching the L2 action.
queue_time is only meaningful for first attempts (run_attempt == 1). On retries, the dispatch timestamp is stale and the relay should send null. HUD already handles this — it only writes queue_time when non-null. Also fixed a stale dynamoKey reference that still showed run_attempt instead of check_run_id.
Replace the 2-state model with the L2 PR's strict 3-state machine: DISPATCHED → IN_PROGRESS → COMPLETED. No shortcuts allowed — missing IN_PROGRESS or missing DISPATCHED both result in rejection. State key updated from run_id to check_run_id for per-execution uniqueness. Added mermaid diagram and updated callback token note to reflect that DISPATCHED partially addresses dispatch provenance.
Use <hardware>, <company name>, <version> placeholders instead of specific vendor or accelerator names in sample payloads, schema comments, and motivation text.
The L2 action uses test_results (underscore) in the workflow dict, not test-results (hyphen). The hyphenated name is only the GHA input name. Also removed failures array from test_results — the L2 action sends summary counts only; detailed test results go via artifact_url.
f84edea to
9f49809
Compare
Address @ZainRizvi's review on pytorch/rfcs#96: - Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: removed schema validation from HUD (moved to relay) - Removed daily budget enforcement - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Timestamps: use downstream-reported started_at/completed_at instead of now() - Timing metrics: only set queue_time/execution_time when non-null - ClickHouse schema: added job_name, run_attempt columns - Queries: select job_name, run_attempt as proper columns - Frontend: updated interfaces to include new fields
|
Hello @albanD, @huydhn and @atalman |
- Fix auth header: X-Hud-Internal-Bot → X-OOT-Relay-Token everywhere (mermaid diagram, data flow table, auth table, implementation plan) - Fix HTTP status: 502 → 500 for DynamoDB write errors - Fix DynamoDB key format: add job_name and check_run_id segments - Add missing DynamoDB fields: job_name, check_run_id, run_id, run_attempt - Update handler code to match actual implementation (relay token auth, no validateRelayPayload — validation done by relay) - Update RelayPayload interface: add downstream_repo_level, schema_version, job_name, check_run_id, run_id, run_attempt, artifact_url - Update extractDynamoRecord: downstream timestamps, Number() coercion for run_attempt, artifact_url extraction - Fix SQL queries to match actual saved queries: - oot_summary: add downstream_repo_level - oot_backend_dashboard: separate workflow_name/job_name, add per-job fields - oot_pr_results: same - Fix Two-Callback Model: both records replicated to ClickHouse, SharedReplacingMergeTree handles dedup (not "completed only") - Fix Hop 4: replicator inserts both in_progress and completed records
Replace flat-format sample payloads with the actual two-stage format:
- Stage 1 (downstream → relay): callback body with workflow dict,
as built by the L2 action.yml
- Stage 2 (relay → HUD): {trusted, untrusted} envelope,
as built by result_handler.py → forward_to_hud()
Includes note about run_attempt string-to-number coercion.
…pped
The L2 action sends test_results with {passed, failed, skipped} but no
total field. Updated the extractDynamoRecord snippet, RelayWorkflow
interface, and sample payloads to reflect this and compute total_tests
as the sum when tr.total is not present.
## Summary Creates the `default.oot_workflow_job` ClickHouse table for storing out-of-tree backend CI results. Split out from #8069 so the table can be deployed independently and populated with test data. The DynamoDB → ClickHouse replicator mapping (`torchci-oot-workflow-job` → `default.oot_workflow_job`) is already on main. ### Schema highlights - `SharedReplacingMergeTree` engine for upsert semantics (in_progress → completed) - Ordered by `(downstream_repo, delivery_id, dynamoKey)` for efficient per-backend and per-delivery queries - Bloom filter indexes on `status`, `pr_number`, `downstream_repo` - Timing fields: `queue_time`, `execution_time`, `started_at`, `completed_at` - Test result columns: `total_tests`, `passed_tests`, `failed_tests`, `skipped_tests` ### Related - Full HUD pipeline PR: #8069 - RFC: pytorch/rfcs#96 - Reference: Similar to #7365 (vLLM ClickHouse table) ## Test Plan - [ ] Table creation succeeds on ClickHouse - [ ] Populate with fake data and verify queries from #8069 return expected results
## Author - @KarhouTam - @can-gaa-hou - @fffrog ## Summary - This PR implements the L2 levels of the cross-repository CI relay described in pytorch/rfcs#90. - For the previous L1 implementation, please refer to #7847. - Please refer to pytorch/rfcs#90 (comment) for the overall implementation. - Please refer to pytorch/rfcs#96 for the design of HUD side. - Please refer to #8069 for the implementation of HUD side. Higher-level behaviors for `L3` and `L4` are intentionally left for follow-up work. ## Architecture The relay is split into two AWS Lambda functions: - `webhook` lambda function (Updated) - [x] receives GitHub webhook PR and push events from the upstream repo - [x] validates webhook signatures and authenticates with AWS Secret Manager - [x] reads the downstream whitelist from the URL and stores it in Redis - [x] for `opened`/`reopened`/`synchronized`/`closed` actions, forwards repository_dispatch events to downstream repos - `callback` lambda function (Added) - [x] receives downstream callback payload through a public lambda function URL - [x] validates callback payload with OIDC - [x] reads the downstream whitelist from the URL and stores it in Redis - [x] extracts CI result information from the payload and uploads to PyTorch HUD - [x] records `queue time` and `execute time` for evolution to `L3` repo ## Changes ```md ..github/ ├── workflows/ │ └── _lambda-do-release-runners.yml # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release │ └── actions/ └── cross-repo-ci-relay-callback/ └── action.yml # Composite action used by downstream workflows to report status back to the relay/result endpoint aws/lambda/cross_repo_ci_relay/ ├── tests/ # Unit tests for allowlist/config/webhook/result/redis behavior ├── README.md # Project overview, local development, callback flow, and result-side validation steps ├── Makefile # Top-level local developer entrypoint for test / deploy / clean ├── local_server.py # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints ├── requirements.txt # Python dependencies required by the relay Lambdas │ ├── utils/ │ ├── allowlist.py # Loads, parses, and queries the downstream allowlist by rollout level │ ├── config.py # Shared runtime config loading and cached get_config() helper │ ├── gh_helper.py # GitHub App, repository_dispatch, and GitHub file access helpers │ ├── hud.py # HUD write helpers for downstream result reporting │ ├── jwt_helper.py # Helpers for minting/verifying relay callback tokens │ ├── redis_helper.py # Redis helpers for allowlist cache, OOT state, and timing data │ └── misc.py # Shared TypedDict definitions and HTTPException │ ├── webhook/ │ ├── Makefile # Build/package/deploy commands for the webhook Lambda │ ├── lambda_function.py # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events │ └── event_handler.py # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos │ └── callback/ ├── Makefile # Build/package/deploy commands for the result Lambda ├── lambda_function.py # Result Lambda entrypoint: verifies callback token and GitHub OIDC token └── callback_handler.py # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD ``` ## Usage See [README.md](https://github.com/KarhouTam/test-infra/blob/crcr-L2/aws/lambda/cross_repo_ci_relay/README.md) for more details. ## Verification We performed the following scenario verification on our AWS Lambda instance: - [x] Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow. - [x] Test with Downstream workflow send callback payload through the added action to the result lambda, then extract CI result information and send to PyTorch HUD. ## Terraform configuration - pytorch/ci-infra#446 ## Unit Tests - [x] Unit Tests (Mock) ## Security - **Callback payload carries full upstream webhook data back to HUD** — `action.yml` builds the callback body by mutating `github.event.client_payload` (which contains the entire original webhook payload: PR metadata, commits, author info) and adding `status`/`conclusion`/`workflow_name`/`workflow_url` on top. This full blob is forwarded verbatim by `hud.py` to HUD with no relay-side filtering. HUD receives both relay-trusted `verified_repo` and an unvalidated body — if HUD trusts self-reported fields inside the body over `verified_repo`, a manipulated dispatch payload could tamper with HUD records. - **Lambda callback URL is public and hardcoded** — The endpoint is hardcoded in `action.yml and exposed in a public action, making it trivially discoverable. OIDC verification blocks unauthorized HUD writes, but the endpoint has no rate limiting; request flooding can cause Lambda concurrency exhaustion or Redis connection saturation. - **Only OIDC is used for verification** — The callback lambda relies solely on GitHub OIDC token verification for authentication, without additional application-level secrets or signatures. If an attacker compromises a downstream repo's GitHub Actions permissions, they could forge authenticated requests to the callback endpoint. Besides, OIDC has its own limitations (e.g., token expiration, potential misconfigurations) that could lead to unauthorized access if not carefully managed. ## HUD Interaction - **Design Principle: Transparent Relay & Decoupling** The Relay Server acts as a **lightweight data passthrough layer**. It does not define or parse specific CI data formats; instead, it offloads data interpretation and validation to the HUD. This ensures complete decoupling between the relay infrastructure and business-specific data. - **Security & Risk Mitigation** The relay uses **OIDC authentication** to guarantee the authenticity of the data source (**Verified Repo**). Its core responsibility is to ensure the data originates from the claimed repository, while security filtering and content compliance are enforced at the HUD level. --------- Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>
Summary
This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).
Data Flow
flowchart LR subgraph Downstream["Downstream CI (OOT Backend)"] DS["Run tests\n+ upload artifacts"] end subgraph ART["Artifact Storage (org-managed)"] STORE[("Logs, test reports,\nJUnit XML")] end subgraph Relay["Relay Server"] RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"] end subgraph HUD["HUD"] API["/api/oot/results\n• Auth check\n• Payload caps (2MB)"] end subgraph Storage["Storage"] DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")] STR["DynamoDB Stream"] REP["clickhouse-replicator-dynamo"] CH[("ClickHouse\ndefault.oot_workflow_job\n(SharedReplacingMergeTree)")] end subgraph Frontend["HUD Frontend"] P1["/oot — Global Summary"] P2["/oot/org/repo — Per-Backend"] P3["/pr/N — OOT Section"] end DS -->|"Upload artifacts"| STORE DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH RH -->|"X-OOT-Relay-Token\n{trusted, untrusted}"| API API -->|"UpdateItem"| DDB DDB --> STR --> REP --> CH CH -->|"Query results +\nartifact_url"| P1 & P2 & P3 P2 & P3 -.->|"User clicks\nexternal link"| STOREKey points:
completedcallback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouseartifact_urlfrom ClickHouse and render it as an external link — no direct connection between HUD and downstream storagein_progressandcompletedrecords are replicated to ClickHouse;SharedReplacingMergeTreehandles deduplication — when acompletedrecord arrives for the samedynamoKey, it replaces thein_progressrowWhat this RFC covers
in_progresscallbacks → DynamoDB viaUpdateItem→ replicated to ClickHouse (shows "running" indicators)completedcallbacks → DynamoDB viaUpdateItem(merges into existing record) → replicated to ClickHouse (replacesin_progressrow viaSharedReplacingMergeTree)/oot— Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)/oot/[org]/[repo]— Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)/pr/[number]— Collapsible "Out-of-Tree Backends" section in existing PR pagesX-OOT-Relay-Tokenheader, error handling strategy, signed callback token proposal, 3-state machine for status transitionsUpdateItemwrite logicReference implementation
A reference implementation is available at pytorch/test-infra#8069, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.
HUD Mockup design
The draft OOT HUD UI mockup can be seen in OOT HUD Mockup link
Status
Ready for review