RFC-0054: HUD Integration for Out-of-Tree CI Results by jewelkm89 · Pull Request #96 · pytorch/rfcs

jewelkm89 · 2026-04-29T17:15:15Z

Summary

This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).

Data Flow

flowchart LR
    subgraph Downstream["Downstream CI (OOT Backend)"]
        DS["Run tests\n+ upload artifacts"]
    end

    subgraph ART["Artifact Storage (org-managed)"]
        STORE[("Logs, test reports,\nJUnit XML")]
    end

    subgraph Relay["Relay Server"]
        RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"]
    end

    subgraph HUD["HUD"]
        API["/api/oot/results\n• Auth check\n• Payload caps (2MB)"]
    end

    subgraph Storage["Storage"]
        DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")]
        STR["DynamoDB Stream"]
        REP["clickhouse-replicator-dynamo"]
        CH[("ClickHouse\ndefault.oot_workflow_job\n(SharedReplacingMergeTree)")]
    end

    subgraph Frontend["HUD Frontend"]
        P1["/oot — Global Summary"]
        P2["/oot/org/repo — Per-Backend"]
        P3["/pr/N — OOT Section"]
    end

    DS -->|"Upload artifacts"| STORE
    DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH
    RH -->|"X-OOT-Relay-Token\n{trusted, untrusted}"| API
    API -->|"UpdateItem"| DDB
    DDB --> STR --> REP --> CH
    CH -->|"Query results +\nartifact_url"| P1 & P2 & P3
    P2 & P3 -.->|"User clicks\nexternal link"| STORE

Key points:

Artifact URLs are included in the completed callback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouse
HUD pages read artifact_url from ClickHouse and render it as an external link — no direct connection between HUD and downstream storage
Both in_progress and completed records are replicated to ClickHouse; SharedReplacingMergeTree handles deduplication — when a completed record arrives for the same dynamoKey, it replaces the in_progress row

What this RFC covers

Write path: Downstream CI → Result Handler → HUD API → DynamoDB → ClickHouse
- in_progress callbacks → DynamoDB via UpdateItem → replicated to ClickHouse (shows "running" indicators)
- completed callbacks → DynamoDB via UpdateItem (merges into existing record) → replicated to ClickHouse (replaces in_progress row via SharedReplacingMergeTree)
- Artifact URLs flow through the callback payload, not sent directly to HUD
Read path: Three new HUD views:
- /oot — Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)
- /oot/[org]/[repo] — Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)
- /pr/[number] — Collapsible "Out-of-Tree Backends" section in existing PR pages
Storage schemas: DynamoDB table and ClickHouse table designs
DB protection: Rate limiting (per-repo at relay), payload caps (2MB at HUD API)
Security: OIDC authentication, trusted/untrusted payload split, dedicated X-OOT-Relay-Token header, error handling strategy, signed callback token proposal, 3-state machine for status transitions
Sample payloads: Two-stage wire format examples (downstream → relay, relay → HUD) with full field definitions
Implementation plan: 6-phase rollout with task-level breakdown:
1. Storage Layer — DynamoDB + ClickHouse + replicator mapping
2. HUD API Endpoint — types, extraction, UpdateItem write logic
3. Relay Integration — handler → HUD forwarding, rate limiting, reusable GHA action
4. HUD Frontend Pages — 3 views + saved ClickHouse queries
5. End-to-End Validation — real downstream repo testing
6. Security Hardening — callback token, state machine (future)

Reference implementation

A reference implementation is available at pytorch/test-infra#8069, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.

HUD Mockup design

The draft OOT HUD UI mockup can be seen in OOT HUD Mockup link

Status

Ready for review

Defines the HUD-side ingestion and display layer for OOT CI results, building on RFC-0050 (Cross-Repository CI Relay). Covers write path, storage schemas, DB protection, security, and three frontend views. Reference implementation: subinz1/test-infra#1

meta-cla · 2026-04-29T17:15:21Z

Hi @jewelkm89!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

ZainRizvi

A few quick high level comments based on the PR description (I really appreciate the diagram btw!):

I see some payload validation happening in HUD. Why not have that in Relay Server instead? In HUD, auth check should likely be limited to authenticating the relay server's header value and a maaaaybe payload cap to ensure we don't overburden the database. Everything else probably will find a better home in Relay Server, where the logic can be centralized. And the payload cap is likely worth adding to the relay server as well (HUD's check is more of a sanity check as the db guardian.)
Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3

Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.

ZainRizvi

Nice work — the architecture diagrams are really clear and the RFC builds on RFC-0050 well. A few things to sort out, mostly around aligning with the L2 PR's actual wire format and a couple of bugs in the extraction logic.

ZainRizvi · 2026-04-30T19:08:59Z

+  const pr = cb.payload?.pull_request;
+
+  const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`;
+  const now = new Date().toISOString();


There are a couple of bugs here:

started_at is set to now() on every callback, including completed. That's the time HUD received the POST, not when the job started. On a completed callback, started_at ≈ completed_at, so the ClickHouse duration_seconds alias will always be ≈ 0. Neither started_at nor completed_at are available in the github context, so the action should capture its own wall-clock time and include it in the payload — in_progress sends started_at, completed sends completed_at. Then HUD stores what the action reports instead of using now().

queue_time is set from trusted.ci_metrics?.queue_time, which is null on completed callbacks (the relay only sends it on in_progress). This needs to skip setting queue_time when it's null rather than writing null over a valid value. Same applies to execution_time on in_progress callbacks.

Also, the DynamoDB write needs to use UpdateItem with SET expressions instead of PutItem. PutItem does a full item replace, so even with the above fixes, the completed callback would clobber queue_time and started_at from the in_progress record. UpdateItem lets each callback write only its own fields.

All three fixed:

started_at/completed_at now use downstream-reported timestamps (wf.started_at, wf.completed_at) instead of new Date().toISOString(). Note: these fields need to be added to the L2 PR's callback action (action.yml), which currently doesn't send wall-clock timestamps.

queue_time/execution_time are only set when the relay provides a non-null value. in_progress sets queue_time; completed sets execution_time. Neither overwrites the other.

Changed from PutItem to UpdateItem with SET expressions throughout — Hop 3, data flow table, code samples, and implementation plan.

ZainRizvi · 2026-04-30T19:08:59Z

+  const wf = cb.workflow;
+  const pr = cb.payload?.pull_request;
+
+  const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`;


This key needs to include the job name and attempt number. The callback action runs inside a specific job, and if a downstream repo uses it in multiple jobs within the same workflow (build, test-cpu, test-gpu), all those jobs share the same github.workflow name and would write to the same dynamoKey — last writer wins, the rest are silently lost. And without the attempt number, job retries would overwrite the original run.

Let's add github.job and github.run_attempt to the key: {repo}/{delivery_id}/{workflow_name}/{job_name}/{attempt}. That gives each job its own row, preserves retry history, and makes the /oot/[org]/[repo] matrix view ("columns = OOT jobs") actually useful.

Thanks you for the thorough review @ZainRizvi, Changed from {repo}/{delivery_id}/{workflow_name} to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}. Added job_name and run_attempt to both DynamoDB and ClickHouse schemas. The L2 callback action needs to be updated to include github.job and github.run_attempt in the payload.

ZainRizvi · 2026-04-30T19:08:59Z

+
+### Sample Payloads
+
+#### In-Progress Callback


These samples don't match the actual wire format. They show a flat structure with top-level status, head_sha, pr_number, etc. But the L2 PR's action.yml sends a nested structure — delivery_id and payload from the dispatch envelope, with a workflow sibling containing status/conclusion/name/url. Your own TypeScript code (RelayPayload, extractDynamoRecord) parses the nested format correctly, so the code is right — but a downstream developer implementing from these samples would produce payloads the system can't parse. These need to match the actual {delivery_id, payload, workflow} structure from the action.

Replaced all flat-structure samples with the actual nested wire format, shown at two stages:

Stage 1 (downstream → relay): {event_type, delivery_id, payload, workflow} — matching the callback action's output from action.yml
Stage 2 (relay → HUD): {trusted: {verified_repo, ci_metrics}, untrusted: {callback_payload: ...}} — matching forward_to_hud in hud.py
Also added a note about fields (job_name, run_attempt, started_at, completed_at) that need to be added to the L2 action.

ZainRizvi · 2026-04-30T19:08:59Z

+This hop requires **no new code**. The existing infrastructure handles it:
+
+1. **DynamoDB Streams** (enabled on the table with `NEW_AND_OLD_IMAGES`) captures every write
+2. **`clickhouse-replicator-dynamo`** receives stream events and inserts `completed` records into ClickHouse (`in_progress` records are filtered out — they serve only as mutable state in DynamoDB)


in_progress records need to be replicated to ClickHouse too, not just completed. Current practice for in-tree CI is we update job status in ClickHouse (in_progress → completed) and OOT should follow the same pattern. SharedReplacingMergeTree already supports upserts. Without this, the /pr/N page wouldn't be able to show running jobs.

Fixed — both in_progress and completed records now replicate to ClickHouse. The replicator forwards all DynamoDB stream events without filtering. SharedReplacingMergeTree handles the upsert — completed replaces in_progress for the same dynamoKey after merges.

ZainRizvi · 2026-04-30T19:08:59Z

+| 2 | Schema validation + payload caps (2MB max body) | `400 Bad Request` |
+| 3 | Write to DynamoDB | `502 Bad Gateway` |
+
+The HUD API endpoint (`torchci/pages/api/oot/results.ts`) follows the existing `webhookToDynamo` pattern:


Schema validation and field extraction should live in the Relay Server, not HUD. The relay already does OIDC, allowlist checks, and rate limiting — adding schema validation there keeps HUD simple (authenticate the relay, write what it's told) and avoids duplicating validation if other consumers of relay data appear later. HUD should still enforce a payload size cap as a safety net.

We have moved — schema validation is now at the relay (Hop 1, step 5). The HUD API endpoint is simplified to: validate X-OOT-Relay-Token header, enforce 2MB payload cap, extract/flatten the record, and UpdateItem write. HUD trusts that the relay has already validated the structure.

ZainRizvi · 2026-04-30T19:09:00Z

+| Hop | From → To | Auth Mechanism | Credential Scope |
+|-----|-----------|----------------|------------------|
+| 1 | Downstream → Result Handler | OIDC token (GHA-issued, 5 min TTL) | No secrets needed by downstream |
+| 2 | Result Handler → HUD API | `X-Hud-Internal-Bot` header | Same pattern as DrCI / trymerge |


Let's use a dedicated auth header for the relay instead of X-Hud-Internal-Bot. That header is shared across multiple systems (DrCI, trymerge) and has only been used for read-only operations. Since this is a write path, giving the relay its own key limits blast radius if any single one leaks.

Definitely @ZainRizvi,
we have changed to X-OOT-Relay-Token — a dedicated header scoped only to the OOT relay write path. Added an explanation in the RFC that this isolates the write path so key rotation or revocation doesn't affect other internal HUD consumers (DrCI, trymerge).

- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: moved schema validation from HUD to relay (Hop 1) - Replication: both in_progress and completed records go to ClickHouse - Timestamps: use downstream-reported started_at/completed_at - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Schema: added job_name, run_attempt columns - Rate limit: 10 → 20 req/min (matches L2 PR default) - Error handling: updated to match actual retry + raise behavior - Sample payloads: rewritten to match actual nested {trusted, untrusted} wire format

subinz1 · 2026-05-01T06:43:08Z

I see some payload validation happening in HUD. Why not have that in Relay Server instead?

Hello @ZainRizvi, Agreed — moved schema validation to the relay (Hop 1, step 5). The relay now validates delivery_id, workflow.status, and conclusion presence before forwarding. HUD is simplified to: auth (X-OOT-Relay-Token) + payload cap (2MB) + UpdateItem write. This aligns with the L2 PR's result_handler.py which already validates these fields at the relay level

subinz1 · 2026-05-01T06:45:26Z

Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3

Fixed — removed all "completed only" language. Both in_progress and completed records now replicate to ClickHouse. SharedReplacingMergeTree handles the deduplication — when the completed record arrives for the same dynamoKey, it replaces the in_progress row after merges. This matches in-tree CI behavior and allows the /pr/N page to show running jobs.

Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.

Changed everywhere — diagrams, data flow table, authentication flow table, code samples, and implementation plan. The relay now uses a dedicated X-OOT-Relay-Token header instead of the shared X-Hud-Internal-Bot. This isolates the OOT write path so rotating/revoking the token doesn't affect DrCI/trymerge.

@ZainRizvi

Address @ZainRizvi's review on pytorch/rfcs#96: - Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: removed schema validation from HUD (moved to relay) - Removed daily budget enforcement - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Timestamps: use downstream-reported started_at/completed_at instead of now() - Timing metrics: only set queue_time/execution_time when non-null - ClickHouse schema: added job_name, run_attempt columns - Queries: select job_name, run_attempt as proper columns - Frontend: updated interfaces to include new fields

KarhouTam

Hey, @jewelkm89. Thanks for this RFC, really a great work! According to the previous comments and suggestions on CRCR L2 implementation PR, some inconsistencies existed between the current implementation and this RFC. I've left some comments. PTAL!

cc @fffrog @can-gaa-hou

KarhouTam · 2026-05-09T06:35:10Z

+| Transition | Action |
+|------------|--------|
+| No record → `in_progress` | Accepted |
+| No record → `completed` | Accepted (handles missed `in_progress` callback) |


Why No record → completed should be accepted?

Hello @KarhouTam,
During the initial design, we have decided that if the in_progress couldn't make it to the DynamoDB due to outages, the downstream repo should still be allowed to insert the completed record. Please refer comment.

After further review and aligning with the L2 implementation, we've updated the RFC to reject No record → completed. With check_run_id-based state tracking and the DISPATCHED prerequisite, accepting a completed without prior in_progress would bypass the state integrity check. The strict DISPATCHED → IN_PROGRESS → COMPLETED flow provides better replay and fabrication protection.

Currently, when an in_progress callback is lost, the system returns a 400 error, which requires the on-call operator to manually rerun the job. While this behavior was designed to reflect real-world network instability, we have decided after thoughtful internal discussion that it would be more efficient to ensure a completed callback is always preceded by an in_progress callback.

cc @fffrog @can-gaa-hou

Thanks for confirming @KarhouTam. This aligns with the update we made to the RFC — No record → completed is now rejected, and the strict DISPATCHED → IN_PROGRESS → COMPLETED flow is enforced. If in_progress is lost, the operator reruns the job rather than accepting a completed without state integrity. The RFC reflects this as the agreed-upon behavior.

KarhouTam · 2026-05-09T06:41:24Z

+  "workflow": {
+    "status": "in_progress",
+    "conclusion": null,
+    "name": "xpu-ci",
+    "url": "https://github.com/{org}/{repo}/actions/runs/24033272679",
+    "job_name": "test-xpu-float32",
+    "run_attempt": 1,
+    "started_at": "2025-04-28T10:15:30Z"


In my state machine implementation, the Redis key now includes check_run_id instead of run_attempt. Since both values change with reruns, I believe check_run_id is a better fit for representing a unique job than {job_name}:{run_attempt}. Accordingly, the Redis state key has been updated to oot:state:{delivery_id}:{repo}:{check_run_id}. run_attempt still in the dict, but not used by relay server.

To improve HUD usability, I added run_id and job_name to the workflow dictionary, which keep consistent among all runs. This allows the HUD to use a grouping key like {job_name}:{run_id}, preventing a single downstream workflow from appearing as multiple rows due to reruns. Under this approach, every unique check_run_id is treated as a separate job.

Great improvements @KarhouTam. Aligned the RFC with your changes:

State machine key: Updated from run_id to check_run_id — oot:state:{delivery_id}:{repo}:{check_run_id}. Since check_run_id is GitHub-assigned and unique per job execution, it's a better uniqueness key than {job_name}:{run_attempt}.

3-state model: Adopted the DISPATCHED → IN_PROGRESS → COMPLETED flow with no shortcuts. DISPATCHED is repo-level (set by webhook handler), IN_PROGRESS/COMPLETED are job-level per check_run_id. Added your mermaid diagram.

New fields: check_run_id, run_id, schema_version added to the workflow interface, DynamoDB/ClickHouse schemas, sample payloads, and the reference implementation.

DynamoKey: Switched from {job_name}/{run_attempt} to {job_name}/{check_run_id} for per-execution uniqueness.

HUD grouping: Frontend groups by job_name and keeps the latest run_attempt per PR, preventing reruns from appearing as duplicate rows.

run_attempt is still in the dict and stored for informational display, but not used for keying.

KarhouTam · 2026-05-09T06:51:09Z

+> [!NOTE]
+> The callback token is not strictly required for L2 (dashboard only). However, L1 is where the token gets minted, and once L1 is deployed and downstream repos depend on the current `client_payload` shape, retrofitting it becomes a breaking change. We recommend adding it now as optional groundwork for L3/L4.
+
+#### State Machine for Status Transitions


The current state machine differs slightly from this RFC. Here are the state transitions:

According to previous comments in the L2 implementation PR, this state machine has three states:

DISPATCHED (set only once at the webhook handler per repository; this state is repo-level and every job needs it to update its state).

IN_PROGRESS (job-level, cannot be duplicated; reruns are treated as separate jobs).

COMPLETED (job-level, cannot be duplicated; reruns are treated as separate jobs).

Note that the direction graph below is for a single check run, reruns have different check_run_id and are treated as separate jobs, so they won't violate the state machine since they won't have a prior IN_PROGRESS or COMPLETED record.

stateDiagram-v2 direction LR [*] --> DISPATCHED: webhook sends DISPATCHED --> IN_PROGRESS: first callback IN_PROGRESS --> COMPLETED: completion IN_PROGRESS --> IN_PROGRESS: ❌ duplicate DISPATCHED --> COMPLETED: ❌ skip IN_PROGRESS COMPLETED --> COMPLETED: ❌ duplicate COMPLETED --> IN_PROGRESS: ❌ wrong direction [*] --> IN_PROGRESS: ❌ no dispatch [*] --> COMPLETED: ❌ no dispatch

Loading

State types:

DISPATCHED: Repo-level state (check_run_id=DISPATCH_CHECK_RUN_ID) — one per repository.

IN_PROGRESS / COMPLETED: Job-level states — independent per job.

Valid flows:

Main: DISPATCHED → IN_PROGRESS → COMPLETED (linear)

Reruns: IN_PROGRESS → IN_PROGRESS or COMPLETED → IN_PROGRESS (update timestamps)

Rejected:

Skipping IN_PROGRESS, duplicate COMPLETED, or missing DISPATCHED state.

Replay attack: duplicate IN_PROGRESS (same check_run_id).

can-gaa-hou · 2026-05-09T10:01:34Z

+}
+```
+
+The `extractDynamoRecord` function flattens this into a single record. It uses `started_at` and `completed_at` timestamps reported by the downstream action (wall-clock time when the callback was sent), rather than generating timestamps at HUD. Timing metrics (`queue_time`, `execution_time`) are only set when the relay provides a non-null value, preventing `completed` callbacks from clobbering `queue_time` with `null`:


For the retry scenario, queue_time may be an invalid value. I suppose we should not update this value in this case.

Good catch @can-gaa-hou. On retries (run_attempt > 1), queue_time = dispatch_timestamp → rerun_in_progress_timestamp could be hours or days old — meaningless as a CI metric.

The fix belongs on the relay side: the relay should set queue_time = null when run_attempt > 1. On the HUD side, we already only write queue_time when the relay provides a non-null value, so no HUD changes are needed.

Updated the RFC to document this behavior.

KarhouTam

More inconsistencies found.

KarhouTam · 2026-05-11T08:18:26Z

+
+The relay's Result Handler validates the callback body and produces a `{trusted, untrusted}` payload:
+
+- `trusted` — relay-generated fields: `verified_repo` (OIDC-proven identity) and `ci_metrics` (relay-measured `queue_time`, `execution_time`)


downstream_repo_level is included for HUD convenience now.

Values: L1 ~ L4 (string)

Thanks @KarhouTam. Good call — updated the RFC and reference implementation to source downstream_repo_level from trusted (relay-determined from the allowlist) instead of from the untrusted callback payload. Changes:

RFC:

Added downstream_repo_level to the trusted fields description and RelayPayload interface

extractDynamoRecord now reads it from trusted.downstream_repo_level

Both sample payloads (in_progress and completed) now include "downstream_repo_level": "L2" in the trusted section

Reference implementation (subinz1/test-infra#1):

Added downstream_repo_level to RelayTrusted and OotWorkflowJobRecord interfaces

extractDynamoRecord extracts it from trusted.downstream_repo_level

The DynamoDB and ClickHouse schemas already had the column from a prior commit — only the ingestion/extraction path was missing.

KarhouTam · 2026-05-11T08:32:04Z

+      workflow: {
+        status: string;
+        conclusion?: string | null;
+        name: string;
+        url: string;
+        job_name?: string;
+        run_attempt?: number;
+        started_at?: string;
+        completed_at?: string;


workflow now is like:

workflow: dict = { "schema_version": ..., "status": ..., "conclusion": ..., "name": ..., "url": ..., "run_attempt": ..., "job_name": ..., "check_run_id": ..., "run_id": ..., "started_at": ..., "completed_at": ..., "test-results": ... }

Thanks for the updated field list @KarhouTam. Updated the RFC and reference implementation to align:

Added schema_version, check_run_id, and run_id to the workflow interface, DynamoDB/ClickHouse schemas, queries, and sample payloads

Switched dynamoKey from run_attempt to check_run_id for per-execution uniqueness

Fixed test-results key to use the hyphenated form matching the L2 action (was test_results — would have silently failed in JS)

One question: should test-results stay hyphenated, or would it be cleaner to normalize to test_results in the action? Hyphenated keys require bracket notation in TS (wf["test-results"]). Happy either way — just want to confirm the convention.

Good one! Actually, in the latest action.yml, we changed it to workflow["test_results"] already.

https://github.com/KarhouTam/test-infra/blob/c2833bd9eee6f7299311695597ad1e57ad974c7b/.github/actions/cross-repo-ci-relay-callback/action.yml#L121-L126

Sorry for the misleading comment.

Thanks @KarhouTam. We've updated both the RFC and reference implementation to use test_results (underscore) — it matches the L2 action's Python code (workflow["test_results"] = ...) and is more convenient for TypeScript (dot notation vs bracket notation). Also aligned test_results to be summary counts only (total/passed/failed/skipped), with detailed test results going via artifact_url as the action describes.

The L2 relay now includes downstream_repo_level (L1–L4) in the trusted dict, determined from the allowlist rather than self-reported by downstream. Updated the RelayPayload interface, extractDynamoRecord, and sample payloads to read it from trusted.

Align with updated L2 workflow dict: add schema_version, check_run_id, and run_id to the workflow interface, DynamoDB schema table, ClickHouse schema, extractDynamoRecord, and sample payloads. Switch dynamoKey from run_attempt to check_run_id for per-execution uniqueness. Fix test-results key to use the hyphenated form matching the L2 action.

queue_time is only meaningful for first attempts (run_attempt == 1). On retries, the dispatch timestamp is stale and the relay should send null. HUD already handles this — it only writes queue_time when non-null. Also fixed a stale dynamoKey reference that still showed run_attempt instead of check_run_id.

Replace the 2-state model with the L2 PR's strict 3-state machine: DISPATCHED → IN_PROGRESS → COMPLETED. No shortcuts allowed — missing IN_PROGRESS or missing DISPATCHED both result in rejection. State key updated from run_id to check_run_id for per-execution uniqueness. Added mermaid diagram and updated callback token note to reflect that DISPATCHED partially addresses dispatch provenance.

Use <hardware>, <company name>, <version> placeholders instead of specific vendor or accelerator names in sample payloads, schema comments, and motivation text.

The L2 action uses test_results (underscore) in the workflow dict, not test-results (hyphen). The hyphenated name is only the GHA input name. Also removed failures array from test_results — the L2 action sends summary counts only; detailed test results go via artifact_url.

@ZainRizvi

Address @ZainRizvi's review on pytorch/rfcs#96: - Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: removed schema validation from HUD (moved to relay) - Removed daily budget enforcement - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Timestamps: use downstream-reported started_at/completed_at instead of now() - Timing metrics: only set queue_time/execution_time when non-null - ClickHouse schema: added job_name, run_attempt columns - Queries: select job_name, run_attempt as proper columns - Frontend: updated interfaces to include new fields

jewelkm89 · 2026-05-14T09:36:13Z

Hello @albanD, @huydhn and @atalman
Request your review for the RFC as well.
Associated PR is pytorch/test-infra#8069

- Fix auth header: X-Hud-Internal-Bot → X-OOT-Relay-Token everywhere (mermaid diagram, data flow table, auth table, implementation plan) - Fix HTTP status: 502 → 500 for DynamoDB write errors - Fix DynamoDB key format: add job_name and check_run_id segments - Add missing DynamoDB fields: job_name, check_run_id, run_id, run_attempt - Update handler code to match actual implementation (relay token auth, no validateRelayPayload — validation done by relay) - Update RelayPayload interface: add downstream_repo_level, schema_version, job_name, check_run_id, run_id, run_attempt, artifact_url - Update extractDynamoRecord: downstream timestamps, Number() coercion for run_attempt, artifact_url extraction - Fix SQL queries to match actual saved queries: - oot_summary: add downstream_repo_level - oot_backend_dashboard: separate workflow_name/job_name, add per-job fields - oot_pr_results: same - Fix Two-Callback Model: both records replicated to ClickHouse, SharedReplacingMergeTree handles dedup (not "completed only") - Fix Hop 4: replicator inserts both in_progress and completed records

Replace flat-format sample payloads with the actual two-stage format: - Stage 1 (downstream → relay): callback body with workflow dict, as built by the L2 action.yml - Stage 2 (relay → HUD): {trusted, untrusted} envelope, as built by result_handler.py → forward_to_hud() Includes note about run_attempt string-to-number coercion.

…pped The L2 action sends test_results with {passed, failed, skipped} but no total field. Updated the extractDynamoRecord snippet, RelayWorkflow interface, and sample payloads to reflect this and compute total_tests as the sum when tr.total is not present.

## Summary Creates the `default.oot_workflow_job` ClickHouse table for storing out-of-tree backend CI results. Split out from #8069 so the table can be deployed independently and populated with test data. The DynamoDB → ClickHouse replicator mapping (`torchci-oot-workflow-job` → `default.oot_workflow_job`) is already on main. ### Schema highlights - `SharedReplacingMergeTree` engine for upsert semantics (in_progress → completed) - Ordered by `(downstream_repo, delivery_id, dynamoKey)` for efficient per-backend and per-delivery queries - Bloom filter indexes on `status`, `pr_number`, `downstream_repo` - Timing fields: `queue_time`, `execution_time`, `started_at`, `completed_at` - Test result columns: `total_tests`, `passed_tests`, `failed_tests`, `skipped_tests` ### Related - Full HUD pipeline PR: #8069 - RFC: pytorch/rfcs#96 - Reference: Similar to #7365 (vLLM ClickHouse table) ## Test Plan - [ ] Table creation succeeds on ClickHouse - [ ] Populate with fake data and verify queries from #8069 return expected results

@KarhouTam

## Author - @KarhouTam - @can-gaa-hou - @fffrog ## Summary - This PR implements the L2 levels of the cross-repository CI relay described in pytorch/rfcs#90. - For the previous L1 implementation, please refer to #7847. - Please refer to pytorch/rfcs#90 (comment) for the overall implementation. - Please refer to pytorch/rfcs#96 for the design of HUD side. - Please refer to #8069 for the implementation of HUD side. Higher-level behaviors for `L3` and `L4` are intentionally left for follow-up work. ## Architecture The relay is split into two AWS Lambda functions: - `webhook` lambda function (Updated) - [x] receives GitHub webhook PR and push events from the upstream repo - [x] validates webhook signatures and authenticates with AWS Secret Manager - [x] reads the downstream whitelist from the URL and stores it in Redis - [x] for `opened`/`reopened`/`synchronized`/`closed` actions, forwards repository_dispatch events to downstream repos - `callback` lambda function (Added) - [x] receives downstream callback payload through a public lambda function URL - [x] validates callback payload with OIDC - [x] reads the downstream whitelist from the URL and stores it in Redis - [x] extracts CI result information from the payload and uploads to PyTorch HUD - [x] records `queue time` and `execute time` for evolution to `L3` repo ## Changes ```md ..github/ ├── workflows/ │ └── _lambda-do-release-runners.yml # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release │ └── actions/ └── cross-repo-ci-relay-callback/ └── action.yml # Composite action used by downstream workflows to report status back to the relay/result endpoint aws/lambda/cross_repo_ci_relay/ ├── tests/ # Unit tests for allowlist/config/webhook/result/redis behavior ├── README.md # Project overview, local development, callback flow, and result-side validation steps ├── Makefile # Top-level local developer entrypoint for test / deploy / clean ├── local_server.py # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints ├── requirements.txt # Python dependencies required by the relay Lambdas │ ├── utils/ │ ├── allowlist.py # Loads, parses, and queries the downstream allowlist by rollout level │ ├── config.py # Shared runtime config loading and cached get_config() helper │ ├── gh_helper.py # GitHub App, repository_dispatch, and GitHub file access helpers │ ├── hud.py # HUD write helpers for downstream result reporting │ ├── jwt_helper.py # Helpers for minting/verifying relay callback tokens │ ├── redis_helper.py # Redis helpers for allowlist cache, OOT state, and timing data │ └── misc.py # Shared TypedDict definitions and HTTPException │ ├── webhook/ │ ├── Makefile # Build/package/deploy commands for the webhook Lambda │ ├── lambda_function.py # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events │ └── event_handler.py # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos │ └── callback/ ├── Makefile # Build/package/deploy commands for the result Lambda ├── lambda_function.py # Result Lambda entrypoint: verifies callback token and GitHub OIDC token └── callback_handler.py # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD ``` ## Usage See [README.md](https://github.com/KarhouTam/test-infra/blob/crcr-L2/aws/lambda/cross_repo_ci_relay/README.md) for more details. ## Verification We performed the following scenario verification on our AWS Lambda instance: - [x] Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow. - [x] Test with Downstream workflow send callback payload through the added action to the result lambda, then extract CI result information and send to PyTorch HUD. ## Terraform configuration - pytorch/ci-infra#446 ## Unit Tests - [x] Unit Tests (Mock) ## Security - **Callback payload carries full upstream webhook data back to HUD** — `action.yml` builds the callback body by mutating `github.event.client_payload` (which contains the entire original webhook payload: PR metadata, commits, author info) and adding `status`/`conclusion`/`workflow_name`/`workflow_url` on top. This full blob is forwarded verbatim by `hud.py` to HUD with no relay-side filtering. HUD receives both relay-trusted `verified_repo` and an unvalidated body — if HUD trusts self-reported fields inside the body over `verified_repo`, a manipulated dispatch payload could tamper with HUD records. - **Lambda callback URL is public and hardcoded** — The endpoint is hardcoded in `action.yml and exposed in a public action, making it trivially discoverable. OIDC verification blocks unauthorized HUD writes, but the endpoint has no rate limiting; request flooding can cause Lambda concurrency exhaustion or Redis connection saturation. - **Only OIDC is used for verification** — The callback lambda relies solely on GitHub OIDC token verification for authentication, without additional application-level secrets or signatures. If an attacker compromises a downstream repo's GitHub Actions permissions, they could forge authenticated requests to the callback endpoint. Besides, OIDC has its own limitations (e.g., token expiration, potential misconfigurations) that could lead to unauthorized access if not carefully managed. ## HUD Interaction - **Design Principle: Transparent Relay & Decoupling** The Relay Server acts as a **lightweight data passthrough layer**. It does not define or parse specific CI data formats; instead, it offloads data interpretation and validation to the HUD. This ensures complete decoupling between the relay infrastructure and business-specific data. - **Security & Risk Mitigation** The relay uses **OIDC authentication** to guarantee the authenticity of the data source (**Verified Repo**). Its core responsibility is to ensure the data originates from the claimed repository, while security filtering and content compliance are enforced at the HUD level. --------- Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>

Minor formatting fix

edac27d

meta-cla Bot added the cla signed label Apr 29, 2026

ZainRizvi reviewed Apr 30, 2026

View reviewed changes

KarhouTam reviewed May 9, 2026

View reviewed changes

can-gaa-hou reviewed May 9, 2026

View reviewed changes

KarhouTam reviewed May 11, 2026

View reviewed changes

jewelkm89 changed the title ~~[WIP] RFC-0054: HUD Integration for Out-of-Tree CI Results~~ RFC-0054: HUD Integration for Out-of-Tree CI Results May 12, 2026

subinz1 added 3 commits May 12, 2026 15:50

subinz1 mentioned this pull request May 12, 2026

OOT HUD: Full ingestion pipeline — API, schemas, queries, and frontend pages pytorch/test-infra#8069

Open

9 tasks

subinz1 added 2 commits May 12, 2026 16:33

Replace company/hardware names with generic placeholders

cf3879a

Use <hardware>, <company name>, <version> placeholders instead of specific vendor or accelerator names in sample payloads, schema comments, and motivation text.

subinz1 force-pushed the oot-hud-integration-rfc branch from f84edea to 9f49809 Compare May 12, 2026 11:57

KarhouTam mentioned this pull request May 12, 2026

[CRCR] Initial implementation of L2 pytorch/test-infra#7967

Merged

12 tasks

jewelkm89 marked this pull request as ready for review May 14, 2026 09:32

subinz1 added 4 commits May 15, 2026 10:32

Replace specific company names with generic terms in Motivation section

80a3429

subinz1 mentioned this pull request May 21, 2026

Add ClickHouse schema for OOT workflow job table pytorch/test-infra#8105

Merged

2 tasks

subinz1 mentioned this pull request May 22, 2026

Add L2 CRCR deployment configuration pytorch/ci-infra#614

Merged


		The relay's Result Handler validates the callback body and produces a `{trusted, untrusted}` payload:

		- `trusted` — relay-generated fields: `verified_repo` (OIDC-proven identity) and `ci_metrics` (relay-measured `queue_time`, `execution_time`)

Conversation

jewelkm89 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data Flow

What this RFC covers

Reference implementation

HUD Mockup design

Status

Uh oh!

meta-cla Bot commented Apr 29, 2026

Action Required

Process

Uh oh!

ZainRizvi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZainRizvi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subinz1 commented May 1, 2026

Uh oh!

subinz1 commented May 1, 2026

Uh oh!

KarhouTam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KarhouTam May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subinz1 May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KarhouTam May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KarhouTam left a comment

Choose a reason for hiding this comment

jewelkm89 commented Apr 29, 2026 •

edited

Loading

ZainRizvi left a comment •

edited

Loading

KarhouTam May 12, 2026 •

edited

Loading

subinz1 May 12, 2026 •

edited

Loading

KarhouTam May 9, 2026 •

edited

Loading

subinz1 May 12, 2026 •

edited

Loading

KarhouTam May 12, 2026 •

edited

Loading

jewelkm89 commented May 14, 2026 •

edited

Loading