Skip to content

RFC-0054: HUD Integration for Out-of-Tree CI Results#96

Open
jewelkm89 wants to merge 13 commits into
pytorch:masterfrom
jewelkm89:oot-hud-integration-rfc
Open

RFC-0054: HUD Integration for Out-of-Tree CI Results#96
jewelkm89 wants to merge 13 commits into
pytorch:masterfrom
jewelkm89:oot-hud-integration-rfc

Conversation

@jewelkm89
Copy link
Copy Markdown

@jewelkm89 jewelkm89 commented Apr 29, 2026

Summary

This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).

Data Flow

flowchart LR
    subgraph Downstream["Downstream CI (OOT Backend)"]
        DS["Run tests\n+ upload artifacts"]
    end

    subgraph ART["Artifact Storage (org-managed)"]
        STORE[("Logs, test reports,\nJUnit XML")]
    end

    subgraph Relay["Relay Server"]
        RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"]
    end

    subgraph HUD["HUD"]
        API["/api/oot/results\n• Auth check\n• Payload caps (2MB)"]
    end

    subgraph Storage["Storage"]
        DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")]
        STR["DynamoDB Stream"]
        REP["clickhouse-replicator-dynamo"]
        CH[("ClickHouse\ndefault.oot_workflow_job\n(SharedReplacingMergeTree)")]
    end

    subgraph Frontend["HUD Frontend"]
        P1["/oot — Global Summary"]
        P2["/oot/org/repo — Per-Backend"]
        P3["/pr/N — OOT Section"]
    end

    DS -->|"Upload artifacts"| STORE
    DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH
    RH -->|"X-OOT-Relay-Token\n{trusted, untrusted}"| API
    API -->|"UpdateItem"| DDB
    DDB --> STR --> REP --> CH
    CH -->|"Query results +\nartifact_url"| P1 & P2 & P3
    P2 & P3 -.->|"User clicks\nexternal link"| STORE
Loading

Key points:

  • Artifact URLs are included in the completed callback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouse
  • HUD pages read artifact_url from ClickHouse and render it as an external link — no direct connection between HUD and downstream storage
  • Both in_progress and completed records are replicated to ClickHouse; SharedReplacingMergeTree handles deduplication — when a completed record arrives for the same dynamoKey, it replaces the in_progress row

What this RFC covers

  • Write path: Downstream CI → Result Handler → HUD API → DynamoDB → ClickHouse
    • in_progress callbacks → DynamoDB via UpdateItem → replicated to ClickHouse (shows "running" indicators)
    • completed callbacks → DynamoDB via UpdateItem (merges into existing record) → replicated to ClickHouse (replaces in_progress row via SharedReplacingMergeTree)
    • Artifact URLs flow through the callback payload, not sent directly to HUD
  • Read path: Three new HUD views:
    • /oot — Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)
    • /oot/[org]/[repo] — Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)
    • /pr/[number] — Collapsible "Out-of-Tree Backends" section in existing PR pages
  • Storage schemas: DynamoDB table and ClickHouse table designs
  • DB protection: Rate limiting (per-repo at relay), payload caps (2MB at HUD API)
  • Security: OIDC authentication, trusted/untrusted payload split, dedicated X-OOT-Relay-Token header, error handling strategy, signed callback token proposal, 3-state machine for status transitions
  • Sample payloads: Two-stage wire format examples (downstream → relay, relay → HUD) with full field definitions
  • Implementation plan: 6-phase rollout with task-level breakdown:
    1. Storage Layer — DynamoDB + ClickHouse + replicator mapping
    2. HUD API Endpoint — types, extraction, UpdateItem write logic
    3. Relay Integration — handler → HUD forwarding, rate limiting, reusable GHA action
    4. HUD Frontend Pages — 3 views + saved ClickHouse queries
    5. End-to-End Validation — real downstream repo testing
    6. Security Hardening — callback token, state machine (future)

Reference implementation

A reference implementation is available at pytorch/test-infra#8069, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.

HUD Mockup design

The draft OOT HUD UI mockup can be seen in OOT HUD Mockup link

Status

Ready for review

Defines the HUD-side ingestion and display layer for OOT CI results,
building on RFC-0050 (Cross-Repository CI Relay). Covers write path,
storage schemas, DB protection, security, and three frontend views.
Reference implementation: subinz1/test-infra#1
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 29, 2026

Hi @jewelkm89!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla meta-cla Bot added the cla signed label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick high level comments based on the PR description (I really appreciate the diagram btw!):

  • I see some payload validation happening in HUD. Why not have that in Relay Server instead? In HUD, auth check should likely be limited to authenticating the relay server's header value and a maaaaybe payload cap to ensure we don't overburden the database. Everything else probably will find a better home in Relay Server, where the logic can be centralized. And the payload cap is likely worth adding to the relay server as well (HUD's check is more of a sanity check as the db guardian.)

  • Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3

Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work — the architecture diagrams are really clear and the RFC builds on RFC-0050 well. A few things to sort out, mostly around aligning with the L2 PR's actual wire format and a couple of bugs in the extraction logic.

const pr = cb.payload?.pull_request;

const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`;
const now = new Date().toISOString();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of bugs here:

  • started_at is set to now() on every callback, including completed. That's the time HUD received the POST, not when the job started. On a completed callback, started_at ≈ completed_at, so the ClickHouse duration_seconds alias will always be ≈ 0. Neither started_at nor completed_at are available in the github context, so the action should capture its own wall-clock time and include it in the payload — in_progress sends started_at, completed sends completed_at. Then HUD stores what the action reports instead of using now().

  • queue_time is set from trusted.ci_metrics?.queue_time, which is null on completed callbacks (the relay only sends it on in_progress). This needs to skip setting queue_time when it's null rather than writing null over a valid value. Same applies to execution_time on in_progress callbacks.

Also, the DynamoDB write needs to use UpdateItem with SET expressions instead of PutItem. PutItem does a full item replace, so even with the above fixes, the completed callback would clobber queue_time and started_at from the in_progress record. UpdateItem lets each callback write only its own fields.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three fixed:

  • started_at/completed_at now use downstream-reported timestamps (wf.started_at, wf.completed_at) instead of new Date().toISOString(). Note: these fields need to be added to the L2 PR's callback action (action.yml), which currently doesn't send wall-clock timestamps.
  • queue_time/execution_time are only set when the relay provides a non-null value. in_progress sets queue_time; completed sets execution_time. Neither overwrites the other.
  • Changed from PutItem to UpdateItem with SET expressions throughout — Hop 3, data flow table, code samples, and implementation plan.

const wf = cb.workflow;
const pr = cb.payload?.pull_request;

const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key needs to include the job name and attempt number. The callback action runs inside a specific job, and if a downstream repo uses it in multiple jobs within the same workflow (build, test-cpu, test-gpu), all those jobs share the same github.workflow name and would write to the same dynamoKey — last writer wins, the rest are silently lost. And without the attempt number, job retries would overwrite the original run.

Let's add github.job and github.run_attempt to the key: {repo}/{delivery_id}/{workflow_name}/{job_name}/{attempt}. That gives each job its own row, preserves retry history, and makes the /oot/[org]/[repo] matrix view ("columns = OOT jobs") actually useful.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you for the thorough review @ZainRizvi, Changed from {repo}/{delivery_id}/{workflow_name} to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}. Added job_name and run_attempt to both DynamoDB and ClickHouse schemas. The L2 callback action needs to be updated to include github.job and github.run_attempt in the payload.


### Sample Payloads

#### In-Progress Callback
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These samples don't match the actual wire format. They show a flat structure with top-level status, head_sha, pr_number, etc. But the L2 PR's action.yml sends a nested structure — delivery_id and payload from the dispatch envelope, with a workflow sibling containing status/conclusion/name/url. Your own TypeScript code (RelayPayload, extractDynamoRecord) parses the nested format correctly, so the code is right — but a downstream developer implementing from these samples would produce payloads the system can't parse. These need to match the actual {delivery_id, payload, workflow} structure from the action.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all flat-structure samples with the actual nested wire format, shown at two stages:

Stage 1 (downstream → relay): {event_type, delivery_id, payload, workflow} — matching the callback action's output from action.yml
Stage 2 (relay → HUD): {trusted: {verified_repo, ci_metrics}, untrusted: {callback_payload: ...}} — matching forward_to_hud in hud.py
Also added a note about fields (job_name, run_attempt, started_at, completed_at) that need to be added to the L2 action.

This hop requires **no new code**. The existing infrastructure handles it:

1. **DynamoDB Streams** (enabled on the table with `NEW_AND_OLD_IMAGES`) captures every write
2. **`clickhouse-replicator-dynamo`** receives stream events and inserts `completed` records into ClickHouse (`in_progress` records are filtered out — they serve only as mutable state in DynamoDB)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_progress records need to be replicated to ClickHouse too, not just completed. Current practice for in-tree CI is we update job status in ClickHouse (in_progress → completed) and OOT should follow the same pattern. SharedReplacingMergeTree already supports upserts. Without this, the /pr/N page wouldn't be able to show running jobs.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — both in_progress and completed records now replicate to ClickHouse. The replicator forwards all DynamoDB stream events without filtering. SharedReplacingMergeTree handles the upsert — completed replaces in_progress for the same dynamoKey after merges.

| 2 | Schema validation + payload caps (2MB max body) | `400 Bad Request` |
| 3 | Write to DynamoDB | `502 Bad Gateway` |

The HUD API endpoint (`torchci/pages/api/oot/results.ts`) follows the existing `webhookToDynamo` pattern:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema validation and field extraction should live in the Relay Server, not HUD. The relay already does OIDC, allowlist checks, and rate limiting — adding schema validation there keeps HUD simple (authenticate the relay, write what it's told) and avoids duplicating validation if other consumers of relay data appear later. HUD should still enforce a payload size cap as a safety net.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have moved — schema validation is now at the relay (Hop 1, step 5). The HUD API endpoint is simplified to: validate X-OOT-Relay-Token header, enforce 2MB payload cap, extract/flatten the record, and UpdateItem write. HUD trusts that the relay has already validated the structure.

| Hop | From → To | Auth Mechanism | Credential Scope |
|-----|-----------|----------------|------------------|
| 1 | Downstream → Result Handler | OIDC token (GHA-issued, 5 min TTL) | No secrets needed by downstream |
| 2 | Result Handler → HUD API | `X-Hud-Internal-Bot` header | Same pattern as DrCI / trymerge |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a dedicated auth header for the relay instead of X-Hud-Internal-Bot. That header is shared across multiple systems (DrCI, trymerge) and has only been used for read-only operations. Since this is a write path, giving the relay its own key limits blast radius if any single one leaks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely @ZainRizvi,
we have changed to X-OOT-Relay-Token — a dedicated header scoped only to the OOT relay write path. Added an explanation in the RFC that this isolates the write path so key rotation or revocation doesn't affect other internal HUD consumers (DrCI, trymerge).

- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: moved schema validation from HUD to relay (Hop 1)
- Replication: both in_progress and completed records go to ClickHouse
- Timestamps: use downstream-reported started_at/completed_at
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Schema: added job_name, run_attempt columns
- Rate limit: 10 → 20 req/min (matches L2 PR default)
- Error handling: updated to match actual retry + raise behavior
- Sample payloads: rewritten to match actual nested {trusted, untrusted} wire format
@subinz1
Copy link
Copy Markdown

subinz1 commented May 1, 2026

  • I see some payload validation happening in HUD. Why not have that in Relay Server instead?

Hello @ZainRizvi, Agreed — moved schema validation to the relay (Hop 1, step 5). The relay now validates delivery_id, workflow.status, and conclusion presence before forwarding. HUD is simplified to: auth (X-OOT-Relay-Token) + payload cap (2MB) + UpdateItem write. This aligns with the L2 PR's result_handler.py which already validates these fields at the relay level

@subinz1
Copy link
Copy Markdown

subinz1 commented May 1, 2026

  • Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3

Fixed — removed all "completed only" language. Both in_progress and completed records now replicate to ClickHouse. SharedReplacingMergeTree handles the deduplication — when the completed record arrives for the same dynamoKey, it replaces the in_progress row after merges. This matches in-tree CI behavior and allows the /pr/N page to show running jobs.

Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.

Changed everywhere — diagrams, data flow table, authentication flow table, code samples, and implementation plan. The relay now uses a dedicated X-OOT-Relay-Token header instead of the shared X-Hud-Internal-Bot. This isolates the OOT write path so rotating/revoking the token doesn't affect DrCI/trymerge.

subinz1 added a commit to subinz1/test-infra that referenced this pull request May 4, 2026
Address @ZainRizvi's review on pytorch/rfcs#96:

- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: removed schema validation from HUD (moved to relay)
- Removed daily budget enforcement
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Timestamps: use downstream-reported started_at/completed_at instead of now()
- Timing metrics: only set queue_time/execution_time when non-null
- ClickHouse schema: added job_name, run_attempt columns
- Queries: select job_name, run_attempt as proper columns
- Frontend: updated interfaces to include new fields
Copy link
Copy Markdown

@KarhouTam KarhouTam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @jewelkm89. Thanks for this RFC, really a great work! According to the previous comments and suggestions on CRCR L2 implementation PR, some inconsistencies existed between the current implementation and this RFC. I've left some comments. PTAL!

cc @fffrog @can-gaa-hou

| Transition | Action |
|------------|--------|
| No record → `in_progress` | Accepted |
| No record → `completed` | Accepted (handles missed `in_progress` callback) |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why No record → completed should be accepted?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @KarhouTam,
During the initial design, we have decided that if the in_progress couldn't make it to the DynamoDB due to outages, the downstream repo should still be allowed to insert the completed record. Please refer comment.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further review and aligning with the L2 implementation, we've updated the RFC to reject No record → completed. With check_run_id-based state tracking and the DISPATCHED prerequisite, accepting a completed without prior in_progress would bypass the state integrity check. The strict DISPATCHED → IN_PROGRESS → COMPLETED flow provides better replay and fabrication protection.

Copy link
Copy Markdown

@KarhouTam KarhouTam May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, when an in_progress callback is lost, the system returns a 400 error, which requires the on-call operator to manually rerun the job. While this behavior was designed to reflect real-world network instability, we have decided after thoughtful internal discussion that it would be more efficient to ensure a completed callback is always preceded by an in_progress callback.

cc @fffrog @can-gaa-hou

Copy link
Copy Markdown

@subinz1 subinz1 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming @KarhouTam. This aligns with the update we made to the RFC — No record → completed is now rejected, and the strict DISPATCHED → IN_PROGRESS → COMPLETED flow is enforced. If in_progress is lost, the operator reruns the job rather than accepting a completed without state integrity. The RFC reflects this as the agreed-upon behavior.

Comment on lines +741 to +748
"workflow": {
"status": "in_progress",
"conclusion": null,
"name": "xpu-ci",
"url": "https://github.com/{org}/{repo}/actions/runs/24033272679",
"job_name": "test-xpu-float32",
"run_attempt": 1,
"started_at": "2025-04-28T10:15:30Z"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my state machine implementation, the Redis key now includes check_run_id instead of run_attempt. Since both values change with reruns, I believe check_run_id is a better fit for representing a unique job than {job_name}:{run_attempt}. Accordingly, the Redis state key has been updated to oot:state:{delivery_id}:{repo}:{check_run_id}. run_attempt still in the dict, but not used by relay server.

To improve HUD usability, I added run_id and job_name to the workflow dictionary, which keep consistent among all runs. This allows the HUD to use a grouping key like {job_name}:{run_id}, preventing a single downstream workflow from appearing as multiple rows due to reruns. Under this approach, every unique check_run_id is treated as a separate job.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements @KarhouTam. Aligned the RFC with your changes:

  • State machine key: Updated from run_id to check_run_idoot:state:{delivery_id}:{repo}:{check_run_id}. Since check_run_id is GitHub-assigned and unique per job execution, it's a better uniqueness key than {job_name}:{run_attempt}.
  • 3-state model: Adopted the DISPATCHED → IN_PROGRESS → COMPLETED flow with no shortcuts. DISPATCHED is repo-level (set by webhook handler), IN_PROGRESS/COMPLETED are job-level per check_run_id. Added your mermaid diagram.
  • New fields: check_run_id, run_id, schema_version added to the workflow interface, DynamoDB/ClickHouse schemas, sample payloads, and the reference implementation.
  • DynamoKey: Switched from {job_name}/{run_attempt} to {job_name}/{check_run_id} for per-execution uniqueness.
  • HUD grouping: Frontend groups by job_name and keeps the latest run_attempt per PR, preventing reruns from appearing as duplicate rows.

run_attempt is still in the dict and stored for informational display, but not used for keying.

> [!NOTE]
> The callback token is not strictly required for L2 (dashboard only). However, L1 is where the token gets minted, and once L1 is deployed and downstream repos depend on the current `client_payload` shape, retrofitting it becomes a breaking change. We recommend adding it now as optional groundwork for L3/L4.

#### State Machine for Status Transitions
Copy link
Copy Markdown

@KarhouTam KarhouTam May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current state machine differs slightly from this RFC. Here are the state transitions:

According to previous comments in the L2 implementation PR, this state machine has three states:

  • DISPATCHED (set only once at the webhook handler per repository; this state is repo-level and every job needs it to update its state).
  • IN_PROGRESS (job-level, cannot be duplicated; reruns are treated as separate jobs).
  • COMPLETED (job-level, cannot be duplicated; reruns are treated as separate jobs).

Note that the direction graph below is for a single check run, reruns have different check_run_id and are treated as separate jobs, so they won't violate the state machine since they won't have a prior IN_PROGRESS or COMPLETED record.

stateDiagram-v2
    direction LR
    [*] --> DISPATCHED: webhook sends
    DISPATCHED --> IN_PROGRESS: first callback
    IN_PROGRESS --> COMPLETED: completion
    IN_PROGRESS --> IN_PROGRESS: ❌ duplicate
    DISPATCHED --> COMPLETED: ❌ skip IN_PROGRESS
    COMPLETED --> COMPLETED: ❌ duplicate
    COMPLETED --> IN_PROGRESS: ❌ wrong direction
    [*] --> IN_PROGRESS: ❌ no dispatch
    [*] --> COMPLETED: ❌ no dispatch
Loading

State types:

  • DISPATCHED: Repo-level state (check_run_id=DISPATCH_CHECK_RUN_ID) — one per repository.
  • IN_PROGRESS / COMPLETED: Job-level states — independent per job.

Valid flows:

  • Main: DISPATCHED → IN_PROGRESS → COMPLETED (linear)
  • Reruns: IN_PROGRESS → IN_PROGRESS or COMPLETED → IN_PROGRESS (update timestamps)

Rejected:

  • Skipping IN_PROGRESS, duplicate COMPLETED, or missing DISPATCHED state.
  • Replay attack: duplicate IN_PROGRESS (same check_run_id).

}
```

The `extractDynamoRecord` function flattens this into a single record. It uses `started_at` and `completed_at` timestamps reported by the downstream action (wall-clock time when the callback was sent), rather than generating timestamps at HUD. Timing metrics (`queue_time`, `execution_time`) are only set when the relay provides a non-null value, preventing `completed` callbacks from clobbering `queue_time` with `null`:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the retry scenario, queue_time may be an invalid value. I suppose we should not update this value in this case.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @can-gaa-hou. On retries (run_attempt > 1), queue_time = dispatch_timestamp → rerun_in_progress_timestamp could be hours or days old — meaningless as a CI metric.

The fix belongs on the relay side: the relay should set queue_time = null when run_attempt > 1. On the HUD side, we already only write queue_time when the relay provides a non-null value, so no HUD changes are needed.

Updated the RFC to document this behavior.

Copy link
Copy Markdown

@KarhouTam KarhouTam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More inconsistencies found.


The relay's Result Handler validates the callback body and produces a `{trusted, untrusted}` payload:

- `trusted` — relay-generated fields: `verified_repo` (OIDC-proven identity) and `ci_metrics` (relay-measured `queue_time`, `execution_time`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downstream_repo_level is included for HUD convenience now.

Values: L1 ~ L4 (string)

Copy link
Copy Markdown

@subinz1 subinz1 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KarhouTam. Good call — updated the RFC and reference implementation to source downstream_repo_level from trusted (relay-determined from the allowlist) instead of from the untrusted callback payload. Changes:

RFC:

  • Added downstream_repo_level to the trusted fields description and RelayPayload interface
  • extractDynamoRecord now reads it from trusted.downstream_repo_level
  • Both sample payloads (in_progress and completed) now include "downstream_repo_level": "L2" in the trusted section

Reference implementation (subinz1/test-infra#1):

  • Added downstream_repo_level to RelayTrusted and OotWorkflowJobRecord interfaces
  • extractDynamoRecord extracts it from trusted.downstream_repo_level

The DynamoDB and ClickHouse schemas already had the column from a prior commit — only the ingestion/extraction path was missing.

Comment on lines +253 to +261
workflow: {
status: string;
conclusion?: string | null;
name: string;
url: string;
job_name?: string;
run_attempt?: number;
started_at?: string;
completed_at?: string;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflow now is like:

workflow: dict = {
    "schema_version": ...,
    "status": ...,
    "conclusion": ...,
    "name": ...,
    "url": ...,
    "run_attempt": ...,
    "job_name": ...,
    "check_run_id": ...,
    "run_id": ...,
    "started_at": ...,
    "completed_at": ...,
    "test-results": ...
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated field list @KarhouTam. Updated the RFC and reference implementation to align:

  • Added schema_version, check_run_id, and run_id to the workflow interface, DynamoDB/ClickHouse schemas, queries, and sample payloads
  • Switched dynamoKey from run_attempt to check_run_id for per-execution uniqueness
  • Fixed test-results key to use the hyphenated form matching the L2 action (was test_results — would have silently failed in JS)

One question: should test-results stay hyphenated, or would it be cleaner to normalize to test_results in the action? Hyphenated keys require bracket notation in TS (wf["test-results"]). Happy either way — just want to confirm the convention.

Copy link
Copy Markdown

@KarhouTam KarhouTam May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one! Actually, in the latest action.yml, we changed it to workflow["test_results"] already.

https://github.com/KarhouTam/test-infra/blob/c2833bd9eee6f7299311695597ad1e57ad974c7b/.github/actions/cross-repo-ci-relay-callback/action.yml#L121-L126

Sorry for the misleading comment.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KarhouTam. We've updated both the RFC and reference implementation to use test_results (underscore) — it matches the L2 action's Python code (workflow["test_results"] = ...) and is more convenient for TypeScript (dot notation vs bracket notation). Also aligned test_results to be summary counts only (total/passed/failed/skipped), with detailed test results going via artifact_url as the action describes.

The L2 relay now includes downstream_repo_level (L1–L4) in the trusted
dict, determined from the allowlist rather than self-reported by
downstream. Updated the RelayPayload interface, extractDynamoRecord, and
sample payloads to read it from trusted.
@jewelkm89 jewelkm89 changed the title [WIP] RFC-0054: HUD Integration for Out-of-Tree CI Results RFC-0054: HUD Integration for Out-of-Tree CI Results May 12, 2026
subinz1 added 3 commits May 12, 2026 15:50
Align with updated L2 workflow dict: add schema_version, check_run_id,
and run_id to the workflow interface, DynamoDB schema table, ClickHouse
schema, extractDynamoRecord, and sample payloads. Switch dynamoKey from
run_attempt to check_run_id for per-execution uniqueness. Fix
test-results key to use the hyphenated form matching the L2 action.
queue_time is only meaningful for first attempts (run_attempt == 1).
On retries, the dispatch timestamp is stale and the relay should send
null. HUD already handles this — it only writes queue_time when
non-null. Also fixed a stale dynamoKey reference that still showed
run_attempt instead of check_run_id.
Replace the 2-state model with the L2 PR's strict 3-state machine:
DISPATCHED → IN_PROGRESS → COMPLETED. No shortcuts allowed — missing
IN_PROGRESS or missing DISPATCHED both result in rejection. State key
updated from run_id to check_run_id for per-execution uniqueness.
Added mermaid diagram and updated callback token note to reflect that
DISPATCHED partially addresses dispatch provenance.
subinz1 added 2 commits May 12, 2026 16:33
Use <hardware>, <company name>, <version> placeholders instead of
specific vendor or accelerator names in sample payloads, schema
comments, and motivation text.
The L2 action uses test_results (underscore) in the workflow dict, not
test-results (hyphen). The hyphenated name is only the GHA input name.
Also removed failures array from test_results — the L2 action sends
summary counts only; detailed test results go via artifact_url.
@subinz1 subinz1 force-pushed the oot-hud-integration-rfc branch from f84edea to 9f49809 Compare May 12, 2026 11:57
subinz1 added a commit to subinz1/test-infra that referenced this pull request May 12, 2026
Address @ZainRizvi's review on pytorch/rfcs#96:

- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: removed schema validation from HUD (moved to relay)
- Removed daily budget enforcement
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Timestamps: use downstream-reported started_at/completed_at instead of now()
- Timing metrics: only set queue_time/execution_time when non-null
- ClickHouse schema: added job_name, run_attempt columns
- Queries: select job_name, run_attempt as proper columns
- Frontend: updated interfaces to include new fields
@jewelkm89 jewelkm89 marked this pull request as ready for review May 14, 2026 09:32
@jewelkm89
Copy link
Copy Markdown
Author

jewelkm89 commented May 14, 2026

Hello @albanD, @huydhn and @atalman
Request your review for the RFC as well.
Associated PR is pytorch/test-infra#8069

subinz1 added 4 commits May 15, 2026 10:32
- Fix auth header: X-Hud-Internal-Bot → X-OOT-Relay-Token everywhere
  (mermaid diagram, data flow table, auth table, implementation plan)
- Fix HTTP status: 502 → 500 for DynamoDB write errors
- Fix DynamoDB key format: add job_name and check_run_id segments
- Add missing DynamoDB fields: job_name, check_run_id, run_id, run_attempt
- Update handler code to match actual implementation (relay token auth,
  no validateRelayPayload — validation done by relay)
- Update RelayPayload interface: add downstream_repo_level, schema_version,
  job_name, check_run_id, run_id, run_attempt, artifact_url
- Update extractDynamoRecord: downstream timestamps, Number() coercion
  for run_attempt, artifact_url extraction
- Fix SQL queries to match actual saved queries:
  - oot_summary: add downstream_repo_level
  - oot_backend_dashboard: separate workflow_name/job_name, add per-job fields
  - oot_pr_results: same
- Fix Two-Callback Model: both records replicated to ClickHouse,
  SharedReplacingMergeTree handles dedup (not "completed only")
- Fix Hop 4: replicator inserts both in_progress and completed records
Replace flat-format sample payloads with the actual two-stage format:
- Stage 1 (downstream → relay): callback body with workflow dict,
  as built by the L2 action.yml
- Stage 2 (relay → HUD): {trusted, untrusted} envelope,
  as built by result_handler.py → forward_to_hud()

Includes note about run_attempt string-to-number coercion.
…pped

The L2 action sends test_results with {passed, failed, skipped} but no
total field. Updated the extractDynamoRecord snippet, RelayWorkflow
interface, and sample payloads to reflect this and compute total_tests
as the sum when tr.total is not present.
atalman pushed a commit to pytorch/test-infra that referenced this pull request May 21, 2026
## Summary

Creates the `default.oot_workflow_job` ClickHouse table for storing
out-of-tree backend CI results. Split out from #8069 so the table can be
deployed independently and populated with test data.

The DynamoDB → ClickHouse replicator mapping (`torchci-oot-workflow-job`
→ `default.oot_workflow_job`) is already on main.

### Schema highlights

- `SharedReplacingMergeTree` engine for upsert semantics (in_progress →
completed)
- Ordered by `(downstream_repo, delivery_id, dynamoKey)` for efficient
per-backend and per-delivery queries
- Bloom filter indexes on `status`, `pr_number`, `downstream_repo`
- Timing fields: `queue_time`, `execution_time`, `started_at`,
`completed_at`
- Test result columns: `total_tests`, `passed_tests`, `failed_tests`,
`skipped_tests`

### Related

- Full HUD pipeline PR: #8069
- RFC: pytorch/rfcs#96
- Reference: Similar to #7365 (vLLM ClickHouse table)

## Test Plan

- [ ] Table creation succeeds on ClickHouse
- [ ] Populate with fake data and verify queries from #8069 return
expected results
atalman pushed a commit to pytorch/test-infra that referenced this pull request May 22, 2026
## Author

- @KarhouTam 
- @can-gaa-hou
- @fffrog 

## Summary
- This PR implements the L2 levels of the cross-repository CI relay
described in pytorch/rfcs#90.
- For the previous L1 implementation, please refer to
#7847.
- Please refer to
pytorch/rfcs#90 (comment) for the
overall implementation.
- Please refer to pytorch/rfcs#96 for the design
of HUD side.
- Please refer to #8069 for
the implementation of HUD side.

Higher-level behaviors for `L3` and `L4` are intentionally left for
follow-up work.

## Architecture

The relay is split into two AWS Lambda functions:

- `webhook` lambda function (Updated)
- [x] receives GitHub webhook PR and push events from the upstream repo
- [x] validates webhook signatures and authenticates with AWS Secret
Manager
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] for `opened`/`reopened`/`synchronized`/`closed` actions, forwards
repository_dispatch events to downstream repos
 
- `callback` lambda function (Added)
- [x] receives downstream callback payload through a public lambda
function URL
  - [x] validates callback payload with OIDC
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] extracts CI result information from the payload and uploads to
PyTorch HUD
- [x] records `queue time` and `execute time` for evolution to `L3` repo

## Changes
```md
..github/
├── workflows/
│   └── _lambda-do-release-runners.yml     # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release
│
└── actions/
    └── cross-repo-ci-relay-callback/
        └── action.yml                     # Composite action used by downstream workflows to report status back to the relay/result endpoint

aws/lambda/cross_repo_ci_relay/
├── tests/                                 # Unit tests for allowlist/config/webhook/result/redis behavior
├── README.md                              # Project overview, local development, callback flow, and result-side validation steps
├── Makefile                               # Top-level local developer entrypoint for test / deploy / clean
├── local_server.py                        # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints
├── requirements.txt                       # Python dependencies required by the relay Lambdas
│
├── utils/
│   ├── allowlist.py                       # Loads, parses, and queries the downstream allowlist by rollout level
│   ├── config.py                          # Shared runtime config loading and cached get_config() helper
│   ├── gh_helper.py                       # GitHub App, repository_dispatch, and GitHub file access helpers
│   ├── hud.py                             # HUD write helpers for downstream result reporting
│   ├── jwt_helper.py                      # Helpers for minting/verifying relay callback tokens
│   ├── redis_helper.py                    # Redis helpers for allowlist cache, OOT state, and timing data
│   └── misc.py                            # Shared TypedDict definitions and HTTPException
│
├── webhook/
│   ├── Makefile                           # Build/package/deploy commands for the webhook Lambda
│   ├── lambda_function.py                 # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events
│   └── event_handler.py                   # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos
│
└── callback/
    ├── Makefile                           # Build/package/deploy commands for the result Lambda
    ├── lambda_function.py                 # Result Lambda entrypoint: verifies callback token and GitHub OIDC token
    └── callback_handler.py                # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD
```

## Usage

See
[README.md](https://github.com/KarhouTam/test-infra/blob/crcr-L2/aws/lambda/cross_repo_ci_relay/README.md)
for more details.

## Verification

We performed the following scenario verification on our AWS Lambda
instance:

- [x] Test with Upstream PR create/reopen/synchronize and push events
triggering webhook, then redispatching to the Downstream CI (different
organization) workflow.
- [x] Test with Downstream workflow send callback payload through the
added action to the result lambda, then extract CI result information
and send to PyTorch HUD.

## Terraform configuration

- pytorch/ci-infra#446

## Unit Tests

- [x] Unit Tests (Mock)

## Security

- **Callback payload carries full upstream webhook data back to HUD** —
`action.yml` builds the callback body by mutating
`github.event.client_payload` (which contains the entire original
webhook payload: PR metadata, commits, author info) and adding
`status`/`conclusion`/`workflow_name`/`workflow_url` on top. This full
blob is forwarded verbatim by `hud.py` to HUD with no relay-side
filtering. HUD receives both relay-trusted `verified_repo` and an
unvalidated body — if HUD trusts self-reported fields inside the body
over `verified_repo`, a manipulated dispatch payload could tamper with
HUD records.

- **Lambda callback URL is public and hardcoded** — The endpoint is
hardcoded in `action.yml and exposed in a public action, making it
trivially discoverable. OIDC verification blocks unauthorized HUD
writes, but the endpoint has no rate limiting; request flooding can
cause Lambda concurrency exhaustion or Redis connection saturation.

- **Only OIDC is used for verification** — The callback lambda relies
solely on GitHub OIDC token verification for authentication, without
additional application-level secrets or signatures. If an attacker
compromises a downstream repo's GitHub Actions permissions, they could
forge authenticated requests to the callback endpoint. Besides, OIDC has
its own limitations (e.g., token expiration, potential
misconfigurations) that could lead to unauthorized access if not
carefully managed.

## HUD Interaction

- **Design Principle: Transparent Relay & Decoupling**
The Relay Server acts as a **lightweight data passthrough layer**. It
does not define or parse specific CI data formats; instead, it offloads
data interpretation and validation to the HUD. This ensures complete
decoupling between the relay infrastructure and business-specific data.

- **Security & Risk Mitigation**
The relay uses **OIDC authentication** to guarantee the authenticity of
the data source (**Verified Repo**). Its core responsibility is to
ensure the data originates from the claimed repository, while security
filtering and content compliance are enforced at the HUD level.

---------

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
Co-authored-by: fffrog <ljw1101.vip@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants