PolicyEngine · anth-volk · May 12, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -13,3 +13,6 @@ from branches in `PolicyEngine/policyengine-us-data`; never create fork PRs.
 For PRs that change pipeline behavior, stage boundaries, generated artifacts, or
 public library functions, read
 `docs/engineering/skills/documentation_review.md` during review.
+
+For deployed Modal pipeline run status or failure diagnosis, read
+`docs/engineering/skills/pipeline_operations.md`.
diff --git a/.github/workflows/pr.yaml b/.github/workflows/pr.yaml
@@ -149,6 +149,8 @@ jobs:
     env:
       MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
       MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
+      MODAL_PROXY_TOKEN_ID: ${{ secrets.MODAL_PROXY_TOKEN_ID }}
+      MODAL_PROXY_TOKEN_SECRET: ${{ secrets.MODAL_PROXY_TOKEN_SECRET }}
       HUGGING_FACE_TOKEN: ${{ secrets.HUGGING_FACE_TOKEN }}
       GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
       MODAL_ENVIRONMENT: staging-us-data-pr-${{ github.event.pull_request.number }}

diff --git a/AGENTS.md b/AGENTS.md
@@ -17,6 +17,9 @@ When reviewing PRs that change pipeline behavior, stage boundaries, generated
 artifacts, or public library functions, read
 `docs/engineering/skills/documentation_review.md`.
 
+When diagnosing a deployed Modal pipeline run or a failed publication pipeline,
+read `docs/engineering/skills/pipeline_operations.md`.
+
 ## GitHub PRs
 
 Read `docs/engineering/skills/github-prs.md` before opening, replacing, or

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -25,6 +25,9 @@ When reviewing PRs that change pipeline behavior, stage boundaries, generated
 artifacts, or public library functions, read
 `docs/engineering/skills/documentation_review.md`.
 
+When diagnosing a deployed Modal pipeline run or a failed publication pipeline,
+read `docs/engineering/skills/pipeline_operations.md`.
+
 ## Safety boundaries
 
 Do not fabricate data, validation metrics, academic results, or performance

diff --git a/changelog.d/962.added.md b/changelog.d/962.added.md
@@ -0,0 +1 @@
+Add structured Modal pipeline status reporting with durable run-scoped error records.
diff --git a/docs/engineering/skills/README.md b/docs/engineering/skills/README.md
@@ -16,5 +16,7 @@ Current skills:
   conventions.
 - `pipeline_docs.md`: decorator-backed pipeline map maintenance and generated
   pydoc-style artifacts.
+- `pipeline_operations.md`: model-neutral workflow for diagnosing deployed Modal
+  pipeline status and durable error records.
 - `testing.md`: test layout, fixture scope, helper placement, and quality guard
   expectations.
diff --git a/docs/engineering/skills/pipeline_operations.md b/docs/engineering/skills/pipeline_operations.md
@@ -0,0 +1,107 @@
+# Pipeline Operations
+
+Use this skill when diagnosing a deployed Modal pipeline run, especially when a
+GitHub Actions pipeline launch fails or a user asks for the status of a run.
+
+## Source Of Truth
+
+Treat the pipeline status endpoint and run-scoped error records as the first
+diagnostic source. Modal dashboard logs are useful supporting evidence, but they
+are not the durable error record for this repo.
+
+The status system reports:
+
+- the run-level manifest;
+- all stage and substage manifests present for that run;
+- missing expected runtime manifest IDs;
+- the latest durable error record, when one exists;
+- a redacted, bounded traceback when one exists.
+
+## Status Surfaces
+
+The structured status payload is canonical. The pipeline status sub-app exposes
+three Modal functions:
+
+- `get_pipeline_status`: Python-callable structured JSON for agents, scripts,
+  dashboards, and tests. Prefer this for diagnosis and automation.
+- `pipeline_status_endpoint`: protected HTTP endpoint returning the same
+  structured JSON for non-Python clients. Use Modal proxy auth headers.
+- `pipeline_status_snippet`: human-readable text used by
+  `modal run modal_app/pipeline.py::main --action status`. This is for quick
+  terminal inspection only and must not be treated as a schema.
+
+## Fetch Status
+
+First identify the run context from the GitHub Actions summary, workflow logs, or
+run-context output:
+
+- `run_id`
+- Modal app name
+- Modal environment
+
+For agent or CLI diagnosis, call the deployed Modal function:
+
+```bash
+uv run python - <<'PY'
+import json
+import modal
+
+app_name = "POLICYENGINE_US_DATA_MODAL_APP"
+environment_name = "main"
+run_id = "US_DATA_RUN_ID"
+
+fn = modal.Function.from_name(
+    app_name,
+    "get_pipeline_status",
+    environment_name=environment_name,
+)
+print(json.dumps(fn.remote(run_id), indent=2))
+PY
+```
+
+The status payload includes a traceback when one is available. Tracebacks are
+redacted and bounded by keeping the newest text if they are very long.
+
+If the local environment cannot sync the full project environment, use the same
+snippet with a Modal-only temporary environment by replacing `uv run python`
+with `uv run --no-sync --with modal python`.
+
+If using the HTTP endpoint, authenticate with Modal proxy auth headers. Do not
+publish or paste proxy auth values into PRs, issues, logs, or docs.
+
+```bash
+curl \
+  -H "Modal-Key: $MODAL_PROXY_TOKEN_ID" \
+  -H "Modal-Secret: $MODAL_PROXY_TOKEN_SECRET" \
+  "https://<status-endpoint>.modal.run?run_id=<run_id>"
+```
+
+## Interpret Results
+
+Use `status` and `message` for the short answer. Then inspect:
+
+- `error.stage_id`: canonical top-level stage, such as `3_fit_weights`;
+- `error.substage_id`: narrower substage, such as
+  `3a_weight_fitting_regional`;
+- `error.record_path`: immutable error record path in the pipeline volume;
+- `error.latest_path`: latest error pointer for the run;
+- `stage_manifests[].manifest.error`: manifest-local failure details;
+- `missing_expected_manifest_ids`: expected runtime manifests that have not yet
+  been written.
+
+When reporting back, name the failing stage and substage, summarize the exception
+type and message, and cite whether the traceback came from the status endpoint or
+from Modal dashboard logs.
+
+## Safety Rules
+
+- Do not paste tracebacks into PRs, issues, or chat unless the user needs that
+  detail.
+- Redact secrets before sharing command output, even though the status endpoint
+  already applies obvious redaction.
+- Do not infer that a missing later-stage manifest is a failure if the run is
+  still running.
+- If the run was hard-killed before Python exception handling ran, the endpoint
+  may show a running run with no durable error. In that case, report the last
+  completed/running manifest and then use Modal dashboard logs as secondary
+  evidence.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Add structured Modal pipeline status reporting with durable run-scoped error records.