Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,6 @@ from branches in `PolicyEngine/policyengine-us-data`; never create fork PRs.
For PRs that change pipeline behavior, stage boundaries, generated artifacts, or
public library functions, read
`docs/engineering/skills/documentation_review.md` during review.

For deployed Modal pipeline run status or failure diagnosis, read
`docs/engineering/skills/pipeline_operations.md`.
2 changes: 2 additions & 0 deletions .github/workflows/pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,8 @@ jobs:
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
MODAL_PROXY_TOKEN_ID: ${{ secrets.MODAL_PROXY_TOKEN_ID }}
MODAL_PROXY_TOKEN_SECRET: ${{ secrets.MODAL_PROXY_TOKEN_SECRET }}
HUGGING_FACE_TOKEN: ${{ secrets.HUGGING_FACE_TOKEN }}
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
MODAL_ENVIRONMENT: staging-us-data-pr-${{ github.event.pull_request.number }}
Expand Down
3 changes: 3 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ When reviewing PRs that change pipeline behavior, stage boundaries, generated
artifacts, or public library functions, read
`docs/engineering/skills/documentation_review.md`.

When diagnosing a deployed Modal pipeline run or a failed publication pipeline,
read `docs/engineering/skills/pipeline_operations.md`.

## GitHub PRs

Read `docs/engineering/skills/github-prs.md` before opening, replacing, or
Expand Down
3 changes: 3 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ When reviewing PRs that change pipeline behavior, stage boundaries, generated
artifacts, or public library functions, read
`docs/engineering/skills/documentation_review.md`.

When diagnosing a deployed Modal pipeline run or a failed publication pipeline,
read `docs/engineering/skills/pipeline_operations.md`.

## Safety boundaries

Do not fabricate data, validation metrics, academic results, or performance
Expand Down
1 change: 1 addition & 0 deletions changelog.d/962.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add structured Modal pipeline status reporting with durable run-scoped error records.
2 changes: 2 additions & 0 deletions docs/engineering/skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,7 @@ Current skills:
conventions.
- `pipeline_docs.md`: decorator-backed pipeline map maintenance and generated
pydoc-style artifacts.
- `pipeline_operations.md`: model-neutral workflow for diagnosing deployed Modal
pipeline status and durable error records.
- `testing.md`: test layout, fixture scope, helper placement, and quality guard
expectations.
107 changes: 107 additions & 0 deletions docs/engineering/skills/pipeline_operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Pipeline Operations

Use this skill when diagnosing a deployed Modal pipeline run, especially when a
GitHub Actions pipeline launch fails or a user asks for the status of a run.

## Source Of Truth

Treat the pipeline status endpoint and run-scoped error records as the first
diagnostic source. Modal dashboard logs are useful supporting evidence, but they
are not the durable error record for this repo.

The status system reports:

- the run-level manifest;
- all stage and substage manifests present for that run;
- missing expected runtime manifest IDs;
- the latest durable error record, when one exists;
- a redacted, bounded traceback when one exists.

## Status Surfaces

The structured status payload is canonical. The pipeline status sub-app exposes
three Modal functions:

- `get_pipeline_status`: Python-callable structured JSON for agents, scripts,
dashboards, and tests. Prefer this for diagnosis and automation.
- `pipeline_status_endpoint`: protected HTTP endpoint returning the same
structured JSON for non-Python clients. Use Modal proxy auth headers.
- `pipeline_status_snippet`: human-readable text used by
`modal run modal_app/pipeline.py::main --action status`. This is for quick
terminal inspection only and must not be treated as a schema.

## Fetch Status

First identify the run context from the GitHub Actions summary, workflow logs, or
run-context output:

- `run_id`
- Modal app name
- Modal environment

For agent or CLI diagnosis, call the deployed Modal function:

```bash
uv run python - <<'PY'
import json
import modal

app_name = "POLICYENGINE_US_DATA_MODAL_APP"
environment_name = "main"
run_id = "US_DATA_RUN_ID"

fn = modal.Function.from_name(
app_name,
"get_pipeline_status",
environment_name=environment_name,
)
print(json.dumps(fn.remote(run_id), indent=2))
PY
```

The status payload includes a traceback when one is available. Tracebacks are
redacted and bounded by keeping the newest text if they are very long.

If the local environment cannot sync the full project environment, use the same
snippet with a Modal-only temporary environment by replacing `uv run python`
with `uv run --no-sync --with modal python`.

If using the HTTP endpoint, authenticate with Modal proxy auth headers. Do not
publish or paste proxy auth values into PRs, issues, logs, or docs.

```bash
curl \
-H "Modal-Key: $MODAL_PROXY_TOKEN_ID" \
-H "Modal-Secret: $MODAL_PROXY_TOKEN_SECRET" \
"https://<status-endpoint>.modal.run?run_id=<run_id>"
```

## Interpret Results

Use `status` and `message` for the short answer. Then inspect:

- `error.stage_id`: canonical top-level stage, such as `3_fit_weights`;
- `error.substage_id`: narrower substage, such as
`3a_weight_fitting_regional`;
- `error.record_path`: immutable error record path in the pipeline volume;
- `error.latest_path`: latest error pointer for the run;
- `stage_manifests[].manifest.error`: manifest-local failure details;
- `missing_expected_manifest_ids`: expected runtime manifests that have not yet
been written.

When reporting back, name the failing stage and substage, summarize the exception
type and message, and cite whether the traceback came from the status endpoint or
from Modal dashboard logs.

## Safety Rules

- Do not paste tracebacks into PRs, issues, or chat unless the user needs that
detail.
- Redact secrets before sharing command output, even though the status endpoint
already applies obvious redaction.
- Do not infer that a missing later-stage manifest is a failure if the run is
still running.
- If the run was hard-killed before Python exception handling ran, the endpoint
may show a running run with no durable error. In that case, report the last
completed/running manifest and then use Modal dashboard logs as secondary
evidence.
Loading