Cross-pipeline integration-test framework for GFW data pipelines (pipe-gaps, anchorages_pipeline/port-visits, pipe-events/fishing).
The dit CLI drives Python workflow files that orchestrate phases (pipeline invocations) across modes (trigger patterns: bf / bfd / bftruncate / mutate-recover) and assert equivalence on the resulting BQ tables via table-check summary.
The framework is intentionally thin — a small library plus per-pipeline workflow files. Capabilities, in scope today:
- Pipeline-agnostic runners (
dit.runners.docker,dit.runners.dataflow). The docker runner invokes a published or locally-built image; the dataflow runner submits an in-process Beam pipeline with lock-split submit/wait. Both used by Phase 1; the docker runner doubles for "Beam-in-container submits to Dataflow" workflows (port-visits' shape) and for BQ-SQL-in-container workflows (pipe-events' shape —entrypoint,volumesfor a GCP auth named volume, configurable composeservice). - Three pipeline workflows today:
workflows/pipe_gaps/mode_equivalence.py(detect-gaps; in-process Beam; SCD-2 compare),workflows/port_visits/ais.py(port-visits; Beam-via-container; truncate compare onvisit_id),workflows/pipe_events/fishing.py(fishing events; BQ-SQL-via-container; truncate compare onevent_id). Each is a faithful port of that pipeline's bash/CLI mode-equivalence test plus the automated cross-mode comparison. Seeworkflows/README.md. - Comparison shim (
dit.compare.compare_tables) — thin wrapper overtable-check summaryfrom thetable_identical_checksrepo. Per-column tolerances,view_suffixfor SCD-2 last-versions vs. truncate-shape,keysfor the comparison join. - BQ + date utilities (
dit.bq,dit.dates) — drop tables by prefix, query for restricted ssvids, snapshot a table or whole dataset for source-data pinning across cross-version runs, half-open date iteration. Used by workflows that need pre/post-run setup or computed inputs. - Library-first: anything you can do via
dit run …you can also do by importingdit.*from a pytest target or another Python script. The CLI is one consumer of the library, not the only one. - Workflow file conventions: per-pipeline workflows live in
workflows/<pipeline>/<name>.pyand exposemain(argv) -> int. Output tables tagged with<experiment_id>_<commit>_<uuid>for provenance;--experiment-id/DIT_EXPERIMENT_IDclusters N runs (e.g.pipe-gaps@mainvs@pr-NNN) under one BQ-prefix-scannable slug, defaulting tosolo_<6-hex>when unset. Every run executes a committed git ref — a dirty tree is auto-snapshotted to a pushedrefs/dit-snapshots/<pipeline>/<sha>ref and run reproducibly (no--allow-dirty-tree; seedocs/no-dirty-tree-pivot.md). - Three install modes per pipeline: editable (fast inner loop), specific-ref (
REF=<sha-or-branch>), and snapshot (git stash create→ anchored on adit-snapshot-<epoch>branch). See Usage § Install modes below. - Per-user infra knobs via
DIT_*env vars:DIT_DEST_DATASET,DIT_DATAFLOW_SA,DIT_DATAFLOW_REGION,DIT_DATAFLOW_TEMP_BUCKET,DIT_DATAFLOW_SUBNETWORK,DIT_BQ_TEMP_DATASET. Plays cleanly with direnv via.envrc.example. - Pipeline integration contract (
docs/pipeline-contract.md) — what a pipeline must expose to be cleanly testable bydit, with an adoption matrix tracking where each current pipeline stands. - Cross-version experiments (
workflows/port_visits/cross_version_ais.py) — pin source data via BQ snapshots at a fixed timestamp, run a workflow at N pipeline-version bindings (git refs in the pipeline checkout), then diff corresponding output tables pairwise. Foundation for PR-validation comparisons (pipe-anchorages@mainvs@pr-NNN). - Content-addressable run cache (
dit.cache+world-fishing-827.tech_great_expectations.dit_runs) — everyexecute_*mode hashes(pipeline_commit, worker_image_digest, workflow_file_sha1, params)and looks up the run-cache table. On hit (output tables still present), the workflow skips Dataflow entirely and uses the cached FQN for downstream comparisons; on miss, it runs and inserts a row with output tables + provenance. The same table also serves as registry + cleanup source formake dit-cancel. Wired into both workflows — pipe-gaps (2026-05-22) and port-visits (M5b, 2026-05-29). Design:docs/run-cache.md; milestone tracker:docs/run-cache-impl.md. - Run cleanup control plane (
dit cache-cancel RUN_ID/make dit-cancel RUN_ID=<id>,dit.cache.cancel_run) — cancel all sibling modes of a run together: discovers the run's Dataflow jobs by thedit_run_idlabel both workflows stamp (the storeddataflow_job_idsare always empty), cancels the still-running ones, deletes the recorded output tables (table-level only — dataset-shaped or malformed values are skipped, so adit_exp_*snapshot dataset can never be touched), and marks the rowscancelled. Idempotent. Landed M5a, 2026-05-29. - Centralised Dataflow job-name builder (
dit.job_names.make_job_name) — single source of truth for thedit-<repo>-<step>-<exp>-<binding?>-<mode?>-<N?>-<M?>shape, with 63-char truncation that preserves the load-bearing tail. Both workflows use it; the prefix lets a single label-filter find every dit-launched Dataflow job in the UI. - Auto-built worker images (
dit.worker_image.ensure_worker_image) — Dataflow runs load pipeline code from two sources (submitter from/workspace, workers from the--worker-imagecontainer). When a run executes unreviewed code against the default worker image, the workers would otherwise run the stale published code;ensure_worker_imagebuilds a content-addressable worker image (gcr.io/world-fishing-827/dit/<pipeline>:dit-<commit>) from the source via a kaniko Cloud Build and points the run at it. Idempotent (existing tag → skip); no-op for reviewed code, an explicit--worker-image, or the docker runner. Replaces the oldwarn_if_worker_image_misses_dirty_treewarning (M-pivot-4) — closing the gap rather than flagging it. - Cloud Build ad-hoc runtime (
cloudbuild-dit.yaml,make dit-cloud,make publish-ditbox) — submit a workflow run to Cloud Build with one command. The pipeline checkout flows through as the build source; dit is cloned fresh per run at a configurable ref. ditbox builds via kaniko with a registry-backed layer cache; per-run pipeline installs useuv(~10× faster than pip).make dit-cloudis fire-and-forget by default (gcloud builds submit --async): it snapshots (if dirty), uploads the source, and triggers the build, returning a build id in seconds — the worker-image build + Dataflow run happen cloud-side. PassWAIT=1to stream logs and block on the result (for CI exit codes or live debugging). The_BEAM_VERSIONsubstitution pins apache-beam to match the worker image's SDK (per-pipeline defaults in the Makefile:pipe-gaps=2.71.0,anchorages_pipeline=2.69.0). Validated end-to-end 2026-05-22 — caught a realpipe-gaps:v0.9.6mode-equivalence bug and verified the fix against a custom-published worker image. PR-trigger integration in each pipeline repo comes as a follow-up.
Pipeline-shape primitives (Phase/Mode/Mutation/Oracle dataclasses, mutation library, phase-sharing via BQ COPY, golden-table regression mode) are deliberately not extracted yet — and the Phase 3 three-consumer review (2026-05-29) made the call explicit for the date-slice helpers too: the three execute_bf/bfd/bftruncate are <20% similar in execution and even their date math diverges (different daily-window conventions), so dit.phases stays deferred. Reasoning recorded in workflows/README.md. See Roadmap below for why and when.
docs/architecture.md— visual reference: repo ownership, run modes, workflow flows, Cloud Build runtime, image namespace. Mermaid diagrams that render on GitHub.docs/plan.md— implementation plan, three-repo split, public API contracts, Phase 1 task breakdown.docs/conventions.md— prod-infra boundary, dit image namespace (gcr.io/world-fishing-827/dit/*), standard build-and-push workflow.docs/pipeline-contract.md— what a GFW pipeline must expose to be cleanly integration-testable; adoption matrix for the three current pipelines. Audience: pipeline maintainers.docs/context.md— background, source bugs the framework caught, branch state at handoff.docs/framework-vision.md— long-term shape (don't optimise for it; Phase 1 stays imperative).CLAUDE.md— working agreements and Plan changelog.
python3 -m venv venv && source venv/bin/activate
make install-pipe-gaps # or install-port-visits / install-pipe-events / install-alldit is pipeline-agnostic; workflow dependencies (pipe-gaps, anchorages_pipeline, pipe-events) are not in dit's base requirements.txt. By default the Makefile assumes sibling checkouts ($(realpath ..)). If yours live elsewhere, prepend PROJECTS=/path to any target or copy .envrc.example → .envrc (gitignored; loaded by direnv).
For the framework only (no workflow deps), make install works — but the dataflow runner won't load without a workflow install bringing apache-beam[gcp] transitively.
| When | Target | What happens |
|---|---|---|
| Active dev on a pipeline (fast inner loop) | make install-<pipeline> |
pip install -e <pipeline-dir> — working-tree edits picked up immediately. |
| Reproducible run against a specific committed ref | make install-<pipeline>-ref REF=<sha-or-branch> |
pip install --force-reinstall --no-deps git+file://...@<ref> — non-editable, exactly that commit, ~5-10s per ref. |
| Test what's currently in the working tree, reproducibly | make snapshot-<pipeline> |
Builds a deterministic orphan commit from the dirty tracked-files (temp-index git write-tree + git commit-tree with frozen dates/identity) → pushes to refs/dit-snapshots/<pipeline>/<commit-short-sha> on origin → installs from that ref. Identical tree → identical SHA → cache hits on repeat invocations. Parent SHA recorded in the commit message. |
| Pipeline's transitive deps changed in target ref | add FULLDEPS=1 |
Drops --no-deps, lets pip reinstall the full dep tree (slower; only needed when the target ref bumped or added a dep). |
| Remove a specific snapshot ref (secret-leak remediation only) | make clean-snapshot PIPELINE=<name> REF=<sha-or-full-ref> |
Deletes the ref locally and on origin in one step. Snapshots otherwise live forever by design (bytes-scale, hidden namespace). |
Notes on snapshot mode: only files already tracked in HEAD are captured (modifications + deletions). Untracked files — including files you git add-ed but haven't git commit-ed — are not included; commit a new file first if you want it in the snapshot. This is the deliberate safety default (see § Safety in Scenario B below); auto-push goes to the pipeline's origin (potentially public), so the capture boundary is "things you've explicitly chosen to track" rather than "anything in the working tree". Requires git push permission on the pipeline's origin and gitleaks on PATH (or DIT_SKIP_SECRET_SCAN=1 to bypass with a loud banner).
dit run workflows/<pipeline>/<workflow>.py [workflow-args...]The CLI loads the Python module at the given path and invokes its main(argv) -> int entry point. Workflows can also be imported directly from a pytest target — dit is library-first, CLI-second.
Example (Phase 2 AIS-staging, the verified mode-equivalence test for port-visits):
dit run workflows/port_visits/ais.py --runner dataflow --parallel --build-from-sourceExample (Phase 3 pipe-events fishing events — a BQ-SQL pipeline run via docker; modes are _SESSION-isolated so --parallel is safe):
# One-time GCP auth into the shared docker named volume (in the pipe-events checkout):
docker volume create gcp
docker compose run gcloud auth application-default login
# Then run the three-mode equivalence test (compares the _fishing_events views on event_id):
dit run workflows/pipe_events/fishing.py --parallel --build-from-sourceThe workflow authenticates by mounting that gcp named volume at /root/.config (via the docker runner's volumes param). It runs the 4-step incremental fishing-events chain per mode and asserts the three modes (1_bf / 2_bfd / 3_bftruncate) produce identical _fishing_events (and _product_events_fishing) output — the comparison the original bash never had.
When the same workflow runs inside Cloud Build (make dit-cloud), there is no gcp named volume to mount; instead the docker runner picks up DIT_CLOUD_MODE=1 (set by cloudbuild-dit.yaml) and adds --network=host so the inner container shares the build VM's network namespace and reaches Cloud Build's metadata server for ADC — same mechanism prod uses via GKE Workload Identity, no on-disk credentials. The workflow code is unchanged across laptop / cloud — see docs/conventions.md § "Auth in the cloud path (ditbox)" for the three-context table.
dit covers a small set of orthogonal axes, summarised first, then walked through as concrete scenarios. Best-practice paths are flagged ⭐.
| Axis | Choices |
|---|---|
| Where | local (dit run …) / Cloud Build (make dit-cloud …) |
| Pipeline ref | clean HEAD of pipeline checkout / auto-created snapshot under refs/dit-snapshots/<pipeline>/<commit-short-sha> (content-addressable; one ref per distinct working-tree state) / a specific committed ref (make install-<pipeline>-ref REF=… or make dit-cloud REF=…) / a PR head SHA (automated, future) |
| Worker image | published default (e.g. pipe-gaps:v0.9.6) / custom-built from a ref (for changes that touch worker code) |
| Single vs cross-version | one ref / N refs compared (workflows/port_visits/cross_version_ais.py) |
| Trigger | manual CLI / PR event (automated, future) / scheduled (future) |
No-dirty-tree policy (see
docs/no-dirty-tree-pivot.md): every dit run executes a committed git ref. If your working tree is dirty when you invoke dit, the workflow auto-snapshots torefs/dit-snapshots/<pipeline>/<commit-short-sha>(content-addressable — identical tree state always produces the same SHA, so repeat runs of unchanged uncommitted code hit the cache), auto-pushes, and uses that ref — no--allow-dirty-treeflag, no special-case logic, every cache row reproducible. Unreviewed code (snapshot / unmerged branch) also auto-builds a matching Dataflow worker image so the workers actually run it.
You're working on a pipeline change; you commit it; push to a branch; open a PR. dit fires automatically and compares the PR against main's cached output.
$ cd $PROJECTS/pipe-gaps
$ git checkout -b fix/PIPELINE-1465-tiebreaker
$ git commit -am "Add deterministic tiebreaker to message sort"
$ git push -u origin fix/PIPELINE-1465-tiebreaker
$ gh pr create --title "Fix PIPELINE-1465 tiebreaker" --body "..."
# ← dit triggers here, automatically
# ← Check Run appears on the PR with the verdict
When to use: default for any pipeline change you intend to land. No commands beyond your normal git workflow. dit hits the cache for main's side (it's already been computed), runs the PR side fresh, posts the diff as a Check Run.
You're mid-development; you want feedback before opening a PR. dit detects the dirty tree, auto-snapshots, auto-pushes, runs against the snapshot. Your working tree is untouched.
$ cd $PROJECTS/pipe-gaps && vim src/... # edits, not committed
$ cd $PROJECTS/data_integration_tests
$ make dit-cloud PIPELINE=pipe-gaps WORKFLOW=workflows/pipe_gaps/mode_equivalence.py
# auto-snapshots → refs/dit-snapshots/pipe-gaps/<commit-short-sha>
# auto-pushes
# runs against the snapshot
When to use: iterating on a fix that isn't PR-ready (e.g. making a pipeline idempotent across modes). Each edit produces a fresh snapshot (new cache key, new MISS, full Dataflow workload). Once you're happy, commit + push the same code to a real branch + open a PR (Scenario A).
Caveats printed at snapshot time:
- First run against each new tree-state is a cache MISS; repeat runs of an unchanged tree are a HIT (snapshot SHA is content-addressable — derived from
git write-treewith frozen author/committer dates, so identical tree → identical SHA → identical cache key). - Only tracked modifications + deletions are captured. Untracked files (and files you
git add-ed but haven'tgit commit-ed yet) are never in the snapshot — see § "Safety: auto-push and credential leakage" below for why this is the deliberate default. Commit a new file before running dit if you want it included. - The cache row is tagged
unreviewed_code=TRUEso PR-validation queries skip it. - If your changes touch worker code, build+push a custom worker image too (
--worker-image=…). - Requires
git pushpermission on the pipeline repo. If you're a read-only viewer, dit will fail at push time with a clear error pointing atmake install-<pipeline>-ref REF=<committed-ref>or asking you to commit + push the changes via a normal branch first.
⚠ Safety: auto-push and credential leakage.
make dit-cloudpushes the snapshot to the pipeline repo's origin automatically. The pipeline repo may be a public GitHub repository. Treat every snapshot as potentially publicly visible.Three defense layers, smallest blast radius first:
git add -u(tracked-only) is the default. Only modifications + deletions to files already in HEAD enter the snapshot. Untracked files (and files yougit add-ed but haven'tgit commit-ed yet) stay out. Rogue.env, downloadedsa.json, one-offquery_results.csv, etc. are not captured.- Pre-push banner shows you the changed paths before the push happens. Visual review is the last-line-of-defence; if anything surprises you, Ctrl-C immediately.
- Pre-push secret scanner (gitleaks). Before each push, the script extracts the snapshot tree to a temp dir and runs
gitleaks detect. A finding aborts the snapshot. Required by default: if gitleaks isn't installed and no bypass env var is set, the snapshot refuses to proceed (brew install gitleaks/ one-line install instructions in the error). Override isexport DIT_SKIP_SECRET_SCAN=1and emits a loudWARNING: secret scan BYPASSEDbanner — for false positives only, never as a "make this work" shortcut. Ditbox has gitleaks pre-baked.Remediation if a snapshot containing secrets has already landed on origin:
- Rotate the credential. Load-bearing step. Anything ever pushed to a public-shaped repo must be treated as compromised (history snapshots, forks, mirrors, indexers all make untoward pushes effectively permanent).
make clean-snapshot PIPELINE=<name> REF=<sha>— deletes the ref locally + on origin. Necessary but not sufficient on its own.If a tracked file contains a secret (the scanner caught one, or you noticed in the banner), the structural fix is
git rm --cached <file>+ add to.gitignore+ commit the removal — then the next snapshot won't include it.
You want to test a specific committed ref without changing what's installed.
$ make dit-cloud PIPELINE=pipe-gaps REF=main WORKFLOW=workflows/pipe_gaps/mode_equivalence.py
# or:
$ make dit-cloud PIPELINE=pipe-gaps REF=fix/PIPELINE-1465-tiebreaker WORKFLOW=...
# or to install locally and run from your venv (e.g. for ad-hoc inspection, or to drive Dataflow without Cloud Build):
$ make install-pipe-gaps-ref REF=fix/PIPELINE-1465-tiebreaker
$ dit run workflows/pipe_gaps/mode_equivalence.py --runner dataflow ...
When to use: testing main as a regression baseline; reviewing a teammate's PR locally; reproducing a historical run from a dit_runs row's pipeline_commit.
You have a fix and want to verify it doesn't regress anywhere it shouldn't.
$ dit run workflows/port_visits/cross_version_ais.py \
--experiment-id pipeline-1465 \
--pin-source-at 2026-05-15T10:00:00Z \
--binding before=main \
--binding after=fix/PIPELINE-1465-tiebreaker \
--binding-worker-image after=gcr.io/world-fishing-827/dit/pipe-anchorages:fix-tiebreaker \
--runner dataflow --parallel --build-from-source
When to use: fix-verification before PR (or as a separate verification run alongside Scenario A). Each binding runs against a frozen source snapshot at --pin-source-at; outputs are diffed pairwise.
Per-binding --worker-image is important when the change touches worker code — see memory [[submitter-vs-worker-split]].
A previous cache row says pipeline_commit=<sha>. You want to rerun against that exact code.
$ bq query --use_legacy_sql=false --project_id=world-fishing-827 \
'SELECT pipeline_commit, params_json FROM `world-fishing-827.tech_great_expectations.dit_runs`
WHERE run_id = "<rid>"'
$ make dit-cloud PIPELINE=pipe-gaps REF=<sha> WORKFLOW=workflows/pipe_gaps/mode_equivalence.py ARGS="..."
When to use: investigating a past diff; auditing a result; sanity-checking that a snapshot ref is still fetchable. Because every dit run executes a pushed ref, this works for anyone with read access to the pipeline repo — the row is genuinely portable.
$ make dit-cancel RUN_ID=<rid> # or: dit cache-cancel <rid> [--region <r>]
Discovers the run's Dataflow jobs by the dit_run_id label dit stamps on every Dataflow job (not the stored dataflow_job_ids, which are always empty), cancels the still-running ones, drops the run's recorded output tables (table-level deletes only — never a dataset), and marks all the run's dit_runs rows cancelled. Cancels all sibling modes together. Idempotent. Region defaults to DIT_DATAFLOW_REGION → us-central1.
You can manually invoke the snapshot step before running dit. Equivalent to letting make dit-cloud auto-snapshot, but useful when you want to inspect the snapshot ref before the run, or when running multiple dit invocations against the same uncommitted code base (the second run hits cache against the first).
$ make snapshot-pipe-gaps
# Prints: Created refs/dit-snapshots/pipe-gaps/<commit-short-sha>; pushed to origin
# (or: ref already exists on origin, skipping push — content-addressable).
$ make dit-cloud PIPELINE=pipe-gaps REF=refs/dit-snapshots/pipe-gaps/<that> ...
Snapshots live forever by design — bytes-scale storage on origin in a hidden ref namespace, no measurable cost. There's no periodic-cleanup target.
The one exception:
$ make clean-snapshot REF=refs/dit-snapshots/pipe-gaps/<sha>
Deletes the specified snapshot ref locally and on origin in one step. Intended for secret-leak remediation — e.g. you noticed a .env file or a credential was tracked in the working tree when you ran make dit-cloud. Surgical, user-invoked. The cache row in dit_runs retains pipeline_commit_parent, so removing the snapshot ref doesn't lose reproduce context for the rest of the row's lifetime.
| Question | Answer | Scenario |
|---|---|---|
| Are you opening a PR? | Yes | A (best) |
| Not yet — iterating | B (auto-snapshot) | |
| Testing an existing ref | C | |
| Are you comparing two refs against pinned source? | Yes | D |
| Are you reproducing or auditing a past run? | Yes | E |
| Need to clean up a stuck run? | Yes | F |
Each phase below is a short summary of what's planned and where we are. The canonical detail lives in docs/plan.md; this section is the operational dashboard.
| Phase / capability | Scope | Status (2026-05-22) |
|---|---|---|
| 1 — pipe-gaps port | Stand up the repo. Lift the four-mode mode-equivalence test from pipe-gaps/tests/integration/mode_equivalence.py onto dit.* helpers. Drop --runner=local. Replace the source file with a shim. |
Code complete + verified on real BQ 2026-05-22 (caught a real pipe-gaps:v0.9.6 non-determinism bug; custom worker image with fix collapsed all 3 pairwise diffs to zero). Pending only: Track 5 shim swap in the pipe-gaps repo. |
| 2 — port-visits | Ship AIS-staging, VMS, and AIS-full workflows for anchorages_pipeline's two-step port-visits (thin_port_messages → port_visits). First real test of the dit.compare abstraction on the truncate-shape (view_suffix="", keys=["visit_id"]). |
AIS-staging verified 2026-05-15 (3/3 pairwise green). VMS not started; AIS-full not started. |
| Cross-version experiments | BQ snapshot helpers (dit.bq.snapshot_table/snapshot_dataset), experiment-id linkage (--experiment-id/DIT_EXPERIMENT_ID), structured Dataflow job names + dynamic labels, and an end-to-end orchestrator (workflows/port_visits/cross_version_ais.py) for diffing pipeline outputs across versions against pinned input. |
Landed 2026-05-15. Validated end-to-end via the PIPELINE-1465 cross-version test + the pipe-gaps --worker-image flow on 2026-05-22. |
| Runtime & CI (Cloud Build) | ditbox image + cloudbuild-dit.yaml + make dit-cloud / make publish-ditbox targets. Moves the orchestrator off the laptop; serves both gcloud builds submit ad-hoc and GitHub-webhook PR triggers. Tiered triggers (cheap AIS-staging on every PR, heavy AIS-full on label) come on top. |
Ad-hoc path battle-tested 2026-05-22 (kaniko cache, uv per-run installs, _BEAM_VERSION pin, dit.job_names centralisation, dit.git_info warn). Per-pipeline PR triggers pending. 30-concurrent-build slot cap is the next architectural ceiling — Cloud Run jobs flagged as the migration target if dit graduates to per-PR matrix testing. |
| 3 — pipe-events port | Port the bash mode-equivalence test (pipe-events/integration_tests/pipe3-bf_bfd_bftruncate.sh, no comparisons) to workflows/pipe_events/fishing.py. Add the automated comparison it lacks. Then decide whether to extract phase helpers (three-consumer evidence). |
Code complete 2026-05-29. workflows/pipe_events/fishing.py lands the 4-step BQ-SQL chain + truncate-shape comparison on event_id (NOT SCD-2 — plan corrected). Docker runner gained volumes/service; add_infra_args split. Phase-helper extraction deliberately deferred (three-consumer review: shapes too divergent — workflows/README.md). Pending: live e2e run + the pipe-events-repo bash shim (a separate processing-repo change). |
| 4 — composer-dags param sync | dit sync-params --from <composer-dags-checkout> reads production DAGs and regenerates params.yaml. Triggered when a real prod-vs-test param drift bug shows up. |
Not started. |
| 5 — Mutation library | Promote pipe-gaps' compute_restricted_ssvids into dit.mutations along with drop_messages, shift_timestamps, set_segment_flag. Cap at ~5 mutations. |
Not started; waits for a second consumer. |
| 6 — Phase sharing | Hash (image-tag, phase-config, mutation-set); second invocation of an identical phase becomes a BQ COPY instead of a re-run. Cuts wall-clock for CI. |
Not started; build only when duplicate-run cost matters operationally. |
| 7 — Golden-table mode | Per-workflow reference _1_bf table keyed by (image-tag, params-hash, date-range); future runs assert byte-equivalence vs. the golden table. Cheap PR-validation regression check. Implementable on top of the cross-version snapshot machinery. |
Not started. |
Operational next steps (rolling, in priority order):
- No-dirty-tree pivot — landed (
docs/no-dirty-tree-pivot.md). M-pivot-1..5 shipped (#22–#25 + docs): deterministic orphan snapshots + gitleaks; auto-snapshot inmake dit-cloud(--async) /dit run;pipeline_dirty→unreviewed_code+pipeline_commit_parent(snapshots cacheable); auto-built worker images closing the submitter-vs-worker gap;--allow-dirty-tree/ the warn-helper / the_dirtysuffix removed. Remaining follow-ups (tracked in the pivot doc § Loose ends): pipe-gaps Dockerfile reorder PR (GlobalFishingWatch/pipe-gaps#128); drop the dual-writtenpipeline_dirtycolumn after a release. The first real end-to-endmake dit-cloudauto-build run is done (2026-05-29) — it also surfaced + fixed twoREF=cloud-path bugs (#27) and reconfirmed thev0.9.6non-determinism (main 350/317/347 vs fix 0/0/0). - Run cache M5 — landed 2026-05-29 (see
docs/run-cache-impl.md§ Milestone 5). M5a ✓:cancel_run+dit cache-cancel+make dit-cancel— discovers Dataflow jobs bydit_run_idlabel (storeddataflow_job_idsare empty), deletes the output tables (table-level only), UPDATEs status; pipe-gaps now stamps thedit_run_idlabel too. M5b ✓: port-visits caching wired (its owncanonical_params_dict+CacheKeyover the shareddit.workflow.run_with_cache;ais.pyflipped toresolve_digest=True). User-gated follow-up: verifydataflow.jobs.cancel+ BQ table-delete for the laptop user (make dit-cancel) andautomated-testing@(the M6 path) against live infra — the M5 tests mock gcloud/BQ. - Run cache M6 — SIGTERM trap inside
dit runso Cloud Build cancellations cleanly tear down their own Dataflow + BQ artefacts viacancel_run(the natural follow-on now M5a has landed). - Per-pipeline PR triggers. Wire
pipe-gaps(most-verified) →anchorages_pipeline→pipe-eventsto fire dit on PRs. Trigger config lives in each pipeline repo; referencescloudbuild-dit.yamlfrom the dit checkout. Path-filter on pipeline-relevant files;dit:runlabel as the escape-hatch override. - PR comment integration via Check Runs.
dit.reportmodule + GitHub Check Run API call from insidedit run. Typed contract end-to-end (no log parsing). Design doc TBD; separately,docs/llm-pr-gating.mdcovers the negative-signal-only LLM pre-filter that gates whether a PR triggers dit (orthogonal concern from how the verdict is posted). - Track 5 — pipe-gaps repo shim, opportunistically (replace
pipe-gaps/tests/integration/mode_equivalence.pywith a thin re-export fromdit/workflows/pipe_gaps/...). - VMS port-visits workflow, then AIS-full (the latter is what motivates Cloud Run jobs hardest once concurrent runs scale up).
- PIPELINE-1465 cross-version full validation against newly-built pipe-anchorages images (parallel to the pipe-gaps validation just completed).
- The first dit release tag (
v0.1.0) once per-pipeline PR triggers + caching land — that's the natural "framework is usable by outsiders" milestone.
The canonical detailed version of this list lives in docs/plan.md § Next steps.