Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@

- There are no FKs pointing into `CommitCheckRun` or `CommitStatusContext` from other models
(only the `Repository` cascade-delete direction), so bulk deletion is safe.
**Note (updated):** `PRQueueWindow` now holds nullable FKs into both `CommitCheckRun` and
`CommitStatusContext` for event attribution (see `040-queue-window-event-attribution.md`).
These use `on_delete=SET_NULL`, so bulk deletion remains safe, but the expire task must mark
affected windows stale and enqueue their PRs for rebuild before deleting rows — see doc 040.

## Decision
- Add a periodic cleanup task `syncer.expire_stale_ci_for_repo` (fanned out by
Expand Down
263 changes: 263 additions & 0 deletions docs/design-decisions/040-queue-window-event-attribution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# Queue Window Event Attribution

## Context
- `PRQueueWindow` records the interval `[from_ts, to_ts)` during which a PR was on the queue
under a given ruleset, but carries no information about *why* each transition happened.
- For offline analysis we want to know what caused each window to open and close — was it a
label change, a CI result, a PR state change, etc.?
- Queue windows are already computed by replaying boundary events chronologically in
`_queue_windows_with_rules` (`qb_site/analyzer/services/queue_windows.py`). At each
boundary point the code already knows which event caused the re-evaluation; this decision
captures that context in the stored rows.
- Boundary sources:
- `PRTimelineEvent` (label add/remove, draft toggle, open/close, force-push)
- `CommitCheckRun` / `CommitStatusContext` (CI completion/update)
- Synthetic boundaries (PR creation, ruleset `effective_from`, revision changes)
- The expire task (`syncer.expire_stale_ci_for_repo`) deletes superseded CI rows
(see `038-commit-ci-row-expiry-and-superseded-cleanup.md`). Doc 038 originally noted
"there are no FKs pointing into `CommitCheckRun` or `CommitStatusContext` from other models"
— this decision changes that.

## Decision
Add ten nullable fields to `PRQueueWindow` recording which event caused each window to open
and close. Use FK references to the triggering DB row (when one exists) plus a discriminator
`event_type` field. When referenced CI rows are deleted by the expire task, mark affected
windows stale and enqueue their PRs for rebuild before deletion.

Attribution is to the *triggering event* — the single boundary event whose processing caused
`on_queue` to flip — not a full eligibility explanation.

## Architecture

### Event Type Discriminator

A new `QueueWindowEventType` `TextChoices` enum:

| Value | FK column populated | Notes |
|---|---|---|
| `REQUIRED_LABEL_ADDED` | `timeline_event` | A required label was added |
| `REQUIRED_LABEL_REMOVED` | `timeline_event` | A required label was removed |
| `FORBIDDEN_LABEL_ADDED` | `timeline_event` | A forbidden label was added |
| `FORBIDDEN_LABEL_REMOVED` | `timeline_event` | A forbidden label was removed |
| `CI_PASSED` | `check_run` or `status_context` | CI state flipped to eligible |
| `CI_FAILED` | `check_run` or `status_context` | CI state flipped to ineligible |
| `PR_OPENED` | `timeline_event` | PR reopened (or opened fresh) |
| `DRAFT_CONVERTED` | `timeline_event` | PR converted from draft to ready |
| `CONVERTED_TO_DRAFT` | `timeline_event` | PR converted to draft |
| `PR_CLOSED` | `timeline_event` | PR closed or merged |
| `HEAD_PUSHED` | `timeline_event` | Force-push; resets CI state for new SHA |
| `INITIAL_STATE` | *(all null)* | PR was already eligible at window start |
| `RULESET_EFFECTIVE` | *(all null)* | Ruleset `effective_from` boundary reached |
| `UNKNOWN` | *(all null)* | Defensive fallback; should not appear in practice |

At most one of the three FK columns is non-null for any open/close event; which one is set
is determined by `event_type`.

### New Fields on `PRQueueWindow`

```python
opened_by_event_type = CharField(max_length=32, null=True, choices=QueueWindowEventType.choices)
opened_by_timeline_event = ForeignKey("syncer.PRTimelineEvent", null=True, on_delete=SET_NULL, related_name="+")
opened_by_check_run = ForeignKey("syncer.CommitCheckRun", null=True, on_delete=SET_NULL, related_name="+")
opened_by_status_context = ForeignKey("syncer.CommitStatusContext",null=True, on_delete=SET_NULL, related_name="+")
opened_at_head_sha = CharField(max_length=40, null=True)

closed_by_event_type = CharField(max_length=32, null=True, choices=QueueWindowEventType.choices)
closed_by_timeline_event = ForeignKey("syncer.PRTimelineEvent", null=True, on_delete=SET_NULL, related_name="+")
closed_by_check_run = ForeignKey("syncer.CommitCheckRun", null=True, on_delete=SET_NULL, related_name="+")
closed_by_status_context = ForeignKey("syncer.CommitStatusContext",null=True, on_delete=SET_NULL, related_name="+")
closed_at_head_sha = CharField(max_length=40, null=True)
```

`closed_by_*` and `closed_at_head_sha` are null for windows that are still open (`to_ts = None`).

All FK columns use `on_delete=SET_NULL`. If a referenced row is deleted out-of-band the FK
silently becomes null; the rebuild-on-deletion mechanism (see below) ensures this state is
transient in practice.

Migration `0025_prqueuewindow_closed_at_head_sha_and_more.py` — pure `ADD COLUMN`, safe on
a live database with no backfill required.

### Attribution Computation (`_queue_windows_with_rules`)

The `_queue_windows_with_rules` function returns `List[AttributedQueueWindow]` (a new
dataclass wrapping `from_ts`, `to_ts`, `opened_by: WindowAttribution`, and
`closed_by: Optional[WindowAttribution]`). `WindowAttribution` holds `event_type`, at most
one FK object, and `head_sha`.

**Label-only path** — each boundary event is a `PRTimelineEvent`; `_timeline_event_type`
maps it to the correct `QueueWindowEventType` using the ruleset's required/forbidden label
sets. On each flip, the triggering event is captured directly.

**CI path** — `_determine_ci_boundary_attribution` resolves the triggering event by
priority:
1. CI eligibility changed **and** a CI event was present at this boundary → `CI_PASSED` /
`CI_FAILED` with FK to the triggering check run or status context.
2. A timeline event was present at this boundary → corresponding timeline type.
3. A revision change occurred (HEAD_PUSHED) with no other event.
4. CI changed without an identifiable event → `UNKNOWN` (defensive; should not fire on
clean data).

The current head SHA is already tracked across revision boundaries in the replay, so
`opened_at_head_sha` / `closed_at_head_sha` require no additional lookups.

`queue_windows_for_pr` strips attribution to preserve the `List[Tuple]` return type for
all existing callers outside the rebuild path.

### Persistence (`rebuild_queue_windows_for_ruleset`)

`_attribution_kwargs(attr, prefix)` unpacks a `WindowAttribution` into model field kwargs,
comparing FK objects by `_id` to avoid unnecessary DB fetches. Attribution fields are
always overwritten on both the create and update paths; they reflect the current computation
and are never carried forward from a previous rebuild.

### Rebuild-on-Deletion (`expire_stale_ci_for_repo_task`)

`_invalidate_queue_windows_for_ci_rows(check_run_ids, status_context_ids)` is called before
each of the four deletion passes with the IDs about to be deleted. It:
1. Queries `PRQueueWindow` for rows referencing any of the given IDs across all four CI FK
columns (`opened_by_check_run_id`, `closed_by_check_run_id`, etc.).
2. Nulls `windows_built_at` on `PRQueueWindowBuildState` for the affected `(PR, ruleset)` pairs.
3. Enqueues `process_pr_task.delay(pr_id)` for each affected PR.

Both the stale mark and the direct enqueue are done (defence in depth): the stale marker
ensures the periodic sweep catches any PR whose direct rebuild task fails; the direct enqueue
avoids waiting for the next sweep cycle.

### Sweep Staleness Extension (`rebuild_queue_windows_sweep`)

Two new conditions are added to the `attribution_backfill` `Exists` subquery that feeds the
`needs_rebuild` prefilter:

- `opened_by_event_type IS NULL` — detects pre-migration windows (null means "not yet
attributed", not a specific event type).
- `opened_by_event_type IN (CI_PASSED, CI_FAILED) AND opened_by_check_run IS NULL AND
opened_by_status_context IS NULL` (and the same for `closed_by_*`) — detects the
post-expire-task partial-failure state where the FK was SET_NULL but the event_type
was not cleared.

The subquery uses the same `Exists` pattern as the existing `rollup_backfill` condition and
operates on indexed FK columns; it does not add a costly scan. `collect_convergence.py`
includes the same conditions in its `rollup_stale_pairs` query and must be kept in sync with
the sweep.

`has_attribution_backfill: bool` is threaded into `_is_ruleset_stale_for_pr`, which returns
`True` immediately when set. The outer prefilter is conservative (SQL `Exists` across all
rulesets); the inner check is per-ruleset exact.

## Consequences

- Every new or rebuilt `PRQueueWindow` carries attribution fields identifying the triggering
event, suitable for offline analysis without rejoining raw timeline/CI tables.
- The `PRQueueWindow` table now holds nullable FKs into `CommitCheckRun` and
`CommitStatusContext`. Bulk deletion of CI rows remains safe (FK uses `SET_NULL`), but the
expire task now carries an explicit obligation to invalidate affected windows before deletion.
- After CI row deletion and rebuild, windows that depended on the deleted rows are recomputed
from surviving data. They may change shape, merge with adjacent windows, or disappear
entirely. They do **not** survive with `UNKNOWN` attribution — `UNKNOWN` is a defensive
fallback for data inconsistencies in the current DB, not a consequence of deletion.
- Attribution backfill via the periodic sweep is automatic; no explicit backfill command is
needed after deploying the migration.
- `CommitCheckRun` rows are mutable (upserted in place on the same `github_node_id`). An FK
on a window correctly identifies *which check run* caused the transition, but the row's
current `conclusion` may differ from its value at `from_ts` if the check run was updated
later. Use `from_ts` together with the check run's `gh_completed_at` to recover the
historical state.

## Subtleties / Invariants

**I1 — Windows are a pure function of current DB inputs.**
After any mutation to the inputs (timeline events, CI rows, rulesets), the stored windows
must exactly match what `_queue_windows_with_rules` would produce if run now. This decision
extends that invariant to cover CI row deletions.

**I2 — Rebuild-on-deletion must be triggered explicitly.**
The existing stale-detection covers `gh_updated_at` bumps, ruleset changes, and revision
version changes. It does NOT fire when the expire task deletes a CI row. The expire task must
both mark affected PRs stale and directly enqueue them for rebuild before deleting rows.
Without either mechanism, I1 is violated silently with no recovery path.

**I3 — Attribution FKs describe the computation that produced the window, not an absolute
historical record.**
After a rebuild, a window that previously closed because of check run A (now deleted) either
no longer exists or is re-attributed to the surviving check run. The FK on the new window is
always valid relative to the current DB state.

**I4 — The expire task is intentionally lossy; windows should reflect that.**
Transient failures that were rerun to success, and then had their failure row deleted, will
disappear from window history after the next rebuild. This is correct: the queue window
history reflects the *retained* CI data, not the unretained data.

**I5 — TimelineEvent FKs are stable; CI FKs are not.**
`PRTimelineEvent` rows are not subject to deletion by any existing task. `opened_by_timeline_event`
/ `closed_by_timeline_event` FKs will not be nullified. Only the CI FKs require the
rebuild-on-deletion guarantee.

**I6 — Synthetic boundaries have null FKs; `*_at_head_sha` is still populated.**
Windows that open due to `INITIAL_STATE` or `RULESET_EFFECTIVE` have no associated DB event
row, so all three FK columns are null. The `event_type` discriminator distinguishes
"null because synthetic" from "null because the row was deleted" or "null because
pre-migration". Offline analysis must not conflate these three cases.
`*_at_head_sha` is still populated for synthetic opens, providing a handle on which revision
was current at that boundary.

**I7 — Attribution is to the triggering event, not a complete eligibility explanation.**
Eligibility is a conjunction (open, not-draft, required labels, forbidden labels, CI).
Attribution records the single event whose processing caused `on_queue` to flip — the
"last missing piece" when opening, the "first broken condition" when closing. It does not
record all conditions that happened to be satisfied or unsatisfied at the boundary.

## Operational Notes

### Deploying to an existing install

The migration is pure `ADD COLUMN` with no backfill and no locking beyond the column
additions; safe to run against a live database. After migration all existing `PRQueueWindow`
rows have all attribution fields null. This means:

- `opened_by_event_type IS NULL` on a live row means "not yet rebuilt post-migration", not
`INITIAL_STATE`. After rebuild, `INITIAL_STATE` windows have that value set explicitly.
Offline analysis should treat `IS NULL` as "not yet attributed" rather than as a specific
event type.
- The `attribution_backfill` sweep condition fires for all such rows, so the normal periodic
sweep will progressively populate attribution without any manual intervention.
- If faster full attribution is needed, use `backfill_queue_window_build_states` to mark all
build states stale; the sweep will then pick up all PRs on its next pass.

### Staleness coverage (updated table from doc 024)

| Staleness cause | Detection mechanism |
|---|---|
| New commit → `revision_version` bumped | `process_pr`; sweep revision lag |
| Label or state change → `gh_updated_at` bumped | `process_pr` unconditional rebuild; sweep `windows_built_at < gh_updated_at` |
| Ruleset definition changed | Sweep `windows_built_at < ruleset.updated_at` |
| New ruleset → missing build-state row | Sweep missing-state count |
| Existing windows missing rollup fields | Sweep `has_rollup_backfill` Exists subquery |
| Pre-migration windows (`opened_by_event_type IS NULL`) | Sweep `has_attribution_backfill` Exists subquery |
| CI FK nulled by expire task without rebuild | Sweep `has_attribution_backfill` Exists subquery |
| CI row deleted → rebuild triggered | `_invalidate_queue_windows_for_ci_rows` marks stale + direct enqueue |

### Attribution reflects surviving CI data after deletion

After the expire task deletes superseded CI rows, rebuilding windows produces attribution
consistent with the *surviving* data only. Historical window shapes may change if a deleted
failure row was previously the triggering event. This is correct behavior (I4) but relevant
for long-term offline analysis of window history.

## References
- `docs/design-decisions/010-queue-windows-first.md`
- `docs/design-decisions/024-per-ruleset-queue-window-build-state.md`
- `docs/design-decisions/038-commit-ci-row-expiry-and-superseded-cleanup.md`
- `qb_site/analyzer/models/queue_window.py` — `QueueWindowEventType`, `PRQueueWindow`
- `qb_site/analyzer/services/queue_windows.py` — `_queue_windows_with_rules`,
`WindowAttribution`, `AttributedQueueWindow`, `_attribution_kwargs`,
`rebuild_queue_windows_for_ruleset`
- `qb_site/analyzer/tasks/rebuild_queue_windows_sweep.py` — `attribution_backfill` subquery
- `qb_site/analyzer/tasks/collect_convergence.py`
- `qb_site/analyzer/migrations/0025_prqueuewindow_closed_at_head_sha_and_more.py`
- `qb_site/syncer/tasks/sync_tasks.py` — `_invalidate_queue_windows_for_ci_rows`,
`expire_stale_ci_for_repo_task`
- `qb_site/analyzer/tests/services/test_queue_window_attribution.py`
- `qb_site/analyzer/tests/services/test_queue_window_attribution_persist.py`
- `qb_site/analyzer/tests/tasks/test_rebuild_queue_windows_sweep_attribution.py`
- `qb_site/syncer/tests/tasks/test_expire_stale_ci_attribution.py`
Loading