Skip to content

feat(redis_admin): split celery_broker caps into display vs scan limits#903

Merged
thomasrockhu-codecov merged 9 commits into
mainfrom
redis-admin/split-celery-broker-caps
May 1, 2026
Merged

feat(redis_admin): split celery_broker caps into display vs scan limits#903
thomasrockhu-codecov merged 9 commits into
mainfrom
redis-admin/split-celery-broker-caps

Conversation

@thomasrockhu-codecov
Copy link
Copy Markdown
Contributor

@thomasrockhu-codecov thomasrockhu-codecov commented Apr 30, 2026

Summary

The single REDIS_ADMIN_MAX_ITEMS_PER_KEY = 20_000 knob was bounding three call sites with very different cost profiles:

  • The changelist row materialisation keeps full parsed envelopes in the per-request _result_cache — RAM-heavy.
  • The frequency chart aggregator walks the queue in chunks and only retains (task_name, repoid, commitid) counts — CPU-light.
  • The streaming clear needs to drain the entire matching set, not a sample.
  • The clear-by-filter preview needs to count the entire matching set so the operator sees the same number both before and after submitting.

Tuning one number forced a bad trade: lowering it for page-load made the chart shallow; raising it for the chart bloated the changelist; and either way the clear inherited the wrong behaviour. After PR #897 lowered the cap to 20_000 the clear silently capped at that window too, so on a saturated queue the per-pass plateau check fired before the clear had drained.

Splits + drops the cap on the right things, then polishes the chart loading UX:

Setting / call site Default Used by
REDIS_ADMIN_CELERY_BROKER_DISPLAY_LIMIT 2_000 per-message rows on the changelist drill-down
REDIS_ADMIN_CELERY_BROKER_SCAN_LIMIT 100_000 frequency chart aggregator only
streaming clear unbounded walks LLEN(queue) so "Clear all" always fully drains
dry-run preview (celery_broker_clear) unbounded runs the streaming counter so result.count matches what the actual clear will do
clear-by-filter preview page unbounded uses the new streaming_celery_count helper so match_count agrees with result.count

REDIS_ADMIN_MAX_ITEMS_PER_KEY keeps its existing 20_000 default and stays wired to the unrelated SET/HASH browsing path in RedisItemQuerySet.

Folds in #904 (clear-cap regression) and #905 (visible loader card UX). Both source PRs left OPEN as historical references.

Test plan

  • Updated chart / clear / cap fakeredis tests for the renamed/dropped settings.
  • Regression: unbounded streaming clear drains a 41k-match queue to LLEN == 1 in a single pass.
  • Regression: dry-run preview returns the full-queue count (5k matches → result.count == 5_000, keep_one=True → 4_999, LLEN intact).
  • Regression: clear-by-filter preview page shows Matched: N message(s) for a queue with N > CELERY_BROKER_DISPLAY_LIMIT.
  • Drill-down rendering test asserts the loader card markup, CSS link, and role="status" are present.
  • ruff check + ruff format --check clean.
  • Manually verify on api-admin.codecov.io after deploy: chart sample size is back to ~100k AND the clear-by-filter preview page agrees with the post-clear LLEN delta AND clicking "Clear all" drains the queue AND the loader card is visible while the chart fragment fetches.

Note

Medium Risk
Changes celery broker admin/clear behavior and limits, including unbounded queue-draining logic; mistakes could over/under-delete messages or slow deep-queue operations.

Overview
Splits the previous single MAX_ITEMS_PER_KEY cap into two celery-specific settings: CELERY_BROKER_DISPLAY_LIMIT (bounds per-message drill-down materialisation/cache) and CELERY_BROKER_SCAN_LIMIT (bounds the frequency-chart sampler), and updates docs/UI text accordingly.

Fixes celery clear/preview semantics on deep queues by making the streaming clear walk the full LLEN(queue) (no artificial cap), adding wildcard-aware filter matching, and ensuring clear-by-filter preview + dry-run counts come from a new streaming counter (streaming_celery_count) rather than the display-limited queryset.

Polishes the drill-down chart loading UX with a CSP-safe external loader stylesheet + richer loading card, tweaks admin filtering (hide celery_broker from the generic queue family filter), adjusts summary-mode default ordering, and adds focused regression tests for deep-queue draining and filter/preview correctness.

Reviewed by Cursor Bugbot for commit 4f86eee. Bugbot is set up for automated code reviews on this repo. Configure here.

The single `MAX_ITEMS_PER_KEY = 20_000` knob was doing too many jobs:
it bounded the changelist row materialisation (RAM-heavy: full
envelopes stay in the per-request cache), the frequency chart
aggregator (CPU-light: discards payloads), and the streaming clear
(the same shape as the chart). Lowering it for page-load made the
chart shallow; raising it for the chart bloated the changelist.

Split into two purpose-specific settings:

- `REDIS_ADMIN_CELERY_BROKER_DISPLAY_LIMIT` = 2_000 — per-message
  rows on the changelist. Smaller because each row keeps a parsed
  envelope in memory.
- `REDIS_ADMIN_CELERY_BROKER_SCAN_LIMIT` = 100_000 — frequency
  chart aggregation + streaming clear deep scan. Both walk the
  queue in chunks and discard payloads, so memory stays flat. The
  chart and clear share this window so anything visible on the
  chart is reachable by the "Clear queue" button.

`MAX_ITEMS_PER_KEY` keeps its existing 20_000 default and stays
wired to the unrelated SET/HASH browsing path in
`RedisItemQuerySet`.
@sentry
Copy link
Copy Markdown
Contributor

sentry Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 95.12195% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.91%. Comparing base (baef6b4) to head (4f86eee).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/redis_admin/admin.py 84.61% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #903   +/-   ##
=======================================
  Coverage   91.90%   91.91%           
=======================================
  Files        1316     1316           
  Lines       50318    50380   +62     
  Branches     1625     1625           
=======================================
+ Hits        46245    46305   +60     
- Misses       3767     3769    +2     
  Partials      306      306           
Flag Coverage Δ
apiunit 94.98% <95.12%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link
Copy Markdown

codecov-notifications Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 95.12195% with 4 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/redis_admin/admin.py 84.61% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

The cap split in this branch left the streaming clear bound to
`CELERY_BROKER_SCAN_LIMIT` (100_000), which is the same window the
chart sampler uses. On a queue saturated past 2x cap pass 1 and
pass 2 each tombstoned exactly that many matches and the
`prev_lset == matches_lset` plateau check spuriously declared
convergence, leaving the rest of the queue in place.

Fold in #904's fix: drop the cap from `_streaming_celery_clear`
so each pass walks `LLEN(queue)`. The plateau check stays for
genuine consumer-side drift. Update the `CELERY_BROKER_SCAN_LIMIT`
docstring + README to reflect that the clear is intentionally
unbounded and only the chart aggregator uses the sample window.

Adds the regression test from #904 (renamed to drop the now-
irrelevant `MAX_ITEMS_PER_KEY` reference): seeds 41k matching
messages and asserts `LLEN == 1` after a single clear pass —
fails on the pre-fold #903 (cap = 100k matches) and passes after.
The "Clear by filter" preview was using `pending_count` (the
materialised target list, capped by `CELERY_BROKER_DISPLAY_LIMIT
= 2_000`) as the dry-run match count. After the cap split in
this branch the actual streaming clear is unbounded, so on a
200k-deep queue with 190k matches the preview reported ~1.9k
while "Clear all" then drained 190k — confusing.

Run `_streaming_celery_clear(..., dry_run=True)` on dry-run too
so the preview walks the same `LLEN(queue)` the actual clear
walks. Adds `total_found` to `_StreamingClearStats` to expose
the count, and short-circuits dry-run after one pass since
nothing's changing across passes. `keep_one` mode subtracts the
per-queue keeper so the preview matches what the actual clear
will tombstone.

Adds regression test seeding 5k matches and asserting both
`dry_run=True / keep_one=False → count == 5_000` and
`keep_one=True → count == 4_999` while `LLEN` stays at 5_000.
Comment thread apps/codecov-api/redis_admin/services.py Outdated
The earlier dry-run fix on this branch made
`celery_broker_clear(dry_run=True).count` reflect the streaming
counter's full-queue scan, but the clear-by-filter preview page
itself was still showing `len(matches)` (capped by
`CELERY_BROKER_DISPLAY_LIMIT = 2_000`). On a queue with 190k
matches the preview said "1903" while "Clear all" then drained
190k. Move `match_count` and `kept_index` to a streaming pass
via the new `streaming_celery_count` helper. Keep the queryset
only for the small `sample_targets` preview table (<= 20 rows).
Adds `first_match_index` to `_StreamingClearStats` so the helper
can surface the lowest-index match for the "Clear all but first"
action's `kept_index` callout.

Also folds in #905: replaces the subtle "Loading frequency
chart..." help-text line with a styled loader card (animated
spinner, themed via Django admin CSS custom properties,
reduced-motion fallback). Loader styles ship as an external
stylesheet referenced via {% extrastyle %} so they survive
production CSP - same pattern PR #902 used for the script.

Adds two regressions: the clear-by-filter preview reports the
full-queue count (`Matched: N message(s)` for N > display
limit), and the changelist body contains the new loader card
markup + CSS link.
Bugbot review on PR #903 caught a low-severity off-by-one in the
clear-by-filter dry-run preview when the filter set spans multiple
queues. The preview was subtracting `len(by_queue_filters)` (one per
queue) from `total_found`, but `_streaming_celery_clear` only
preserves a survivor in queues that have at least one match. So if
a queue drained to zero matches between the changelist render and
the preview (e.g. a consumer popped the last match), the preview
underreported by one — diverging from what the real clear path
actually tombstones.

Fix tracks `queues_with_matches` while iterating per-queue and
deducts that count instead of `len(by_queue_filters)`.

Adds a regression test that pushes 100 matching messages into one
queue, leaves a second queue empty, and asserts `keep_one=True`
dry-run reports `n - 1` (not `n - 2`). Verified the test fails on
the pre-fix code with `assert 98 == (100 - 1)`.
Comment thread apps/codecov-api/redis_admin/admin.py Outdated
… clears land

Bugbot review on PR #903 caught a high-severity silent-no-op:
`clear_by_filter_view` was passing `sample_targets` (the materialised
window capped at `CELERY_BROKER_DISPLAY_LIMIT = 2_000`) as the
`targets` arg to `celery_broker_clear`. Two failure modes:

1. All matches sit beyond the display window (recent burst on a
   queue with `LLEN > 2_000`) -- `sample_targets` is empty, so
   `_celery_broker_clear` derives `by_queue_filters = {}`, the
   streaming-clear loop never executes, and the operator sees
   "Cleared 0 of N matching message(s)" while the queue stays
   full. `match_count` (from `streaming_celery_count`, which walks
   the full queue) had already reported the matches.
2. Filter set inferred from sample_targets enumerates only the
   `(task, repoid, commit)` triples present in the first 2_000
   rows -- triples living past that cap survive the clear.

Fix: synthesise a single `CeleryBrokerQueue` carrying the
operator's exact filter (queue + task_name? + repoid? + commitid?)
and pass that as `targets`. `_celery_broker_clear` now derives the
same `frozenset({(task or None, repoid, commit or None)})` that
`streaming_celery_count` walks the full queue with, so preview
count and actual clear count agree even on 500k-deep queues. The
per-message and bulk-select paths (delete_model / clear_selected /
delete_queryset / clear_dry_run) keep passing real materialised
rows -- they're index-driven, not filter-driven, and don't hit
this code path.

Also fixes the low-severity hint hardcode flagged in the same
review: the loader copy under the chart fragment said "Sampling up
to 100,000 messages" regardless of `CELERY_BROKER_SCAN_LIMIT`.
Pass the actual setting through `changelist_view`'s template
context (pre-formatted with thousands separators since
`django.contrib.humanize` isn't in INSTALLED_APPS).

Adds regression test seeding `CELERY_BROKER_DISPLAY_LIMIT + 1_000`
matching messages and asserting `celery_broker_clear` with a
synthetic filter target drains the entire queue
(`llen == 0`, `result.count == n`). Pre-fix that test fails with
`result.count == 0` and the queue intact.
Pre-existing bug surfaced by the synthetic-target fix in the prior
commit and by the existing `CeleryBrokerClearByFilterViewTest` /
`ChartFragmentViewTest` suites that started failing on this branch:
`_streaming_celery_clear` (used by both `streaming_celery_count`
and the actual clear path) compared `(meta.task, meta.repoid,
meta.commitid)` to `filter_tuples` via exact set membership. When
the operator left `task_name` unset, `streaming_celery_count`
built `frozenset({(None, repoid, commitid)})` and compared it
against real envelopes whose `meta.task` is always populated, so
nothing matched. The preview reported zero matches even when the
queryset's `filter(repoid=..., commitid__startswith=...)` would
surface rows -- which is what the eight failing
`CeleryBrokerClearByFilterViewTest` tests on commit 39aed24
were tripping over.

Fix: replace the membership check with `_envelope_matches_any_filter`
which treats `None` slots in any filter tuple as wildcards
(`ft_task is None or ft_task == meta.task`, same for repoid /
commitid). Mirrors the queryset's "filter by what's set" semantic,
so the streaming counter and clear agree with the queryset's match
set across every operator-input shape (queue + repoid only,
queue + task only, queue + commit only, queue + any combination).
Tuples with all three slots populated keep the same exact-equality
behaviour (`None` short-circuits, real values still match by
`==`), so existing tests using the `CeleryBrokerQueueQuerySet`-
materialised path continue to pass unchanged.

Adds two regression tests:

* `test_streaming_celery_count_wildcard_task_name_matches_any_task`
  pins the counter contract: `task_name=None, repoid=R, commit=C`
  matches every envelope with that `(repoid, commit)` regardless
  of task class. Pre-fix it returned 0; now it returns 5 (3
  NotifyTask + 2 UploadTask) while the unrelated `repoid+1`
  envelope is excluded.
* `test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`
  pins the clear contract: a synthetic target with `task_name=None`
  drains all 5 matching envelopes (across both task classes) and
  preserves the unrelated row. Pre-fix `result.count == 0` and
  the queue stayed full.
…nd-to-end

PR #903's clear-queue path has regressed twice (folded fixes:
PR #904 -- streaming clear walks full queue, no artificial cap; and
the synthetic-target fix -- clear-by-filter uses a single synthetic
target so deep-queue clears don't silently no-op). Both regressions
were silent at the existing-test level because the assertions
inferred drain success from `total_lset`/`stats.total_found` rather
than walking the post-call queue. Add a small focused module that:

* Asserts `LLEN(queue) == initial_total - matches` explicitly
  before/after `celery_broker_clear`.
* Walks the full post-call queue with `parse_celery_envelope` and
  *positively* counts how many envelopes still match the filter
  triple, so any future regression that silently leaves matches
  behind (capped scan, dropped tail, etc.) fails immediately.
* Exercises queues with `LLEN > CELERY_BROKER_DISPLAY_LIMIT` for
  every scenario so a queryset-bounded clear can never sneak back
  in.

Three scenarios:

* `test_celery_broker_clear_drains_deep_queue_end_to_end` --
  CELERY_BROKER_DISPLAY_LIMIT + 500 matching envelopes, single
  synthetic target through `celery_broker_clear(dry_run=False,
  keep_one=False)`, asserts `LLEN == 0` and zero remaining
  matches via the full-queue walk.
* `test_celery_broker_clear_keep_one_deep_queue_leaves_exactly_one_at_head`
  -- CELERY_BROKER_DISPLAY_LIMIT + 100 matching envelopes,
  `keep_one=True`, asserts exactly one survivor sitting at index 0
  (head-of-queue, the next message a worker would BLPOP).
* `test_celery_broker_clear_mixed_content_deep_queue_clears_only_filter`
  -- three task buckets (A: target, B: same repoid+commit but
  different task, C: same task but different repoid+commit), B + C
  seeded both before A (head-of-queue) and after A (past the
  display window) so the clear must walk the full LLEN. Asserts
  zero A survivors, every B and C survives.

Lives in its own module to insulate from churn on the much larger
`test_celery_broker_queue.py` and uses `fakeredis` +
`SimpleNamespace` so it runs cleanly under \`pytest -m 'not django_db'\`.
…ily option

- Default the CeleryBrokerQueueAdmin summary view to ORDER BY depth DESC,
  queue_name ASC so the deepest queues surface first (operator scans the
  top of the page for hot spots; ties resolve alphabetically). Drill-down
  per-message view keeps its existing index-ordered list.
- Remove `celery_broker` from the RedisQueueAdmin family list filter
  choices. The data path already excluded celery_broker rows, but the
  dropdown still offered it as a clickable option that returned zero
  rows — now hidden entirely via a new QueueFamilyFilter subclass that
  plumbs an `exclude` list through `families.iter_families()`.

Tests:
- Pin summary-mode ordering (depth DESC, name ASC, with a tie on depth).
- Pin admin.get_ordering flip between summary and drill-down modes.
- Pin family-filter choices to confirm celery_broker is dropped while
  other registered families are still offered (queue + lock categories
  each checked).
- Pin iter_families registry behaviour (unfiltered, category-narrowed,
  exclude-filtered).
@thomasrockhu-codecov thomasrockhu-codecov force-pushed the redis-admin/split-celery-broker-caps branch from 45dff23 to 4f86eee Compare May 1, 2026 14:12
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4f86eee. Configure here.

if ft_commitid is not None and ft_commitid != meta.commitid:
continue
return True
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wildcard filter over-matches in per-message clear paths

High Severity

_envelope_matches_any_filter treats None as a wildcard ("don't constrain"), but the per-message clear paths (clear_selected, clear_dry_run) build filter tuples from materialised rows where None means "the envelope genuinely has no value for this field." A sync_repos task with repoid=None and commitid=None produces filter tuple (task, None, None), which under wildcard semantics matches every message with that task name — not just the one the operator selected. The old exact-tuple-membership check (key not in filter_tuples) correctly distinguished "field is None" from "field is unconstrained." The wildcard semantic is correct for the new synthetic-target path from clear_by_filter_view, but silently broadens the per-message paths, potentially tombstoning unintended messages.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4f86eee. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this is a real over-broaden risk and would have been a data-loss regression in production. Fixed in 566eb62.

Switched the wildcard signal from None to a dedicated _FILTER_ANY sentinel (a _FilterWildcard singleton), and the match logic checks is _FILTER_ANY (identity) for the wildcard branch and falls through to == for everything else. So:

  • Operator-input paths (streaming_celery_count, the clear_by_filter_view synthetic target) substitute _FILTER_ANY for unset slots and get wildcard semantics.
  • Per-message paths keep passing real envelope values — including legitimate None slots like sync_repos's (repoid, commitid) = (None, None) — and get exact-tuple membership semantics. (task, None, None) only matches (task, None, None) envelopes, not every envelope sharing the task name.

Added regression test test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots: seeds two (sync_repos, None, None) envelopes alongside three (sync_repos, repoid, commitid) envelopes; clearing a single materialised target with (repoid, commitid) = (None, None) must tombstone exactly the two matching envelopes, not all five. Pre-fix it would have asserted result.count == 5 and llen == 0 — a 3-message data loss for the operator who only wanted to clear two stuck messages.

@thomasrockhu-codecov thomasrockhu-codecov added this pull request to the merge queue May 1, 2026
Merged via the queue into main with commit b96458c May 1, 2026
40 checks passed
@thomasrockhu-codecov thomasrockhu-codecov deleted the redis-admin/split-celery-broker-caps branch May 1, 2026 14:30
thomasrockhu-codecov added a commit that referenced this pull request May 1, 2026
Bugbot review on PR #903 caught a high-severity over-broaden in
the wildcard fix from ebbb6cf: `_envelope_matches_any_filter`
treated `None` slots in any filter tuple as wildcards, but the
per-message clear paths (`clear_selected`, `clear_dry_run`,
`delete_model`, `delete_queryset`) build filter tuples from
materialised rows whose envelope legitimately carries `None` in
a slot (e.g. a `sync_repos` task with `repoid=None` and
`commitid=None`). Under the prior wildcard semantic, single-row
clear of one such message would have tombstoned EVERY message
sharing the task name regardless of repoid / commit -- a real
data-loss regression in production.

Fix introduces a dedicated `_FILTER_ANY` sentinel (a
`_FilterWildcard` singleton) for "do not constrain on this slot"
that is *distinct from* `None`. The match logic checks
`is _FILTER_ANY` (identity) for the wildcard branch and falls
through to `==` for everything else, so:

* Operator-input paths (`streaming_celery_count`, the
  `clear_by_filter_view` synthetic target) substitute
  `_FILTER_ANY` for unset slots and get wildcard semantics.
* Per-message paths keep passing real envelope values --
  including legitimate `None` slots -- and get exact-tuple
  membership semantics (`(task, None, None)` matches only
  `(task, None, None)` envelopes, not every envelope sharing
  the task name).

Adds a regression test
(`test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots`)
seeding two `(sync_repos, None, None)` envelopes alongside three
`(sync_repos, repoid, commitid)` envelopes; clearing a single
materialised target with `(repoid, commitid) = (None, None)` must
tombstone exactly the two matching envelopes, not all five.
Pre-fix that test would have asserted `result.count == 5` and
`llen == 0` -- a 3-message data loss for the operator who only
wanted to clear two stuck messages.

Also updates the existing wildcard regression test
(`test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`)
to construct its synthetic target with `task_name=_FILTER_ANY`
instead of `None`, matching the new contract the
`clear_by_filter_view` synthetic target uses.
thomasrockhu-codecov added a commit that referenced this pull request May 1, 2026
Bugbot review on PR #903 caught a high-severity over-broaden in
the wildcard fix from ebbb6cf: `_envelope_matches_any_filter`
treated `None` slots in any filter tuple as wildcards, but the
per-message clear paths (`clear_selected`, `clear_dry_run`,
`delete_model`, `delete_queryset`) build filter tuples from
materialised rows whose envelope legitimately carries `None` in
a slot (e.g. a `sync_repos` task with `repoid=None` and
`commitid=None`). Under the prior wildcard semantic, single-row
clear of one such message would have tombstoned EVERY message
sharing the task name regardless of repoid / commit -- a real
data-loss regression in production.

Fix introduces a dedicated `_FILTER_ANY` sentinel (a
`_FilterWildcard` singleton) for "do not constrain on this slot"
that is *distinct from* `None`. The match logic checks
`is _FILTER_ANY` (identity) for the wildcard branch and falls
through to `==` for everything else, so:

* Operator-input paths (`streaming_celery_count`, the
  `clear_by_filter_view` synthetic target) substitute
  `_FILTER_ANY` for unset slots and get wildcard semantics.
* Per-message paths keep passing real envelope values --
  including legitimate `None` slots -- and get exact-tuple
  membership semantics (`(task, None, None)` matches only
  `(task, None, None)` envelopes, not every envelope sharing
  the task name).

Adds a regression test
(`test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots`)
seeding two `(sync_repos, None, None)` envelopes alongside three
`(sync_repos, repoid, commitid)` envelopes; clearing a single
materialised target with `(repoid, commitid) = (None, None)` must
tombstone exactly the two matching envelopes, not all five.
Pre-fix that test would have asserted `result.count == 5` and
`llen == 0` -- a 3-message data loss for the operator who only
wanted to clear two stuck messages.

Also updates the existing wildcard regression test
(`test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`)
to construct its synthetic target with `task_name=_FILTER_ANY`
instead of `None`, matching the new contract the
`clear_by_filter_view` synthetic target uses.
thomasrockhu-codecov added a commit that referenced this pull request May 1, 2026
…ncel

Layered on top of the wildcard sentinel fix from 77047ba so PR #904
ships both the high-severity data-loss fix from #903's bugbot review
AND the next round of clear-queue UX work in a single PR.

Decouples the destructive clear's wall-clock from the HTTP request
lifetime: a confirmed `clear_all` / `clear_keep_one` submit on
`clear-by-filter/` no longer runs the streaming clear synchronously
inside the gunicorn worker -- it spawns a background daemon thread
tracked under `redis_admin:celery_clear_job:<uuid>` in the cache
Redis (kept off the broker so a clear that drains the broker can't
evict its own progress hash) and 302s the operator to a progress
page that polls a status JSON view every ~1s. The page renders a
`<progress>` bar, matched/total/drift/pass counters, a status pill
(neutral / blue / green / amber / red per state), and a Cancel
button that lands at the next chunk boundary (default 1000 messages
per LRANGE chunk in the chunked variant -- the synchronous path
keeps the existing 10k chunk size unchanged).

Service layer
-------------

- `start_celery_broker_clear_job(queue, *, user, task_name, repoid,
  commitid, keep_one, dry_run) -> str` snapshots `LLEN(queue)` as
  `total_estimated`, writes the initial job hash with
  `status=pending`, sets a 24h TTL so abandoned jobs auto-cleanup,
  then starts a daemon thread named `redis_admin.clear_job.<8hex>`.
- `get_celery_broker_clear_job(job_id)` reads the hash; the
  status JSON view casts numerics to int before serialising for
  a stable wire shape pinned by tests.
- `request_cancel_celery_broker_clear_job(job_id)` flips
  `cancel_requested=1` (idempotent; returns False on unknown id).

`_streaming_celery_clear` gains two optional kwargs --
`chunk_size` and `progress_callback` -- so the chunked-job worker
and the synchronous `celery_broker_clear` share the same chunk
inner-loop. The callback receives a `_ChunkProgress` snapshot
(per-chunk deltas + running totals) at every chunk boundary;
returning True triggers a clean cancel: in-flight tombstones for
the current pass are LREMmed (queue is left clean, not partial-
graveyard) and `_StreamingClearStats.cancelled=True`. Cancel only
ever lands at chunk boundaries, never mid-LSET, so a tombstone we
already wrote is always paired with the matching LREM at pass-end
or cancel-drain.

Worker exceptions are swallowed: `log.exception(...)` so Sentry
captures, plus `status=failed` + `error=<str(exc)>` on the hash
so the operator's progress page shows the failure. The daemon
thread never propagates to the gunicorn worker.

Audit log: a `LogEntry` row is written at job-completion (not
job-start) with `mode` ∈ `chunked-dry-run` /
`chunked-all-from-bucket` / `chunked-all-but-first` (vs
synchronous `dry-run` / `all-from-bucket` / `all-but-first`) plus
`job_id`, `passes_run`, `total_drifted`, `cancelled` extras so
queries can distinguish chunked from synchronous clears.

Why threading and not Celery: the queue we're clearing is the
Celery broker, so submitting a control task to it creates a
circular dependency, and the operator pod isn't part of the worker
fleet so the task wouldn't pick up anyway. Restart-safety
trade-off documented in README -- a worker killed mid-job leaves
in-flight tombstones in the queue; operator can re-run (the new
pass walks the full queue and skips them) or `LREM` by hand.

Admin views
-----------

Three new superuser-only URLs under `clear-by-filter/job/<uuid>/`:

- `clear_by_filter_progress_view` -- renders the progress page
  HTML with an initial server-side snapshot so the page is
  informative even before JS loads (matched / processed / drift /
  pass / status pill / timestamps).
- `clear_by_filter_status_view` -- JSON snapshot of the job
  hash. 404s on unknown / expired ids.
- `clear_by_filter_cancel_view` -- POST-only (405 on GET).
  Sets `cancel_requested=1` and returns 202 with the latest
  snapshot.

`clear_by_filter_view` rewired so confirmed destructive submits
now spawn a job and 302 to the progress page; dry-run keeps the
synchronous shape (audit-log + re-rendered preview) since the
streaming-count walk is fast.

Frontend
--------

CSP-compliant -- all CSS / JS loads via `{% static %}`, no inline
`<script>` / `<style>`. Mirrors the same approach as PR #902's
chart fragment loader (`celery_chart_fragment.{js,css}`).

- `clear_by_filter_progress.html` -- card layout, native
  `<progress>` element so screen readers announce progress, real
  `<form>` for the cancel button so the page degrades gracefully
  without JS.
- `celery_clear_progress.js` -- IIFE, no globals. Polls every 1s,
  backs off (1s -> 2s -> 5s, capped) on errors. Stops on
  terminal status, switches the layout (Cancel button reads
  "Done" / "Cancelled" / "Failed", back link enables). Cancel
  button hijacks form submit, fires `fetch()` with the CSRF
  token, prevents double-clicks.
- `celery_clear_progress.css` -- Django-admin theme variable
  driven (light + dark), `prefers-reduced-motion` fallback.

Tests
-----

12 new tests in `redis_admin/tests/test_celery_broker_clear_job.py`,
11 of which run under `-m 'not django_db'` (the audit-log test
needs `LogEntry` and runs in CI under `@pytest.mark.django_db`):

- service: hash shape pinned, deep-queue drain end-to-end
  (CELERY_BROKER_DISPLAY_LIMIT + 500 messages -> queue empty),
  cancel halts mid-pass with remaining matches intact, keep_one
  leaves exactly one survivor at index 0, monotonic progress
  callback, failure swallows exception + records `status=failed`.
- views: status JSON shape pinned, cancel rejects GET (405),
  unknown id returns 404 JSON, confirmed submit on
  `clear_by_filter_view` redirects to the progress page,
  dry-run preview unchanged (no chunked job spawned).
- audit (django_db): chunked-mode payload contains `mode`,
  `job_id`, `cancelled`, `passes_run`.

Request timeouts
----------------

In-repo gunicorn `--timeout` is at `apps/codecov-api/api.sh:118`
defaulting to 600s (env-overridable via `GUNICORN_TIMEOUT`).
The user-observed 120s ceiling is NOT in this repo; most likely
sources are external (Cloudflare's free-tier 100s ceiling, GCP
HTTPS LB Backend Service timeout, or k8s ingress annotations) and
need an infra-side investigation. Knob is documented in the
README; this PR doesn't bump it (deploy-tuning decision).

Long-term, the chunked-clear flow itself sidesteps the entire
upstream-timeout class -- the POST that starts the clear spawns
a thread + 302s in well under a second, and each subsequent poll
is sub-millisecond, so no HTTP request ever sits on the actual
clearing work.
github-merge-queue Bot pushed a commit that referenced this pull request May 1, 2026
…ldcard sentinel fix (#904)

* fix(redis_admin): isolate streaming wildcard semantic with sentinel

Bugbot review on PR #903 caught a high-severity over-broaden in
the wildcard fix from ebbb6cf: `_envelope_matches_any_filter`
treated `None` slots in any filter tuple as wildcards, but the
per-message clear paths (`clear_selected`, `clear_dry_run`,
`delete_model`, `delete_queryset`) build filter tuples from
materialised rows whose envelope legitimately carries `None` in
a slot (e.g. a `sync_repos` task with `repoid=None` and
`commitid=None`). Under the prior wildcard semantic, single-row
clear of one such message would have tombstoned EVERY message
sharing the task name regardless of repoid / commit -- a real
data-loss regression in production.

Fix introduces a dedicated `_FILTER_ANY` sentinel (a
`_FilterWildcard` singleton) for "do not constrain on this slot"
that is *distinct from* `None`. The match logic checks
`is _FILTER_ANY` (identity) for the wildcard branch and falls
through to `==` for everything else, so:

* Operator-input paths (`streaming_celery_count`, the
  `clear_by_filter_view` synthetic target) substitute
  `_FILTER_ANY` for unset slots and get wildcard semantics.
* Per-message paths keep passing real envelope values --
  including legitimate `None` slots -- and get exact-tuple
  membership semantics (`(task, None, None)` matches only
  `(task, None, None)` envelopes, not every envelope sharing
  the task name).

Adds a regression test
(`test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots`)
seeding two `(sync_repos, None, None)` envelopes alongside three
`(sync_repos, repoid, commitid)` envelopes; clearing a single
materialised target with `(repoid, commitid) = (None, None)` must
tombstone exactly the two matching envelopes, not all five.
Pre-fix that test would have asserted `result.count == 5` and
`llen == 0` -- a 3-message data loss for the operator who only
wanted to clear two stuck messages.

Also updates the existing wildcard regression test
(`test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`)
to construct its synthetic target with `task_name=_FILTER_ANY`
instead of `None`, matching the new contract the
`clear_by_filter_view` synthetic target uses.

* feat(redis_admin): chunked celery_broker clear with progress bar + cancel

Layered on top of the wildcard sentinel fix from 77047ba so PR #904
ships both the high-severity data-loss fix from #903's bugbot review
AND the next round of clear-queue UX work in a single PR.

Decouples the destructive clear's wall-clock from the HTTP request
lifetime: a confirmed `clear_all` / `clear_keep_one` submit on
`clear-by-filter/` no longer runs the streaming clear synchronously
inside the gunicorn worker -- it spawns a background daemon thread
tracked under `redis_admin:celery_clear_job:<uuid>` in the cache
Redis (kept off the broker so a clear that drains the broker can't
evict its own progress hash) and 302s the operator to a progress
page that polls a status JSON view every ~1s. The page renders a
`<progress>` bar, matched/total/drift/pass counters, a status pill
(neutral / blue / green / amber / red per state), and a Cancel
button that lands at the next chunk boundary (default 1000 messages
per LRANGE chunk in the chunked variant -- the synchronous path
keeps the existing 10k chunk size unchanged).

Service layer
-------------

- `start_celery_broker_clear_job(queue, *, user, task_name, repoid,
  commitid, keep_one, dry_run) -> str` snapshots `LLEN(queue)` as
  `total_estimated`, writes the initial job hash with
  `status=pending`, sets a 24h TTL so abandoned jobs auto-cleanup,
  then starts a daemon thread named `redis_admin.clear_job.<8hex>`.
- `get_celery_broker_clear_job(job_id)` reads the hash; the
  status JSON view casts numerics to int before serialising for
  a stable wire shape pinned by tests.
- `request_cancel_celery_broker_clear_job(job_id)` flips
  `cancel_requested=1` (idempotent; returns False on unknown id).

`_streaming_celery_clear` gains two optional kwargs --
`chunk_size` and `progress_callback` -- so the chunked-job worker
and the synchronous `celery_broker_clear` share the same chunk
inner-loop. The callback receives a `_ChunkProgress` snapshot
(per-chunk deltas + running totals) at every chunk boundary;
returning True triggers a clean cancel: in-flight tombstones for
the current pass are LREMmed (queue is left clean, not partial-
graveyard) and `_StreamingClearStats.cancelled=True`. Cancel only
ever lands at chunk boundaries, never mid-LSET, so a tombstone we
already wrote is always paired with the matching LREM at pass-end
or cancel-drain.

Worker exceptions are swallowed: `log.exception(...)` so Sentry
captures, plus `status=failed` + `error=<str(exc)>` on the hash
so the operator's progress page shows the failure. The daemon
thread never propagates to the gunicorn worker.

Audit log: a `LogEntry` row is written at job-completion (not
job-start) with `mode` ∈ `chunked-dry-run` /
`chunked-all-from-bucket` / `chunked-all-but-first` (vs
synchronous `dry-run` / `all-from-bucket` / `all-but-first`) plus
`job_id`, `passes_run`, `total_drifted`, `cancelled` extras so
queries can distinguish chunked from synchronous clears.

Why threading and not Celery: the queue we're clearing is the
Celery broker, so submitting a control task to it creates a
circular dependency, and the operator pod isn't part of the worker
fleet so the task wouldn't pick up anyway. Restart-safety
trade-off documented in README -- a worker killed mid-job leaves
in-flight tombstones in the queue; operator can re-run (the new
pass walks the full queue and skips them) or `LREM` by hand.

Admin views
-----------

Three new superuser-only URLs under `clear-by-filter/job/<uuid>/`:

- `clear_by_filter_progress_view` -- renders the progress page
  HTML with an initial server-side snapshot so the page is
  informative even before JS loads (matched / processed / drift /
  pass / status pill / timestamps).
- `clear_by_filter_status_view` -- JSON snapshot of the job
  hash. 404s on unknown / expired ids.
- `clear_by_filter_cancel_view` -- POST-only (405 on GET).
  Sets `cancel_requested=1` and returns 202 with the latest
  snapshot.

`clear_by_filter_view` rewired so confirmed destructive submits
now spawn a job and 302 to the progress page; dry-run keeps the
synchronous shape (audit-log + re-rendered preview) since the
streaming-count walk is fast.

Frontend
--------

CSP-compliant -- all CSS / JS loads via `{% static %}`, no inline
`<script>` / `<style>`. Mirrors the same approach as PR #902's
chart fragment loader (`celery_chart_fragment.{js,css}`).

- `clear_by_filter_progress.html` -- card layout, native
  `<progress>` element so screen readers announce progress, real
  `<form>` for the cancel button so the page degrades gracefully
  without JS.
- `celery_clear_progress.js` -- IIFE, no globals. Polls every 1s,
  backs off (1s -> 2s -> 5s, capped) on errors. Stops on
  terminal status, switches the layout (Cancel button reads
  "Done" / "Cancelled" / "Failed", back link enables). Cancel
  button hijacks form submit, fires `fetch()` with the CSRF
  token, prevents double-clicks.
- `celery_clear_progress.css` -- Django-admin theme variable
  driven (light + dark), `prefers-reduced-motion` fallback.

Tests
-----

12 new tests in `redis_admin/tests/test_celery_broker_clear_job.py`,
11 of which run under `-m 'not django_db'` (the audit-log test
needs `LogEntry` and runs in CI under `@pytest.mark.django_db`):

- service: hash shape pinned, deep-queue drain end-to-end
  (CELERY_BROKER_DISPLAY_LIMIT + 500 messages -> queue empty),
  cancel halts mid-pass with remaining matches intact, keep_one
  leaves exactly one survivor at index 0, monotonic progress
  callback, failure swallows exception + records `status=failed`.
- views: status JSON shape pinned, cancel rejects GET (405),
  unknown id returns 404 JSON, confirmed submit on
  `clear_by_filter_view` redirects to the progress page,
  dry-run preview unchanged (no chunked job spawned).
- audit (django_db): chunked-mode payload contains `mode`,
  `job_id`, `cancelled`, `passes_run`.

Request timeouts
----------------

In-repo gunicorn `--timeout` is at `apps/codecov-api/api.sh:118`
defaulting to 600s (env-overridable via `GUNICORN_TIMEOUT`).
The user-observed 120s ceiling is NOT in this repo; most likely
sources are external (Cloudflare's free-tier 100s ceiling, GCP
HTTPS LB Backend Service timeout, or k8s ingress annotations) and
need an infra-side investigation. Knob is documented in the
README; this PR doesn't bump it (deploy-tuning decision).

Long-term, the chunked-clear flow itself sidesteps the entire
upstream-timeout class -- the POST that starts the clear spawns
a thread + 302s in well under a second, and each subsequent poll
is sub-millisecond, so no HTTP request ever sits on the actual
clearing work.

* docs(redis_admin): drop gunicorn timeout knob discussion

The chunked clear job already decouples the clear's wall-clock
from the HTTP request lifetime, so the gunicorn `--timeout`
knob is no longer relevant to whether a clear succeeds. Drop
the README + PR-body discussion of `apps/codecov-api/api.sh:118`,
`GUNICORN_TIMEOUT`, Cloudflare's per-request ceiling, the GCP
HTTPS LB Backend Service `timeoutSec`, and the ingress
annotations — operators don't need to reason about any of those
to use the new progress-bar flow. Trim the corresponding
"Why this exists" paragraph in the README and the analogous
header comment in services.py to just the design intent.

* fix(redis_admin): preserve status pill base class + 0-base chunk_index

Two Bugbot findings on PR #904:

* MEDIUM (`celery_clear_progress.js`): `setStatusPill` filtered
  classes by the `celery-clear-status-` prefix, which dropped the
  base `celery-clear-status-pill` class along with the previous
  state class. After the first poll the pill lost all base styling
  (display, padding, border-radius, font-weight, etc.) and only
  the colour overrides remained. Replace the prefix-strip with an
  explicit `STATE_CLASSES` enumeration so only state classes are
  toggled and the base class survives.

* LOW (`services.py`): `_ChunkProgress.chunk_index` was documented
  as 0-based but incremented BEFORE being passed to the callback,
  effectively making it 1-based (first chunk reported 1, last
  reported `chunks_total`). Move the increment to AFTER the
  callback so the snapshot describes the chunk that just finished
  and the documented contract holds. Adds a regression assertion
  that snapshots[*].chunk_index == [0, 1, ..., chunks_total - 1].

* test(redis_admin): make audit-log test cross-thread visible

CI failure on PR #904 traced to a Django + threading transaction-
visibility issue, not a logic regression: the chunked-mode audit-
log test creates a `User` via `UserFactory()` from the test thread,
then `start_celery_broker_clear_job` spawns the daemon worker
which opens its own DB connection and tries to insert a
`LogEntry` referring to that user. With plain `@pytest.mark.
django_db`, the test thread's writes live inside an uncommitted
transaction the worker thread can't see, so the FK to `users.id`
fails: `ForeignKeyViolation: Key (user_id)=(N) is not present in
table "users".`

Fix:

* Switch the test to `@pytest.mark.django_db(transaction=True)`
  so the user is committed before the worker thread reads it
  (Django's `TransactionTestCase` semantics).
* Add `close_old_connections()` to the worker thread's `finally`
  block so the lazily-opened ORM connection drops at thread exit.
  Without that, `transaction=True`'s truncate-on-teardown would
  block on our idle connection, and a long-lived gunicorn worker
  would slowly leak one connection per chunked clear that ever
  ran on it. `close_old_connections` is the documented helper
  for this exact background-thread cleanup case.

No production-code logic change. The non-DB tests continue to
pass (11/11); the `transaction=True` test will be validated by
CI's API CI / Test job since this sandbox can't reach Postgres.

* ref(redis-admin): consolidate _FILTER_ANY substitution into shared helper

Bugbot review on PR #904 (LOW): the operator-input filter-tuple
construction `(task_name if task_name else _FILTER_ANY, repoid if
repoid is not None else _FILTER_ANY, commitid if commitid else
_FILTER_ANY)` was duplicated identically in three sites:
`streaming_celery_count`, `_run_celery_broker_clear_job_body`, and
`clear_by_filter_view`. If the rule diverged (new filter field, or a
slot's truthiness check loosened) the dry-run count, the chunked
clear, and the synchronous preview would silently disagree on which
envelopes match — a data-loss vector.

Extract a `_substitute_filter_any(task_name, repoid, commitid)` helper
next to the `_FILTER_ANY` sentinel definition so all three call sites
share one rule, and add a regression test that pins the contract
(both directly and against the literal inline expression the three
sites used to repeat).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants