feat(redis_admin): split celery_broker caps into display vs scan limits#903
Conversation
The single `MAX_ITEMS_PER_KEY = 20_000` knob was doing too many jobs: it bounded the changelist row materialisation (RAM-heavy: full envelopes stay in the per-request cache), the frequency chart aggregator (CPU-light: discards payloads), and the streaming clear (the same shape as the chart). Lowering it for page-load made the chart shallow; raising it for the chart bloated the changelist. Split into two purpose-specific settings: - `REDIS_ADMIN_CELERY_BROKER_DISPLAY_LIMIT` = 2_000 — per-message rows on the changelist. Smaller because each row keeps a parsed envelope in memory. - `REDIS_ADMIN_CELERY_BROKER_SCAN_LIMIT` = 100_000 — frequency chart aggregation + streaming clear deep scan. Both walk the queue in chunks and discard payloads, so memory stays flat. The chart and clear share this window so anything visible on the chart is reachable by the "Clear queue" button. `MAX_ITEMS_PER_KEY` keeps its existing 20_000 default and stays wired to the unrelated SET/HASH browsing path in `RedisItemQuerySet`.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #903 +/- ##
=======================================
Coverage 91.90% 91.91%
=======================================
Files 1316 1316
Lines 50318 50380 +62
Branches 1625 1625
=======================================
+ Hits 46245 46305 +60
- Misses 3767 3769 +2
Partials 306 306
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
The cap split in this branch left the streaming clear bound to `CELERY_BROKER_SCAN_LIMIT` (100_000), which is the same window the chart sampler uses. On a queue saturated past 2x cap pass 1 and pass 2 each tombstoned exactly that many matches and the `prev_lset == matches_lset` plateau check spuriously declared convergence, leaving the rest of the queue in place. Fold in #904's fix: drop the cap from `_streaming_celery_clear` so each pass walks `LLEN(queue)`. The plateau check stays for genuine consumer-side drift. Update the `CELERY_BROKER_SCAN_LIMIT` docstring + README to reflect that the clear is intentionally unbounded and only the chart aggregator uses the sample window. Adds the regression test from #904 (renamed to drop the now- irrelevant `MAX_ITEMS_PER_KEY` reference): seeds 41k matching messages and asserts `LLEN == 1` after a single clear pass — fails on the pre-fold #903 (cap = 100k matches) and passes after.
The "Clear by filter" preview was using `pending_count` (the materialised target list, capped by `CELERY_BROKER_DISPLAY_LIMIT = 2_000`) as the dry-run match count. After the cap split in this branch the actual streaming clear is unbounded, so on a 200k-deep queue with 190k matches the preview reported ~1.9k while "Clear all" then drained 190k — confusing. Run `_streaming_celery_clear(..., dry_run=True)` on dry-run too so the preview walks the same `LLEN(queue)` the actual clear walks. Adds `total_found` to `_StreamingClearStats` to expose the count, and short-circuits dry-run after one pass since nothing's changing across passes. `keep_one` mode subtracts the per-queue keeper so the preview matches what the actual clear will tombstone. Adds regression test seeding 5k matches and asserting both `dry_run=True / keep_one=False → count == 5_000` and `keep_one=True → count == 4_999` while `LLEN` stays at 5_000.
The earlier dry-run fix on this branch made `celery_broker_clear(dry_run=True).count` reflect the streaming counter's full-queue scan, but the clear-by-filter preview page itself was still showing `len(matches)` (capped by `CELERY_BROKER_DISPLAY_LIMIT = 2_000`). On a queue with 190k matches the preview said "1903" while "Clear all" then drained 190k. Move `match_count` and `kept_index` to a streaming pass via the new `streaming_celery_count` helper. Keep the queryset only for the small `sample_targets` preview table (<= 20 rows). Adds `first_match_index` to `_StreamingClearStats` so the helper can surface the lowest-index match for the "Clear all but first" action's `kept_index` callout. Also folds in #905: replaces the subtle "Loading frequency chart..." help-text line with a styled loader card (animated spinner, themed via Django admin CSS custom properties, reduced-motion fallback). Loader styles ship as an external stylesheet referenced via {% extrastyle %} so they survive production CSP - same pattern PR #902 used for the script. Adds two regressions: the clear-by-filter preview reports the full-queue count (`Matched: N message(s)` for N > display limit), and the changelist body contains the new loader card markup + CSS link.
Bugbot review on PR #903 caught a low-severity off-by-one in the clear-by-filter dry-run preview when the filter set spans multiple queues. The preview was subtracting `len(by_queue_filters)` (one per queue) from `total_found`, but `_streaming_celery_clear` only preserves a survivor in queues that have at least one match. So if a queue drained to zero matches between the changelist render and the preview (e.g. a consumer popped the last match), the preview underreported by one — diverging from what the real clear path actually tombstones. Fix tracks `queues_with_matches` while iterating per-queue and deducts that count instead of `len(by_queue_filters)`. Adds a regression test that pushes 100 matching messages into one queue, leaves a second queue empty, and asserts `keep_one=True` dry-run reports `n - 1` (not `n - 2`). Verified the test fails on the pre-fix code with `assert 98 == (100 - 1)`.
… clears land Bugbot review on PR #903 caught a high-severity silent-no-op: `clear_by_filter_view` was passing `sample_targets` (the materialised window capped at `CELERY_BROKER_DISPLAY_LIMIT = 2_000`) as the `targets` arg to `celery_broker_clear`. Two failure modes: 1. All matches sit beyond the display window (recent burst on a queue with `LLEN > 2_000`) -- `sample_targets` is empty, so `_celery_broker_clear` derives `by_queue_filters = {}`, the streaming-clear loop never executes, and the operator sees "Cleared 0 of N matching message(s)" while the queue stays full. `match_count` (from `streaming_celery_count`, which walks the full queue) had already reported the matches. 2. Filter set inferred from sample_targets enumerates only the `(task, repoid, commit)` triples present in the first 2_000 rows -- triples living past that cap survive the clear. Fix: synthesise a single `CeleryBrokerQueue` carrying the operator's exact filter (queue + task_name? + repoid? + commitid?) and pass that as `targets`. `_celery_broker_clear` now derives the same `frozenset({(task or None, repoid, commit or None)})` that `streaming_celery_count` walks the full queue with, so preview count and actual clear count agree even on 500k-deep queues. The per-message and bulk-select paths (delete_model / clear_selected / delete_queryset / clear_dry_run) keep passing real materialised rows -- they're index-driven, not filter-driven, and don't hit this code path. Also fixes the low-severity hint hardcode flagged in the same review: the loader copy under the chart fragment said "Sampling up to 100,000 messages" regardless of `CELERY_BROKER_SCAN_LIMIT`. Pass the actual setting through `changelist_view`'s template context (pre-formatted with thousands separators since `django.contrib.humanize` isn't in INSTALLED_APPS). Adds regression test seeding `CELERY_BROKER_DISPLAY_LIMIT + 1_000` matching messages and asserting `celery_broker_clear` with a synthetic filter target drains the entire queue (`llen == 0`, `result.count == n`). Pre-fix that test fails with `result.count == 0` and the queue intact.
Pre-existing bug surfaced by the synthetic-target fix in the prior
commit and by the existing `CeleryBrokerClearByFilterViewTest` /
`ChartFragmentViewTest` suites that started failing on this branch:
`_streaming_celery_clear` (used by both `streaming_celery_count`
and the actual clear path) compared `(meta.task, meta.repoid,
meta.commitid)` to `filter_tuples` via exact set membership. When
the operator left `task_name` unset, `streaming_celery_count`
built `frozenset({(None, repoid, commitid)})` and compared it
against real envelopes whose `meta.task` is always populated, so
nothing matched. The preview reported zero matches even when the
queryset's `filter(repoid=..., commitid__startswith=...)` would
surface rows -- which is what the eight failing
`CeleryBrokerClearByFilterViewTest` tests on commit 39aed24
were tripping over.
Fix: replace the membership check with `_envelope_matches_any_filter`
which treats `None` slots in any filter tuple as wildcards
(`ft_task is None or ft_task == meta.task`, same for repoid /
commitid). Mirrors the queryset's "filter by what's set" semantic,
so the streaming counter and clear agree with the queryset's match
set across every operator-input shape (queue + repoid only,
queue + task only, queue + commit only, queue + any combination).
Tuples with all three slots populated keep the same exact-equality
behaviour (`None` short-circuits, real values still match by
`==`), so existing tests using the `CeleryBrokerQueueQuerySet`-
materialised path continue to pass unchanged.
Adds two regression tests:
* `test_streaming_celery_count_wildcard_task_name_matches_any_task`
pins the counter contract: `task_name=None, repoid=R, commit=C`
matches every envelope with that `(repoid, commit)` regardless
of task class. Pre-fix it returned 0; now it returns 5 (3
NotifyTask + 2 UploadTask) while the unrelated `repoid+1`
envelope is excluded.
* `test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`
pins the clear contract: a synthetic target with `task_name=None`
drains all 5 matching envelopes (across both task classes) and
preserves the unrelated row. Pre-fix `result.count == 0` and
the queue stayed full.
…nd-to-end PR #903's clear-queue path has regressed twice (folded fixes: PR #904 -- streaming clear walks full queue, no artificial cap; and the synthetic-target fix -- clear-by-filter uses a single synthetic target so deep-queue clears don't silently no-op). Both regressions were silent at the existing-test level because the assertions inferred drain success from `total_lset`/`stats.total_found` rather than walking the post-call queue. Add a small focused module that: * Asserts `LLEN(queue) == initial_total - matches` explicitly before/after `celery_broker_clear`. * Walks the full post-call queue with `parse_celery_envelope` and *positively* counts how many envelopes still match the filter triple, so any future regression that silently leaves matches behind (capped scan, dropped tail, etc.) fails immediately. * Exercises queues with `LLEN > CELERY_BROKER_DISPLAY_LIMIT` for every scenario so a queryset-bounded clear can never sneak back in. Three scenarios: * `test_celery_broker_clear_drains_deep_queue_end_to_end` -- CELERY_BROKER_DISPLAY_LIMIT + 500 matching envelopes, single synthetic target through `celery_broker_clear(dry_run=False, keep_one=False)`, asserts `LLEN == 0` and zero remaining matches via the full-queue walk. * `test_celery_broker_clear_keep_one_deep_queue_leaves_exactly_one_at_head` -- CELERY_BROKER_DISPLAY_LIMIT + 100 matching envelopes, `keep_one=True`, asserts exactly one survivor sitting at index 0 (head-of-queue, the next message a worker would BLPOP). * `test_celery_broker_clear_mixed_content_deep_queue_clears_only_filter` -- three task buckets (A: target, B: same repoid+commit but different task, C: same task but different repoid+commit), B + C seeded both before A (head-of-queue) and after A (past the display window) so the clear must walk the full LLEN. Asserts zero A survivors, every B and C survives. Lives in its own module to insulate from churn on the much larger `test_celery_broker_queue.py` and uses `fakeredis` + `SimpleNamespace` so it runs cleanly under \`pytest -m 'not django_db'\`.
…ily option - Default the CeleryBrokerQueueAdmin summary view to ORDER BY depth DESC, queue_name ASC so the deepest queues surface first (operator scans the top of the page for hot spots; ties resolve alphabetically). Drill-down per-message view keeps its existing index-ordered list. - Remove `celery_broker` from the RedisQueueAdmin family list filter choices. The data path already excluded celery_broker rows, but the dropdown still offered it as a clickable option that returned zero rows — now hidden entirely via a new QueueFamilyFilter subclass that plumbs an `exclude` list through `families.iter_families()`. Tests: - Pin summary-mode ordering (depth DESC, name ASC, with a tie on depth). - Pin admin.get_ordering flip between summary and drill-down modes. - Pin family-filter choices to confirm celery_broker is dropped while other registered families are still offered (queue + lock categories each checked). - Pin iter_families registry behaviour (unfiltered, category-narrowed, exclude-filtered).
45dff23 to
4f86eee
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4f86eee. Configure here.
| if ft_commitid is not None and ft_commitid != meta.commitid: | ||
| continue | ||
| return True | ||
| return False |
There was a problem hiding this comment.
Wildcard filter over-matches in per-message clear paths
High Severity
_envelope_matches_any_filter treats None as a wildcard ("don't constrain"), but the per-message clear paths (clear_selected, clear_dry_run) build filter tuples from materialised rows where None means "the envelope genuinely has no value for this field." A sync_repos task with repoid=None and commitid=None produces filter tuple (task, None, None), which under wildcard semantics matches every message with that task name — not just the one the operator selected. The old exact-tuple-membership check (key not in filter_tuples) correctly distinguished "field is None" from "field is unconstrained." The wildcard semantic is correct for the new synthetic-target path from clear_by_filter_view, but silently broadens the per-message paths, potentially tombstoning unintended messages.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 4f86eee. Configure here.
There was a problem hiding this comment.
Good catch — this is a real over-broaden risk and would have been a data-loss regression in production. Fixed in 566eb62.
Switched the wildcard signal from None to a dedicated _FILTER_ANY sentinel (a _FilterWildcard singleton), and the match logic checks is _FILTER_ANY (identity) for the wildcard branch and falls through to == for everything else. So:
- Operator-input paths (
streaming_celery_count, theclear_by_filter_viewsynthetic target) substitute_FILTER_ANYfor unset slots and get wildcard semantics. - Per-message paths keep passing real envelope values — including legitimate
Noneslots likesync_repos's(repoid, commitid) = (None, None)— and get exact-tuple membership semantics.(task, None, None)only matches(task, None, None)envelopes, not every envelope sharing the task name.
Added regression test test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots: seeds two (sync_repos, None, None) envelopes alongside three (sync_repos, repoid, commitid) envelopes; clearing a single materialised target with (repoid, commitid) = (None, None) must tombstone exactly the two matching envelopes, not all five. Pre-fix it would have asserted result.count == 5 and llen == 0 — a 3-message data loss for the operator who only wanted to clear two stuck messages.
Bugbot review on PR #903 caught a high-severity over-broaden in the wildcard fix from ebbb6cf: `_envelope_matches_any_filter` treated `None` slots in any filter tuple as wildcards, but the per-message clear paths (`clear_selected`, `clear_dry_run`, `delete_model`, `delete_queryset`) build filter tuples from materialised rows whose envelope legitimately carries `None` in a slot (e.g. a `sync_repos` task with `repoid=None` and `commitid=None`). Under the prior wildcard semantic, single-row clear of one such message would have tombstoned EVERY message sharing the task name regardless of repoid / commit -- a real data-loss regression in production. Fix introduces a dedicated `_FILTER_ANY` sentinel (a `_FilterWildcard` singleton) for "do not constrain on this slot" that is *distinct from* `None`. The match logic checks `is _FILTER_ANY` (identity) for the wildcard branch and falls through to `==` for everything else, so: * Operator-input paths (`streaming_celery_count`, the `clear_by_filter_view` synthetic target) substitute `_FILTER_ANY` for unset slots and get wildcard semantics. * Per-message paths keep passing real envelope values -- including legitimate `None` slots -- and get exact-tuple membership semantics (`(task, None, None)` matches only `(task, None, None)` envelopes, not every envelope sharing the task name). Adds a regression test (`test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots`) seeding two `(sync_repos, None, None)` envelopes alongside three `(sync_repos, repoid, commitid)` envelopes; clearing a single materialised target with `(repoid, commitid) = (None, None)` must tombstone exactly the two matching envelopes, not all five. Pre-fix that test would have asserted `result.count == 5` and `llen == 0` -- a 3-message data loss for the operator who only wanted to clear two stuck messages. Also updates the existing wildcard regression test (`test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`) to construct its synthetic target with `task_name=_FILTER_ANY` instead of `None`, matching the new contract the `clear_by_filter_view` synthetic target uses.
Bugbot review on PR #903 caught a high-severity over-broaden in the wildcard fix from ebbb6cf: `_envelope_matches_any_filter` treated `None` slots in any filter tuple as wildcards, but the per-message clear paths (`clear_selected`, `clear_dry_run`, `delete_model`, `delete_queryset`) build filter tuples from materialised rows whose envelope legitimately carries `None` in a slot (e.g. a `sync_repos` task with `repoid=None` and `commitid=None`). Under the prior wildcard semantic, single-row clear of one such message would have tombstoned EVERY message sharing the task name regardless of repoid / commit -- a real data-loss regression in production. Fix introduces a dedicated `_FILTER_ANY` sentinel (a `_FilterWildcard` singleton) for "do not constrain on this slot" that is *distinct from* `None`. The match logic checks `is _FILTER_ANY` (identity) for the wildcard branch and falls through to `==` for everything else, so: * Operator-input paths (`streaming_celery_count`, the `clear_by_filter_view` synthetic target) substitute `_FILTER_ANY` for unset slots and get wildcard semantics. * Per-message paths keep passing real envelope values -- including legitimate `None` slots -- and get exact-tuple membership semantics (`(task, None, None)` matches only `(task, None, None)` envelopes, not every envelope sharing the task name). Adds a regression test (`test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots`) seeding two `(sync_repos, None, None)` envelopes alongside three `(sync_repos, repoid, commitid)` envelopes; clearing a single materialised target with `(repoid, commitid) = (None, None)` must tombstone exactly the two matching envelopes, not all five. Pre-fix that test would have asserted `result.count == 5` and `llen == 0` -- a 3-message data loss for the operator who only wanted to clear two stuck messages. Also updates the existing wildcard regression test (`test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`) to construct its synthetic target with `task_name=_FILTER_ANY` instead of `None`, matching the new contract the `clear_by_filter_view` synthetic target uses.
…ncel Layered on top of the wildcard sentinel fix from 77047ba so PR #904 ships both the high-severity data-loss fix from #903's bugbot review AND the next round of clear-queue UX work in a single PR. Decouples the destructive clear's wall-clock from the HTTP request lifetime: a confirmed `clear_all` / `clear_keep_one` submit on `clear-by-filter/` no longer runs the streaming clear synchronously inside the gunicorn worker -- it spawns a background daemon thread tracked under `redis_admin:celery_clear_job:<uuid>` in the cache Redis (kept off the broker so a clear that drains the broker can't evict its own progress hash) and 302s the operator to a progress page that polls a status JSON view every ~1s. The page renders a `<progress>` bar, matched/total/drift/pass counters, a status pill (neutral / blue / green / amber / red per state), and a Cancel button that lands at the next chunk boundary (default 1000 messages per LRANGE chunk in the chunked variant -- the synchronous path keeps the existing 10k chunk size unchanged). Service layer ------------- - `start_celery_broker_clear_job(queue, *, user, task_name, repoid, commitid, keep_one, dry_run) -> str` snapshots `LLEN(queue)` as `total_estimated`, writes the initial job hash with `status=pending`, sets a 24h TTL so abandoned jobs auto-cleanup, then starts a daemon thread named `redis_admin.clear_job.<8hex>`. - `get_celery_broker_clear_job(job_id)` reads the hash; the status JSON view casts numerics to int before serialising for a stable wire shape pinned by tests. - `request_cancel_celery_broker_clear_job(job_id)` flips `cancel_requested=1` (idempotent; returns False on unknown id). `_streaming_celery_clear` gains two optional kwargs -- `chunk_size` and `progress_callback` -- so the chunked-job worker and the synchronous `celery_broker_clear` share the same chunk inner-loop. The callback receives a `_ChunkProgress` snapshot (per-chunk deltas + running totals) at every chunk boundary; returning True triggers a clean cancel: in-flight tombstones for the current pass are LREMmed (queue is left clean, not partial- graveyard) and `_StreamingClearStats.cancelled=True`. Cancel only ever lands at chunk boundaries, never mid-LSET, so a tombstone we already wrote is always paired with the matching LREM at pass-end or cancel-drain. Worker exceptions are swallowed: `log.exception(...)` so Sentry captures, plus `status=failed` + `error=<str(exc)>` on the hash so the operator's progress page shows the failure. The daemon thread never propagates to the gunicorn worker. Audit log: a `LogEntry` row is written at job-completion (not job-start) with `mode` ∈ `chunked-dry-run` / `chunked-all-from-bucket` / `chunked-all-but-first` (vs synchronous `dry-run` / `all-from-bucket` / `all-but-first`) plus `job_id`, `passes_run`, `total_drifted`, `cancelled` extras so queries can distinguish chunked from synchronous clears. Why threading and not Celery: the queue we're clearing is the Celery broker, so submitting a control task to it creates a circular dependency, and the operator pod isn't part of the worker fleet so the task wouldn't pick up anyway. Restart-safety trade-off documented in README -- a worker killed mid-job leaves in-flight tombstones in the queue; operator can re-run (the new pass walks the full queue and skips them) or `LREM` by hand. Admin views ----------- Three new superuser-only URLs under `clear-by-filter/job/<uuid>/`: - `clear_by_filter_progress_view` -- renders the progress page HTML with an initial server-side snapshot so the page is informative even before JS loads (matched / processed / drift / pass / status pill / timestamps). - `clear_by_filter_status_view` -- JSON snapshot of the job hash. 404s on unknown / expired ids. - `clear_by_filter_cancel_view` -- POST-only (405 on GET). Sets `cancel_requested=1` and returns 202 with the latest snapshot. `clear_by_filter_view` rewired so confirmed destructive submits now spawn a job and 302 to the progress page; dry-run keeps the synchronous shape (audit-log + re-rendered preview) since the streaming-count walk is fast. Frontend -------- CSP-compliant -- all CSS / JS loads via `{% static %}`, no inline `<script>` / `<style>`. Mirrors the same approach as PR #902's chart fragment loader (`celery_chart_fragment.{js,css}`). - `clear_by_filter_progress.html` -- card layout, native `<progress>` element so screen readers announce progress, real `<form>` for the cancel button so the page degrades gracefully without JS. - `celery_clear_progress.js` -- IIFE, no globals. Polls every 1s, backs off (1s -> 2s -> 5s, capped) on errors. Stops on terminal status, switches the layout (Cancel button reads "Done" / "Cancelled" / "Failed", back link enables). Cancel button hijacks form submit, fires `fetch()` with the CSRF token, prevents double-clicks. - `celery_clear_progress.css` -- Django-admin theme variable driven (light + dark), `prefers-reduced-motion` fallback. Tests ----- 12 new tests in `redis_admin/tests/test_celery_broker_clear_job.py`, 11 of which run under `-m 'not django_db'` (the audit-log test needs `LogEntry` and runs in CI under `@pytest.mark.django_db`): - service: hash shape pinned, deep-queue drain end-to-end (CELERY_BROKER_DISPLAY_LIMIT + 500 messages -> queue empty), cancel halts mid-pass with remaining matches intact, keep_one leaves exactly one survivor at index 0, monotonic progress callback, failure swallows exception + records `status=failed`. - views: status JSON shape pinned, cancel rejects GET (405), unknown id returns 404 JSON, confirmed submit on `clear_by_filter_view` redirects to the progress page, dry-run preview unchanged (no chunked job spawned). - audit (django_db): chunked-mode payload contains `mode`, `job_id`, `cancelled`, `passes_run`. Request timeouts ---------------- In-repo gunicorn `--timeout` is at `apps/codecov-api/api.sh:118` defaulting to 600s (env-overridable via `GUNICORN_TIMEOUT`). The user-observed 120s ceiling is NOT in this repo; most likely sources are external (Cloudflare's free-tier 100s ceiling, GCP HTTPS LB Backend Service timeout, or k8s ingress annotations) and need an infra-side investigation. Knob is documented in the README; this PR doesn't bump it (deploy-tuning decision). Long-term, the chunked-clear flow itself sidesteps the entire upstream-timeout class -- the POST that starts the clear spawns a thread + 302s in well under a second, and each subsequent poll is sub-millisecond, so no HTTP request ever sits on the actual clearing work.
…ldcard sentinel fix (#904) * fix(redis_admin): isolate streaming wildcard semantic with sentinel Bugbot review on PR #903 caught a high-severity over-broaden in the wildcard fix from ebbb6cf: `_envelope_matches_any_filter` treated `None` slots in any filter tuple as wildcards, but the per-message clear paths (`clear_selected`, `clear_dry_run`, `delete_model`, `delete_queryset`) build filter tuples from materialised rows whose envelope legitimately carries `None` in a slot (e.g. a `sync_repos` task with `repoid=None` and `commitid=None`). Under the prior wildcard semantic, single-row clear of one such message would have tombstoned EVERY message sharing the task name regardless of repoid / commit -- a real data-loss regression in production. Fix introduces a dedicated `_FILTER_ANY` sentinel (a `_FilterWildcard` singleton) for "do not constrain on this slot" that is *distinct from* `None`. The match logic checks `is _FILTER_ANY` (identity) for the wildcard branch and falls through to `==` for everything else, so: * Operator-input paths (`streaming_celery_count`, the `clear_by_filter_view` synthetic target) substitute `_FILTER_ANY` for unset slots and get wildcard semantics. * Per-message paths keep passing real envelope values -- including legitimate `None` slots -- and get exact-tuple membership semantics (`(task, None, None)` matches only `(task, None, None)` envelopes, not every envelope sharing the task name). Adds a regression test (`test_celery_broker_clear_per_message_target_does_not_overmatch_on_none_slots`) seeding two `(sync_repos, None, None)` envelopes alongside three `(sync_repos, repoid, commitid)` envelopes; clearing a single materialised target with `(repoid, commitid) = (None, None)` must tombstone exactly the two matching envelopes, not all five. Pre-fix that test would have asserted `result.count == 5` and `llen == 0` -- a 3-message data loss for the operator who only wanted to clear two stuck messages. Also updates the existing wildcard regression test (`test_celery_broker_clear_synthetic_target_with_wildcard_task_name_drains_all_matches`) to construct its synthetic target with `task_name=_FILTER_ANY` instead of `None`, matching the new contract the `clear_by_filter_view` synthetic target uses. * feat(redis_admin): chunked celery_broker clear with progress bar + cancel Layered on top of the wildcard sentinel fix from 77047ba so PR #904 ships both the high-severity data-loss fix from #903's bugbot review AND the next round of clear-queue UX work in a single PR. Decouples the destructive clear's wall-clock from the HTTP request lifetime: a confirmed `clear_all` / `clear_keep_one` submit on `clear-by-filter/` no longer runs the streaming clear synchronously inside the gunicorn worker -- it spawns a background daemon thread tracked under `redis_admin:celery_clear_job:<uuid>` in the cache Redis (kept off the broker so a clear that drains the broker can't evict its own progress hash) and 302s the operator to a progress page that polls a status JSON view every ~1s. The page renders a `<progress>` bar, matched/total/drift/pass counters, a status pill (neutral / blue / green / amber / red per state), and a Cancel button that lands at the next chunk boundary (default 1000 messages per LRANGE chunk in the chunked variant -- the synchronous path keeps the existing 10k chunk size unchanged). Service layer ------------- - `start_celery_broker_clear_job(queue, *, user, task_name, repoid, commitid, keep_one, dry_run) -> str` snapshots `LLEN(queue)` as `total_estimated`, writes the initial job hash with `status=pending`, sets a 24h TTL so abandoned jobs auto-cleanup, then starts a daemon thread named `redis_admin.clear_job.<8hex>`. - `get_celery_broker_clear_job(job_id)` reads the hash; the status JSON view casts numerics to int before serialising for a stable wire shape pinned by tests. - `request_cancel_celery_broker_clear_job(job_id)` flips `cancel_requested=1` (idempotent; returns False on unknown id). `_streaming_celery_clear` gains two optional kwargs -- `chunk_size` and `progress_callback` -- so the chunked-job worker and the synchronous `celery_broker_clear` share the same chunk inner-loop. The callback receives a `_ChunkProgress` snapshot (per-chunk deltas + running totals) at every chunk boundary; returning True triggers a clean cancel: in-flight tombstones for the current pass are LREMmed (queue is left clean, not partial- graveyard) and `_StreamingClearStats.cancelled=True`. Cancel only ever lands at chunk boundaries, never mid-LSET, so a tombstone we already wrote is always paired with the matching LREM at pass-end or cancel-drain. Worker exceptions are swallowed: `log.exception(...)` so Sentry captures, plus `status=failed` + `error=<str(exc)>` on the hash so the operator's progress page shows the failure. The daemon thread never propagates to the gunicorn worker. Audit log: a `LogEntry` row is written at job-completion (not job-start) with `mode` ∈ `chunked-dry-run` / `chunked-all-from-bucket` / `chunked-all-but-first` (vs synchronous `dry-run` / `all-from-bucket` / `all-but-first`) plus `job_id`, `passes_run`, `total_drifted`, `cancelled` extras so queries can distinguish chunked from synchronous clears. Why threading and not Celery: the queue we're clearing is the Celery broker, so submitting a control task to it creates a circular dependency, and the operator pod isn't part of the worker fleet so the task wouldn't pick up anyway. Restart-safety trade-off documented in README -- a worker killed mid-job leaves in-flight tombstones in the queue; operator can re-run (the new pass walks the full queue and skips them) or `LREM` by hand. Admin views ----------- Three new superuser-only URLs under `clear-by-filter/job/<uuid>/`: - `clear_by_filter_progress_view` -- renders the progress page HTML with an initial server-side snapshot so the page is informative even before JS loads (matched / processed / drift / pass / status pill / timestamps). - `clear_by_filter_status_view` -- JSON snapshot of the job hash. 404s on unknown / expired ids. - `clear_by_filter_cancel_view` -- POST-only (405 on GET). Sets `cancel_requested=1` and returns 202 with the latest snapshot. `clear_by_filter_view` rewired so confirmed destructive submits now spawn a job and 302 to the progress page; dry-run keeps the synchronous shape (audit-log + re-rendered preview) since the streaming-count walk is fast. Frontend -------- CSP-compliant -- all CSS / JS loads via `{% static %}`, no inline `<script>` / `<style>`. Mirrors the same approach as PR #902's chart fragment loader (`celery_chart_fragment.{js,css}`). - `clear_by_filter_progress.html` -- card layout, native `<progress>` element so screen readers announce progress, real `<form>` for the cancel button so the page degrades gracefully without JS. - `celery_clear_progress.js` -- IIFE, no globals. Polls every 1s, backs off (1s -> 2s -> 5s, capped) on errors. Stops on terminal status, switches the layout (Cancel button reads "Done" / "Cancelled" / "Failed", back link enables). Cancel button hijacks form submit, fires `fetch()` with the CSRF token, prevents double-clicks. - `celery_clear_progress.css` -- Django-admin theme variable driven (light + dark), `prefers-reduced-motion` fallback. Tests ----- 12 new tests in `redis_admin/tests/test_celery_broker_clear_job.py`, 11 of which run under `-m 'not django_db'` (the audit-log test needs `LogEntry` and runs in CI under `@pytest.mark.django_db`): - service: hash shape pinned, deep-queue drain end-to-end (CELERY_BROKER_DISPLAY_LIMIT + 500 messages -> queue empty), cancel halts mid-pass with remaining matches intact, keep_one leaves exactly one survivor at index 0, monotonic progress callback, failure swallows exception + records `status=failed`. - views: status JSON shape pinned, cancel rejects GET (405), unknown id returns 404 JSON, confirmed submit on `clear_by_filter_view` redirects to the progress page, dry-run preview unchanged (no chunked job spawned). - audit (django_db): chunked-mode payload contains `mode`, `job_id`, `cancelled`, `passes_run`. Request timeouts ---------------- In-repo gunicorn `--timeout` is at `apps/codecov-api/api.sh:118` defaulting to 600s (env-overridable via `GUNICORN_TIMEOUT`). The user-observed 120s ceiling is NOT in this repo; most likely sources are external (Cloudflare's free-tier 100s ceiling, GCP HTTPS LB Backend Service timeout, or k8s ingress annotations) and need an infra-side investigation. Knob is documented in the README; this PR doesn't bump it (deploy-tuning decision). Long-term, the chunked-clear flow itself sidesteps the entire upstream-timeout class -- the POST that starts the clear spawns a thread + 302s in well under a second, and each subsequent poll is sub-millisecond, so no HTTP request ever sits on the actual clearing work. * docs(redis_admin): drop gunicorn timeout knob discussion The chunked clear job already decouples the clear's wall-clock from the HTTP request lifetime, so the gunicorn `--timeout` knob is no longer relevant to whether a clear succeeds. Drop the README + PR-body discussion of `apps/codecov-api/api.sh:118`, `GUNICORN_TIMEOUT`, Cloudflare's per-request ceiling, the GCP HTTPS LB Backend Service `timeoutSec`, and the ingress annotations — operators don't need to reason about any of those to use the new progress-bar flow. Trim the corresponding "Why this exists" paragraph in the README and the analogous header comment in services.py to just the design intent. * fix(redis_admin): preserve status pill base class + 0-base chunk_index Two Bugbot findings on PR #904: * MEDIUM (`celery_clear_progress.js`): `setStatusPill` filtered classes by the `celery-clear-status-` prefix, which dropped the base `celery-clear-status-pill` class along with the previous state class. After the first poll the pill lost all base styling (display, padding, border-radius, font-weight, etc.) and only the colour overrides remained. Replace the prefix-strip with an explicit `STATE_CLASSES` enumeration so only state classes are toggled and the base class survives. * LOW (`services.py`): `_ChunkProgress.chunk_index` was documented as 0-based but incremented BEFORE being passed to the callback, effectively making it 1-based (first chunk reported 1, last reported `chunks_total`). Move the increment to AFTER the callback so the snapshot describes the chunk that just finished and the documented contract holds. Adds a regression assertion that snapshots[*].chunk_index == [0, 1, ..., chunks_total - 1]. * test(redis_admin): make audit-log test cross-thread visible CI failure on PR #904 traced to a Django + threading transaction- visibility issue, not a logic regression: the chunked-mode audit- log test creates a `User` via `UserFactory()` from the test thread, then `start_celery_broker_clear_job` spawns the daemon worker which opens its own DB connection and tries to insert a `LogEntry` referring to that user. With plain `@pytest.mark. django_db`, the test thread's writes live inside an uncommitted transaction the worker thread can't see, so the FK to `users.id` fails: `ForeignKeyViolation: Key (user_id)=(N) is not present in table "users".` Fix: * Switch the test to `@pytest.mark.django_db(transaction=True)` so the user is committed before the worker thread reads it (Django's `TransactionTestCase` semantics). * Add `close_old_connections()` to the worker thread's `finally` block so the lazily-opened ORM connection drops at thread exit. Without that, `transaction=True`'s truncate-on-teardown would block on our idle connection, and a long-lived gunicorn worker would slowly leak one connection per chunked clear that ever ran on it. `close_old_connections` is the documented helper for this exact background-thread cleanup case. No production-code logic change. The non-DB tests continue to pass (11/11); the `transaction=True` test will be validated by CI's API CI / Test job since this sandbox can't reach Postgres. * ref(redis-admin): consolidate _FILTER_ANY substitution into shared helper Bugbot review on PR #904 (LOW): the operator-input filter-tuple construction `(task_name if task_name else _FILTER_ANY, repoid if repoid is not None else _FILTER_ANY, commitid if commitid else _FILTER_ANY)` was duplicated identically in three sites: `streaming_celery_count`, `_run_celery_broker_clear_job_body`, and `clear_by_filter_view`. If the rule diverged (new filter field, or a slot's truthiness check loosened) the dry-run count, the chunked clear, and the synchronous preview would silently disagree on which envelopes match — a data-loss vector. Extract a `_substitute_filter_any(task_name, repoid, commitid)` helper next to the `_FILTER_ANY` sentinel definition so all three call sites share one rule, and add a regression test that pins the contract (both directly and against the literal inline expression the three sites used to repeat).


Summary
The single
REDIS_ADMIN_MAX_ITEMS_PER_KEY = 20_000knob was bounding three call sites with very different cost profiles:_result_cache— RAM-heavy.(task_name, repoid, commitid)counts — CPU-light.Tuning one number forced a bad trade: lowering it for page-load made the chart shallow; raising it for the chart bloated the changelist; and either way the clear inherited the wrong behaviour. After PR #897 lowered the cap to 20_000 the clear silently capped at that window too, so on a saturated queue the per-pass plateau check fired before the clear had drained.
Splits + drops the cap on the right things, then polishes the chart loading UX:
REDIS_ADMIN_CELERY_BROKER_DISPLAY_LIMIT2_000REDIS_ADMIN_CELERY_BROKER_SCAN_LIMIT100_000LLEN(queue)so "Clear all" always fully drainscelery_broker_clear)result.countmatches what the actual clear will dostreaming_celery_counthelper somatch_countagrees withresult.countREDIS_ADMIN_MAX_ITEMS_PER_KEYkeeps its existing 20_000 default and stays wired to the unrelated SET/HASH browsing path inRedisItemQuerySet.Folds in #904 (clear-cap regression) and #905 (visible loader card UX). Both source PRs left OPEN as historical references.
Test plan
LLEN == 1in a single pass.result.count == 5_000,keep_one=True → 4_999,LLENintact).Matched: N message(s)for a queue with N >CELERY_BROKER_DISPLAY_LIMIT.role="status"are present.ruff check+ruff format --checkclean.LLENdelta AND clicking "Clear all" drains the queue AND the loader card is visible while the chart fragment fetches.Note
Medium Risk
Changes celery broker admin/clear behavior and limits, including unbounded queue-draining logic; mistakes could over/under-delete messages or slow deep-queue operations.
Overview
Splits the previous single
MAX_ITEMS_PER_KEYcap into two celery-specific settings:CELERY_BROKER_DISPLAY_LIMIT(bounds per-message drill-down materialisation/cache) andCELERY_BROKER_SCAN_LIMIT(bounds the frequency-chart sampler), and updates docs/UI text accordingly.Fixes celery clear/preview semantics on deep queues by making the streaming clear walk the full
LLEN(queue)(no artificial cap), adding wildcard-aware filter matching, and ensuring clear-by-filter preview + dry-run counts come from a new streaming counter (streaming_celery_count) rather than the display-limited queryset.Polishes the drill-down chart loading UX with a CSP-safe external loader stylesheet + richer loading card, tweaks admin filtering (hide
celery_brokerfrom the generic queue family filter), adjusts summary-mode default ordering, and adds focused regression tests for deep-queue draining and filter/preview correctness.Reviewed by Cursor Bugbot for commit 4f86eee. Bugbot is set up for automated code reviews on this repo. Configure here.