[Draft] Don't overwrite logs & status in concurrent background tasks#1026
Draft
[Draft] Don't overwrite logs & status in concurrent background tasks#1026
Conversation
✅ Deploy Preview for antenna-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
mihow
added a commit
that referenced
this pull request
Apr 20, 2026
…tarvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path.
mihow
added a commit
that referenced
this pull request
Apr 20, 2026
…ues (#1257) * feat(celery): split tasks across three queues to prevent cross-task starvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path. * docs(celery): add rollout plan for queue split branch * chore(celery): parameterize local dev worker queues via CELERY_QUEUES Match the production start script: read the queue list from $CELERY_QUEUES, defaulting to all three queues (antenna, jobs, ml_results) so the single local worker keeps consuming everything by default. Lets devs override for isolation testing if they want. Co-Authored-By: Claude <noreply@anthropic.com> * feat(celery): route create_detection_images to ml_results queue This task is emitted from save_results (pipeline.py:990, one delay per batch of source images) and does heavy image cropping + S3 writes. Left unrouted, it defaults to the antenna queue — the opposite of what the queue-split is trying to achieve, since a single large job's cropping fan-out can then starve beat/housekeeping. Co-Authored-By: Claude <noreply@anthropic.com> * refactor(celery): move jobs + ml_results workers off the app host Production topology previously put all three worker services (antenna, jobs, ml_results) on ami-live, which meant the bursty ML pool was competing with Django/beat/flower for CPU and RAM. With CELERY_WORKER_CONCURRENCY=16 inherited per service, that's 48 prefork processes before any dedicated worker VM spins up. Now: - docker-compose.production.yml runs only the antenna worker (alongside Django + beat + flower on the app host). - docker-compose.worker.yml runs three dedicated services (antenna / jobs / ml_results) per worker VM, so isolation holds there too — a burst on one class can't saturate a shared pool and starve another. Rollout doc updated to reflect the new topology. Co-Authored-By: Claude <noreply@anthropic.com> * fix(celery): address review comments on queue split PR - Standardize rollout doc script name to reset_demo_to_branch.sh - Clarify settings comment: only staging/production/worker composes run per-queue dedicated workers; local/CI use a single all-queues worker Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): use reset_to_branch.sh (the generic script name) The script is used on staging, demo, and single-box deploys — not demo-specific. Standardize both mentions to reset_to_branch.sh, which matches the actual filename on the hosts. Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): scrub internal hostnames and update rollout doc accuracy - Generalize ami-live / ami-worker-2 / ami-worker-3 hostnames — this doc lives in a public repo and shouldn't reference deployment-specific names - Drop stale commit SHA; branch name is sufficient after further commits - Clarify that the "scp three files" list is the demo-path subset, not the full changeset on the branch Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Pulled from #981
More coming soon