[Draft] Don't overwrite logs & status in concurrent background tasks by mihow · Pull Request #1026 · RolnickLab/antenna

mihow · 2025-10-31T00:49:48Z

Pulled from #981

More coming soon

netlify · 2025-10-31T00:49:53Z

✅ Deploy Preview for antenna-preview ready!

Name	Link
🔨 Latest commit	`1b17592`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/690407afd194380008e4798f
😎 Deploy Preview	https://deploy-preview-1026--antenna-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
Lighthouse	1 paths audited Performance: 30 (🔴 down 1 from production) Accessibility: 80 (no change from production) Best Practices: 100 (no change from production) SEO: 92 (no change from production) PWA: 80 (no change from production) View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2025-10-31T00:49:58Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/job-clobbering

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…tarvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path.

…ues (#1257) * feat(celery): split tasks across three queues to prevent cross-task starvation Previously all Celery tasks shared a single 'antenna' queue, so a burst of high-volume tasks could block lower-volume ones on the same worker pool. Observed scenario: ~740-image async_api job emitted ~180 process_nats_pipeline_result tasks per 5 min and starved run_job invocations behind it, leaving newly submitted jobs stuck in PENDING for many minutes. Long-running run_job tasks can similarly hold worker slots and delay beat / housekeeping tasks. Split into three queues, each with its own worker service: antenna default — beat tasks, cache refresh, sync, housekeeping jobs run_job (can hold a slot for hours) ml_results process_nats_pipeline_result + save_results bursts Worker start script now takes CELERY_QUEUES as env var (default: antenna) so one image serves all three services. Worker-only hosts (ami-worker-2, ami-worker-3) consume all three queues as spillover capacity via docker-compose.worker.yml. Relates to #1256 (job logging bottleneck) and #1026 (concurrent job log updates) — those two tackle the write-path; this change tackles the dispatch-path. * docs(celery): add rollout plan for queue split branch * chore(celery): parameterize local dev worker queues via CELERY_QUEUES Match the production start script: read the queue list from $CELERY_QUEUES, defaulting to all three queues (antenna, jobs, ml_results) so the single local worker keeps consuming everything by default. Lets devs override for isolation testing if they want. Co-Authored-By: Claude <noreply@anthropic.com> * feat(celery): route create_detection_images to ml_results queue This task is emitted from save_results (pipeline.py:990, one delay per batch of source images) and does heavy image cropping + S3 writes. Left unrouted, it defaults to the antenna queue — the opposite of what the queue-split is trying to achieve, since a single large job's cropping fan-out can then starve beat/housekeeping. Co-Authored-By: Claude <noreply@anthropic.com> * refactor(celery): move jobs + ml_results workers off the app host Production topology previously put all three worker services (antenna, jobs, ml_results) on ami-live, which meant the bursty ML pool was competing with Django/beat/flower for CPU and RAM. With CELERY_WORKER_CONCURRENCY=16 inherited per service, that's 48 prefork processes before any dedicated worker VM spins up. Now: - docker-compose.production.yml runs only the antenna worker (alongside Django + beat + flower on the app host). - docker-compose.worker.yml runs three dedicated services (antenna / jobs / ml_results) per worker VM, so isolation holds there too — a burst on one class can't saturate a shared pool and starve another. Rollout doc updated to reflect the new topology. Co-Authored-By: Claude <noreply@anthropic.com> * fix(celery): address review comments on queue split PR - Standardize rollout doc script name to reset_demo_to_branch.sh - Clarify settings comment: only staging/production/worker composes run per-queue dedicated workers; local/CI use a single all-queues worker Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): use reset_to_branch.sh (the generic script name) The script is used on staging, demo, and single-box deploys — not demo-specific. Standardize both mentions to reset_to_branch.sh, which matches the actual filename on the hosts. Co-Authored-By: Claude <noreply@anthropic.com> * docs(celery): scrub internal hostnames and update rollout doc accuracy - Generalize ami-live / ami-worker-2 / ami-worker-3 hostnames — this doc lives in a public repo and shouldn't reference deployment-specific names - Drop stale commit SHA; branch name is sufficient after further commits - Clarify that the "scp three files" list is the demo-path subset, not the full changeset on the branch Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

feat: don't overwrite logs & status in concurrent background tasks

1b17592

mihow mentioned this pull request Apr 20, 2026

feat(job): refactor job logging so it isn't a bottleneck #1256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Don't overwrite logs & status in concurrent background tasks#1026

[Draft] Don't overwrite logs & status in concurrent background tasks#1026
mihow wants to merge 1 commit intomainfrom
fix/job-clobbering

mihow commented Oct 31, 2025 •

edited

Loading

Uh oh!

netlify Bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Oct 31, 2025

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mihow commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify Bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview ready!

Uh oh!

coderabbitai Bot commented Oct 31, 2025

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mihow commented Oct 31, 2025 •

edited

Loading

netlify Bot commented Oct 31, 2025 •

edited

Loading