Skip to content

Smartstack Scaffolding#444

Open
timbeccue wants to merge 45 commits into
mainfrom
feature/separate-site-deps
Open

Smartstack Scaffolding#444
timbeccue wants to merge 45 commits into
mainfrom
feature/separate-site-deps

Conversation

@timbeccue
Copy link
Copy Markdown
Contributor

@timbeccue timbeccue commented Mar 18, 2026

Originally this PR was just splitting Redis/RabbitMQ out of the site/local
docker-compose files, but it ended up bundling several dependent changes
together rather than stacking PRs that kept overriding each other.

What's in here now:

Smart stacking scaffolding

  • New StackFrame model + SiteBase declarative base for site-only tables
    (dbs.py)
  • SubframeListener + process_subframe Celery task: validates incoming
    RabbitMQ messages, inserts a tracking row, runs the reduction pipeline,
    updates the row with the reduced filepath, and pushes a Redis notification
    per camera (main.py, scheduling.py)
  • stacking.py: per-camera worker loop drains Redis notifications, checks
    group completion (all reduced AND all expected frames present, or
    is_last), and finalizes complete or timed-out stacks; supervisor spawns
    one worker per camera
  • New banzai-subframe-listener, banzai-subframe-worker,
    banzai-stacking-supervisor services + matching env vars
  • E2E coverage in test_smart_stacking.py and a new site-E2E phase

Redis / RabbitMQ as external dependencies

  • Pulled out of docker-compose-{site,local}.yml into
    docker-compose-dependencies.yml so they can be left running across
    pipeline restarts (and pointed at existing site infrastructure)
  • All workers connect via REDIS_URL / RABBITMQ_URL; no more hardcoded
    redis://redis:6379/0

Site deployment fixes & polish (from running this at a site)

  • Download worker: clearer status logging w/ heartbeat + state-change
    triggers, NULL-frameid handling, per-frameid failure backoff, filepath
    reconciliation when replication overwrites with NULL, "waiting for
    replication" log while calimages is empty during initialization
  • Replication: fixed subscription name, optional SLOT_NAME override,
    reuse existing slot when recreating subscription
  • Listener improvements: log and remove malformed JSON payloads from queue, top-level try/except around the stacking worker loop
  • created_at (not dateobs) for stack-timeout — handles NULL dateobs
    and reprocessing (note: stack timeout is being reworked a bit in the branch I'm currently working on)
  • Site E2E: Linux compatibility (named volume for postgres,
    host.docker.internal resolution), container-prefix support so the
    fixture stack and just site-up can coexist, absolute-HOST_*_DIR
    requirement so file paths resolve identically inside and outside
    containers
  • argparse entry points for site workers; removed unused
    OPENTSDB_PYTHON_METRICS_TEST_MODE plumbing

@timbeccue timbeccue linked an issue Mar 19, 2026 that may be closed by this pull request
@timbeccue timbeccue requested a review from cmccully March 19, 2026 12:19
@timbeccue timbeccue force-pushed the feature/smartstack-integration branch from 0f7e6d3 to 3512bdb Compare March 20, 2026 05:49
@timbeccue timbeccue force-pushed the feature/separate-site-deps branch 2 times, most recently from 3502eec to 8b6fb9a Compare March 23, 2026 18:38
@timbeccue timbeccue force-pushed the feature/smartstack-scaffold branch from 3805921 to ba59ec7 Compare March 27, 2026 01:32
@timbeccue timbeccue force-pushed the feature/separate-site-deps branch 2 times, most recently from 38dc3d6 to b0189da Compare March 27, 2026 05:19
@timbeccue timbeccue force-pushed the feature/smartstack-scaffold branch from ba59ec7 to e5350bd Compare March 27, 2026 05:19
@timbeccue timbeccue force-pushed the feature/smartstack-scaffold branch from e5350bd to 3bc4bc5 Compare April 6, 2026 21:01
@timbeccue timbeccue force-pushed the feature/separate-site-deps branch 2 times, most recently from 29b6816 to b720aa8 Compare April 9, 2026 22:02
@timbeccue timbeccue force-pushed the feature/separate-site-deps branch from 30a1d0b to ea53072 Compare April 23, 2026 06:55
@timbeccue timbeccue changed the base branch from feature/smartstack-scaffold to main April 23, 2026 07:09
@timbeccue timbeccue changed the title Separate Redis and RabbitMQ into standalone dependencies Smartstack Scaffolding May 6, 2026
@cmccully cmccully requested a review from Copilot May 8, 2026 13:41
Comment thread banzai/cache/download_worker.py Outdated
update_filepath(db_address, cal.id, dest_dir)
logger.info(f"Downloaded {cal.filename}")
logger.info(f"Cached {cal.filename}")
return True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't generally like returning True unless it is a boolean check, e.g., is this object x type? I tend to raise an exception if something didn't work. This is kind of a stylistic thing, but it's a place to make the code more readable for future us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes a file that needs to be cached and results in three possible outcomes: downloaded, skipped, or failed. The return value is used by the caller to keep track of how many files were downloaded vs how many were already cached. So I still think this is the cleanest solution, but I'm open to alternatives if you have one in mind.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the caller care if the file is downloaded or already cached? Doesn't it just care that the file is on disk now?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces “smart stacking” scaffolding for site deployments (DB tracking + subframe reduction + Redis-based notifications + per-camera stacking workers), while also refactoring local/site Docker Compose to treat Redis/RabbitMQ as external dependencies and polishing site deployment/E2E workflows.

Changes:

  • Add site-only DB tables (StackFrame via SiteBase) and smart-stacking worker/supervisor plumbing.
  • Externalize Redis/RabbitMQ into docker-compose-dependencies.yml and switch workers to REDIS_URL / RABBITMQ_URL.
  • Expand/adjust unit + site E2E coverage and update site/local env templates and documentation accordingly.

Reviewed changes

Copilot reviewed 30 out of 32 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
uv.lock Dependency lock revision bump.
site-banzai-env.default Site env defaults updated (absolute paths, broker URLs, smart stacking knobs).
local-banzai-env.default Local env defaults updated to use broker URLs and queue names.
scripts/queue_images.py Queueing script changed to publish absolute host paths.
README.rst Docs updated for new dependencies compose flow and env vars.
pytest.ini Adds smart stacking markers and updates recursion excludes.
pyproject.toml Adds new console entry points for stacking/subframe components.
docker-compose-site.yml Site compose updated for external brokers, absolute-path mounts, new stacking services, and e2e isolation knobs.
docker-compose-local.yml Local compose updated for external brokers and configurable queue/worker naming.
docker-compose-dependencies.yml New compose file for Redis + RabbitMQ dependencies.
banzai/utils/stage_utils.py run_pipeline_stages now returns the processed images list.
banzai/utils/fits_utils.py Adds optional attempt logging to download_from_s3.
banzai/stacking.py New stacking worker loop, timeout handling, and supervisor process manager.
banzai/settings.py Adds env-derived settings for subframe/stacking queues and broker URLs.
banzai/scheduling.py Adds process_subframe Celery task and wiring for DB + Redis notifications.
banzai/main.py Adds SubframeListener + CLI entrypoint to consume stack-queue messages and dispatch Celery tasks.
banzai/dbs.py Introduces SiteBase, StackFrame, site-aware sessions, and stack frame CRUD/upsert helpers.
banzai/celeryconfig.py Removes hardcoded broker_url from Celery config module.
banzai/cache/replication.py Subscription naming/slot reuse improvements for replication setup.
banzai/cache/init.py Ensures site DB schema creation and updated replication setup call signature.
banzai/cache/download_worker.py Adds argparse CLI, improved logging/heartbeat, NULL-frameid handling, reconciliation and backoff logic.
banzai/tests/test_smart_stacking.py New unit tests covering stacking DB ops, notifications, listener, task, supervisor, and resilience.
banzai/tests/test_replication.py Updates replication tests for new subscription/slot behavior.
banzai/tests/test_download_worker.py Updates tests for new download worker behavior and CLI parsing.
banzai/tests/test_dbs.py Adds tests for site-table creation and site-aware session enforcement.
banzai/tests/test_cache_init.py Updates cache-init expectations for site schema and replication call changes.
banzai/tests/site_e2e/test_site_e2e.py Expands site E2E coverage (cached cal usage, drift reconciliation, stacking path, slot reuse).
banzai/tests/site_e2e/conftest.py Improves e2e isolation/scoping, absolute path enforcement, compose helpers, and cleanup.
banzai/tests/site_e2e/site_e2e.env.template Adds isolation knobs + absolute host path requirements + broker URLs.
banzai/tests/site_e2e/README.md Updates e2e docs for isolation, absolute paths, and log watching.
.gitignore Ignores local-banzai-env.
.dockerignore Ignores data/ directory in Docker build context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/queue_images.py
Comment thread banzai/main.py
Comment thread banzai/dbs.py
Comment thread banzai/settings.py Outdated
Comment thread banzai/cache/download_worker.py Outdated
Comment thread banzai/cache/init.py Outdated
Comment thread banzai/cache/replication.py
Comment thread banzai/tests/site_e2e/conftest.py Outdated
Comment thread banzai/tests/site_e2e/conftest.py
Comment thread banzai/tests/site_e2e/conftest.py Outdated
shutil.rmtree(path, ignore_errors=True)


def run_docker_compose(compose_file, *args, cwd=None, env=None, project=None):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may just be me, but I'm really nervous about python calling docker when this python code will likely be running in a docker container.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent is that site_e2e is a host-run deployment harness, not code that runs inside the banzai containers. Pytest is currently the outer orchestrator for Docker Compose. If we want the site_e2e tests to run in CI then I agree it makes sense to move the compose lifecycle management out to CI and let pytest talk only to the already running services. Are you ok if we defer that for now?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think pytest should be in charge of container management. That feels like an anti-pattern to me. In principle we could use tox, but I'm not really sure what you are trying to accomplish, so I'm concerned.

Comment thread banzai/tests/site_e2e/README.md
Comment thread banzai/tests/site_e2e/README.md
@timbeccue timbeccue force-pushed the feature/separate-site-deps branch from adf8300 to 5f760f6 Compare May 16, 2026 23:47
@timbeccue
Copy link
Copy Markdown
Contributor Author

Sacking architecture and process flow diagrams have been added here: https://github.com/LCOGT/banzai/pull/444/changes#diff-1df90f6630375c6d0b8ea4221090de028e494a40178e28d9a35f0458f6d82451

@timbeccue timbeccue requested a review from cmccully May 18, 2026 15:25
@timbeccue timbeccue requested a review from jchate6 May 29, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Separate redis and rabbitmq from site docker-compose

3 participants