These files are generated artifacts. Do not edit them directly. The source of truth is each service's
docker-compose.yml(plus itsswarm.fragment.ymlfor Swarm-specific config). To regenerate:./stackctl.sh generateTo check for drift:./stackctl.sh sync
infrastructure.yml: Traefik, Portainer, APISIX (gateway + etcd + dashboard), Postgres, Mongo, Redisobservability.yml: Prometheus, Grafana, Loki, Tempo, OTel Collectorplatform.yml: GrowthBook (dashboard + proxy), AniTrend apps/services (anitrend,on-the-edge,edge-graphql)
- Shared overlay network:
traefik-public(external, attachable). Create once per swarm host. - No Compose-only keys: do not use
container_name,restart, orbuildin stacks. - Use
deployfor scheduling (mode, placement, resources) andenv_filefor configuration. - All exposed services must attach to
traefik-publicand define Traefik labels for routing. - Persist critical data via named volumes. Mark volumes as
external: trueto reuse existing data.
# 1) Initialize Swarm (idempotent)
docker swarm init
# 2) Create shared overlay network (idempotent)
docker network create --driver=overlay --attachable traefik-public
# 3) Deploy stacks (names are identifiers)
docker stack deploy -c stacks/infrastructure.yml infrastructure
docker stack deploy -c stacks/observability.yml observability
docker stack deploy -c stacks/platform.yml platform
# 4) Verify
docker stack services infrastructure
docker stack services observability
docker stack services platform
# 5) Teardown (keeps volumes)
docker stack rm platform
docker stack rm observability
docker stack rm infrastructureThe repo includes a helper script at the root, ./stackctl.sh, which wraps the common lifecycle with preflight checks and nicer ergonomics.
Prerequisites:
- Docker Engine with Swarm enabled (single-node is fine)
- The external overlay network
traefik-public - Optional: local TLS certs in
traefik/certs/for*.docker.localhost
Quick start:
# Validate your environment (safe to run repeatedly). Add --fix-network to auto-create the overlay network.
./stackctl.sh doctor --fix-network
# Optionally ensure external named volumes exist before deploying
./stackctl.sh doctor --fix-volumes
# Deploy all stacks and follow key logs (Traefik, Prometheus, Loki)
./stackctl.sh up
# Or deploy a subset
./stackctl.sh up -s infrastructure,observability
# Check status
./stackctl.sh status
# Tail logs for specific services
./stackctl.sh logs infrastructure_traefik observability_prometheus
# Remove stacks (keeps volumes); add --remove-network to also remove traefik-public
./stackctl.sh down -yNotes:
stackctl.shfinds stack files from eitherstacks/*.ymlor the repo root (infrastructure.yml, etc.).- The
doctorcommand validates Compose syntax for each stack and reminds you to create.envfiles where a.env.exampleexists. - If you use local HTTPS, make sure
traefik/certs/local-cert.pemandtraefik/certs/local-key.pemexist; see below for generation.
When deploying, stackctl.sh pre-renders variables into a copy of the stack file and writes it to .rendered/ with a docker-compose.* prefix:
stacks/infrastructure.yml->.rendered/docker-compose.infrastructure.rendered.ymlstacks/observability.yml->.rendered/docker-compose.observability.rendered.ymlstacks/platform.yml->.rendered/docker-compose.platform.rendered.yml
These files are ignored by Git and safe to regenerate at any time.
- Ensure each service folder has a
.envcopied from its.env.examplewhere applicable. - APISIX dashboard uses
apisix/api-dashboard/config/conf.yaml(generated fromconf.example.yml). - Consider adding healthchecks for critical dependencies to improve startup reliability.
- Stacks set conservative
deploy.resourcesreservations/limits to avoid runaway memory/CPU. Adjust in ±128–256MiB steps based on telemetry. - Services use the
locallogging driver with rotation (max-size=10m,max-file=3) to reduce JSON log churn. If you prefer a global default, set it in/etc/docker/daemon.jsonand restart Docker.
- Prometheus: 3d retention (
--storage.tsdb.retention.time=3d),--query.max-concurrency=10; scrape intervals relaxed to 30s for most jobs. - Loki: retention 72h, chunk target ~1.5MiB, moderate ingestion rate, compactor retention enabled.
- Tempo: local backend with 48h retention from config; single-replica by default.
- GrowthBook: Node heap capped via
NODE_OPTIONS=--max-old-space-size=512. - Traefik: access logs disabled by default; enable temporarily if debugging.
- Verify per-stack services:
docker stack services <stack>anddocker service logs <stack>_<service>. - If Traefik can't reach a service, confirm it's attached to
traefik-publicand labels point to the correctserver.portand host. - For noisy logs or high disk writes, ensure the
localdriver is in effect and service-level logging options are applied.
For local development with HTTPS on domains like grafana.docker.localhost, Traefik is configured with a local certificatesResolver and a file provider for TLS certificates.
What this means:
- ACME/Let’s Encrypt will not issue for
.localhostdomains. Instead, generate a local development certificate and key, and place them intraefik/certs/aslocal-cert.pemandlocal-key.pem. - The dynamic config (
traefik/config/dynamic.yml) already references these files and declares thedocker.localhostSANs, including*.docker.localhost. - Set
CERT_RESOLVER=localintraefik/.env(and any service labels that reference it) to use the local resolver while Traefik serves the file-based certs.
Generate a dev cert (example using mkcert):
mkcert -install
mkcert -cert-file traefik/certs/local-cert.pem -key-file traefik/certs/local-key.pem "docker.localhost" "*.docker.localhost"Notes:
traefik/certs/.gitignoreprevents committing private keys or ACME storage files.- Browsers trust mkcert’s local CA after
mkcert -install. If not using mkcert, you may need to trust your self-signed CA manually.