Full stack data platform for federating multiple external data sources with ClickHouse, dbt, monitoring, and API capabilities.
Beefy Databarn is a modern data platform designed to aggregate, transform, and analyze data from multiple external sources (primarily PostgreSQL databases) without requiring data replication. The platform leverages ClickHouse's federation capabilities to query external databases directly, while using dbt for data modeling and transformation. This architecture enables real-time analytics, business intelligence, and monitoring without the overhead of maintaining duplicate data copies.
Data Federation: Instead of copying data from source databases, ClickHouse uses external table engines (PostgreSQL engine) to query remote databases on-demand. This means:
- No data duplication or storage overhead
- Always up-to-date data (queries hit the source directly)
- Reduced ETL complexity and maintenance
dbt Data Modeling: The platform follows a three-layer data modeling approach:
- Staging: External tables that point to source databases (read-only views)
- Intermediate: Data cleaning, standardization, and business logic transformations
- Marts: Business-ready analytical models for consumption (revenue, yield, dimensions)
Production-Ready Infrastructure: The platform includes monitoring, security, and orchestration tools for reliable production deployment.
flowchart TB
%% Gateway
TR[Traefik<br/>Reverse Proxy & HTTPS]
%% External Data Sources
subgraph EXT["External Data Sources"]
BEEFY_DB[(PostgreSQL<br/>beefy-db)]
BEEFY_API[Beefy API]
end
%% ELT
subgraph ELT["ELT"]
DBT[dbt]
PYTHON[Python Script]
end
%% Data Warehouse
subgraph DW["Data Warehouse - ClickHouse"]
CH[(ClickHouse)]
MINIO[MinIO]
end
%% Analytics
subgraph ANALYTICS["Analytics"]
SUP[Superset]
CELERY[Celery]
SUP_W[Superset Worker]
PG[(PostgreSQL)]
REDIS_A[Redis]
end
%% Observability
subgraph OBS["Observability"]
GRAF[Grafana]
PG_OBS[(PostgreSQL)]
REDIS_OBS[Redis]
REDIS_C[Redis Exporter]
PG_C[Postgres Exporter]
PROM[Prometheus]
end
%% Users
USR[Users & Analysts]
%% External Data Flow
CH -.->|External Tables<br/>Federation Query| BEEFY_DB
CH -.->|External Tables<br/>S3 Engine| MINIO
PYTHON -->|Query API| BEEFY_API
PYTHON -->|Ingest Data| CH
%% dbt Flow
DBT -->|Creates Views & Tables| CH
%% Analytics Flow
SUP -->|Query| CH
SUP -->|Metadata| PG
SUP -->|Cache| REDIS_A
SUP -->|Async Queries| CELERY
CELERY -->|Work Queue| REDIS_A
REDIS_A -->|Work Requests| SUP_W
SUP_W -->|Query| CH
%% Observability Flow
GRAF -->|Query| PROM
PG_C -->|Push Metrics| PROM
REDIS_C -->|Push Metrics| PROM
GRAF -->|Metadata| PG_OBS
GRAF -->|Cache| REDIS_OBS
%% Gateway Flow
USR -->|HTTPS| TR
TR -->|Route| SUP
TR -->|Route| GRAF
%% Styling
classDef gateway fill:#ffffff,stroke:#1976d2,stroke-width:3px,color:#000
classDef external fill:#ffffff,stroke:#f57c00,stroke-width:2px,color:#000
classDef elt fill:#ffffff,stroke:#9c27b0,stroke-width:2px,color:#000
classDef warehouse fill:#ffffff,stroke:#388e3c,stroke-width:2px,color:#000
classDef analytics fill:#ffffff,stroke:#c2185b,stroke-width:2px,color:#000
classDef observability fill:#ffffff,stroke:#f9a825,stroke-width:2px,color:#000
classDef user fill:#ffffff,stroke:#d32f2f,stroke-width:2px,color:#000
class TR gateway
class BEEFY_DB,BEEFY_API external
class PYTHON,DBT elt
class CH,MINIO warehouse
class SUP,SUP_W,CELERY,PG,REDIS_A analytics
class PROM,GRAF,PG_OBS,REDIS_OBS,REDIS_C,PG_C observability
class USR user
-
ClickHouse: Columnar analytics database that uses PostgreSQL table engines to federate queries to external databases. Stores transformed data in marts for fast analytical queries.
-
dbt: Data transformation tool that manages the data modeling pipeline:
- Creates external tables in ClickHouse pointing to source databases (staging layer)
- Transforms and cleans data (intermediate layer)
- Builds business-ready analytical models (marts layer)
- Runs on a schedule via cron jobs in production
-
Superset: Business intelligence platform that connects to ClickHouse as a native data source. Analysts can create dashboards, charts, and SQL queries against the dbt models.
-
Traefik: Reverse proxy that handles:
- HTTPS termination with Let's Encrypt certificates
- Routing to internal services
- Blue-green deployment support
-
Prometheus + Grafana: Monitoring stack that collects metrics from all services and provides dashboards for infrastructure health, query performance, and system resources.
-
MinIO: S3-compatible object storage for future data artifacts, backups, and pipeline outputs.
-
Docker Swarm: Container orchestration for production deployment, enabling high availability and service management across multiple nodes.
- Docker and Docker Compose (for local development)
- Docker Swarm (for production deployment)
- uv (Python package manager)
- Access to external PostgreSQL "beefy-db" database
Quick start (automated):
make setup # Setup .env and install dependencies
make infra start # Start infrastructure services (ClickHouse auto-initializes)
make dbt run # Run dbt modelsManual setup:
-
Clone the repository and run initial setup:
make setup # Or manually: cp .env.example .env # Edit .env with your credentials uv sync # Install dependencies
-
Start infrastructure services:
make infra start # Start ClickHouse, Superset, Grafana, Prometheus, MinIO # ClickHouse automatically initializes on first startup
-
Run dbt models:
make dbt run # Run all models make dbt test # Run tests
-
Access services:
- ClickHouse: http://localhost:8123
- Superset: http://localhost:8088
- Grafana: http://localhost:3000 (admin/changeme by default)
- Prometheus: http://localhost:9090
- MinIO Console: http://localhost:9001 (admin/changeme by default)
- MinIO API: http://localhost:9002
Other useful commands:
make help # Show all available commands
make infra logs # View infrastructure logs
make infra ps # Check service status
make infra stop # Stop all services-
Follow steps 1-2 from Local Development Setup
-
Navigate to production directory:
cd infra/prod -
Build and deploy infrastructure:
# Build images (dbt, superset, etc.) make build # Deploy stack make deploy # Or combine: make deploy (builds and deploys)
-
ClickHouse automatically initializes on first startup via init script
-
Run dbt models (dbt runs on a schedule via cron in production):
# Or manually trigger from dbt container: docker exec -it <dbt-container> /app/run_dbt.sh
Production Management:
cd infra/prod
make ps # Show running services
make logs # View logs from all services
make up # Rolling update (redeploy stack)infra/: Infrastructure configurations (dev/prod stacks, ClickHouse, Superset, Traefik, monitoring)infra/prod/: Production deployment files (Makefile, docker-stack.yml, docker-compose.build.yml)infra/dev/: Development docker-compose configurationinfra/dbt/: dbt container configuration (Dockerfile, cron jobs)
dbt/: dbt project with staging, intermediate, and marts modelsdlt/: Future data ingestion pipelineshasura/: Future GraphQL API setup
The dbt project follows best practices with three layers that progressively transform raw source data into business-ready analytical models.
graph LR
A[PostgreSQL<br/>Source Tables] -->|External Table Engine| B[Staging Layer<br/>External Tables]
B -->|Transform & Clean| C[Intermediate Layer<br/>Views]
C -->|Aggregate & Model| D[Marts Layer<br/>Tables]
D -->|Query| E[Superset<br/>Dashboards]
style A fill:#FFB6C1
style B fill:#87CEEB
style C fill:#90EE90
style D fill:#DDA0DD
style E fill:#F0E68C
-
Staging (
models/staging/):- Creates ClickHouse external tables using PostgreSQL table engines
- These tables act as read-only views that query the source database on-demand
- No data is copied; queries are federated to PostgreSQL
- Example:
stg_beefy_db__harvestsqueries theharveststable from beefy-db PostgreSQL - Materialized as views (no storage overhead)
-
Intermediate (
models/intermediate/):- Cleans and standardizes data from staging
- Handles data quality issues (nulls, invalid values, outliers)
- Applies business logic and calculations
- Joins related tables and enriches with dimension data
- Example:
int_pricestandardizes price data across sources - Materialized as views for flexibility
-
Marts (
models/marts/):- Business-ready analytical models organized by domain
- Dimensions (
dimensions/): Reference data (chains, products) - Revenue (
revenue/): Revenue-generating events (buyback events) - Yield (
yield/): Yield-generating events (harvest events) - Optimized for analytical queries with proper indexing
- Materialized as tables for query performance
- Source: PostgreSQL
harveststable contains raw harvest transaction data - Staging:
stg_beefy_db__harvestscreates an external table that queries PostgreSQL directly - Intermediate: Data is cleaned (filters invalid prices, removes outliers)
- Marts:
yield_harvest_eventsjoins with dimensions (chain, product) and calculates USD values - Consumption: Superset queries
yield_harvest_eventsfor dashboards and analytics
View all commands:
make helpmake infra start # Start infrastructure services
make infra stop # Stop infrastructure services
make infra restart # Restart infrastructure services
make infra build # Rebuild infrastructure images
make infra logs # View logs
make infra ps # Check service statusmake dbt run # Run all models
make dbt test # Run tests
make dbt compile # Compile models
make dbt docs # Generate and serve documentationmake dev # Complete setup: install deps, start services, run dbt
make infra start # Start infrastructure servicesIf you prefer to use docker-compose and dbt directly:
# Infrastructure (from project root)
docker-compose -f infra/dev/docker-compose.yml up -d
docker-compose -f infra/dev/docker-compose.yml logs -f
docker-compose -f infra/dev/docker-compose.yml down
# dbt
cd dbt && uv run dbt run
cd dbt && uv run dbt testThe platform supports both development and production environments:
- Development:
infra/dev/docker-stack.yml - Production:
infra/prod/docker-stack.yml
Services can be moved between nodes using Docker Swarm labels.
- Superset: http://localhost:8088
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- ClickHouse: http://localhost:8123
Note: Superset supports ClickHouse as a native data source. After initial setup, you can configure ClickHouse connections through the Superset UI to query your dbt models and data.
- Traefik HTTPS termination with Let's Encrypt
- Docker Swarm network isolation
- Service-level security configurations in docker-stack.yml