Welcome! In this notebook, we'll build a complete, production-ready ELT pipeline from scratch. Here’s a brief overview of our project:
- The Dataset: We'll use the "Jaffle Shop," a fictional e-commerce store. Our raw data is split across three CSV files:
raw_customers,raw_orders, andraw_payments. These tables are logically linked by sharedidcolumns, which we'll use to join them, as shown in the schema diagram below.
-
The Tasks: We will build an end-to-end pipeline. This includes Loading the data (using
dbt seed), Transforming it with a 3-layer dbt model (staging$\to$ intermediate$\to$ marts), Testing our models for data quality (like uniqueness and relationships), and finally, Orchestrating the entire process into an automated, scheduled job with Airflow. -
The Audience: This pipeline is for any business that wants to answer the critical question, "Who are my most valuable customers?" Our final product will be a clean, reliable, and analytics-ready table (
dim_customers) that a BI tool (like Tableau or Power BI) can connect to for analysis.
Follow the notebook
We have successfully built and orchestrated a full data pipeline. The pipeline successfully created the final analytical table, and the output directly answers the core business question: "Who are our most valuable customers?".
-
Built Models (dbt): We used dbt to load seed data, run transformations (staging
$\rightarrow$ intermediate$\rightarrow$ marts), and test our data quality. -
Orchestrated Pipeline (Airflow): We wrote an Airflow DAG and used the airflow command to automatically run our entire dbt pipeline (
seed,run, andtest) in the correct, automated sequence. -
The Answer: The final output is the single, reliable
dim_customerstable, which a BI tool (like Tableau or Power BI) could connect to for analysis.
This is the core workflow of a modern data pipeline!
