Datatalks homeworks and exercises for DE Zoomcamp 2026
Status: Completed | Folder: 01-docker-terraform/
What I learned:
- Docker fundamentals - Container lifecycle, volumes, and data persistence
- Docker networking - Container communication and port mapping
- Docker Compose - Multi-service orchestration (PostgreSQL + pgAdmin)
- Data pipelines - NYC Taxi data ingestion using Python, pandas, and SQLAlchemy
- SQL queries - Data analysis on PostgreSQL databases
- Terraform for GCP - Infrastructure as Code for Google Cloud (Storage + BigQuery)
- Terraform for AWS - Equivalent AWS infrastructure (S3 + Glue Catalog)
- Best practices - Environment variables,
.gitignorefor credentials, and project structure
Key deliverables:
- Homework 01 - Docker, SQL, and Terraform exercises β
- Working PostgreSQL + pgAdmin environment via Docker Compose
- Python data ingestion scripts for parquet and CSV files
- Terraform configurations for GCP and AWS resource provisioning
Technologies used: Docker, PostgreSQL, pgAdmin, Python, pandas, SQLAlchemy, Terraform, GCP, AWS
Cloud Service Comparison - GCP vs AWS:
| GCP Service | AWS Equivalent | Purpose |
|---|---|---|
| Google Cloud Storage (GCS) | Amazon S3 | Data Lake storage |
| Uniform Bucket Level Access | S3 Public Access Block | Secure bucket access |
| Object Lifecycle Rules | S3 Lifecycle Configuration | Automatic data expiration |
| BigQuery Dataset | AWS Glue Catalog Database | Metadata & query layer |
| Service Account JSON Key | IAM User Profile (AWS CLI) | Authentication |
Authentication Difference:
- GCP: Requires
my-creds.jsonservice account key file - AWS: Uses IAM profile from
~/.aws/credentials(no separate file needed)
Status: Completed | Folder: 02-workflow-orchestration/
What I learned:
- Kestra fundamentals - Modern declarative workflow orchestration platform
- ETL pipeline orchestration - Extract, transform, and load NYC taxi data to GCP
- Variables and expressions - Dynamic workflow configuration using Jinja templating
- Backfill functionality - Historical data processing for multiple time periods
- Scheduled triggers - Automated workflow execution with timezone support
- GCP integration - Cloud Storage and BigQuery data loading
- Secrets management - Secure credential handling with base64 encoding
- Docker Compose orchestration - Multi-service setup (Kestra + PostgreSQL + pgAdmin)
Key deliverables:
- Homework 02 - Workflow orchestration and data pipeline exercises β
- Working Kestra instance with PostgreSQL backend via Docker Compose
- Automated ETL flows processing millions of taxi trip records
- GCP bucket and BigQuery dataset created via Terraform-like flows
- Backfill executions for all 2020 data (Yellow: 24.6M rows, Green: 1.7M rows)
Technologies used: Kestra, Docker, PostgreSQL, Python, GCP (Cloud Storage + BigQuery), Gemini AI
Kestra Workflow Highlights:
| Flow | Purpose | Data Processed |
|---|---|---|
| 08_gcp_taxi | Manual ETL execution | Single month of taxi data |
| 09_gcp_taxi_scheduled | Scheduled ETL with backfill | Multiple months via cron triggers |
| 06_gcp_kv | Configuration management | Stores GCP project settings |
| 07_gcp_setup | Infrastructure provisioning | Creates GCS bucket + BigQuery dataset |
Data Pipeline Results:
- Yellow Taxi (2020): 24,648,499 records across 12 months
- Green Taxi (2020): 1,734,051 records across 12 months
- Yellow Taxi (March 2021): 1,925,152 records
- File size example: 128.3 MiB uncompressed CSV for Dec 2020
Status: Completed | Folder: 03-data-warehouse/
What I learned:
- BigQuery fundamentals - External tables, materialized tables, and native BigQuery storage
- Partitioning strategies - Date-based partitioning for query optimization (91% cost reduction)
- Clustering techniques - Organizing data within partitions for faster access
- Columnar storage - Understanding how BigQuery scans only requested columns
- Query optimization - Using
--dry_runto estimate costs before execution - Cost management - Cleanup strategies to avoid unnecessary GCP charges
- GCS integration - Creating external tables referencing Cloud Storage data
- Data analysis at scale - Working with 20.3M taxi trip records across 6 months
Key deliverables:
- Homework 03 - BigQuery & Data Warehousing exercises β
- External and materialized tables with 20.3M records (326.1 MiB parquet data)
- Partitioned and clustered table achieving 91% query cost reduction (310MB β 27MB)
- Complete cost analysis and GCP resource cleanup documentation
- SQL queries demonstrating columnar storage efficiency and metadata optimization
Technologies used: BigQuery, Google Cloud Storage, bq CLI, gsutil, SQL, Parquet
BigQuery Optimization Results:
| Optimization Technique | Before | After | Savings |
|---|---|---|---|
| Partitioning (date-range query) | 310.24 MB | 26.84 MB | 91% reduction |
| Columnar storage (2 cols vs 1) | 155 MB | 310 MB | Linear scaling |
| COUNT(*) metadata usage | N/A | 0 bytes | 100% (no scan) |
| External vs Materialized estimation | 0 MB | 155.12 MB | Accurate sizing |
Key Learning:
Partitioning by date (tpep_dropoff_datetime) + clustering by frequently filtered columns (VendorID) creates a powerful optimization strategy. For a 15-day query window on 6 months of data, partitioning achieved 91% reduction in data scanned (from 310 MB to 27 MB), translating directly to cost savings in production. Understanding the difference between external tables (data in GCS, no storage cost) and materialized tables (data in BigQuery, faster queries but storage cost) is crucial for cost optimization.
Status: Completed | Folder: 04-analytics-engineering/
What I learned:
- dbt Core fundamentals - Local CLI-based development with BigQuery adapter
- Project structure - Organizing models into staging, intermediate, and core layers
- Data modeling - Building dimensional models (facts and dimensions) from raw data
- Data quality testing - Implementing automated tests for uniqueness and referential integrity
- Deduplication strategies - 4-column composite key approach for handling duplicate records
- SQL materialization types - Views vs tables vs incremental models
- Jinja templating - Dynamic SQL with variables and macros (e.g.,
is_test_runflag) - dbt packages - Using
dbt_utilsfor surrogate keys and testing utilities - BigQuery optimization - QUALIFY window functions for efficient deduplication
- Data lineage - Understanding upstream/downstream dependencies between models
Key deliverables:
- Homework 04 - dbt modeling and transformation exercises β
- Complete dbt project with 8 models across 3 layers (staging, intermediate, core)
- Processed 130M+ raw records β 112M deduplicated trips
- Fact tables:
fct_trips(112M records),fct_monthly_zone_revenue(11,662 aggregations) - Data quality tests with 9 assertions covering uniqueness and relationships
- Terraform-managed GCP infrastructure (service account, BigQuery datasets)
Technologies used: dbt Core, BigQuery, SQL, Jinja, Python, GCP, Terraform, uv (package manager)
Hybrid Approach (different from course guides): Instead of using dbt Cloud (the course's recommended web IDE), I opted for a local-first hybrid setup:
- Local: dbt Core CLI + VSCode for development, Git for version control
- Cloud: BigQuery for data warehouse and SQL execution
- Infrastructure as Code: Terraform for GCP resource provisioning (service accounts, BigQuery datasets, IAM roles)
- Why: Better IDE experience, full control over environment, reproducible infrastructure, and no vendor lock-in
- Trade-off: Manual setup (Terraform configs, service accounts, profiles.yml) vs dbt Cloud's one-click setup
This approach mirrors real-world production environments where teams prefer local development with cloud compute and infrastructure automation.
dbt Model Architecture:
| Layer | Model | Purpose | Record Count |
|---|---|---|---|
| Staging | stg_green_tripdata |
Clean & standardize green taxi data | ~2M records |
| Staging | stg_yellow_tripdata |
Clean & standardize yellow taxi data | ~130M records |
| Staging | stg_fhv_tripdata |
Clean & standardize FHV data | 43.2M records |
| Intermediate | int_trips_unioned |
Union yellow + green taxi data | 114M records |
| Intermediate | int_trips |
Deduplicate using 4-column key | 112M records |
| Core | fct_trips |
Enriched trip facts with zone names | 112M records |
| Core | fct_monthly_zone_revenue |
Monthly revenue aggregations | 11,662 records |
Deduplication Strategy:
Key Learning:
dbt transforms the data warehouse into a development environment with version control, testing, and documentation. The layered architecture (staging β intermediate β core) creates maintainable transformations: staging cleans raw data, intermediate handles business logic (like deduplication), and core creates analytics-ready tables. Using local dbt Core with a cloud data warehouse (BigQuery) provides the best of both worldsβpowerful IDE experience locally with scalable compute in the cloud. The 4-column deduplication strategy proved critical: deduplicating only on vendorid + pickup_datetime wasn't enough; adding pickup_locationid + service_type captured the true uniqueness of trip records.
Status: Completed | Folder: 05-data-platforms/
What I learned:
- Bruin CLI fundamentals β Unified tool for data ingestion, transformation, orchestration, and governance
- Pipeline architecture β Layered approach (ingestion, staging, reporting) for NYC Taxi data
- Incremental processing β Using
time_intervalmaterialization for efficient updates - Quality checks & lineage β Built-in validation and visualization of data flows
- Asset management β Version-controlled assets (Python, SQL, YAML, CSV) for reproducible pipelines
- AI agent integration β Conversational pipeline development and troubleshooting with Bruin MCP
- Cloud deployment β Managed infrastructure and cloud-native workflows with Bruin Cloud
- Quick feedback cycles β Local-first development with rapid iteration
Key deliverables:
- Homework 05 β Bruin CLI, pipeline, and platform exercises β
- Complete Bruin project structure:
.bruin.yml,pipeline/pipeline.yml, and layered assets - End-to-end NYC Taxi pipeline (ingestion, staging, reporting)
- Reference notes and video tutorials for Bruin and data platform concepts
Technologies used: Bruin CLI, VS Code extension, DuckDB, BigQuery, Python, SQL, YAML
Bruin Platform Highlights:
| Feature | Purpose | Result |
|---|---|---|
| Unified CLI | Ingestion, transformation, orchestration, governance | One tool for all workflows |
| Layered pipeline | Modular asset structure | Reproducible, extensible pipelines |
| Incremental materialization | Efficient updates | Fast, scalable processing |
| Quality checks | Data validation | Reliable pipelines |
| Lineage visualization | Data flow tracking | Transparent dependencies |
| AI agent (MCP) | Conversational development | Fast troubleshooting & building |
| Cloud deployment | Managed infrastructure | Scalable, production-ready workflows |
Unique Approaches:
- AI-assisted pipeline development and troubleshooting (Bruin MCP)
- All pipeline configuration and assets are version-controlled text files (no UI lock-in)
- Mix-and-match asset types (Python, SQL, YAML, CSV) in a single pipeline
- Emphasis on quick feedback cycles and local development
Key Learning: Module 5 demonstrates a modern, unified data platform workflow using Bruin CLI. The approach emphasizes reproducibility, quality, extensibility, and rapid iteration. AI integration and version-controlled assets enable flexible, scalable pipelines suitable for both local and cloud environments.
Status: Completed | Folder: 06-batch/
What I learned:
- Batch processing fundamentals - Why batch workloads matter and where Spark fits in modern data platforms
- Spark setup and local development - Installing Java/PySpark and creating local Spark sessions
- Spark DataFrames and Spark SQL - Reading parquet/CSV data, transformations, aggregations, and SQL-based analysis
- Schema and storage handling - Working with taxi schemas, parquet output, and partition/repartition strategies
- Spark internals - Cluster anatomy, execution behavior, and performance implications of groupBy and joins
- RDD concepts (optional) - Lower-level distributed operations including
mapPartitions - Cloud execution patterns - Spark with GCS, local clusters, Dataproc setup, and BigQuery connectivity
- Validation through homework - Applying class concepts on NYC Taxi November 2025 data to confirm understanding
Key deliverables:
- Homework 06 - Spark batch-processing exercises and validated answers β
- Reproducible solver script: homework_queries.py
- Completed class materials and notebooks in 06-batch/class_materials/code/
- End-to-end analysis on Yellow Taxi November 2025 data using Spark DataFrames and joins
Technologies used: Apache Spark, PySpark, Spark SQL, Python, Parquet, CSV, Java, GCS, Dataproc, BigQuery
Spark Learning Highlights:
| Topic | What was practiced | Outcome |
|---|---|---|
| Spark setup | Local session, runtime verification, Spark UI | Working Spark 4.1.1 environment |
| DataFrames + SQL | Filters, aggregations, SQL queries on taxi data | Reproducible batch analysis |
| Partitioning strategy | repartition(4) and parquet output sizing |
Controlled file layout and storage behavior |
| GroupBy + joins | Pickup zone frequency with lookup joins | Correct low-frequency zone identification |
| Duration calculations | Timestamp-based trip duration metrics | Accurate max duration computation |
| Cloud patterns | GCS/Dataproc/BigQuery integration concepts | Clear path from local to cloud execution |
Key Learning: Module 6 reinforced that Spark is not only about writing transformations; it is about understanding execution patterns, partitioning behavior, and data layout decisions. The homework served as a practical validation layer for the class materials, confirming both conceptual understanding and implementation accuracy on a real dataset.
Status: Completed | Folder: 07-streaming/
What I learned:
- Streaming fundamentals - Event-driven architecture and the role of message brokers in real-time data pipelines
- Redpanda - Kafka-compatible broker with simpler operational model; topic creation and management via
rpkCLI - Kafka-Python producer/consumer - Serializing records to JSON, batching, and consuming messages from topics
- PyFlink Table API - Defining schemas, creating Kafka source/PostgreSQL sink DDLs, and running streaming SQL jobs
- Windowed aggregations - TUMBLE windows (fixed 5-min intervals) for trip counts and SESSION windows (gap-based) for session analysis
- Watermarks and event time - Defining watermarks via computed columns to handle late-arriving events
- JDBC sink connector - Writing Flink results to PostgreSQL tables using the Flink JDBC connector
- End-to-end streaming pipeline - Producer β Redpanda β Flink job β PostgreSQL, all orchestrated with Docker Compose
Key deliverables:
- Homework 07 - Streaming exercises with Redpanda, kafka-python, and PyFlink β
- Workshop stack: Docker Compose environment with Redpanda, Flink JobManager + TaskManager, and PostgreSQL
- Green Taxi October 2025 dataset (49,416 trips) streamed and windowed in real time
Technologies used: Redpanda, Apache Flink (PyFlink), Kafka-Python, PostgreSQL, Docker Compose, Python, Parquet
Streaming Pipeline Highlights:
| Component | Role | Key Detail |
|---|---|---|
| Redpanda | Message broker | Kafka-compatible, v25.3.9, green-trips topic |
| kafka-python KafkaProducer | Data ingestion | JSON-serialized rows, ~2.92 s for full dataset |
| PyFlink Table API | Stream processing | SQL-based windowed aggregations |
| TUMBLE(5 min) window | Trip count per zone | PULocationID 74 β 15 trips (top zone) |
| SESSION(5 min gap) window | Session analysis | Longest session: 81 trips for PULocationID 74 |
| PostgreSQL JDBC sink | Result storage | Flink writes aggregated results to DB tables |
Key Learning: Module 7 demonstrated how real-time streaming pipelines differ from batch jobs: events are processed continuously as they arrive, windowing (TUMBLE and SESSION) replaces GROUP BY on static data, and watermarks control how late events are handled. Redpanda's Kafka-compatible API made it easy to reuse standard kafka-python clients while keeping the broker lightweight. PyFlink's Table API enables declarative SQL over streaming data, bridging the gap between familiar batch SQL and low-latency stream processing.