Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"permissions": {
"allow": [
"Bash(cd /home/sspaeti/git/data-engineering/practical-data-engineering/src/pipelines/real-estate && rm -f setup.py setup.cfg dev-requirements.txt tox.ini && rm -rf realestate.egg-info/ real_estate.egg-info/ spark-warehouse/)",
"Bash(rm -f uv.lock && uv lock 2>&1 | tail -30)",
"Bash(uv sync:*)",
"Bash(uv run:*)",
"Bash(pkill -f \"dagster dev\" 2>/dev/null; sleep 1; timeout 15 uv run dagster dev 2>&1 || true)",
"Bash(timeout 15 uv run dagster dev 2>&1; echo \"EXIT: $?\")",
"Read(//tmp/**)",
"Bash(curl -s -A \"Mozilla/5.0 \\(X11; Linux x86_64; rv:128.0\\) Gecko/20100101 Firefox/128.0\" \"https://www.immoscout24.ch/en/real-estate/buy/city-twann?r=0&map=1\" -o /tmp/immo_twann2.html && wc -c /tmp/immo_twann2.html)"
]
}
}
123 changes: 90 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@

[![Open Source Logos](https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/images/open-source-logos.png)](https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/)

This repository containts a practical implementation of a data engineering project that spans across web-scraping real-estates, processing with Spark and Delta Lake, adding data science with Jupyter Notebooks, ingesting data into Apache Druid, visualizing with Apache Superset, and managing workflows with Dagster—all orchestrated on Kubernetes.
This repository containts a practical implementation of a data engineering project that spans across web-scraping real-estates, processing with Spark and Delta Lake, adding data science with Jupyter Notebooks, ingesting data into Apache Druid, visualizing with Apache Superset, and managing workflows with Dagster—all orchestrated on Kubernetes.

**Built your own DE project or forked mine? Let me know in the comments; I'd be curious to know more about.**

## 🌟 About This Project
## About This Project

This Practical Data Engineering project addresses common data engineering challenges while exploring innovative technologies. It should serve as a learning project but incorporate comprehensive real-world use cases. It's a guide to building a data application that collects real-estate data, enriches it with various metrics, and offers insights through machine learning and data visualization. This application helps you find your dream properties in your area and showcases how to handle a full-fledged data engineering pipeline using modern tools and frameworks.

Expand All @@ -22,68 +22,127 @@ This Practical Data Engineering project addresses common data engineering challe
### Key Features & Learnings:
- Scraping real estate listings with [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/index.html)
- Change Data Capture (CDC) mechanisms for efficient data updates
- Utilizing [MinIO](https://github.com/minio/minio) as an S3-Gateway for cloud-agnostic storage
- Implementing UPSERTs and ACID transactions with [Delta Lake](https://delta.io/)
- Integrating [Jupyter Notebooks](https://github.com/jupyter/notebook) for data science tasks Visualizing data with [Apache Superset](https://github.com/apache/superset)
- Utilizing [SeaweedFS](https://github.com/seaweedfs/seaweedfs) as an S3-compatible object store for cloud-agnostic storage
- Implementing UPSERTs and ACID transactions with [Delta Lake](https://delta.io/)
- Integrating [Jupyter Notebooks](https://github.com/jupyter/notebook) for data science tasks
- Visualizing data with [Apache Superset](https://github.com/apache/superset)
- Orchestrating workflows with [Dagster](https://github.com/dagster-io/dagster/)
- Deploying on [Kubernetes](https://github.com/kubernetes/kubernetes) for scalability and cloud-agnostic architecture

### Technologies, Tools, and Frameworks:
This project leverages a vast array of open-source technologies including MinIO, Spark, Delta Lake, Jupyter Notebooks, Apache Druid, Apache Superset, and Dagster—all running on Kubernetes to ensure scalability and cloud-agnostic deployment.
This project leverages a vast array of open-source technologies including SeaweedFS, Delta Lake, Jupyter Notebooks, Apache Druid, Apache Superset, and Dagster—all running on Kubernetes to ensure scalability and cloud-agnostic deployment.

<p align="center">
<img src="https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/images/lakehouse-open-sourced.png" height="500">
</p>

### 🔄 Project Evolution and Updates
### Project Evolution and Updates

This project started in November 2020 as a project for me to learn and teach about data engineering. I published the entire project in March 2021 (see the initial version on [branch `v1`](https://github.com/sspaeti-com/practical-data-engineering/tree/v1)). Three years later, it's interesting that the tools used in this project are still used today. We always say how fast the Modern Data Stack changes, but if you choose wisely, you see that good tools will stay the time. Today, in `March 2024`, I updated the project to the latest Dagster and representative tools versions. I kept most technologies, except Apache Spark. It was a nightmare to setup locally and to work with Delta Lake SQL APi. I replaced it with [delta-rs](https://github.com/delta-io/delta-rs) direct, which is implemented in Rust and can edit and write Delta Tables directly in Python.
This project started in November 2020 as a project for me to learn and teach about data engineering. I published the entire project in March 2021 (see the initial version on [branch `v1`](https://github.com/sspaeti-com/practical-data-engineering/tree/v1)). Three years later, it's interesting that the tools used in this project are still used today. We always say how fast the Modern Data Stack changes, but if you choose wisely, you see that good tools will stay the time. Today, in `March 2024`, I updated the project to the latest Dagster and representative tools versions. I kept most technologies, except Apache Spark. It was a nightmare to setup locally and to work with Delta Lake SQL APi. I replaced it with [delta-rs](https://github.com/delta-io/delta-rs) direct, which is implemented in Rust and can edit and write Delta Tables directly in Python.

Next, I might add Rill Developer to the mix to have some fun analyzing the data powered by DuckDB. For a more production-ready dashboard, Superset would still be my choice tough.
**March 2026 Update**: Migrated from Dagster 1.5.1 to 1.12.x with modern patterns (`import dagster as dg`, `ConfigurableResource`, `Definitions`). Replaced MinIO with [SeaweedFS](https://github.com/seaweedfs/seaweedfs) for S3-compatible storage. Added Docker Compose setup for Apache Druid. All ops/graphs/jobs modernized while keeping the same pipeline logic. Spark code is kept as commented reference for future reactivation via Dagster Pipes.


## 🛠 Installation & Usage
Please refer to individual component directories for detailed setup and usage instructions. The project is designed to run both locally and on cloud environments, offering flexibility in deployment and testing.
## Installation & Usage

### Prerequisites:
- Python and pip for installing dependencies
- MinIO running for cloud-agnostic S3 storage
- Docker Desktop & Kubernetes for running Jupyter Notebooks
- Basic understanding of Python and SQL for effective navigation and customization of the project
- Python 3.10-3.13 and [uv](https://docs.astral.sh/uv/) for dependency management
- Docker for running SeaweedFS (S3 storage) and optional services
- Basic understanding of Python and SQL

### Quick Start:
1. Clone this repository.
2. Install dependencies
3. Install and start MinIO
4. Explore the data with the provided Jupyter Notebooks and Superset dashboards.

```sh
#change to the pipeline directory
# change to the pipeline directory
cd src/pipelines/real-estate

# installation
pip install -e ".[dev]"
# install dependencies
uv sync --all-extras

# start SeaweedFS (S3-compatible object store)
weed server -s3 -s3.config=seaweedfs-s3.json -dir=/tmp/seaweedfs
# or: make s3

# start dagster (in another terminal)
uv run dagster dev
```

Open http://127.0.0.1:3000 in your browser to access the Dagster UI.

### SeaweedFS (S3 Storage)

This project uses [SeaweedFS](https://github.com/seaweedfs/seaweedfs) as an S3-compatible object store, replacing the previously used MinIO. SeaweedFS is lightweight and provides full S3 API compatibility on port 8333.

**Docker (recommended):**
```sh
docker compose up seaweedfs -d
```

**Standalone (recommended):**
```sh
# Install via package manager or download from https://github.com/seaweedfs/seaweedfs/releases
# On Arch Linux:
yay -S seaweedfs

# Run with S3 enabled (from the real-estate directory)
weed server -s3 -s3.config=seaweedfs-s3.json -dir=/tmp/seaweedfs
# or: make s3
```

The `-s3.config` flag points to `seaweedfs-s3.json` which defines the `admin`/`admin` credentials. Without it, SeaweedFS rejects all authenticated requests.

**Configuration:**

# run minio
minio server /tmp/minio/
The pipeline reads S3 credentials from environment variables with these defaults:

# startup dagster
dagster dev
| Variable | Default | Description |
|---|---|---|
| `S3_ACCESS_KEY` | `admin` | S3 access key |
| `S3_SECRET_KEY` | `admin` | S3 secret key |
| `S3_ENDPOINT` | `http://127.0.0.1:8333` | S3 endpoint URL |

SeaweedFS auto-creates buckets on first write, so no manual bucket setup is needed.

**Migrating from MinIO:**

If you have existing data in MinIO, you can copy it to SeaweedFS using any S3-compatible tool:
```sh
# Using aws cli
aws s3 sync s3://real-estate s3://real-estate \
--source-endpoint-url http://127.0.0.1:9000 \
--endpoint-url http://127.0.0.1:8333
```

### Apache Druid (Optional)

The pipeline includes an optional Druid ingestion step for OLAP analytics. The `docker-compose.yml` includes a full Druid cluster (Coordinator, Broker, Historical, MiddleManager, Router) with Zookeeper and a PostgreSQL metadata store.

```sh
# Start the full Druid stack
docker compose up druid-router -d

# Druid UI available at http://127.0.0.1:8888
```

Druid is configured to use SeaweedFS for deep storage. The `ingest_druid` op is available in the codebase but not yet wired into the main pipeline graph. To activate it, uncomment the relevant lines in `realestate/pipelines.py`.

### Running Tests

```sh
uv run pytest realestate_tests/ -v
```

## 📈 Visualizing the Pipeline
## Visualizing the Pipeline

![Dagster UI – Practical Data Engineering Pipeline](images/dagster-practical-data-engineering-pipeline.png)

## 📚 Resources & Further Reading
## Resources & Further Reading
- [Building a Data Engineering Project in 20 Minutes](https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/): Access the full blog post detailing the project's development, challenges, and solutions.
- [DevOps Repositories](https://github.com/sspaeti-com/data-engineering-devops): Explore the setup for Druid, MinIO and other components.
- [DevOps Repositories](https://github.com/sspaeti-com/data-engineering-devops): Explore the setup for Druid, SeaweedFS and other components.
- [Business Intelligence Meets Data Engineering with Emerging Technologies](https://www.ssp.sh/blog/business-intelligence-meets-data-engineering/): An earlier post that dives into some of the technologies used in this project.
- [Data Engineering Vault](https://vault.ssp.sh/): A collection of resources, tutorials, and guides for data engineering projects.
- [Open-Source Data Engineering Projects](https://www.ssp.sh/brain/open-source-data-engineering-projects/): A curated list of open-source data engineering projects to explore.

## 📣 Feedback
## Feedback
Your feedback is invaluable to improve this project. If you've built your project based on this repository or have suggestions, please let me know through creating an Issues or a Pull Request directly.

---
Expand All @@ -97,5 +156,3 @@ Your feedback is invaluable to improve this project. If you've built your projec
<img src="https://sspaeti.com/blog/the-location-independent-lifestyle/europe/sspaeti_com_todays_office_033.jpg" width="600">

</p>


2 changes: 1 addition & 1 deletion src/pipelines/real-estate/.tool-versions
Original file line number Diff line number Diff line change
@@ -1 +1 @@
python 3.11.7
python latest
17 changes: 9 additions & 8 deletions src/pipelines/real-estate/Makefile
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
.DEFAULT_GOAL := up
.DEFAULT_GOAL := run

up:
dagster dev
run: ## Start dagster dev server
uv run dagster dev

install:
pip install -e ".[dev]"
install: ## Install all dependencies
uv sync --all-extras

minio:
minio server ~/Documents/minio/
s3: ## Start SeaweedFS with S3 API
weed server -s3 -s3.config=seaweedfs-s3.json -dir=/tmp/seaweedfs

test: ## Run tests
uv run pytest realestate_tests/ -v

help: ## Show all Makefile targets
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'

48 changes: 0 additions & 48 deletions src/pipelines/real-estate/dev-requirements.txt

This file was deleted.

Loading