ssp-data · sspaeti · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026
diff --git a/.claude/settings.local.json b/.claude/settings.local.json
@@ -0,0 +1,14 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(cd /home/sspaeti/git/data-engineering/practical-data-engineering/src/pipelines/real-estate && rm -f setup.py setup.cfg dev-requirements.txt tox.ini && rm -rf realestate.egg-info/ real_estate.egg-info/ spark-warehouse/)",
+      "Bash(rm -f uv.lock && uv lock 2>&1 | tail -30)",
+      "Bash(uv sync:*)",
+      "Bash(uv run:*)",
+      "Bash(pkill -f \"dagster dev\" 2>/dev/null; sleep 1; timeout 15 uv run dagster dev 2>&1 || true)",
+      "Bash(timeout 15 uv run dagster dev 2>&1; echo \"EXIT: $?\")",
+      "Read(//tmp/**)",
+      "Bash(curl -s -A \"Mozilla/5.0 \\(X11; Linux x86_64; rv:128.0\\) Gecko/20100101 Firefox/128.0\" \"https://www.immoscout24.ch/en/real-estate/buy/city-twann?r=0&map=1\" -o /tmp/immo_twann2.html && wc -c /tmp/immo_twann2.html)"
+    ]
+  }
+}
diff --git a/README.md b/README.md
@@ -6,11 +6,11 @@
 
 [![Open Source Logos](https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/images/open-source-logos.png)](https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/)
 
-This repository containts a practical implementation of a data engineering project that spans across web-scraping real-estates, processing with Spark and Delta Lake, adding data science with Jupyter Notebooks, ingesting data into Apache Druid, visualizing with Apache Superset, and managing workflows with Dagster—all orchestrated on Kubernetes. 
+This repository containts a practical implementation of a data engineering project that spans across web-scraping real-estates, processing with Spark and Delta Lake, adding data science with Jupyter Notebooks, ingesting data into Apache Druid, visualizing with Apache Superset, and managing workflows with Dagster—all orchestrated on Kubernetes.
 
 **Built your own DE project or forked mine? Let me know in the comments; I'd be curious to know more about.**
 
-## 🌟 About This Project
+## About This Project
 
 This Practical Data Engineering project addresses common data engineering challenges while exploring innovative technologies. It should serve as a learning project but incorporate comprehensive real-world use cases. It's a guide to building a data application that collects real-estate data, enriches it with various metrics, and offers insights through machine learning and data visualization. This application helps you find your dream properties in your area and showcases how to handle a full-fledged data engineering pipeline using modern tools and frameworks.
 
@@ -22,68 +22,127 @@ This Practical Data Engineering project addresses common data engineering challe
 ### Key Features & Learnings:
 - Scraping real estate listings with [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/index.html)
 - Change Data Capture (CDC) mechanisms for efficient data updates
-- Utilizing [MinIO](https://github.com/minio/minio) as an S3-Gateway for cloud-agnostic storage
-- Implementing UPSERTs and ACID transactions with [Delta Lake](https://delta.io/) 
-- Integrating [Jupyter Notebooks](https://github.com/jupyter/notebook) for data science tasks Visualizing data with [Apache Superset](https://github.com/apache/superset)
+- Utilizing [SeaweedFS](https://github.com/seaweedfs/seaweedfs) as an S3-compatible object store for cloud-agnostic storage
+- Implementing UPSERTs and ACID transactions with [Delta Lake](https://delta.io/)
+- Integrating [Jupyter Notebooks](https://github.com/jupyter/notebook) for data science tasks
+- Visualizing data with [Apache Superset](https://github.com/apache/superset)
 - Orchestrating workflows with [Dagster](https://github.com/dagster-io/dagster/)
 - Deploying on [Kubernetes](https://github.com/kubernetes/kubernetes) for scalability and cloud-agnostic architecture
 
 ### Technologies, Tools, and Frameworks:
-This project leverages a vast array of open-source technologies including MinIO, Spark, Delta Lake, Jupyter Notebooks, Apache Druid, Apache Superset, and Dagster—all running on Kubernetes to ensure scalability and cloud-agnostic deployment.
+This project leverages a vast array of open-source technologies including SeaweedFS, Delta Lake, Jupyter Notebooks, Apache Druid, Apache Superset, and Dagster—all running on Kubernetes to ensure scalability and cloud-agnostic deployment.
 
 <p align="center">
 <img src="https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/images/lakehouse-open-sourced.png" height="500">
 </p>
 
-### 🔄 Project Evolution and Updates
+### Project Evolution and Updates
 
-This project started in November 2020 as a project for me to learn and teach about data engineering. I published the entire project in March 2021 (see the initial version on [branch `v1`](https://github.com/sspaeti-com/practical-data-engineering/tree/v1)). Three years later, it's interesting that the tools used in this project are still used today. We always say how fast the Modern Data Stack changes, but if you choose wisely, you see that good tools will stay the time. Today, in `March 2024`, I updated the project to the latest Dagster and representative tools versions. I kept most technologies, except Apache Spark. It was a nightmare to setup locally and to work with Delta Lake SQL APi. I replaced it with [delta-rs](https://github.com/delta-io/delta-rs) direct, which is implemented in Rust and can edit and write Delta Tables directly in Python. 
+This project started in November 2020 as a project for me to learn and teach about data engineering. I published the entire project in March 2021 (see the initial version on [branch `v1`](https://github.com/sspaeti-com/practical-data-engineering/tree/v1)). Three years later, it's interesting that the tools used in this project are still used today. We always say how fast the Modern Data Stack changes, but if you choose wisely, you see that good tools will stay the time. Today, in `March 2024`, I updated the project to the latest Dagster and representative tools versions. I kept most technologies, except Apache Spark. It was a nightmare to setup locally and to work with Delta Lake SQL APi. I replaced it with [delta-rs](https://github.com/delta-io/delta-rs) direct, which is implemented in Rust and can edit and write Delta Tables directly in Python.
 
-Next, I might add Rill Developer to the mix to have some fun analyzing the data powered by DuckDB. For a more production-ready dashboard, Superset would still be my choice tough. 
+**March 2026 Update**: Migrated from Dagster 1.5.1 to 1.12.x with modern patterns (`import dagster as dg`, `ConfigurableResource`, `Definitions`). Replaced MinIO with [SeaweedFS](https://github.com/seaweedfs/seaweedfs) for S3-compatible storage. Added Docker Compose setup for Apache Druid. All ops/graphs/jobs modernized while keeping the same pipeline logic. Spark code is kept as commented reference for future reactivation via Dagster Pipes.
 
-
-## 🛠 Installation & Usage
-Please refer to individual component directories for detailed setup and usage instructions. The project is designed to run both locally and on cloud environments, offering flexibility in deployment and testing.
+## Installation & Usage
 
 ### Prerequisites:
-- Python and pip for installing dependencies
-- MinIO running for cloud-agnostic S3 storage
-- Docker Desktop & Kubernetes for running Jupyter Notebooks
-- Basic understanding of Python and SQL for effective navigation and customization of the project
+- Python 3.10-3.13 and [uv](https://docs.astral.sh/uv/) for dependency management
+- Docker for running SeaweedFS (S3 storage) and optional services
+- Basic understanding of Python and SQL
 
 ### Quick Start:
-1. Clone this repository.
-2. Install dependencies
-3. Install and start MinIO
-4. Explore the data with the provided Jupyter Notebooks and Superset dashboards.
 
 ```sh
-#change to the pipeline directory
+# change to the pipeline directory
 cd src/pipelines/real-estate
 
-# installation
-pip install -e ".[dev]"
+# install dependencies
+uv sync --all-extras
+
+# start SeaweedFS (S3-compatible object store)
+weed server -s3 -s3.config=seaweedfs-s3.json -dir=/tmp/seaweedfs
+# or: make s3
+
+# start dagster (in another terminal)
+uv run dagster dev
+```
+
+Open http://127.0.0.1:3000 in your browser to access the Dagster UI.
+
+### SeaweedFS (S3 Storage)
+
+This project uses [SeaweedFS](https://github.com/seaweedfs/seaweedfs) as an S3-compatible object store, replacing the previously used MinIO. SeaweedFS is lightweight and provides full S3 API compatibility on port 8333.
+
+**Docker (recommended):**
+```sh
+docker compose up seaweedfs -d
+```
+
+**Standalone (recommended):**
+```sh
+# Install via package manager or download from https://github.com/seaweedfs/seaweedfs/releases
+# On Arch Linux:
+yay -S seaweedfs
+
+# Run with S3 enabled (from the real-estate directory)
+weed server -s3 -s3.config=seaweedfs-s3.json -dir=/tmp/seaweedfs
+# or: make s3
+```
+
+The `-s3.config` flag points to `seaweedfs-s3.json` which defines the `admin`/`admin` credentials. Without it, SeaweedFS rejects all authenticated requests.
+
+**Configuration:**
 
-# run minio
-minio server /tmp/minio/
+The pipeline reads S3 credentials from environment variables with these defaults:
 
-# startup dagster
-dagster dev
+| Variable | Default | Description |
+|---|---|---|
+| `S3_ACCESS_KEY` | `admin` | S3 access key |
+| `S3_SECRET_KEY` | `admin` | S3 secret key |
+| `S3_ENDPOINT` | `http://127.0.0.1:8333` | S3 endpoint URL |
+
+SeaweedFS auto-creates buckets on first write, so no manual bucket setup is needed.
+
+**Migrating from MinIO:**
+
+If you have existing data in MinIO, you can copy it to SeaweedFS using any S3-compatible tool:
+```sh
+# Using aws cli
+aws s3 sync s3://real-estate s3://real-estate \
+  --source-endpoint-url http://127.0.0.1:9000 \
+  --endpoint-url http://127.0.0.1:8333
+```
+
+### Apache Druid (Optional)
+
+The pipeline includes an optional Druid ingestion step for OLAP analytics. The `docker-compose.yml` includes a full Druid cluster (Coordinator, Broker, Historical, MiddleManager, Router) with Zookeeper and a PostgreSQL metadata store.
+
+```sh
+# Start the full Druid stack
+docker compose up druid-router -d
+
+# Druid UI available at http://127.0.0.1:8888
 ```
 
+Druid is configured to use SeaweedFS for deep storage. The `ingest_druid` op is available in the codebase but not yet wired into the main pipeline graph. To activate it, uncomment the relevant lines in `realestate/pipelines.py`.
+
+### Running Tests
+
+```sh
+uv run pytest realestate_tests/ -v
+```
 
-## 📈 Visualizing the Pipeline
+## Visualizing the Pipeline
 
 ![Dagster UI – Practical Data Engineering Pipeline](images/dagster-practical-data-engineering-pipeline.png)
 
-## 📚 Resources & Further Reading
+## Resources & Further Reading
 - [Building a Data Engineering Project in 20 Minutes](https://www.ssp.sh/blog/data-engineering-project-in-twenty-minutes/): Access the full blog post detailing the project's development, challenges, and solutions.
-- [DevOps Repositories](https://github.com/sspaeti-com/data-engineering-devops): Explore the setup for Druid, MinIO and other components.
+- [DevOps Repositories](https://github.com/sspaeti-com/data-engineering-devops): Explore the setup for Druid, SeaweedFS and other components.
 - [Business Intelligence Meets Data Engineering with Emerging Technologies](https://www.ssp.sh/blog/business-intelligence-meets-data-engineering/): An earlier post that dives into some of the technologies used in this project.
 - [Data Engineering Vault](https://vault.ssp.sh/): A collection of resources, tutorials, and guides for data engineering projects.
 - [Open-Source Data Engineering Projects](https://www.ssp.sh/brain/open-source-data-engineering-projects/): A curated list of open-source data engineering projects to explore.
 
-## 📣 Feedback
+## Feedback
 Your feedback is invaluable to improve this project. If you've built your project based on this repository or have suggestions, please let me know through creating an Issues or a Pull Request directly.
 
 ---
@@ -97,5 +156,3 @@ Your feedback is invaluable to improve this project. If you've built your projec
 <img src="https://sspaeti.com/blog/the-location-independent-lifestyle/europe/sspaeti_com_todays_office_033.jpg" width="600">
 
 </p>
-
-
diff --git a/src/pipelines/real-estate/.tool-versions b/src/pipelines/real-estate/.tool-versions
@@ -1 +1 @@
-python 3.11.7
+python latest
diff --git a/src/pipelines/real-estate/Makefile b/src/pipelines/real-estate/Makefile
@@ -1,15 +1,16 @@
-.DEFAULT_GOAL := up
+.DEFAULT_GOAL := run
 
-up: 
-	dagster dev
+run: ## Start dagster dev server
+	uv run dagster dev
 
-install:
-	pip install -e ".[dev]"
+install: ## Install all dependencies
+	uv sync --all-extras
 
-minio:
-	minio server ~/Documents/minio/
+s3: ## Start SeaweedFS with S3 API
+	weed server -s3 -s3.config=seaweedfs-s3.json -dir=/tmp/seaweedfs
 
+test: ## Run tests
+	uv run pytest realestate_tests/ -v
 
 help: ## Show all Makefile targets
 	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'
-
diff --git a/src/pipelines/real-estate/dev-requirements.txt b/src/pipelines/real-estate/dev-requirements.txt