A complete data engineering project for collecting, processing, and analyzing English Premier League data using Python, Airflow, PostgreSQL, BigQuery, and Docker.
This project builds a full-stack football data pipeline. It scrapes data from reliable football sources, stores it in relational databases and cloud data warehouses, automates ETL using Airflow, and supports analysis via SQL.
- Select data sources (BBC & worldfootball.net)
- Scrape raw data using Python + BeautifulSoup (functions in
scrape.py) - Preview and verify the data structure in Jupyter Notebook
- Set up BigQuery & manually create partitioned tables
- Load transformed data to PostgreSQL and BigQuery (append mode with
ingestion_time) - Use Docker Compose to manage containers (Airflow, Postgres, Jupyter, etc.)
- Schedule daily/weekly scraping jobs in Airflow DAGs
- Analyze data directly in BigQuery using SQL
| Source | Data | Frequency |
|---|---|---|
| BBC Sport | League table & top scorers | Daily |
| worldfootball.net | Goal data, player info, history stats | Weekly/Seasonal |
- Python
- Airflow
- PostgreSQL
- Google BigQuery
- Docker
- Jupyter Notebook
Airflow Dags/
├── init_full_load.py
├── scrape_daily_dag.py
└── scrape_weekly_dag.py
scrape.py
docker-compose.yaml
README.md
| DAG | Script | Frequency | Description |
|---|---|---|---|
| Init Load | init_full_load.py |
Manual | One-time historical load |
| Daily Scrape | scrape_daily_dag.py |
Daily at 06:00 | league table & scorers |
| Weekly Scrape | scrape_weekly_dag.py |
Sunday | historical/player data |
Each table includes an ingestion_time timestamp column for partitioning.
git clone https://github.com/yourusername/Premier-League-Data-Engineering-Project.git
cd Premier-League-Data-Engineering-Projectexport GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.jsondocker-compose up -dGo to http://localhost:8080
SELECT Name, Club, COUNT(*) as goals
FROM `project.dataset.top_scorers`
WHERE ingestion_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY Name, Club
ORDER BY goals DESC
LIMIT 5;Pull requests welcome. Submit issues or suggestions.
ZhenXIN
Data Engineer & Football Enthusiast ⚽
MIT License
