An ETL (Extract, Transform, Load) pipeline that processes GDP data from Wikipedia, transforms the values from millions to billions USD, and stores the results in both CSV and database formats.
- Extract: Write a data extraction function to retrieve GDP information from the Wikipedia URL
- Transform: Convert GDP information from 'Million USD' to 'Billion USD'
- Load: Store the transformed data in both CSV file and SQLite database formats
- URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29
- Source: Wikipedia - List of countries by GDP (nominal)
- CSV:
Countries_by_GDP.csv- GDP data in CSV format - Database:
World_Economies.db- SQLite database withCountries_by_GDPtable
| Column | Description |
|---|---|
| Country | Country name |
| GDP_USD_billion | GDP value in billions USD (converted from millions) |
extract(url, table_attribs)- Extracts GDP data from Wikipediatransform(df)- Converts GDP from millions to billions USDload_to_csv(df, csv_path)- Saves data to CSV fileload_to_db(df, sql_connection, table_name)- Saves data to SQLite databaserun_query(query_statement, sql_connection)- Executes SQL querieslog_progress(message)- Logs execution progress
- Clone the repository
- Create virtual environment:
python -m venv venv - Activate virtual environment:
source venv/bin/activate(Unix) orvenv\Scripts\activate(Windows) - Install dependencies:
pip install -r requirements.txt
Run the ETL pipeline:
python etl_gdp.py- pandas - Data manipulation and analysis
- requests - HTTP library for web scraping
- beautifulsoup4 - HTML parsing
- numpy - Numerical computing
- sqlite3 - Database operations (built-in)