Production-grade ELT pipeline processing 8.85M healthcare provider records (9.9 GB) through a medallion architecture โ orchestrated with Airflow, modeled in dbt, and served from Snowflake.
| Dataset | NPPES National Provider Identifier Registry (CMS) |
| Scale | 8,850,000+ records ยท 9.9 GB raw data |
| Architecture | Medallion (Bronze โ Silver โ Gold) |
| Orchestration | Apache Airflow |
| Transformation | dbt Core (staging โ intermediate โ marts) |
| Warehouse | Snowflake |
| Storage | AWS S3 (data lake) |
| Processing | Polars + DuckDB |
| Code Quality | Ruff (pycodestyle, isort, flake8-bugbear, security rules) |
โโโโโโโโโโโโโโโโโโโโโโโ
โ CMS Public Data โ
โ (NPPES CSVs) โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ Extract
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ AWS S3 โ
โ Bronze Layer โ โ Raw data lake
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ Polars + DuckDB cleaning
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ AWS S3 โ
โ Silver Layer โ โ Cleaned Parquet files
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ dbt Load
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Snowflake โ
โ Gold Layer โ โ Analytics-ready marts
โ โ
โ staging/ โ
โ โโ stg_providers โ
โ โโ stg_taxonomy โ
โ โโ stg_fips โ
โ โ
โ marts/ โ
โ โโ provider_summaryโ
โ โโ by_state โ
โ โโ by_county โ
โ โโ directory โ
โ โโ specialty_dist โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Airflow DAGs โ โ End-to-end orchestration
โ Analytics & โ
โ Visualizations โ
โโโโโโโโโโโโโโโโโโโโโโโ
- Extract โ Download raw NPPES data from CMS public registry
- Load โ Upload to AWS S3 Bronze layer (raw data lake)
- Clean โ Polars + DuckDB processing into Silver layer (Parquet)
- Transform โ dbt models build staging and mart layers in Snowflake
- Orchestrate โ Airflow DAGs manage end-to-end pipeline execution
- Analyze โ Query Gold layer for geographic and specialty insights
| Question | Why It Matters |
|---|---|
| How many providers operate in each state? | Healthcare access and capacity planning |
| What are the most common specialties by geography? | Identifying underserved specialties by region |
| Individual providers vs. organizations โ what's the split? | Market structure analysis |
| Which counties have the highest provider density? | Rural vs. urban healthcare access gaps |
- 8.85M providers across 54 states and territories, representing 842 unique specialties
- 78.3% individual providers vs. 21.7% organizations nationally
- Significant geographic concentration โ top 5 states account for ~40% of all providers
- Rural counties show dramatically lower provider-to-population ratios than urban CBSAs
| Model | Description |
|---|---|
stg_nppes_providers |
Cleaned provider records with standardized column names |
stg_taxonomy |
Healthcare specialty classification codes (NUCC) |
stg_fips |
Geographic reference data (state/county FIPS codes) |
| Model | Description |
|---|---|
mart_provider_summary |
Overall dataset statistics โ total providers, entity types, coverage |
mart_providers_by_state |
Provider counts and metrics aggregated by state |
mart_providers_by_county |
County-level provider density analysis |
mart_provider_directory |
Searchable provider directory with specialty info |
mart_specialty_distribution |
Medical specialty analysis across regions |
-- Top 10 states by provider count
SELECT
state,
total_providers,
individual_providers,
organization_providers
FROM mart_providers_by_state
ORDER BY total_providers DESC
LIMIT 10;
-- Provider directory search by specialty
SELECT
provider_name,
provider_type,
specialty_name,
city,
zip_code
FROM mart_provider_directory
WHERE state = 'CA'
AND specialty_classification = 'Physician'
LIMIT 20;
-- Overall dataset summary
SELECT * FROM mart_provider_summary;
Top 10 states by provider count โ California, Texas, and New York lead nationally.
Provider distribution across the top 15 states by county coverage.
Counties within Core-Based Statistical Areas (metropolitan regions).
NPPES/
โโโ data/
โ โโโ raw/ # Raw CSV files from NPPES
โ โโโ cleaned/ # Cleaned Parquet files (Silver layer)
โโโ nnpes/ # dbt project
โ โโโ models/
โ โ โโโ staging/ # 1:1 source transformations
โ โ โโโ marts/ # Business-ready analytics tables
โ โโโ dbt_project.yml
โ โโโ dev.duckdb
โโโ dags/ # Airflow DAG definitions
โโโ scripts/
โ โโโ clean_nppes.py # Polars data cleaning
โ โโโ upload_to_s3.py # S3 upload automation
โ โโโ visualize_data.py # Chart generation
โ โโโ explore_raw_data.py # Data exploration
โโโ src/
โ โโโ s3_services.py # AWS S3 utilities
โ โโโ my_logger.py # Custom logging
โโโ visualizations/ # Generated charts
โโโ query_examples.sql # Sample SQL queries
โโโ ruff.toml # Code quality configuration
โโโ requirements.txt
โโโ README.md
- Python 3.9+
- AWS account (for S3 features)
- Snowflake account
- ~10GB free disk space
# Clone the repo
git clone https://github.com/JingYou-data/NPPES.git
cd NPPES
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your AWS + Snowflake credentials
# Run data cleaning
python scripts/clean_nppes.py
# Run dbt transformations
cd nnpes
dbt run
dbt test- Incremental dbt models โ avoid reprocessing all 8.85M rows on each run
- Streamlit dashboard โ interactive provider density maps by county
- Geographic mapping โ Folium/Plotly choropleth showing provider access gaps
- Great Expectations โ comprehensive data quality validation beyond dbt tests
- Docker containerization โ reproducible local development environment
| Source | URL |
|---|---|
| NPPES Registry | CMS National Provider Identifier Registry |
| Taxonomy Codes | NUCC Healthcare Provider Taxonomy |
| FIPS Codes | US Census Bureau |
Jing You โ Data Analytics & Engineering
Production-grade ELT pipeline ยท NPPES CMS Public Data ยท Modern Data Stack