Skip to content

JingYou-data/NPPES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿฅ NPPES Healthcare Provider Analytics

Production-grade ELT pipeline processing 8.85M healthcare provider records (9.9 GB) through a medallion architecture โ€” orchestrated with Airflow, modeled in dbt, and served from Snowflake.


๐Ÿ“Œ Project Overview

Dataset NPPES National Provider Identifier Registry (CMS)
Scale 8,850,000+ records ยท 9.9 GB raw data
Architecture Medallion (Bronze โ†’ Silver โ†’ Gold)
Orchestration Apache Airflow
Transformation dbt Core (staging โ†’ intermediate โ†’ marts)
Warehouse Snowflake
Storage AWS S3 (data lake)
Processing Polars + DuckDB
Code Quality Ruff (pycodestyle, isort, flake8-bugbear, security rules)

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   CMS Public Data   โ”‚
โ”‚   (NPPES CSVs)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚ Extract
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     AWS S3          โ”‚
โ”‚   Bronze Layer      โ”‚  โ† Raw data lake
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚ Polars + DuckDB cleaning
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     AWS S3          โ”‚
โ”‚   Silver Layer      โ”‚  โ† Cleaned Parquet files
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚ dbt Load
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Snowflake       โ”‚
โ”‚   Gold Layer        โ”‚  โ† Analytics-ready marts
โ”‚                     โ”‚
โ”‚  staging/           โ”‚
โ”‚  โ”œโ”€ stg_providers   โ”‚
โ”‚  โ”œโ”€ stg_taxonomy    โ”‚
โ”‚  โ””โ”€ stg_fips        โ”‚
โ”‚                     โ”‚
โ”‚  marts/             โ”‚
โ”‚  โ”œโ”€ provider_summaryโ”‚
โ”‚  โ”œโ”€ by_state        โ”‚
โ”‚  โ”œโ”€ by_county       โ”‚
โ”‚  โ”œโ”€ directory       โ”‚
โ”‚  โ””โ”€ specialty_dist  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Airflow DAGs       โ”‚  โ† End-to-end orchestration
โ”‚  Analytics &        โ”‚
โ”‚  Visualizations     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Flow

  1. Extract โ€” Download raw NPPES data from CMS public registry
  2. Load โ€” Upload to AWS S3 Bronze layer (raw data lake)
  3. Clean โ€” Polars + DuckDB processing into Silver layer (Parquet)
  4. Transform โ€” dbt models build staging and mart layers in Snowflake
  5. Orchestrate โ€” Airflow DAGs manage end-to-end pipeline execution
  6. Analyze โ€” Query Gold layer for geographic and specialty insights

๐Ÿ” Business Questions Answered

Question Why It Matters
How many providers operate in each state? Healthcare access and capacity planning
What are the most common specialties by geography? Identifying underserved specialties by region
Individual providers vs. organizations โ€” what's the split? Market structure analysis
Which counties have the highest provider density? Rural vs. urban healthcare access gaps

๐Ÿ“Š Key Insights

  • 8.85M providers across 54 states and territories, representing 842 unique specialties
  • 78.3% individual providers vs. 21.7% organizations nationally
  • Significant geographic concentration โ€” top 5 states account for ~40% of all providers
  • Rural counties show dramatically lower provider-to-population ratios than urban CBSAs

๐Ÿงฎ dbt Data Models

Staging Layer (1:1 with sources)

Model Description
stg_nppes_providers Cleaned provider records with standardized column names
stg_taxonomy Healthcare specialty classification codes (NUCC)
stg_fips Geographic reference data (state/county FIPS codes)

Mart Layer (Business Logic)

Model Description
mart_provider_summary Overall dataset statistics โ€” total providers, entity types, coverage
mart_providers_by_state Provider counts and metrics aggregated by state
mart_providers_by_county County-level provider density analysis
mart_provider_directory Searchable provider directory with specialty info
mart_specialty_distribution Medical specialty analysis across regions

๐Ÿ“ Sample Queries

-- Top 10 states by provider count
SELECT
    state,
    total_providers,
    individual_providers,
    organization_providers
FROM mart_providers_by_state
ORDER BY total_providers DESC
LIMIT 10;

-- Provider directory search by specialty
SELECT
    provider_name,
    provider_type,
    specialty_name,
    city,
    zip_code
FROM mart_provider_directory
WHERE state = 'CA'
  AND specialty_classification = 'Physician'
LIMIT 20;

-- Overall dataset summary
SELECT * FROM mart_provider_summary;

๐Ÿ“ˆ Visualizations

Geographic Distribution

State Distribution Top 10 states by provider count โ€” California, Texas, and New York lead nationally.

County Analysis

Counties by State Provider distribution across the top 15 states by county coverage.

Metropolitan Coverage

CBSA Coverage Counties within Core-Based Statistical Areas (metropolitan regions).


๐Ÿ“ Repository Structure

NPPES/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                        # Raw CSV files from NPPES
โ”‚   โ””โ”€โ”€ cleaned/                    # Cleaned Parquet files (Silver layer)
โ”œโ”€โ”€ nnpes/                          # dbt project
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ”œโ”€โ”€ staging/                # 1:1 source transformations
โ”‚   โ”‚   โ””โ”€โ”€ marts/                  # Business-ready analytics tables
โ”‚   โ”œโ”€โ”€ dbt_project.yml
โ”‚   โ””โ”€โ”€ dev.duckdb
โ”œโ”€โ”€ dags/                           # Airflow DAG definitions
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ clean_nppes.py              # Polars data cleaning
โ”‚   โ”œโ”€โ”€ upload_to_s3.py             # S3 upload automation
โ”‚   โ”œโ”€โ”€ visualize_data.py           # Chart generation
โ”‚   โ””โ”€โ”€ explore_raw_data.py         # Data exploration
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ s3_services.py              # AWS S3 utilities
โ”‚   โ””โ”€โ”€ my_logger.py                # Custom logging
โ”œโ”€โ”€ visualizations/                 # Generated charts
โ”œโ”€โ”€ query_examples.sql              # Sample SQL queries
โ”œโ”€โ”€ ruff.toml                       # Code quality configuration
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.9+
  • AWS account (for S3 features)
  • Snowflake account
  • ~10GB free disk space
# Clone the repo
git clone https://github.com/JingYou-data/NPPES.git
cd NPPES

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your AWS + Snowflake credentials

# Run data cleaning
python scripts/clean_nppes.py

# Run dbt transformations
cd nnpes
dbt run
dbt test

๐Ÿš€ Future Work

  • Incremental dbt models โ€” avoid reprocessing all 8.85M rows on each run
  • Streamlit dashboard โ€” interactive provider density maps by county
  • Geographic mapping โ€” Folium/Plotly choropleth showing provider access gaps
  • Great Expectations โ€” comprehensive data quality validation beyond dbt tests
  • Docker containerization โ€” reproducible local development environment

๐Ÿ“ฆ Data Sources

Source URL
NPPES Registry CMS National Provider Identifier Registry
Taxonomy Codes NUCC Healthcare Provider Taxonomy
FIPS Codes US Census Bureau

๐Ÿ‘ค Author

Jing You โ€” Data Analytics & Engineering LinkedIn GitHub Portfolio


Production-grade ELT pipeline ยท NPPES CMS Public Data ยท Modern Data Stack

Releases

No releases published

Packages

 
 
 

Contributors

Languages