Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# WHO Tuberculosis Estimated Incidence Rate Import

## Overview
This dataset contains national-level statistics for the Estimated Tuberculosis Incidence Rate (per 100,000 population).
Specifically, it provides incidence rates for two categories:
- Overall TB incidence
- HIV-positive TB incidence

The generated statistical variables capture the incidence rate for these conditions.
Examples of the statvars generated:
- `dcid:Count_MedicalConditionIncident_ConditionTuberculosis_AsAFractionOf_Count_Person`
- `dcid:Count_MedicalConditionIncident_ConditionTuberculosisAndHIV_AsAFractionOf_Count_Person`

**type of place:** Country, Region (M49 codes), WHO Regions, Overseas Territory, Special Administrative Regions
**years:** 2000 to 2024
**place_resolution:** Resolved to DCIDs (e.g., dcid:country/FRA, dcid:country/IND)

## Data Source
**Source URL:**
https://data.who.int/indicators/i/EB68992/2674B39

**Provenance Description:**
The data comes from the World Health Organization (WHO) master database and the public API. It tracks the estimated TB incidence rate globally (Indicator ID: `EB689922674B39`).

## Refresh Type
Automatic Refresh

For refresh of the data, the import includes a Python script (`tb_data_download.py`) to automatically fetch the data from the WHO API, join it with ISO3 geographic identifiers, and save the formatted CSV.

## Data Publish Frequency
Release Frequency = Annual

## How To Download Input Data
To download the data, run the provided script:
```bash
python3 tb_data_download.py
```
This will fetch the latest full dataset, process the ISO3 codes, and save it locally as `input_files/Estimated_incidence_rate_per_100_000_population.csv` making it available for stat var processing.

## Processing Instructions
To process the WHO Tuberculosis Incidence Rate data and generate statistical variables, use the following command from the import directory:

**For Data Run**
```bash
python3 ../../../tools/statvar_importer/stat_var_processor.py \
--input_data=input_files/* \
--pv_map=tuberculosis_estimated_incidence_rate_pvmap.csv \
--output_path=tuberculosis_estimated_incidence_rate_output \
--config_file=tuberculosis_estimated_incidence_rate_metadata.csv
```

This generates the following output files:
- tuberculosis_estimated_incidence_rate_output.csv
- tuberculosis_estimated_incidence_rate_output_stat_vars_schema.mcf
- tuberculosis_estimated_incidence_rate_output_stat_vars.mcf
- tuberculosis_estimated_incidence_rate_output.tmcf

**For Data Quality Checks and validation**
Validation of the data is done using the lint flag in the DataCommons import tool.

```bash
java -jar datacommons-import-tool-0.1-jar-with-dependencies.jar lint tuberculosis_estimated_incidence_rate_output_stat_vars_schema.mcf tuberculosis_estimated_incidence_rate_output.csv tuberculosis_estimated_incidence_rate_output.tmcf tuberculosis_estimated_incidence_rate_output_stat_vars.mcf
```

This generates the following output files:
- report.json
- summary_report.csv
- summary_report.html

The report files can be analyzed to check for errors and warnings. Further, linting is performed on the generated output files using the DataCommons import tool.

## Testing
Testing is performed using the provided `test_data` directory:
- Input: `test_data/tuberculosis_estimated_incidence_rate_input.csv`
- Output (expected): `test_data/tuberculosis_estimated_incidence_rate_output.csv`
- MCF (expected): `test_data/tuberculosis_estimated_incidence_rate_output.tmcf`
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"import_specifications": [
{
"import_name": "WHO_TuberculosisEstimatedIncidenceRate",
"curator_emails": ["support@datacommons.org"],
"provenance_url": "https://data.who.int/indicators/i/EB68992/2674B39",
"provenance_description": "Estimated number of new episodes of TB cases arising in a given year per 100 000 population.",
"scripts": ["tb_data_download.py",
"../../tools/statvar_importer/stat_var_processor.py --input_data=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/input_files/* --pv_map=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/tuberculosis_estimated_incidence_rate_pvmap.csv --config_file=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/tuberculosis_estimated_incidence_rate_metadata.csv --output_path=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/tuberculosis_estimated_incidence_rate_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf"
],
"import_inputs": [
{
"template_mcf": "tuberculosis_estimated_incidence_rate_output.tmcf",
"cleaned_csv": "tuberculosis_estimated_incidence_rate_output.csv"
}
],
"source_files": ["input_files/*.csv"],
"cron_schedule": "15 22 15 12 *"
}
]
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import os
import requests
import io
import pandas as pd

def download_who_data():
# 1. Get the Clean Data from the API using the new Indicator ID
api_url = "https://xmart-api-public.who.int/DATA_/RELAY_TB_DATA"
params = {
"$filter": "IND_ID eq 'EB689922674B39'",
"$format": "csv"
}

print("1. Fetching clean percentage data from WHO API...")
api_response = requests.get(api_url, params=params)

if api_response.status_code != 200:
print(f"Failed to fetch API data. HTTP {api_response.status_code}")
return

# Load the clean API data into a pandas table
api_df = pd.read_csv(io.StringIO(api_response.text))

# 2. Get ONLY the iso3 code from the master database
print("2. Fetching country iso3 codes from WHO master database...")
master_url = "https://extranet.who.int/tme/generateCSV.asp?ds=notifications"

# We only pull the 'country' (for matching) and 'iso3' columns
geo_columns = ['country', 'iso3']
master_df = pd.read_csv(master_url, usecols=geo_columns).drop_duplicates()

# 3. Merge the two datasets together based on the country name
print("3. Merging data and formatting...")
# The API uses uppercase 'COUNTRY', the master uses lowercase 'country'
merged_df = pd.merge(api_df, master_df, left_on='COUNTRY', right_on='country', how='left')

# Drop the duplicate lowercase 'country' column used for joining
merged_df = merged_df.drop(columns=['country'])

# Reorder columns so the iso3 code sits right next to the Country name
final_columns = [
'IND_ID', 'INDICATOR_NAME', 'YEAR', 'COUNTRY', 'iso3', 'DISAGGR_1', 'VALUE'
]
merged_df = merged_df[final_columns]

# 4. Save to CSV in a new folder
output_dir = "input_files"
filename = os.path.join(output_dir, "Estimated_incidence_rate_per_100_000_population.csv")

os.makedirs(output_dir, exist_ok=True)

# Save without the pandas index column
merged_df.to_csv(filename, index=False)
print(f"Success! Data saved locally as '{filename}'")

if __name__ == "__main__":
download_who_data()
Loading
Loading