datacommonsorg · smarthg-gi · Apr 3, 2026 · Apr 3, 2026 · Apr 7, 2026 · Apr 10, 2026
diff --git a/statvar_imports/who/tuberculosis_estimated_incidence_rate/README.md b/statvar_imports/who/tuberculosis_estimated_incidence_rate/README.md
@@ -0,0 +1,76 @@
+# WHO Tuberculosis Estimated Incidence Rate Import
+
+## Overview
+This dataset contains national-level statistics for the Estimated Tuberculosis Incidence Rate (per 100,000 population).
+Specifically, it provides incidence rates for two categories:
+- Overall TB incidence
+- HIV-positive TB incidence
+
+The generated statistical variables capture the incidence rate for these conditions.
+Examples of the statvars generated:
+- `dcid:Count_MedicalConditionIncident_ConditionTuberculosis_AsAFractionOf_Count_Person`
+- `dcid:Count_MedicalConditionIncident_ConditionTuberculosisAndHIV_AsAFractionOf_Count_Person`
+
+**type of place:** Country, Region (M49 codes), WHO Regions, Overseas Territory, Special Administrative Regions
+**years:** 2000 to 2024
+**place_resolution:** Resolved to DCIDs (e.g., dcid:country/FRA, dcid:country/IND)
+
+## Data Source
+**Source URL:**
+https://data.who.int/indicators/i/EB68992/2674B39
+
+**Provenance Description:**
+The data comes from the World Health Organization (WHO) master database and the public API. It tracks the estimated TB incidence rate globally (Indicator ID: `EB689922674B39`).
+
+## Refresh Type
+Automatic Refresh
+
+For refresh of the data, the import includes a Python script (`tb_data_download.py`) to automatically fetch the data from the WHO API, join it with ISO3 geographic identifiers, and save the formatted CSV.
+
+## Data Publish Frequency
+Release Frequency = Annual
+
+## How To Download Input Data
+To download the data, run the provided script:
+```bash
+python3 tb_data_download.py
+```
+This will fetch the latest full dataset, process the ISO3 codes, and save it locally as `input_files/Estimated_incidence_rate_per_100_000_population.csv` making it available for stat var processing.
+
+## Processing Instructions
+To process the WHO Tuberculosis Incidence Rate data and generate statistical variables, use the following command from the import directory:
+
+**For Data Run**
+```bash
+python3 ../../../tools/statvar_importer/stat_var_processor.py \
+    --input_data=input_files/* \
+    --pv_map=tuberculosis_estimated_incidence_rate_pvmap.csv \
+    --output_path=tuberculosis_estimated_incidence_rate_output \
+    --config_file=tuberculosis_estimated_incidence_rate_metadata.csv
+```
+
+This generates the following output files:
+- tuberculosis_estimated_incidence_rate_output.csv
+- tuberculosis_estimated_incidence_rate_output_stat_vars_schema.mcf
+- tuberculosis_estimated_incidence_rate_output_stat_vars.mcf
+- tuberculosis_estimated_incidence_rate_output.tmcf
+
+**For Data Quality Checks and validation**
+Validation of the data is done using the lint flag in the DataCommons import tool.
+
+```bash
+java -jar datacommons-import-tool-0.1-jar-with-dependencies.jar lint tuberculosis_estimated_incidence_rate_output_stat_vars_schema.mcf tuberculosis_estimated_incidence_rate_output.csv tuberculosis_estimated_incidence_rate_output.tmcf tuberculosis_estimated_incidence_rate_output_stat_vars.mcf
+```
+
+This generates the following output files:
+- report.json
+- summary_report.csv
+- summary_report.html
+
+The report files can be analyzed to check for errors and warnings. Further, linting is performed on the generated output files using the DataCommons import tool.
+
+## Testing
+Testing is performed using the provided `test_data` directory:
+- Input: `test_data/tuberculosis_estimated_incidence_rate_input.csv`
+- Output (expected): `test_data/tuberculosis_estimated_incidence_rate_output.csv`
+- MCF (expected): `test_data/tuberculosis_estimated_incidence_rate_output.tmcf`
diff --git a/statvar_imports/who/tuberculosis_estimated_incidence_rate/manifest.json b/statvar_imports/who/tuberculosis_estimated_incidence_rate/manifest.json
@@ -0,0 +1,22 @@
+{
+    "import_specifications": [
+      {
+        "import_name": "WHO_TuberculosisEstimatedIncidenceRate",
+        "curator_emails": ["support@datacommons.org"],
+        "provenance_url": "https://data.who.int/indicators/i/EB68992/2674B39",
+        "provenance_description": "Estimated number of new episodes of TB cases arising in a given year per 100 000 population.",
+        "scripts": ["tb_data_download.py",
+        	    "../../tools/statvar_importer/stat_var_processor.py --input_data=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/input_files/* --pv_map=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/tuberculosis_estimated_incidence_rate_pvmap.csv --config_file=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/tuberculosis_estimated_incidence_rate_metadata.csv --output_path=gs://unresolved_mcf/who/TB_Estimated_Incidence_Rate/tuberculosis_estimated_incidence_rate_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf"
+],
+        "import_inputs": [
+          {
+            "template_mcf": "tuberculosis_estimated_incidence_rate_output.tmcf",
+            "cleaned_csv": "tuberculosis_estimated_incidence_rate_output.csv"
+          }
+        ],
+        "source_files": ["input_files/*.csv"],
+        "cron_schedule": "15 22 15 12 *"
+      }
+    ]
+  }
+
diff --git a/statvar_imports/who/tuberculosis_estimated_incidence_rate/tb_data_download.py b/statvar_imports/who/tuberculosis_estimated_incidence_rate/tb_data_download.py
@@ -0,0 +1,57 @@
+import os
+import requests
+import io
+import pandas as pd
+
+def download_who_data():
+    # 1. Get the Clean Data from the API using the new Indicator ID
+    api_url = "https://xmart-api-public.who.int/DATA_/RELAY_TB_DATA"
+    params = {
+        "$filter": "IND_ID eq 'EB689922674B39'",
+        "$format": "csv"
+    }
+
+    print("1. Fetching clean percentage data from WHO API...")
+    api_response = requests.get(api_url, params=params)
+
+    if api_response.status_code != 200:
+        print(f"Failed to fetch API data. HTTP {api_response.status_code}")
+        return
+
+    # Load the clean API data into a pandas table
+    api_df = pd.read_csv(io.StringIO(api_response.text))
+
+    # 2. Get ONLY the iso3 code from the master database
+    print("2. Fetching country iso3 codes from WHO master database...")
+    master_url = "https://extranet.who.int/tme/generateCSV.asp?ds=notifications"
+
+    # We only pull the 'country' (for matching) and 'iso3' columns
+    geo_columns = ['country', 'iso3']
+    master_df = pd.read_csv(master_url, usecols=geo_columns).drop_duplicates()
+
+    # 3. Merge the two datasets together based on the country name
+    print("3. Merging data and formatting...")
+    # The API uses uppercase 'COUNTRY', the master uses lowercase 'country'
+    merged_df = pd.merge(api_df, master_df, left_on='COUNTRY', right_on='country', how='left')
+
+    # Drop the duplicate lowercase 'country' column used for joining
+    merged_df = merged_df.drop(columns=['country'])
+
+    # Reorder columns so the iso3 code sits right next to the Country name
+    final_columns = [
+        'IND_ID', 'INDICATOR_NAME', 'YEAR', 'COUNTRY', 'iso3', 'DISAGGR_1', 'VALUE'
+    ]
+    merged_df = merged_df[final_columns]
+
+    # 4. Save to CSV in a new folder
+    output_dir = "input_files"
+    filename = os.path.join(output_dir, "Estimated_incidence_rate_per_100_000_population.csv")
+
+    os.makedirs(output_dir, exist_ok=True)
+
+    # Save without the pandas index column
+    merged_df.to_csv(filename, index=False)
+    print(f"Success! Data saved locally as '{filename}'")
+
+if __name__ == "__main__":
+    download_who_data()