Team ID: UIDAI 4732 | Hackathon 2026
๐ Live Power BI Dashboard:
View Live Interactive Dashboard
Birla Institute of Technology, Mesra Department of Quantitative Economics and Data Science
| Name | ID |
|---|---|
| Rounak Kumar | IED/10026/22 |
| Dhruv | IED/10017/22 |
| Apurva Mishra | IED/10024/22 |
This project processes approximately 44 lakh records of Aadhaar Enrolment and Update datasets provided by the National Informatics Centre (NIC).
The raw data was fragmented across multiple split CSVs with inconsistent schemas, noisy geographic identifiers, and duplicates. Our solution consolidates this into a single, analysis-ready source of truth and visualizes it via an interactive Power BI dashboard to identify operational gaps, regional disparities, and lifecycle transition trends.
The raw Aadhaar operational datasets present several challenges that hinder direct analysis:
- Schema Inconsistencies: Split files (Biometric, Demographic, Enrolment) have different column structures.
- Noisy Geography: State and District names contain spelling variants (e.g., "Orissa" vs "Odisha", "Cuddapah" vs "YSR"), special characters, and casing issues.
- Lack of Metrics: Raw data provides counts but lacks performance indicators like "Growth Rate" or "Transition Continuity."
- Duplication: Repeated records exist at identical reporting granularities.
Objective: Construct a unified pipeline to clean, standardize, and enrich the data for district-level decision-making.
We implemented a 7-step ETL (Extract, Transform, Load) pipeline using Python (Pandas) and Power Query.
graph TD
A[Raw CSV Ingestion] -->|Concat Split Files| B(Schema Alignment)
B -->|Standardize Age Cols| C{Consolidation}
C --> D[Geographic Cleaning]
D -->|Regex & Mapping| E[Aggregation & Deduping]
E --> F[Feature Engineering]
F --> G[Final Dashboard Model]
- Read split CSVs from:
- Biometric folder
- Demographic folder
- Enrolment folder
Columns renamed to canonical formats:
age_0_5(Bal Aadhaar)age_5_17(Mandatory Biometric Updates)age_18_greater(Adult Updates)
- Merged all sources into a master staging table.
- Removed numeric junk.
- Fixed casing issues with "and".
- Mapped common variants (example:
&toand).
- Applied a comprehensive correction dictionary to map legacy names to current administrative districts.
- Example:
GurgaontoGurugram.
- Grouped by:
- Date
- State
- District
- Pincode
- Purpose: remove duplicate records.
- Calculated daily growth metrics.
- Derived lifecycle and transition ratios.
- Merged district-level performance bands back into the daily aggregated view.
A significant portion of execution focused on cleaning dirty text fields using vectorized string operations.
df["state"] = (
df["state"]
.astype("string")
.str.strip()
.str.replace(r"\s+", " ", regex=True)
.str.replace("&", "and", regex=False)
.str.lower()
)correction_map = {
"orissa": "odisha",
"pondicherry": "puducherry",
"allahabad": "prayagraj",
"gurgaon": "gurugram",
"cuddapah": "ysr"
}
df["district"] = df["district"].replace(correction_map)Derived KPIs to evaluate district-level performance:
| Metric | Formula | Purpose |
|---|---|---|
| Total Updates | age_0_5 + age_5_17 + age_18_greater |
Primary workload measure |
| Zero Activity Flag | IF(total_updates == 0, 1, 0) |
Identifies service interruptions |
| Transition Ratio | total_adult_updates / (total_child_updates + 1) |
Measures lifecycle continuity |
| Priority Index | Transition Ratio < 1.5 OR Zero Days > 5 | Flags districts needing attention |
The Power BI solution is divided into two analytical views.
Goal
High-level monitoring of national and state trends.
Visuals
- Choropleth map of update activity by state
- Monthly growth rate trends
- Activity status distribution (Increasing vs Declining)
Goal
Deep dive into district-level operational gaps.
Visuals
- Priority list of districts flagged as High Priority
- Transition band donut chart (Low, Moderate, High continuity)
- Zero activity tracker for frequent zero-reporting districts
- Replace CSV dumps with direct UIDAI API integration.
- Use Isolation Forest to detect sudden drops in enrolment packets.
- Correlate Aadhaar saturation with Census 2021 and 2026 population data to estimate remaining demand.
- Language: Python 3.10+
- Libraries: pandas, numpy, regex
- Visualization: Microsoft Power BI
- Source Control: GitHub

