Skip to content

Official repository for Team UIDAI 4732. A comprehensive data analysis solution for the UIDAI Data Hackathon 2026.

Notifications You must be signed in to change notification settings

apooorv19/UIDAI-hackathon-2026

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

UIDAI Aadhaar Data Analysis Project

Team ID: UIDAI 4732 | Hackathon 2026

Status Python Power BI Data

๐Ÿ“Š Dashboard Preview

๐Ÿ”— Live Power BI Dashboard:

View Live Interactive Dashboard

Dashboard Preview Dashboard Preview

๐Ÿ‘ฅ Team Members

Birla Institute of Technology, Mesra Department of Quantitative Economics and Data Science

Name ID
Rounak Kumar IED/10026/22
Dhruv IED/10017/22
Apurva Mishra IED/10024/22

๐Ÿ“– Project Overview

This project processes approximately 44 lakh records of Aadhaar Enrolment and Update datasets provided by the National Informatics Centre (NIC).

The raw data was fragmented across multiple split CSVs with inconsistent schemas, noisy geographic identifiers, and duplicates. Our solution consolidates this into a single, analysis-ready source of truth and visualizes it via an interactive Power BI dashboard to identify operational gaps, regional disparities, and lifecycle transition trends.


๐Ÿšฉ Problem Statement

The raw Aadhaar operational datasets present several challenges that hinder direct analysis:

  1. Schema Inconsistencies: Split files (Biometric, Demographic, Enrolment) have different column structures.
  2. Noisy Geography: State and District names contain spelling variants (e.g., "Orissa" vs "Odisha", "Cuddapah" vs "YSR"), special characters, and casing issues.
  3. Lack of Metrics: Raw data provides counts but lacks performance indicators like "Growth Rate" or "Transition Continuity."
  4. Duplication: Repeated records exist at identical reporting granularities.

Objective: Construct a unified pipeline to clean, standardize, and enrich the data for district-level decision-making.


โš™๏ธ The Approach: End-to-End Pipeline

We implemented a 7-step ETL (Extract, Transform, Load) pipeline using Python (Pandas) and Power Query.

graph TD
    A[Raw CSV Ingestion] -->|Concat Split Files| B(Schema Alignment)
    B -->|Standardize Age Cols| C{Consolidation}
    C --> D[Geographic Cleaning]
    D -->|Regex & Mapping| E[Aggregation & Deduping]
    E --> F[Feature Engineering]
    F --> G[Final Dashboard Model]
Loading

Pipeline Steps

1. Ingestion

  • Read split CSVs from:
    • Biometric folder
    • Demographic folder
    • Enrolment folder

2. Schema Alignment

Columns renamed to canonical formats:

  • age_0_5 (Bal Aadhaar)
  • age_5_17 (Mandatory Biometric Updates)
  • age_18_greater (Adult Updates)

3. Consolidation

  • Merged all sources into a master staging table.

4. Geographic Cleaning

State Normalization

  • Removed numeric junk.
  • Fixed casing issues with "and".
  • Mapped common variants (example: & to and).

District Normalization

  • Applied a comprehensive correction dictionary to map legacy names to current administrative districts.
  • Example: Gurgaon to Gurugram.

5. Aggregation

  • Grouped by:
    • Date
    • State
    • District
    • Pincode
  • Purpose: remove duplicate records.

6. Metric Engineering

  • Calculated daily growth metrics.
  • Derived lifecycle and transition ratios.

7. Enrichment

  • Merged district-level performance bands back into the daily aggregated view.

๐Ÿงน Data Cleaning

A significant portion of execution focused on cleaning dirty text fields using vectorized string operations.

Example: Standardizing State Names

df["state"] = (
    df["state"]
    .astype("string")
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
    .str.replace("&", "and", regex=False)
    .str.lower()
)

Correction Mapping (Snippet)

correction_map = {
    "orissa": "odisha",
    "pondicherry": "puducherry",
    "allahabad": "prayagraj",
    "gurgaon": "gurugram",
    "cuddapah": "ysr"
}

df["district"] = df["district"].replace(correction_map)

๐Ÿงฎ Feature Engineering

Derived KPIs to evaluate district-level performance:

Metric Formula Purpose
Total Updates age_0_5 + age_5_17 + age_18_greater Primary workload measure
Zero Activity Flag IF(total_updates == 0, 1, 0) Identifies service interruptions
Transition Ratio total_adult_updates / (total_child_updates + 1) Measures lifecycle continuity
Priority Index Transition Ratio < 1.5 OR Zero Days > 5 Flags districts needing attention

๐Ÿ“Š Dashboard Architecture

The Power BI solution is divided into two analytical views.

1. Executive Summary and Activity

Goal
High-level monitoring of national and state trends.

Visuals

  • Choropleth map of update activity by state
  • Monthly growth rate trends
  • Activity status distribution (Increasing vs Declining)

2. District Performance and Priority Analysis

Goal
Deep dive into district-level operational gaps.

Visuals

  • Priority list of districts flagged as High Priority
  • Transition band donut chart (Low, Moderate, High continuity)
  • Zero activity tracker for frequent zero-reporting districts

๐Ÿš€ Future Scope

Real-time API

  • Replace CSV dumps with direct UIDAI API integration.

Anomaly Detection

  • Use Isolation Forest to detect sudden drops in enrolment packets.

Census Overlay

  • Correlate Aadhaar saturation with Census 2021 and 2026 population data to estimate remaining demand.

๐Ÿ› ๏ธ Tech Stack

  • Language: Python 3.10+
  • Libraries: pandas, numpy, regex
  • Visualization: Microsoft Power BI
  • Source Control: GitHub

About

Official repository for Team UIDAI 4732. A comprehensive data analysis solution for the UIDAI Data Hackathon 2026.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •