A Python-based data cleaning and validation project that ensures the accuracy and integrity of Permanent Account Numbers (PAN) for Indian nationals.
The goal is to check that each PAN follows the official format and to categorize it as Valid or Invalid.
This project takes an input dataset containing PAN numbers (from an CSV/Excel file), performs cleaning and preprocessing, validates each PAN based on official rules, and outputs:
- A list of PAN Numbers marked with Valid & Invalid
- A list of Invalid PAN Categories
- A summary report with counts
-
Data Cleaning & Preprocessing
- Handles missing values (removal or imputation).
- Removes duplicate PAN numbers.
- Strips leading/trailing spaces.
- Converts all PAN numbers to uppercase.
-
PAN Format Validation Rules
- Exactly 10 characters long.
- Format:
AAAAA1234A- First 5 characters: uppercase alphabets.
- No consecutive identical alphabets (e.g.,
AABCD❌). - Not a sequential alphabet series (e.g.,
ABCDE❌). - Next 4 characters: digits.
- No consecutive identical digits (e.g.,
1123❌). - Not a sequential digit series (e.g.,
1234❌). - Last character: uppercase alphabet.
-
Categorization
- Valid: Meets all format rules.
- Invalid: Violates any rule or contains non-alphanumeric characters.
- Observation: Which category of invalid format it falls (blank if valid)
-
Reporting
- Total records processed.
- Total valid PANs.
- Total invalid PANs.
- Total missing/incomplete PANs.
- Categorization of invalid PANs.
.
├── resources/
│ └── PAN Number Validation Dataset.csv # Input dataset (csv)
│ └── PAN Number Validation Dataset.xlsx # Input dataset (xlsx)
│ └── PAN Number Validation - Problem Statement.pdf # Input dataset (xlsx)
├── analysis_raw.ipynb # Raw analysis file
├── analysis_final.ipynb # Ready to run python script
├── README.md # Project documentation
└── output/
├── PAN_Validation_Results.xlsx
├── PAN_Validation_Summary.xlsx
└── Valid_Invalid_Category.xlsx
1️⃣ Clone the Repository
git clone https://github.com/<your-username>/pan-number-validation.git2️⃣ Place Your Dataset
- Put your
PAN Number Validation Dataset.csvfile inside theresources/folder.
3️⃣ Run the Script
- analysis_final.ipynb
- Python (pandas, re)
- Excel/CSV for input/output
Shreyajyoti Dutta 🔗 LinkedIn Profile 📫 Open to opportunities in Data Analytics, Data Engineering, ETL and BI
Python pandas Data Cleaning Data Preprocessing Data Transformation Business Insights Data Analytics