Skip to content

DarkOracle10/CSV-Cleaner---Report-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š CSV Cleaner & Report Generator

Automated Python utility for cleaning messy CSV data with duplicate removal, missing value handling, and comprehensive report generation.

Python Pandas License

✨ Features

  • 🧹 Remove Duplicates - Automatically detect and remove duplicate rows
  • πŸ”§ Fill Missing Values - Smart filling with mean/median/mode or custom values
  • πŸ“… Date Standardization - Auto-detect and format date columns consistently
  • πŸ“ Report Generation - Detailed text reports with before/after statistics
  • 🎯 Column Filtering - Clean specific columns or entire dataset
  • ⚑ Fast Processing - Pandas-powered for large datasets
  • πŸ” Data Quality Checks - Identify data quality issues

πŸš€ Quick Start

Installation

# Clone repository
git clone https://github.com/DarkOracle10/CSV-Cleaner---Report-Generator.git
cd CSV-Cleaner---Report-Generator

# Install dependencies
pip install pandas numpy python-dateutil

# Or use requirements.txt
pip install -r requirements.txt

Basic Usage

# Clean a CSV file
python csv_cleaner.py input_data.csv

# Clean and generate report
python csv_cleaner.py input_data.csv --report

# Remove duplicates only
python csv_cleaner.py input_data.csv --remove-duplicates

# Fill missing values with mean
python csv_cleaner.py input_data.csv --fill-missing mean

πŸ“– Usage Examples

Example 1: Basic Cleaning

python csv_cleaner.py messy_data.csv
# Output: messy_data_cleaned.csv

Example 2: Full Cleaning with Report

python csv_cleaner.py sales_data.csv --report --output cleaned_sales.csv
# Output: cleaned_sales.csv + cleaning_report.txt

Example 3: Python API

from csv_cleaner import CSVCleaner

# Initialize cleaner
cleaner = CSVCleaner('data.csv')

# Remove duplicates
cleaner.remove_duplicates()

# Fill missing values
cleaner.fill_missing_values(strategy='mean')

# Standardize dates
cleaner.standardize_dates()

# Generate report
report = cleaner.generate_report()
print(report)

# Save cleaned data
cleaner.save('cleaned_data.csv')

🎯 Features in Detail

Duplicate Removal

  • Identifies exact duplicate rows
  • Option to keep first/last occurrence
  • Reports number of duplicates removed

Missing Value Handling

Strategies:

  • mean - Fill with column mean (numeric only)
  • median - Fill with column median
  • mode - Fill with most frequent value
  • forward - Forward fill from previous row
  • backward - Backward fill from next row
  • Custom value

Date Standardization

  • Auto-detects date columns
  • Converts to ISO 8601 format (YYYY-MM-DD)
  • Handles multiple date formats:
    • MM/DD/YYYY
    • DD-MM-YYYY
    • YYYY/MM/DD
    • And more...

Report Generation

Includes:

  • Original dataset statistics
  • Cleaning operations performed
  • Before/after comparison
  • Data quality metrics
  • Processing time

πŸ“ Examples

See the examples/ directory for sample data:

examples/
β”œβ”€β”€ messy_data.csv          # Input: Dataset with issues
β”œβ”€β”€ cleaned_data.csv        # Output: After cleaning
└── cleaning_report.txt     # Report: Operations performed

πŸ› οΈ CLI Reference

usage: csv_cleaner.py [-h] [--remove-duplicates] [--fill-missing {mean,median,mode,forward,backward}]
                      [--standardize-dates] [--report] [--output OUTPUT] [--columns COLUMNS]
                      input_file

positional arguments:
  input_file            Path to input CSV file

optional arguments:
  -h, --help            Show this help message and exit
  --remove-duplicates   Remove duplicate rows
  --fill-missing STRATEGY
                        Fill missing values with strategy
  --standardize-dates   Standardize date formats
  --report              Generate cleaning report
  --output OUTPUT       Output file path (default: input_cleaned.csv)
  --columns COLUMNS     Comma-separated columns to clean (default: all)

examples:
  python csv_cleaner.py data.csv --remove-duplicates --fill-missing mean
  python csv_cleaner.py data.csv --report --output clean.csv
  python csv_cleaner.py data.csv --columns "Name,Email,Date"

πŸ“Š Sample Report

=== CSV Cleaning Report ===
Generated: 2026-02-05 10:30:45

Input File: messy_data.csv
Output File: messy_data_cleaned.csv

Dataset Statistics:
- Original Rows: 1,000
- Original Columns: 15
- Final Rows: 847
- Final Columns: 15

Operations Performed:
1. Removed 153 duplicate rows (15.3%)
2. Filled 45 missing values in column 'Age' with mean (35.2)
3. Filled 12 missing values in column 'Email' with 'N/A'
4. Standardized 1,000 dates in column 'Registration Date'

Data Quality Metrics:
- Completeness: 98.5% (before: 92.1%)
- Duplicates: 0% (before: 15.3%)
- Date Format Consistency: 100% (before: 78.4%)

Processing Time: 0.34 seconds

πŸ§ͺ Testing

# Run tests
pytest tests/

# With coverage
pytest --cov=csv_cleaner tests/

🀝 Contributing

Contributions welcome! Ideas for improvement:

  • GUI interface
  • Excel file support (.xlsx)
  • Data type inference and conversion
  • Outlier detection and handling
  • Column rename suggestions
  • Data validation rules

πŸ“„ License

MIT License - See LICENSE file

πŸ‘€ Author

Amir Aeiny


⭐ Found this useful? Star the repo!

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages