Skip to content

Devanik21/EDA--improved

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EDA Improved

Language Stars Forks Author Status

Exploratory data analysis, elevated — a deeply rigorous, statistically grounded EDA framework with publication-ready visualisations and actionable analytical commentary.


Topics: python · data-analysis · data-science · exploratory-data-analysis · machine-learning · matplotlib · pandas · seaborn · statistical-profiling · visualization

Overview

EDA Improved is a second-generation exploratory data analysis toolkit that addresses the limitations of typical automated EDA tools: shallow statistical coverage, generic visualisations that rarely surface genuine insight, and the absence of analytical narrative that connects observations to actionable next steps. This framework was built by someone who had used dozens of EDA tools and wanted something that went deeper in every dimension.

The statistical layer is substantially more thorough than standard EDA. Normality testing applies four tests simultaneously — Shapiro-Wilk (≤5,000 samples), D'Agostino-Pearson, Lilliefors, and Anderson-Darling — with a Bonferroni-corrected consensus verdict, rather than relying on a single test. Outlier detection uses three methods — IQR fence, Z-score with configurable threshold, and Isolation Forest — with a per-sample majority-vote outlier label. Stationarity testing for time-series datasets uses ADF, KPSS, and PP tests with conflict resolution guidance.

The visualisation philosophy rejects chart maximalism: every plot has a specific analytical purpose, follows Tufte's data-ink ratio principle, and is accompanied by a one-sentence interpretation that makes the takeaway explicit. A figure should never require the viewer to think 'what am I supposed to notice here?' — the annotation does that work.


Motivation

The original EDA project was functional but generic. This improved version was motivated by a specific dissatisfaction: EDA tools that plot everything but explain nothing, that report statistics without interpretation, and that produce 50-page reports where the genuinely interesting patterns are buried under pages of boilerplate. EDA Improved embeds analytical judgment into the pipeline — producing fewer, more meaningful outputs with explicit interpretive commentary.


Architecture

Dataset Input (CSV / Excel / Parquet)
        │
  Quality Assessment Layer:
  ├── Four normality tests (with Bonferroni correction)
  ├── Three outlier methods (IQR, Z-score, Isolation Forest)
  └── Stationarity tests for datetime-indexed data
        │
  Statistical Characterisation:
  ├── Distributional family fitting (best-fit distribution)
  ├── Robust statistics (median, MAD, trimmed mean)
  └── Effect size computation (Cohen's d, Cramér's V)
        │
  Insight-Driven Visualisation:
  (each plot annotated with interpretive commentary)
        │
  Structured EDA Report (HTML with embedded narrative)

Features

Multi-Test Normality Assessment

Four normality tests (Shapiro-Wilk, D'Agostino-Pearson, Lilliefors, Anderson-Darling) with Bonferroni correction and consensus verdict — replacing unreliable single-test assessments.

Ensemble Outlier Detection

Per-sample outlier labelling via majority vote of IQR fence, Z-score threshold, and Isolation Forest — providing more robust outlier identification than any single method.

Distribution Family Fitting

Fit 15+ parametric distributions (normal, log-normal, exponential, gamma, beta, Weibull) to each numerical column via MLE, select best fit via AIC/BIC, and display fitted PDF over histogram.

Effect Size Reporting

All group comparisons accompanied by effect size metrics: Cohen's d for continuous variables, Cramér's V for categorical associations — making statistical significance practically interpretable.

Robust Statistics Module

Median, MAD, trimmed mean (10%), Winsorised mean alongside standard mean/std — providing outlier-resistant estimates that standard EDA tools omit.

Annotated Visualisations

Every generated chart includes a one-sentence analytical interpretation in the figure caption, stating the key pattern to notice — implementing Tufte's principle that graphics should be self-explanatory.

Time-Series Stationarity Analysis

ADF, KPSS, and Phillips-Perron unit root tests for datetime-indexed data, with conflict resolution guidance (integrated vs. trend-stationary distinction) and differencing recommendations.

Actionable EDA Summary

Structured conclusion section with prioritised list of preprocessing steps recommended before modelling: specific columns to transform, encode, impute, or drop — with justification.


Tech Stack

Library / Tool Role Why This Choice
pandas Data manipulation Type inference, group operations, time-series indexing
SciPy Statistical tests Normality tests, distribution fitting, KS test
statsmodels Time-series analysis ADF, KPSS, PP stationarity tests, ACF/PACF
scikit-learn Outlier detection Isolation Forest for ensemble outlier labelling
Plotly / Matplotlib Annotated visualisation Tufte-inspired charts with interpretive captions
Streamlit Interactive interface Dataset upload, column selection, report generation
pdfkit / WeasyPrint Report export HTML-to-PDF report generation

Getting Started

Prerequisites

  • Python 3.9+ (or Node.js 18+ for TypeScript/JavaScript projects)
  • A virtual environment manager (venv, conda, or equivalent)
  • API keys as listed in the Configuration section

Installation

git clone https://github.com/Devanik21/EDA--improved.git
cd EDA--improved
python -m venv venv && source venv/bin/activate
pip install pandas scipy statsmodels scikit-learn plotly matplotlib streamlit
streamlit run app.py

Usage

# Launch improved EDA interface
streamlit run app.py

# Run full rigorous EDA from CLI
python eda.py --data housing.csv --target price --output eda_report.html

# Normality assessment only
python normality.py --data data.csv --columns age,income --alpha 0.05

# Outlier detection with ensemble method
python outliers.py --data data.csv --method ensemble --output flagged.csv

# Time-series stationarity analysis
python stationarity.py --data timeseries.csv --value_col price --date_col date

Configuration

Variable Default Description
NORMALITY_ALPHA 0.05 Significance level for normality tests
OUTLIER_Z_THRESHOLD 3.0 Z-score threshold for outlier flagging
OUTLIER_IF_CONTAMINATION 0.05 Isolation Forest contamination parameter
DIST_N_CANDIDATES 15 Number of distributions to fit and compare
TUFTE_MODE True Enable minimalist Tufte-style chart formatting

Copy .env.example to .env and populate required values before running.


Project Structure

EDA--improved/
├── README.md
├── Stackoverflow_Survey_Analysis-checkpoint.ipynb
└── ...

Roadmap

  • Causal discovery integration: PC algorithm for skeleton graph identification from observational data
  • Natural language EDA narrative generation: LLM summarises the full EDA in plain English for stakeholder reporting
  • Longitudinal EDA: track dataset statistics across dataset versions to detect drift over time
  • Bayesian hypothesis testing module: Bayes factors as alternatives to frequentist p-values
  • Multi-dataset comparative EDA: test for distribution shift between train/test or historical/current datasets

Contributing

Contributions, issues, and suggestions are welcome.

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-idea
  3. Commit your changes: git commit -m 'feat: add your idea'
  4. Push to your branch: git push origin feature/your-idea
  5. Open a Pull Request with a clear description

Please follow conventional commit messages and add documentation for new features.


Notes

The multi-test normality framework applies Bonferroni correction for multiple comparisons — individual test p-values are multiplied by the number of tests before applying the α threshold. This conservative approach reduces false positives at the cost of lower power. For small samples (<50), rely primarily on Shapiro-Wilk; for large samples (>5,000), all tests will detect trivial departures from normality that are practically irrelevant.


Author

Devanik Debnath
B.Tech, Electronics & Communication Engineering
National Institute of Technology Agartala

GitHub LinkedIn


License

This project is open source and available under the MIT License.


Built with curiosity, depth, and care — because good projects deserve good documentation.

About

Enhanced exploratory data analysis toolkit — publication-quality Plotly/Seaborn visualisations, automated outlier detection, correlation heatmaps, feature distribution reports, and interactive filters.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors