EDA Improved

Exploratory data analysis, elevated — a deeply rigorous, statistically grounded EDA framework with publication-ready visualisations and actionable analytical commentary.

Topics: python · data-analysis · data-science · exploratory-data-analysis · machine-learning · matplotlib · pandas · seaborn · statistical-profiling · visualization

Overview

EDA Improved is a second-generation exploratory data analysis toolkit that addresses the limitations of typical automated EDA tools: shallow statistical coverage, generic visualisations that rarely surface genuine insight, and the absence of analytical narrative that connects observations to actionable next steps. This framework was built by someone who had used dozens of EDA tools and wanted something that went deeper in every dimension.

The statistical layer is substantially more thorough than standard EDA. Normality testing applies four tests simultaneously — Shapiro-Wilk (≤5,000 samples), D'Agostino-Pearson, Lilliefors, and Anderson-Darling — with a Bonferroni-corrected consensus verdict, rather than relying on a single test. Outlier detection uses three methods — IQR fence, Z-score with configurable threshold, and Isolation Forest — with a per-sample majority-vote outlier label. Stationarity testing for time-series datasets uses ADF, KPSS, and PP tests with conflict resolution guidance.

The visualisation philosophy rejects chart maximalism: every plot has a specific analytical purpose, follows Tufte's data-ink ratio principle, and is accompanied by a one-sentence interpretation that makes the takeaway explicit. A figure should never require the viewer to think 'what am I supposed to notice here?' — the annotation does that work.

Motivation

The original EDA project was functional but generic. This improved version was motivated by a specific dissatisfaction: EDA tools that plot everything but explain nothing, that report statistics without interpretation, and that produce 50-page reports where the genuinely interesting patterns are buried under pages of boilerplate. EDA Improved embeds analytical judgment into the pipeline — producing fewer, more meaningful outputs with explicit interpretive commentary.

Architecture

Dataset Input (CSV / Excel / Parquet)
        │
  Quality Assessment Layer:
  ├── Four normality tests (with Bonferroni correction)
  ├── Three outlier methods (IQR, Z-score, Isolation Forest)
  └── Stationarity tests for datetime-indexed data
        │
  Statistical Characterisation:
  ├── Distributional family fitting (best-fit distribution)
  ├── Robust statistics (median, MAD, trimmed mean)
  └── Effect size computation (Cohen's d, Cramér's V)
        │
  Insight-Driven Visualisation:
  (each plot annotated with interpretive commentary)
        │
  Structured EDA Report (HTML with embedded narrative)

Features

Multi-Test Normality Assessment

Four normality tests (Shapiro-Wilk, D'Agostino-Pearson, Lilliefors, Anderson-Darling) with Bonferroni correction and consensus verdict — replacing unreliable single-test assessments.

Ensemble Outlier Detection

Per-sample outlier labelling via majority vote of IQR fence, Z-score threshold, and Isolation Forest — providing more robust outlier identification than any single method.

Distribution Family Fitting

Fit 15+ parametric distributions (normal, log-normal, exponential, gamma, beta, Weibull) to each numerical column via MLE, select best fit via AIC/BIC, and display fitted PDF over histogram.

Effect Size Reporting

All group comparisons accompanied by effect size metrics: Cohen's d for continuous variables, Cramér's V for categorical associations — making statistical significance practically interpretable.

Robust Statistics Module

Median, MAD, trimmed mean (10%), Winsorised mean alongside standard mean/std — providing outlier-resistant estimates that standard EDA tools omit.

Annotated Visualisations

Every generated chart includes a one-sentence analytical interpretation in the figure caption, stating the key pattern to notice — implementing Tufte's principle that graphics should be self-explanatory.

Time-Series Stationarity Analysis

ADF, KPSS, and Phillips-Perron unit root tests for datetime-indexed data, with conflict resolution guidance (integrated vs. trend-stationary distinction) and differencing recommendations.

Actionable EDA Summary

Structured conclusion section with prioritised list of preprocessing steps recommended before modelling: specific columns to transform, encode, impute, or drop — with justification.

Tech Stack

Library / Tool	Role	Why This Choice
pandas	Data manipulation	Type inference, group operations, time-series indexing
SciPy	Statistical tests	Normality tests, distribution fitting, KS test
statsmodels	Time-series analysis	ADF, KPSS, PP stationarity tests, ACF/PACF
scikit-learn	Outlier detection	Isolation Forest for ensemble outlier labelling
Plotly / Matplotlib	Annotated visualisation	Tufte-inspired charts with interpretive captions
Streamlit	Interactive interface	Dataset upload, column selection, report generation
pdfkit / WeasyPrint	Report export	HTML-to-PDF report generation

Getting Started

Prerequisites

Python 3.9+ (or Node.js 18+ for TypeScript/JavaScript projects)
A virtual environment manager (venv, conda, or equivalent)
API keys as listed in the Configuration section

Installation

git clone https://github.com/Devanik21/EDA--improved.git
cd EDA--improved
python -m venv venv && source venv/bin/activate
pip install pandas scipy statsmodels scikit-learn plotly matplotlib streamlit
streamlit run app.py

Usage

# Launch improved EDA interface
streamlit run app.py

# Run full rigorous EDA from CLI
python eda.py --data housing.csv --target price --output eda_report.html

# Normality assessment only
python normality.py --data data.csv --columns age,income --alpha 0.05

# Outlier detection with ensemble method
python outliers.py --data data.csv --method ensemble --output flagged.csv

# Time-series stationarity analysis
python stationarity.py --data timeseries.csv --value_col price --date_col date

Configuration

Variable	Default	Description
`NORMALITY_ALPHA`	`0.05`	Significance level for normality tests
`OUTLIER_Z_THRESHOLD`	`3.0`	Z-score threshold for outlier flagging
`OUTLIER_IF_CONTAMINATION`	`0.05`	Isolation Forest contamination parameter
`DIST_N_CANDIDATES`	`15`	Number of distributions to fit and compare
`TUFTE_MODE`	`True`	Enable minimalist Tufte-style chart formatting

Copy .env.example to .env and populate required values before running.

Project Structure

EDA--improved/
├── README.md
├── Stackoverflow_Survey_Analysis-checkpoint.ipynb
└── ...

Roadmap

Causal discovery integration: PC algorithm for skeleton graph identification from observational data
Natural language EDA narrative generation: LLM summarises the full EDA in plain English for stakeholder reporting
Longitudinal EDA: track dataset statistics across dataset versions to detect drift over time
Bayesian hypothesis testing module: Bayes factors as alternatives to frequentist p-values
Multi-dataset comparative EDA: test for distribution shift between train/test or historical/current datasets

Contributing

Contributions, issues, and suggestions are welcome.

Fork the repository
Create a feature branch: git checkout -b feature/your-idea
Commit your changes: git commit -m 'feat: add your idea'
Push to your branch: git push origin feature/your-idea
Open a Pull Request with a clear description

Please follow conventional commit messages and add documentation for new features.

Notes

The multi-test normality framework applies Bonferroni correction for multiple comparisons — individual test p-values are multiplied by the number of tests before applying the α threshold. This conservative approach reduces false positives at the cost of lower power. For small samples (<50), rely primarily on Shapiro-Wilk; for large samples (>5,000), all tests will detect trivial departures from normality that are practically irrelevant.

Author

Devanik Debnath
B.Tech, Electronics & Communication Engineering
National Institute of Technology Agartala

License

This project is open source and available under the MIT License.

Built with curiosity, depth, and care — because good projects deserve good documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
EDA--improved		EDA--improved
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
Stackoverflow_Survey_Analysis-checkpoint.ipynb		Stackoverflow_Survey_Analysis-checkpoint.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDA Improved

Overview

Motivation

Architecture

Features

Multi-Test Normality Assessment

Ensemble Outlier Detection

Distribution Family Fitting

Effect Size Reporting

Robust Statistics Module

Annotated Visualisations

Time-Series Stationarity Analysis

Actionable EDA Summary

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Configuration

Project Structure

Roadmap

Contributing

Notes

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EDA Improved

Overview

Motivation

Architecture

Features

Multi-Test Normality Assessment

Ensemble Outlier Detection

Distribution Family Fitting

Effect Size Reporting

Robust Statistics Module

Annotated Visualisations

Time-Series Stationarity Analysis

Actionable EDA Summary

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Configuration

Project Structure

Roadmap

Contributing

Notes

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages