Exploratory data analysis, elevated — a deeply rigorous, statistically grounded EDA framework with publication-ready visualisations and actionable analytical commentary.
Topics: python · data-analysis · data-science · exploratory-data-analysis · machine-learning · matplotlib · pandas · seaborn · statistical-profiling · visualization
EDA Improved is a second-generation exploratory data analysis toolkit that addresses the limitations of typical automated EDA tools: shallow statistical coverage, generic visualisations that rarely surface genuine insight, and the absence of analytical narrative that connects observations to actionable next steps. This framework was built by someone who had used dozens of EDA tools and wanted something that went deeper in every dimension.
The statistical layer is substantially more thorough than standard EDA. Normality testing applies four tests simultaneously — Shapiro-Wilk (≤5,000 samples), D'Agostino-Pearson, Lilliefors, and Anderson-Darling — with a Bonferroni-corrected consensus verdict, rather than relying on a single test. Outlier detection uses three methods — IQR fence, Z-score with configurable threshold, and Isolation Forest — with a per-sample majority-vote outlier label. Stationarity testing for time-series datasets uses ADF, KPSS, and PP tests with conflict resolution guidance.
The visualisation philosophy rejects chart maximalism: every plot has a specific analytical purpose, follows Tufte's data-ink ratio principle, and is accompanied by a one-sentence interpretation that makes the takeaway explicit. A figure should never require the viewer to think 'what am I supposed to notice here?' — the annotation does that work.
The original EDA project was functional but generic. This improved version was motivated by a specific dissatisfaction: EDA tools that plot everything but explain nothing, that report statistics without interpretation, and that produce 50-page reports where the genuinely interesting patterns are buried under pages of boilerplate. EDA Improved embeds analytical judgment into the pipeline — producing fewer, more meaningful outputs with explicit interpretive commentary.
Dataset Input (CSV / Excel / Parquet)
│
Quality Assessment Layer:
├── Four normality tests (with Bonferroni correction)
├── Three outlier methods (IQR, Z-score, Isolation Forest)
└── Stationarity tests for datetime-indexed data
│
Statistical Characterisation:
├── Distributional family fitting (best-fit distribution)
├── Robust statistics (median, MAD, trimmed mean)
└── Effect size computation (Cohen's d, Cramér's V)
│
Insight-Driven Visualisation:
(each plot annotated with interpretive commentary)
│
Structured EDA Report (HTML with embedded narrative)
Four normality tests (Shapiro-Wilk, D'Agostino-Pearson, Lilliefors, Anderson-Darling) with Bonferroni correction and consensus verdict — replacing unreliable single-test assessments.
Per-sample outlier labelling via majority vote of IQR fence, Z-score threshold, and Isolation Forest — providing more robust outlier identification than any single method.
Fit 15+ parametric distributions (normal, log-normal, exponential, gamma, beta, Weibull) to each numerical column via MLE, select best fit via AIC/BIC, and display fitted PDF over histogram.
All group comparisons accompanied by effect size metrics: Cohen's d for continuous variables, Cramér's V for categorical associations — making statistical significance practically interpretable.
Median, MAD, trimmed mean (10%), Winsorised mean alongside standard mean/std — providing outlier-resistant estimates that standard EDA tools omit.
Every generated chart includes a one-sentence analytical interpretation in the figure caption, stating the key pattern to notice — implementing Tufte's principle that graphics should be self-explanatory.
ADF, KPSS, and Phillips-Perron unit root tests for datetime-indexed data, with conflict resolution guidance (integrated vs. trend-stationary distinction) and differencing recommendations.
Structured conclusion section with prioritised list of preprocessing steps recommended before modelling: specific columns to transform, encode, impute, or drop — with justification.
| Library / Tool | Role | Why This Choice |
|---|---|---|
| pandas | Data manipulation | Type inference, group operations, time-series indexing |
| SciPy | Statistical tests | Normality tests, distribution fitting, KS test |
| statsmodels | Time-series analysis | ADF, KPSS, PP stationarity tests, ACF/PACF |
| scikit-learn | Outlier detection | Isolation Forest for ensemble outlier labelling |
| Plotly / Matplotlib | Annotated visualisation | Tufte-inspired charts with interpretive captions |
| Streamlit | Interactive interface | Dataset upload, column selection, report generation |
| pdfkit / WeasyPrint | Report export | HTML-to-PDF report generation |
- Python 3.9+ (or Node.js 18+ for TypeScript/JavaScript projects)
- A virtual environment manager (
venv,conda, or equivalent) - API keys as listed in the Configuration section
git clone https://github.com/Devanik21/EDA--improved.git
cd EDA--improved
python -m venv venv && source venv/bin/activate
pip install pandas scipy statsmodels scikit-learn plotly matplotlib streamlit
streamlit run app.py# Launch improved EDA interface
streamlit run app.py
# Run full rigorous EDA from CLI
python eda.py --data housing.csv --target price --output eda_report.html
# Normality assessment only
python normality.py --data data.csv --columns age,income --alpha 0.05
# Outlier detection with ensemble method
python outliers.py --data data.csv --method ensemble --output flagged.csv
# Time-series stationarity analysis
python stationarity.py --data timeseries.csv --value_col price --date_col date| Variable | Default | Description |
|---|---|---|
NORMALITY_ALPHA |
0.05 |
Significance level for normality tests |
OUTLIER_Z_THRESHOLD |
3.0 |
Z-score threshold for outlier flagging |
OUTLIER_IF_CONTAMINATION |
0.05 |
Isolation Forest contamination parameter |
DIST_N_CANDIDATES |
15 |
Number of distributions to fit and compare |
TUFTE_MODE |
True |
Enable minimalist Tufte-style chart formatting |
Copy
.env.exampleto.envand populate required values before running.
EDA--improved/
├── README.md
├── Stackoverflow_Survey_Analysis-checkpoint.ipynb
└── ...
- Causal discovery integration: PC algorithm for skeleton graph identification from observational data
- Natural language EDA narrative generation: LLM summarises the full EDA in plain English for stakeholder reporting
- Longitudinal EDA: track dataset statistics across dataset versions to detect drift over time
- Bayesian hypothesis testing module: Bayes factors as alternatives to frequentist p-values
- Multi-dataset comparative EDA: test for distribution shift between train/test or historical/current datasets
Contributions, issues, and suggestions are welcome.
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-idea - Commit your changes:
git commit -m 'feat: add your idea' - Push to your branch:
git push origin feature/your-idea - Open a Pull Request with a clear description
Please follow conventional commit messages and add documentation for new features.
The multi-test normality framework applies Bonferroni correction for multiple comparisons — individual test p-values are multiplied by the number of tests before applying the α threshold. This conservative approach reduces false positives at the cost of lower power. For small samples (<50), rely primarily on Shapiro-Wilk; for large samples (>5,000), all tests will detect trivial departures from normality that are practically irrelevant.
Devanik Debnath
B.Tech, Electronics & Communication Engineering
National Institute of Technology Agartala
This project is open source and available under the MIT License.
Built with curiosity, depth, and care — because good projects deserve good documentation.