Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 0 additions & 25 deletions ..github/workflows/release.yaml

This file was deleted.

Binary file modified .coverage
Binary file not shown.
52 changes: 52 additions & 0 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Build and Release

on:
push:
branches: [master]
tags: ['v*']
pull_request:
branches: [master]

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set Up Python
uses: actions/setup-python@v5

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: |
uv sync --dev --all-extras

- name: Run tests (small dataset)
run: |
uv run pytest -m "not large"
env:
TEST_SIZE: small

build:
needs: test
runs-on: ubuntu-latest
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v4

- name: Set Up Python
uses: actions/setup-python@v5

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Build package
run: uv build --no-sources

# - name: Publish to PyPI
# run: uv publish
# env:
# UV_PUBLISH_TOKEN: ${{ secrets.PYPI_TOKEN }}
31 changes: 31 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Build and Test

on:
push:
branches: [dev]
tags: ['v*']
pull_request:
branches: [dev]

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set Up Python
uses: actions/setup-python@v5

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: |
uv sync --dev --all-extras

- name: Run tests (small dataset)
run: |
uv run pytest -m "not large"
env:
TEST_SIZE: small
35 changes: 30 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,11 +1,36 @@
# Python-generated files
__pycache__/
*.py[oc]
wheels/

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
wheels/
*.egg-info
eggs/
lib/
lib64/
*.egg-info/
.installed.cfg
*.manifest

# Pytest cache
.pytest_cache/

# Virtual environment
.env/
venv/

# IDE specific files
.vscode/
.idea/

# Operating System files
.DS_Store
Thumbs.db

# Virtual environments
.venv
.old
.old/
84 changes: 77 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,81 @@
# 👨‍🔬 bloodhound-utils
<img width="550" height="188" alt="mindhunter-header" src="https://github.com/user-attachments/assets/47fbbe27-251b-4961-80dc-809c73020d10" />

My personal stash of Python utilities for statistical, probabilistic and data analysis.
# 🐯 mindhunter

## ℹ️ What does it include:
Extensions for DataFrames to make statistical and analysis operations much, *much* more comfortable and convenient. Turns your `DataFrame` into a `StatFrame`, composing Mindhunter's new features *over* it, supercharging its capabilities without sacrificing compatibility.

- `StatisticalObject`: a wrapper for pandas DataFrames that adds a bunch of common statistical analysis operations to that instance.
- Clean and normalize DataFrame column names to lower-case, snake-case text.
- Retrieve columns by name.
- Get several statistical values: (CV, Z-score, PSD, etc.)
---

## 📦 Installation

### 🗃️ From the repo:
You need `uv` to build the module.

- Clone the repository
- `chmod +x ./build.sh`
- `./build.sh`
- It will clear cache, build, install and test the module.


## 🧪 Testing
Mindhunter implements a fairly rudimentary setup for testing. It will look inside `tests` for any fixtures or tests inside files starting with `test_`. It uses `pytest` and `faker` to create a randomised dataset to test upon.

So far, coverage goes to the extent of making sure a `StatFrame` can be created and data can be obtained. More testing is being developed and it's coming soon.


## 📝 Features

### 📋 Meet `StatFrame` and the crew

- Your new `StatFrame` can be used now with Mindhunter's new **Analyzers, Plotters and Toolkits:**
- `DistributionAnalyzer`: adds normal distribution utilities directly on top of the `DataFrame`.
- `HypothesisAnalyzer`: adds hypothesis testing, binomial and related functionality.
- `AnalyticalTools`: provides access to `scipy.stats` methods to generate and convert several values over a given `StatFrame`.
- `StatPlotter`: adds ready-to-go plotting capabilities for many common values, like z-scores, Coefficient of Variation, Normal Distribution, and others; using `seaborn` and `matplotlib.pyplot`.
- `StatVisualizer`: provides easy access to build common graphs and visualizations, returning ready-to-go graphs just by passing lists or a `StatFrame`.

### 💾 Quick stats and cached values
- `StatFrame` also holds a cache of the most commonly-used values and variables, providing easy access to the values of not just a column, but of a whole set. It caches:
- **Central Tendency:**
- mean
- median
- mode
- **Spread/Variability:**
- std (standard deviation)
- variance
- range
- iqr (inter-quantile range)
- mad (median absolute deviation)
- **Distribution Shape:**
- skewness
- kurtosis
- **Data Quality:**
- count
- missing_count
- missing_pct
- **Extreme Values:**
- min
- max
- q1
- q3
- **Key Ratios:**
- cv (coefficient of variation)
- sem (standard error of mean)

### 🧹 Auto-cleanup:
- Mindhunter can also **automatically cleans column names, drops NaN and duplicates** of datasets. It also provides methods to **locate, analyze and remove zero-values** from your dataset.

---

## ℹ️ But, why?

I've been studying data analysis and, over the months, I've been collecting a bunch of little methods and scripts to do my homework. It then went to the point it was a 800+ line cell on each Jupyter Notebook. It became a *bit* too much.

### 🏗️ How does it work on the inside:

In short: it uses basic OOP **composition**, against all advise, to pass the `StatFrame` as an argument. That class holds the `DataFrame` itself, and all operations are done through the `StatFrame` directly to the DF. All operations act directly on the source, and calling `update()` will re-trigger the caching process.

### 🔮 So, what's the future?


This library will be updated fairly regularly, as I start collecting and tidying up more and more little tools, and taking more advantage of the internal mechanisms. I am *much* more of a developer than a data analyst, so I need much more help knowing what the community *needs* for me to keep on improving the library. If you have any issue, suggestion or comment, feel free to create a new issue!
32 changes: 0 additions & 32 deletions a.py

This file was deleted.

3 changes: 2 additions & 1 deletion build.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#! /bin/sh
# for local builds only:

uv clean && rm -rf .pytest_cache .coverage htmlcov dist build *.egg-info
uv build
uv build --link-mode=copy
uv pip install .
uv run pytest -v
Binary file added docs/dw-mindhunter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/mindhunter-header.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 17 additions & 18 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,34 @@ build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/mindhunter"]
dependencies = [
"pandas>=2.0.0",
"numpy>=1.24.0",
"seaborn>=0.12.0",
"matplotlib>=3.7.0",
"scipy>=1.10.0",
]

[project.optional-dependencies]
dev = [
"faker>=37.8.0",
"matplotlib>=3.10.6",
"numpy>=2.3.3",
"pandas>=2.3.3",
"seaborn>=0.13.2",
"pytest>=7.4.0",
"pytest-cov>=4.1.0",
"faker>=19.0.0",
"scipy>=1.10.0",
]

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
addopts = "--cov=src/project_name --cov-report=term-missing"

[tool.uv]
dev-dependencies = [
"pytest>=7.4.0",
"pytest-cov>=4.1.0",
"faker>=19.0.0",
]
addopts = "--cov=src/mindhunter --cov-report=term-missing"

[project]
name = "mindhunter"
version = "0.1.0"
description = "DataFrame extensions for data analysis."
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"faker>=37.8.0",
"matplotlib>=3.10.6",
"numpy>=2.3.3",
"pandas>=2.3.3",
"seaborn>=0.13.2",
"pytest>=7.4.0",
"pytest-cov>=4.1.0",
"scipy>=1.10.0",
]
22 changes: 11 additions & 11 deletions src/mindhunter/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,26 @@

"""
# core
from .core.analyzer import DataAnalyzer
from .mindhunter import StatFrame

# statistics
from .statistics.distributions import DistributionAnalyzer
from .statistics.hypothesis_tests import HypothesisTester
from .statistics.hypothesis import HypothesisAnalyzer

# utils
from .utils.toolkit import AnalysisToolkit
from .utils.toolkit import AnalyticalTools

# visualization
from .visualization.stat_plotter import StatisticalPlotter
from .visualization.plotter import Visualizer
from .visualization.stat_plotter import StatPlotter
from .visualization.visualizer import StatVisualizer

__version__ = '0.1.0'
__name__ = 'mindhunter'
__all__ = [
'DataAnalyzer',
'StatFrame',
'DistributionAnalyzer',
'HypothesisTester',
'AnalysisToolkit',
'StatisticalPlotter',
'Visualizer',
]
'HypothesisAnalyzer',
'AnalyticalTools',
'StatPlotter',
'StatVisualizer',
]
3 changes: 0 additions & 3 deletions src/mindhunter/core/__init__.py

This file was deleted.

Loading