Skip to content

Commit 2fa248a

Browse files
committed
Finalizing release: Update README and exclude data files
1 parent 9ed81d0 commit 2fa248a

File tree

4 files changed

+163
-1000008
lines changed

4 files changed

+163
-1000008
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,9 @@ htmlcov/
4343
nosetests.xml
4444
coverage.xml
4545

46-
# csv files
46+
# data files
4747
*.csv
48+
*.parquet
4849

4950
# Translations
5051
*.mo

README.md

Lines changed: 161 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,17 @@
55
[![Rust](https://img.shields.io/badge/built%20with-Rust-orange)](https://www.rust-lang.org/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
77

8-
**Phaeton** is a high-performance, memory-efficient data cleaning engine for Python, powered by **Rust**.
8+
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.0)**.
9+
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule.
910
10-
It is designed to be the **"Gatekeeper"** of your data pipeline. Phaeton sanitizes, validates, and standardizes massive datasets (GBs/TBs) using a streaming architecture before they enter your analysis tools (like Pandas, Polars, or ML models).
1111

12-
> **Why Phaeton?** Because cleaning 10GB of dirty CSVs shouldn't require 32GB of RAM.
12+
**Phaeton** is a specialized, Rust-powered preprocessing engine designed to sanitize raw data streams before they reach your analytical environment.
13+
14+
It acts as the strictly typed **"Gatekeeper"** of your data pipeline. Unlike traditional DataFrame libraries that load entire datasets into RAM, Phaeton employs a **zero-copy streaming architecture**. It processes data chunk-by-chunk—filtering noise, fixing encodings, and standardizing formats ensuring **O(1) memory complexity**.
15+
16+
This allows you to process massive datasets (GBs/TBs) on standard hardware without memory spikes, delivering clean, high-quality data to downstream tools like Pandas, Polars, or ML models.
17+
18+
> **The Philosophy:** Don't waste memory loading garbage. Clean the stream first, then analyze the gold.
1319
1420
---
1521

@@ -18,7 +24,7 @@ It is designed to be the **"Gatekeeper"** of your data pipeline. Phaeton sanitiz
1824
* **Streaming Architecture:** Processes files chunk-by-chunk. Memory usage remains flat and low regardless of file size.
1925
* **Parallel Execution:** Utilizes all CPU cores via Rayon (Rust) for heavy lifting (Regex, Fuzzy Matching).
2026
* **Strict Quarantine:** Bad data isn't just dropped; it's quarantined into a separate file with a generated `_phaeton_reason` column for auditing.
21-
* **Smart Casting:** Automatically handles messy currency formats (e.g., `"Rp 5.000,00"``5000.0` float) without manual string parsing.
27+
* **Smart Casting:** Automatically handles messy currency formats (e.g., `"$ 5.000,00"``5000.0` float) without manual string parsing.
2228
* **Zero-Copy Logic:** Built on Rust's `Cow<str>` to minimize memory allocation during processing.
2329

2430
---
@@ -31,4 +37,154 @@ pip install phaeton
3137

3238
## ⚡ Key Features
3339

34-
1. The Scenario
40+
**1. The Scenario**
41+
42+
You have a dirty CSV `(raw_data.csv)` with mixed encodings, typos in city names, and messy currency strings. You want a clean Parquet file for Pandas.
43+
44+
**2. The Code**
45+
46+
```python
47+
import phaeton
48+
49+
# 1. Probe the file (Auto-detect encoding, delimiter, headers, etc)
50+
info = phaeton.probe("raw_data.csv")
51+
print(f"Detected: {info['encoding']} with delimiter '{info['delimiter']}'")
52+
53+
# 2. Initialize Engine (0 = Use all CPU cores)
54+
eng = phaeton.Engine(workers=0)
55+
56+
# 3. Build the Pipeline
57+
pipeline = (
58+
eng.ingest("raw_data.csv")
59+
60+
# GATEKEEPING: Fix encoding & standardize headers
61+
.decode(encoding=info['encoding'])
62+
.headers(style="snake")
63+
64+
# ELIMINATION: Remove useless rows
65+
.prune(col="email") # Drop rows with empty email
66+
.discard(col="status", match="BANNED", mode="exact")
67+
68+
# TRANSFORMATION: Smart Cleaning
69+
# "$ 30.000,00" -> 30000 (Integer)
70+
# If it fails (e.g., "Free"), send row to Quarantine
71+
.cast("salary", type="int", clean=True, on_error="quarantine")
72+
73+
# FUZZY FIXING: Fix typos ("Cihcago" -> "Chicago")
74+
.fuzzyalign(
75+
col="city",
76+
ref=["Chicago", "Jakarta", "Shanghai"],
77+
threshold=0.85
78+
)
79+
80+
# OUTPUT: Split into Clean Data & Audit Log
81+
.quarantine("bad_data_audit.csv")
82+
.dump("clean_data.parquet")
83+
)
84+
85+
# 4. Execute (Rust takes over)
86+
stats = eng.exec([pipeline])
87+
88+
print(f"Processed: {stats.processed} rows")
89+
print(f"Saved: {stats.saved} | Quarantined: {stats.quarantined}")
90+
```
91+
92+
<br>
93+
94+
## 📊 Performance Benchmark
95+
96+
Phaeton is optimized for "Dirty Data" scenarios (String parsing, Regex filtering, Fuzzy matching).
97+
98+
**Test Environment:**
99+
- **Dataset:** 1 Million Rows (Mixed dirty data: Typos, Currency strings, Encoding issues).
100+
- **Hardware:** Entry Level Laptop.
101+
102+
**Result:**
103+
| Metric | Phaeton |
104+
| :---: | :---: |
105+
| Speed | ~575,000 rows/sec |
106+
| Memory Usage | ~50MB (Constant) |
107+
| Strategy | Parallel Streaming |
108+
109+
<br>
110+
111+
> Note: Phaeton maintains low memory footprint even when processing multi-gigabyte files due to its zero-copy streaming architecture.
112+
113+
<br>
114+
115+
## 📚 API Reference
116+
117+
### Root Module <br>
118+
| Method | Description
119+
| :---: | :---: |
120+
| `phaeton.probe(path)` | Detects encoding (e.g., Windows-1252) and delimiter automatically. |
121+
122+
### Pipeline Methods <br>
123+
These methods are chainable.
124+
125+
#### Transformation & Cleaning
126+
127+
Methods focused on data sanitization and ETL logic.
128+
129+
| Method | Description |
130+
| :--- | :--- |
131+
| `.decode(encoding)` | Fixes file encoding (e.g., `latin-1` or `cp1252`). **Mandatory** as the first step if encoding is broken. |
132+
| `.scrub(col, mode)` | Basic string cleaning. <br> **Modes:** `'trim'`, `'lower'`, `'upper'`, `'currency'`, `'html'`. |
133+
| `.fuzzyalign(col, ref, threshold)` | Fixes typos using *Levenshtein distance* against a reference list. |
134+
| `.reformat(col, to, from)` | Standardizes date formats to ISO-8601 or custom formats. |
135+
| `.cast(col, type, clean)` | **Smart Cast.** Converts types (`int`/`float`/`bool`). <br> Set `clean=True` to strip non-numeric chars before casting. |
136+
137+
#### Structure & Security
138+
139+
Methods for column management and data privacy.
140+
141+
| Method | Description |
142+
| :--- | :--- |
143+
| `.headers(style)` | Standardizes header casing. <br> **Styles:** `'snake'`, `'camel'`, `'pascal'`, `'kebab'`. |
144+
| `.hash(col, salt)` | Applies hashing (SHA-256) to specific columns for PII anonymization. |
145+
| `.rename(mapping)` | Renames specific columns using a dictionary mapping. |
146+
147+
#### Output
148+
149+
Methods to save the final results or handle rejected data.
150+
151+
| Method | Description |
152+
| :--- | :--- |
153+
| `.quarantine(path)` | Saves rejected rows (with reasons) to a separate CSV file. |
154+
| `.dump(path, format)` | Saves clean data to `.csv`, `.parquet`, or `.json` formats. |
155+
156+
---
157+
158+
## 🗺️ Roadmap
159+
160+
Phaeton is currently in **Beta (v0.2.0)**. Here is the status of our development pipeline:
161+
162+
| Feature | Status | Notes |
163+
| :--- | :---: | :--- |
164+
| **Parallel Streaming Engine** | ✅ Ready | Powered by Rayon |
165+
| **Smart Type Casting** | ✅ Ready | Auto-clean numeric strings |
166+
| **Quarantine Logic** | ✅ Ready | Audit logs for bad data |
167+
| **Fuzzy Alignment** | ✅ Ready | Jaro-Winkler / Levenshtein |
168+
| **SHA-256 Hashing** | 📝 Planned | Security for PII data |
169+
| **Column Splitting & Combining** | 📝 Planned | - |
170+
| **Imputation (`.fill()`)** | 📝 Planned | Mean/Median/Mode fill |
171+
| **Parquet/Arrow Integration** | 📝 Planned | Native output support |
172+
173+
---
174+
175+
## 🤝 Contributing
176+
177+
This project is built with **Maturin** (PyO3 + Rust). Interested in contributing?
178+
179+
1. **Clone** this repository.
180+
2. Ensure **Rust & Cargo** are installed.
181+
3. Set up the environment and build:
182+
183+
```bash
184+
# Setup virtual environment (optional)
185+
python -m venv .venv
186+
source .venv/bin/activate
187+
188+
# Build & Install package in development mode
189+
maturin develop --release
190+
```

0 commit comments

Comments
 (0)