Skip to content

Commit c2a22da

Browse files
committed
chore(release): bump version to 0.3.0
Major update adding dedupe, hash, strict validation, and improved pipeline controls. See CHANGELOG.md for details.
1 parent 89fd977 commit c2a22da

File tree

13 files changed

+1330
-443
lines changed

13 files changed

+1330
-443
lines changed

CHANGELOG.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Changelog 0.3.0 - Phaeton Update
2+
3+
This release introduces comprehensive data transformation capabilities, enhanced pipeline observability, and strict schema validation.
4+
5+
## New Features
6+
7+
### Core Transformations
8+
- **feat(pipeline):** Added `dedupe()` method.
9+
- Supports **Full Row** deduplication (default).
10+
- Supports **Single Column** deduplication.
11+
- Supports **Composite Key** deduplication (List of columns).
12+
- **feat(pipeline):** Added `fill()` method for data imputation.
13+
- Supports `fixed` value replacement.
14+
- Supports `ffill` (Streaming Forward Fill) for time-series/sequential data.
15+
- **feat(pipeline):** Added `hash()` method for PII anonymization.
16+
- Uses SHA-256 algorithm.
17+
- Supports optional `salt` for security against Rainbow Table attacks.
18+
- **feat(pipeline):** Added `map()` for dictionary-based value mapping (VLOOKUP style).
19+
- **feat(structure):** Added `rename()` for column remapping `{old: new}`.
20+
- **feat(structure):** Added `headers()` for casing normalization.
21+
- Supported styles: `snake`, `camel`, `pascal`, `kebab`, `constant`.
22+
23+
### Filter & Logic Enhancements
24+
- **feat(pipeline):** Enhanced `prune()` to support **List[str]**.
25+
- Applies `ANY` logic (drops row if *any* specified column is empty).
26+
- **feat(pipeline):** Enhanced `keep()` and `discard()` to support **List[str]** inputs.
27+
- Allows filtering based on multiple exact matches (e.g., `match=["A", "B"]`).
28+
29+
### Developer Experience (DX)
30+
- **feat(engine):** Implemented **Strict Schema Validation**.
31+
- `Engine(strict=True)` now pre-validates column existence and parameter types before Rust execution.
32+
- **feat(pipeline):** Enhanced `.fork()` with `tag` parameter.
33+
- Enables human-readable lineage in logs (e.g., `PIPE-1 - Active Users`).
34+
- **feat(exceptions):** Introduced granular Exception hierarchy.
35+
- Added `SchemaError`, `ConfigurationError`, `ValueError`, `StateError`, and `EngineError`.
36+
37+
## Bug Fixes & Refactoring
38+
39+
- **fix(pipeline):** Rewrote `.peek()` implementation.
40+
- Now correctly executes a dry-run stream.
41+
- Added `col` parameter to preview specific columns only.
42+
- Fixed headers transformation not reflecting in preview.
43+
- **fix(engine):** Fixed validation logic to correctly handle Enum/Literal types using `typing.get_args`.
44+
- **refactor(validation):** Moved validation logic to `Pipeline._validate()` for better encapsulation and support for manual triggering.
45+
46+
## Breaking Changes
47+
48+
- `Engine` constructor now accepts a `strict` boolean parameter.
49+
- `peek()` output format is now strictly controlled by `tabulate` with `disable_numparse=True`.

Cargo.lock

Lines changed: 10 additions & 9 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "phaeton"
3-
version = "0.2.3"
3+
version = "0.3.0"
44
edition = "2021"
55
authors = ["Zahraan Dzakii Tsaqiif <zahraandzakiits@gmail.com>"]
66
description = "A high-performance preprocessing and ETL engine for sanitizing raw data streams, accelerated by Rust."
@@ -31,6 +31,7 @@ serde_json = "1.0"
3131
# Text Processing
3232
regex = "1.10"
3333
strsim = "0.11"
34+
heck = "0.4"
3435

3536
# Encoding
3637
encoding_rs = "0.8"
@@ -41,7 +42,7 @@ chrono = "0.4"
4142

4243
# Security
4344
sha2 = "0.10"
44-
base64 = "0.21"
45+
hex = "0.4"
4546

4647
# Error Handling
4748
thiserror = "1.0"

README.md

Lines changed: 95 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55
[![Rust](https://img.shields.io/badge/built%20with-Rust-orange)](https://www.rust-lang.org/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
77

8-
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.3)**.
9-
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule. So, some methods are still not working or are only dummy or mockup methods.
8+
> ⚠️ **Project Status:** Phaeton is currently in **Stable Beta (v0.3.0)**.
9+
> The core streaming engine is fully functional. However, please note that some auxiliary methods (marked in docs) are currently placeholders and will be implemented in future versions.
1010
1111

1212
**Phaeton** is a specialized, Rust-powered preprocessing and ETL engine designed to sanitize raw data streams before they reach your analytical environment.
@@ -25,8 +25,10 @@ This allows you to process massive datasets on standard hardware without memory
2525
* **Parallel Execution:** Utilizes all CPU cores via **Rust Rayon** to handle heavy lifting (Regex, Fuzzy Matching) without blocking Python.
2626
* **Strict Quarantine:** Bad data isn't just dropped silently; it's quarantined into a separate file with a generated `_phaeton_reason` column for auditing.
2727
* **Smart Casting:** Automatically handles messy formats (e.g., `"Rp 5.250.000,00"``5250000` int) without complex manual parsing.
28+
* **Privacy & Security:** Built-in email masking and SHA-256 hashing for PII compliance.
2829
* **Configurable Engine:** Full control over `batch_size` and worker threads to tune performance for low-memory devices or high-end servers.
2930

31+
3032
---
3133

3234
## Performance Benchmark
@@ -35,8 +37,8 @@ Phaeton is optimized for "Dirty Data" scenarios involving heavy string parsing,
3537

3638

3739
**Test Scenario:**
38-
We generated a **Chaos Dataset** containing **1 Million Rows** of mixed dirty data:
39-
* **Operations:** Trim whitespace, Currency scrubbing (`$ 50.000,00` -> `50000`), Type casting, Fuzzy Alignment (Typo correction for City names), and Regex Filtering.
40+
* **Dataset:** 1 Million Rows of generated mixed dirty data.
41+
* **Operations:** Trim whitespace, Currency scrubbing (`$ 50.000,00` -> `50000`), Type casting, Fuzzy Alignment (Typo correction for City names), and Filtering.
4042
* **Hardware:** Entry-level Laptop (Intel Core i3-1220P, 16GB RAM).
4143

4244
**Results:**
@@ -46,7 +48,13 @@ We generated a **Chaos Dataset** containing **1 Million Rows** of mixed dirty da
4648
| **Windows 11** | **~820,000 rows/s** | **1.21s** | **~70 MB/s** |
4749
| **Linux (Arch)** | ~575,000 rows/s | 1.73s | ~49 MB/s |
4850

49-
> *Note: Phaeton maintains a low and predictable memory footprint (~10-20MB overhead) regardless of the input file size due to its streaming nature.*
51+
<br>
52+
53+
> ⚠️ Note on I/O Bottleneck: The performance difference above is due to hardware configuration during testing.
54+
> * Windows: Ran on internal NVMe SSD (High I/O speed).
55+
> * Linux: Ran on External SSD via USB 3.2 enclosure (I/O Bottleneck).
56+
57+
In an equal hardware environment, Phaeton performs identically on Linux and Windows. The engine is heavily I/O bound; faster disk = faster processing.
5058

5159
---
5260
## Usage Example
@@ -55,47 +63,60 @@ Based on the features available in the current version.
5563
```python
5664
import phaeton
5765

58-
# 1. Initialize Engine (Auto-detect cores)
59-
eng = phaeton.Engine(workers=0, batch_size=25_000)
66+
# 1. Initialize Engine
67+
# 'strict=True' enables schema validation before execution starts.
68+
eng = phaeton.Engine(workers=0, batch_size=25_000, strict=True)
6069

61-
# 2. Define Pipeline
62-
63-
# Base Pipeline
70+
# 2. Define Base Pipeline (Shared Logic)
6471
base = (
6572
eng.ingest("dirty_data.csv")
66-
.prune(col="email") # Drop rows if email is empty
67-
.prune(col="salary") # Drop rows if salary is empty
68-
.scrub("username", "trim") # Clean whitespace
69-
.scrub("salary", "currency") # Parse "Rp 5.000" to number
70-
.cast("salary", "int", clean=True) # Safely cast to Integer
71-
.fuzzyalign("city",
72-
ref=["Jakarta", "Bandung"],
73-
threshold=0.85
74-
) # Fix typos
73+
# Critical Data Filter: Drop row if 'email' OR 'username' is missing
74+
.prune(['email', 'username'])
75+
76+
# Deduplication: Ensure email uniqueness across the dataset
77+
.dedupe('email')
78+
79+
# Cleaning & Normalization
80+
.scrub('username', mode='trim') # Remove whitespace
81+
.scrub('salary', mode='currency') # Normalize format ("$ 5,000" -> "5000")
82+
83+
# Type Enforcement: Validate data is integer, strip noise if needed
84+
.cast('salary', dtype='int', clean=True)
85+
86+
# Imputation: Fill missing status with a default value
87+
.fill('status', value='UNKNOWN')
88+
89+
# Correction: Fix typos using Jaro-Winkler distance
90+
.fuzzyalign('city',
91+
ref=['Jakarta', 'Minnesota'],
92+
threshold=0.85
93+
) # e.g., "Jkarta" -> "Jakarta"
7594
)
7695

77-
# 3 Pipeline branching using .fork() (Optional)
96+
# 3. Pipeline Branching using .fork()
7897

79-
# Pipeline 1: Keep all rows except status 'BANNED'
98+
# Pipeline 1: Secure & Clean Active Users
8099
p1 = (
81-
base.fork()
82-
.discard("status", "BANNED", mode="exact") # Filter specific values (BANNED)
83-
.quarantine("quarantine_1.csv") # Save bad data here
84-
.dump("clean_data_1.csv") # Save good data here
100+
base.fork('Active Users')
101+
.keep('status', match='ACTIVE', mode='exact')
102+
.hash('email', salt='s3cret') # Anonymize PII (SHA-256)
103+
.dump('clean_active.csv')
85104
)
86105

87-
# Pipeline 2: Only rows with 'ACTIVE' status keeped
106+
# Pipeline 2: Audit Banned Users
88107
p2 = (
89-
base.fork()
90-
.keep("status", "ACTIVE", mode="exact") # Keep specific values (ACTIVE)
91-
.quarantine("quarantine_output_2.csv") # Save bad data here
92-
.dump("cleaned_output_2.csv", format="csv") # Save good data here
108+
base.fork('Banned Analysis')
109+
.keep('status', match='BANNED', mode='exact')
110+
.quarantine('quarantine_banned.csv') # Isolate bad rows for review
111+
.dump('clean_banned.csv')
93112
)
94113

95-
# 4. Execute Two Pipeline in Parallel
96-
stats = engine.exec([p1, p2])
97-
print(f"Pipeline 1 = Processed: {stats[0].processed}, Saved: {stats[0].saved}")
98-
print(f"Pipeline 2 = Processed: {stats[1].processed}, Saved: {stats[1].saved}")
114+
# 4. Execute Pipelines in Parallel
115+
# Returns a list of result statistics
116+
stats = eng.exec([p1, p2])
117+
118+
print(f"Pipeline 1 (Active) | Processed: {stats[0].processed}, Saved: {stats[0].saved}")
119+
print(f"Pipeline 2 (Banned) | Processed: {stats[1].processed}, Saved: {stats[1].saved}")
99120
```
100121

101122
---
@@ -112,82 +133,77 @@ pip install phaeton
112133

113134
## API Reference
114135

115-
### Root Module <br>
116-
| Method | Description
117-
| :---: | :---: |
118-
| `phaeton.probe(path)` | Detects encoding (e.g., Windows-1252) and delimiter automatically. |
119-
120-
### Engine Methods <br>
121-
122-
Methods to save the final results or handle rejected data.
123-
136+
### 1. Engine & Diagnostics <br>
124137
| Method | Description |
125-
| :--- | :--- |
126-
| `.ingest(source)` | Creates a new data processing pipeline for a specific source file. |
127-
| `.exec(pipelines)` | Executes multiple pipelines in parallel. |
128-
| `.validate(pipelines)` | Performs a dry-run to validate schema compatibility. |
138+
| :--- | :--- |
139+
| `phaeton.probe(path)` | Detects encoding and delimiter automatically. |
140+
| `eng.ingest(source)` | Creates a new pipeline builder. |
141+
| `eng.exec(pipelines)` | Executes pipelines in parallel threads. |
142+
| `eng.validate(pipelines)` | Runs a schema dry-run check without executing data processing. |
129143

130-
### Pipeline Methods <br>
131-
These methods are chainable.
132144

133-
#### Transformation & Cleaning
134-
135-
Methods focused on data sanitization and ETL logic.
145+
### 2. Pipeline: Cleaning & Transformation <br>
146+
Methods to sanitize data content.
136147

137148
| Method | Description |
138149
| :--- | :--- |
139150
| `.decode(encoding)` | Fixes file encoding (e.g., `latin-1` or `cp1252`). **Mandatory** as the first step if encoding is broken. |
140-
| `.scrub(col, mode)` | Basic string cleaning. <br> **Modes:** `'trim'`, `'lower'`, `'upper'`, `'currency'`, `'html'`. |
141-
| `.fuzzyalign(col, ref, threshold)` | Fixes typos using *Levenshtein distance* against a reference list. |
142-
| `.reformat(col, to, from)` | Standardizes date formats to ISO-8601 or custom formats. |
143-
| `.cast(col, type, clean)` | **Smart Cast.** Converts types (`int`/`float`/`bool`). <br> Set `clean=True` to strip non-numeric chars before casting. |
144-
145-
#### Structure & Security
151+
| `.scrub(col, mode)` | Basic string cleaning. <br> **Modes:** `'trim'`, `'lower'`, `'upper'`, `'currency'`, `'html'`, `numeric_only`, `email (masking)` . |
152+
|`.fill(col, val, method)`|**Methods:** `fixed` (constant value) or `ffill` (forward fill).|
153+
|`.dedupe(col)`|Removes duplicates. `col` can be `None` (full row), `str` (single col), or `list` (composite key).|
154+
| `.fuzzyalign(col, ref, threshold)` | Fixes typos using Jaro-Winkler distance against a reference list. |
155+
| `.cast(col, dtype, clean)` | **Smart Cast.** Converts types (`int`/`float`/`bool`). <br> Set `clean=True` to strip non-numeric chars before casting. |
146156

147-
Methods for column management and data privacy.
157+
### 3. Pipeline: Structure & Security
158+
Methods to manage columns and privacy.
148159

149160
| Method | Description |
150161
| :--- | :--- |
151-
| `.headers(style)` | Standardizes header casing. <br> **Styles:** `'snake'`, `'camel'`, `'pascal'`, `'kebab'`. |
162+
| `.headers(style)` | Standardizes header casing. <br> **Styles:** `'snake'`, `'camel'`, `'pascal'`, `'kebab', 'constant`. |
163+
| `.rename(mapping)` | Renames specific columns using a dictionary mapping `({'old': 'new'})`. |
152164
| `.hash(col, salt)` | Applies hashing (SHA-256) to specific columns for PII anonymization. |
153-
| `.rename(mapping)` | Renames specific columns using a dictionary mapping. |
165+
|`.map(col, mapping)`| Maps values using a dictionary lookup (VLOOKUP style).|
154166

155-
#### Output
156167

157-
Methods to save the final results or handle rejected data.
168+
### 4. Pipeline: Output & Flow
158169

159-
| Method | Description |
160-
| :--- | :--- |
161-
| `.quarantine(path)` | Saves rejected rows (with reasons) to a separate CSV file. |
162-
| `.dump(path, format)` | Saves clean data to `.csv`, `.parquet`, or `.json` formats. |
163-
164-
#### Utility & Workflow
165170
Methods to save the final results or handle rejected data.
166171

167172
| Method | Description |
168173
| :--- | :--- |
169-
| `.fork()` | Creates a deep copy of the current pipeline branch. Useful for splitting logic (e.g., saving to multiple formats or creating different clean levels) without rewriting steps. |
170-
| `.peek(n)` | Previews the first n rows. |
171-
174+
| `.quarantine(path)` | Saves rejected rows (with reasons) to a separate CSV file. |
175+
| `.dump(path, format)` | Saves clean data to `.csv`. |
176+
|`.fork(tag)`|Creates a branch of the pipeline.|
177+
|`.peek(n, col)`| Runs a dry-run preview. `n`: rows limit. `col`: specific column(s) to inspect (optional). |
178+
179+
<br>
180+
181+
> ⚠️ **Placeholder Methods (Coming Soon)**
182+
>
183+
> These methods are present in the API for compatibility but do not perform operations yet in v0.3.0.
184+
> * `reformat(col, ...)`: Date parsing/reformatting.
185+
> * `split(col, ...)`: Splitting columns.
186+
> * `combine(cols, ...)`: Merging columns.
172187
---
173188

174189
## Roadmap
175190

176-
Phaeton is currently in **Beta (v0.2.3)**. Here is the status of our development:
191+
Phaeton is currently in **Stable Beta (v0.3.0)**. Here is the status of our development:
177192

178193
| Feature | Status | Implementation Notes |
179194
| :--- | :---: | :--- |
180195
| **Parallel Streaming Engine** | ✅ Ready | Powered by Rust Rayon (Multi-core) |
181-
| **Regex & Filter Logic** | ✅ Ready | `keep`, `discard`, `prune` implemented |
182-
| **Smart Type Casting** | ✅ Ready | Auto-clean numeric strings (`"Rp 5,000"` -> `5000`) |
196+
| **Filter Logic & Regex** | ✅ Ready | `keep`, `discard`, `prune` implemented |
197+
| **Text Scrubbing** | ✅ Ready | HTML, Currency, Email Masking, etc. |
198+
| **Type Enforcement** | ✅ Ready | Validates data types & scrubs noise for clean CSV output |
183199
| **Fuzzy Alignment** | ✅ Ready | Jaro-Winkler for typo correction |
184200
| **Quarantine System** | ✅ Ready | Full audit trail for rejected rows |
185-
| **Basic Text Scrubbing** | ✅ Ready | Trim, HTML strip, Case conversion |
201+
| **Deduplication** | ✅ Ready | Row-level & Column-level dedupe |
202+
| **Hashing & Anonymization** | ✅ Ready | SHA-256 for PII data |
203+
| **Header Normalization** | ✅ Ready | `snake_case`, `camelCase` conversions |
204+
|**Strict Schema Validation**| ✅ Ready | `Engine(strict=True)`|
186205
| **Inspector Engine** | 📝 Planned | Dedicated stream for data profiling (Read-Only) |
187-
| **Header Normalization** | 📝 Planned | `snake_case`, `camelCase` conversions |
188206
| **Date Normalization** | 📝 Planned | Auto-detect & reformat dates |
189-
| **Deduplication** | 📝 Planned | Row-level & Column-level dedupe |
190-
| **Hashing & Anonymization** | 📝 Planned | SHA-256 for PII data |
191207
| **Parquet/Arrow Support** | 📝 Planned | Native output integration |
192208

193209
---

0 commit comments

Comments
 (0)