You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> ⚠️ **Project Status:** Phaeton is currently in **Experimental Beta (v0.2.3)**.
9
-
> The core streaming engine is functional, but the library is currently under limited maintenance due to the author's personal schedule. So, some methods are still not working or are only dummy or mockup methods.
8
+
> ⚠️ **Project Status:** Phaeton is currently in **Stable Beta (v0.3.0)**.
9
+
> The core streaming engine is fully functional. However, please note that some auxiliary methods (marked in docs) are currently placeholders and will be implemented in future versions.
10
10
11
11
12
12
**Phaeton** is a specialized, Rust-powered preprocessing and ETL engine designed to sanitize raw data streams before they reach your analytical environment.
@@ -25,8 +25,10 @@ This allows you to process massive datasets on standard hardware without memory
25
25
***Parallel Execution:** Utilizes all CPU cores via **Rust Rayon** to handle heavy lifting (Regex, Fuzzy Matching) without blocking Python.
26
26
***Strict Quarantine:** Bad data isn't just dropped silently; it's quarantined into a separate file with a generated `_phaeton_reason` column for auditing.
***Privacy & Security:** Built-in email masking and SHA-256 hashing for PII compliance.
28
29
***Configurable Engine:** Full control over `batch_size` and worker threads to tune performance for low-memory devices or high-end servers.
29
30
31
+
30
32
---
31
33
32
34
## Performance Benchmark
@@ -35,8 +37,8 @@ Phaeton is optimized for "Dirty Data" scenarios involving heavy string parsing,
35
37
36
38
37
39
**Test Scenario:**
38
-
We generated a **Chaos Dataset**containing **1 Million Rows** of mixed dirty data:
39
-
***Operations:** Trim whitespace, Currency scrubbing (`$ 50.000,00` -> `50000`), Type casting, Fuzzy Alignment (Typo correction for City names), and Regex Filtering.
40
+
***Dataset:** 1 Million Rows of generated mixed dirty data.
41
+
***Operations:** Trim whitespace, Currency scrubbing (`$ 50.000,00` -> `50000`), Type casting, Fuzzy Alignment (Typo correction for City names), and Filtering.
|`.rename(mapping)`| Renames specific columns using a dictionary mapping `({'old': 'new'})`. |
152
164
|`.hash(col, salt)`| Applies hashing (SHA-256) to specific columns for PII anonymization. |
153
-
|`.rename(mapping)`| Renames specific columns using a dictionary mapping. |
165
+
|`.map(col, mapping)`| Maps values using a dictionary lookup (VLOOKUP style).|
154
166
155
-
#### Output
156
167
157
-
Methods to save the final results or handle rejected data.
168
+
### 4. Pipeline: Output & Flow
158
169
159
-
| Method | Description |
160
-
| :--- | :--- |
161
-
|`.quarantine(path)`| Saves rejected rows (with reasons) to a separate CSV file. |
162
-
|`.dump(path, format)`| Saves clean data to `.csv`, `.parquet`, or `.json` formats. |
163
-
164
-
#### Utility & Workflow
165
170
Methods to save the final results or handle rejected data.
166
171
167
172
| Method | Description |
168
173
| :--- | :--- |
169
-
|`.fork()`| Creates a deep copy of the current pipeline branch. Useful for splitting logic (e.g., saving to multiple formats or creating different clean levels) without rewriting steps. |
170
-
|`.peek(n)`| Previews the first n rows. |
171
-
174
+
|`.quarantine(path)`| Saves rejected rows (with reasons) to a separate CSV file. |
175
+
|`.dump(path, format)`| Saves clean data to `.csv`. |
176
+
|`.fork(tag)`|Creates a branch of the pipeline.|
177
+
|`.peek(n, col)`| Runs a dry-run preview. `n`: rows limit. `col`: specific column(s) to inspect (optional). |
178
+
179
+
<br>
180
+
181
+
> ⚠️ **Placeholder Methods (Coming Soon)**
182
+
>
183
+
> These methods are present in the API for compatibility but do not perform operations yet in v0.3.0.
184
+
> *`reformat(col, ...)`: Date parsing/reformatting.
185
+
> *`split(col, ...)`: Splitting columns.
186
+
> *`combine(cols, ...)`: Merging columns.
172
187
---
173
188
174
189
## Roadmap
175
190
176
-
Phaeton is currently in **Beta (v0.2.3)**. Here is the status of our development:
191
+
Phaeton is currently in **Stable Beta (v0.3.0)**. Here is the status of our development:
177
192
178
193
| Feature | Status | Implementation Notes |
179
194
| :--- | :---: | :--- |
180
195
|**Parallel Streaming Engine**| ✅ Ready | Powered by Rust Rayon (Multi-core) |
0 commit comments