|
1 | | -### `AliasDataFrame` – A Lightweight Wrapper for Pandas with Alias Support |
| 1 | +# AliasDataFrame – Hierarchical Lazy Evaluation for Pandas + ROOT |
2 | 2 |
|
3 | | -`AliasDataFrame` is a small utility that extends `pandas.DataFrame` functionality by enabling: |
| 3 | +`AliasDataFrame` is an extension of `pandas.DataFrame` that enables **named expression-based columns (aliases)** with: |
4 | 4 |
|
5 | | -* **Lazy evaluation of derived columns via named aliases** |
6 | | -* **Automatic dependency resolution across aliases** |
7 | | -* **Persistence via Parquet + JSON or ROOT TTree (via `uproot` + `PyROOT`)** |
8 | | -* **ROOT-compatible TTree export/import including alias metadata** |
| 5 | +* ✅ **Lazy evaluation** (on-demand computation) |
| 6 | +* ✅ **Automatic dependency resolution** (topological sort, cycle detection) |
| 7 | +* ✅ **Hierarchical aliasing** across **linked subframes** (e.g. clusters referencing tracks via index joins) |
| 8 | +* ✅ **Persistence** to Parquet and ROOT TTree formats, including full alias metadata |
| 9 | + |
| 10 | +It is designed for physics and data analysis workflows where derived quantities, calibration constants, and multi-table joins should remain symbolic until final export. |
9 | 11 |
|
10 | 12 | --- |
11 | 13 |
|
12 | | -#### 🔧 Example Usage |
| 14 | +## ✨ Core Features |
| 15 | + |
| 16 | +### ✅ Alias Definition & Lazy Evaluation |
| 17 | + |
| 18 | +Define symbolic columns as expressions involving other columns or aliases: |
13 | 19 |
|
14 | 20 | ```python |
15 | | -import pandas as pd |
16 | | -from AliasDataFrame import AliasDataFrame |
| 21 | +adf.add_alias("pt", "sqrt(px**2 + py**2)") |
| 22 | +adf.materialize_alias("pt") |
| 23 | +``` |
17 | 24 |
|
18 | | -# Base DataFrame |
19 | | -df = pd.DataFrame({"x": [1, 2], "y": [10, 20]}) |
20 | | -adf = AliasDataFrame(df) |
| 25 | +### ✅ Subframe Support (Hierarchical Dependencies) |
21 | 26 |
|
22 | | -# Add aliases (on-demand expressions) |
23 | | -adf.add_alias("z", "x + y") |
24 | | -adf.add_alias("w", "z * 2") |
| 27 | +Reference a subframe (e.g. per-cluster frame linked to a per-track frame): |
25 | 28 |
|
26 | | -# Materialize evaluated columns |
27 | | -adf.materialize_all() |
28 | | -print(adf.df) |
| 29 | +```python |
| 30 | +adf_clusters.register_subframe("T", adf_tracks, index_columns=["track_index"]) |
| 31 | +adf_clusters.add_alias("dX", "mX - T.mX") |
| 32 | +adf_clusters.materialize_alias("dX") |
29 | 33 | ``` |
30 | 34 |
|
31 | | ---- |
| 35 | +Under the hood, this performs a join using `track_index` between clusters and tracks, rewrites `T.mX` to the joined column, and evaluates in that context. |
32 | 36 |
|
33 | | -#### 📦 Persistence |
| 37 | +### ✅ Dependency Graph & Cycle Detection |
34 | 38 |
|
35 | | -##### Save to Parquet + Aliases JSON: |
| 39 | +* Automatically resolves dependency order |
| 40 | +* Detects and raises on circular alias definitions |
| 41 | +* Visualize with: |
36 | 42 |
|
37 | 43 | ```python |
38 | | -adf.save("mydata") |
| 44 | +adf.plot_alias_dependencies() |
39 | 45 | ``` |
40 | 46 |
|
41 | | -##### Load from disk: |
| 47 | +### ✅ Constant Aliases & Dtype Enforcement |
42 | 48 |
|
43 | 49 | ```python |
44 | | -adf2 = AliasDataFrame.load("mydata") |
45 | | -adf2.describe_aliases() |
| 50 | +adf.add_alias("scale", "1.5", dtype=np.float32, is_constant=True) |
46 | 51 | ``` |
47 | 52 |
|
48 | 53 | --- |
49 | 54 |
|
50 | | -#### 🌲 ROOT TTree Support |
| 55 | +## 💾 Persistence |
51 | 56 |
|
52 | | -##### Export to `.root` with aliases: |
| 57 | +### ➤ Save to Parquet |
53 | 58 |
|
54 | 59 | ```python |
55 | | -adf.export_tree("mytree.root", treename="myTree", dropAliasColumns=True) |
| 60 | +adf.save("data/my_frame") # Saves data + alias metadata |
56 | 61 | ``` |
57 | 62 |
|
58 | | -This uses `uproot` for writing columns and `PyROOT` to set alias metadata via `TTree::SetAlias`. |
| 63 | +### ➤ Load from Parquet |
59 | 64 |
|
60 | | -##### Read `.root` file back: |
| 65 | +```python |
| 66 | +adf2 = AliasDataFrame.load("data/my_frame") |
| 67 | +``` |
| 68 | + |
| 69 | +### ➤ Export to ROOT TTree (with aliases!) |
61 | 70 |
|
62 | 71 | ```python |
63 | | -adf2 = adf.read_tree("mytree.root", treename="myTree") |
| 72 | +adf.export_tree("output.root", treename="MyTree") |
64 | 73 | ``` |
65 | 74 |
|
| 75 | +### ➤ Import from ROOT TTree |
| 76 | + |
| 77 | +```python |
| 78 | +adf = AliasDataFrame.read_tree("output.root", treename="MyTree") |
| 79 | +``` |
| 80 | + |
| 81 | +Subframe alias metadata (including join indices) is preserved recursively. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## 🧪 Unit-Tested Features |
| 86 | + |
| 87 | +Tests included for: |
| 88 | + |
| 89 | +* Basic alias chaining and materialization |
| 90 | +* Dtype conversion |
| 91 | +* Constant and hierarchical aliasing |
| 92 | +* Partial materialization |
| 93 | +* Subframe joins on index columns |
| 94 | +* Persistence round-trips for `.parquet` and `.root` |
| 95 | +* Error detection: cycles, invalid expressions, undefined symbols |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +## 🧠 Internals |
| 100 | + |
| 101 | +* Expression evaluation via `eval()` with math/Numpy-safe scope |
| 102 | +* Dependency analysis via `networkx` |
| 103 | +* Subframes stored in a registry (`SubframeRegistry`) with index-aware entries |
| 104 | +* Subframe alias resolution is performed via on-the-fly joins using provided index columns |
| 105 | +* Metadata embedded into: |
| 106 | + |
| 107 | + * `.parquet` via Arrow schema metadata |
| 108 | + * `.root` via `TTree::SetAlias` and `TObjString` |
| 109 | + |
66 | 110 | --- |
67 | 111 |
|
68 | | -#### 🔍 Introspection |
| 112 | +## 🔍 Introspection & Debugging |
69 | 113 |
|
70 | 114 | ```python |
71 | | -adf.describe_aliases() |
| 115 | +adf.describe_aliases() # Print aliases, dependencies, broken ones |
| 116 | +adf.validate_aliases() # List broken/inconsistent aliases |
72 | 117 | ``` |
73 | 118 |
|
74 | | -Outputs: |
| 119 | +--- |
| 120 | + |
| 121 | +## 🧩 Requirements |
| 122 | + |
| 123 | +* `pandas`, `numpy`, `pyarrow`, `uproot`, `networkx`, `matplotlib`, `ROOT` |
| 124 | + |
| 125 | +--- |
| 126 | + |
| 127 | +## 🔁 Comparison with Other Tools |
| 128 | + |
| 129 | +| Feature | AliasDataFrame | pandas | Vaex | Awkward Arrays | polars | Dask | |
| 130 | +| ----------------------------- | -------------- | --------- | -------- | -------------- | --------- | --------- | |
| 131 | +| Lazy alias columns | ✅ Yes | ⚠️ Manual | ✅ Yes | ❌ | ✅ Partial | ✅ Partial | |
| 132 | +| Dependency tracking | ✅ Full graph | ❌ | ⚠️ Basic | ❌ | ❌ | ❌ | |
| 133 | +| Subframe hierarchy (joins) | ✅ Index-based | ❌ | ❌ | ⚠️ Nested only | ❌ | ⚠️ Manual | |
| 134 | +| Constant alias support | ✅ With dtype | ❌ | ❌ | ❌ | ❌ | ❌ | |
| 135 | +| Visualization of dependencies | ✅ `networkx` | ❌ | ❌ | ❌ | ❌ | ❌ | |
| 136 | +| Export to ROOT TTree | ✅ Optional | ❌ | ❌ | ✅ via uproot | ❌ | ❌ | |
| 137 | + |
| 138 | +--- |
| 139 | + |
| 140 | +## ❓ Why AliasDataFrame? |
| 141 | + |
| 142 | +In many data workflows, users recreate the same patterns again and again: |
| 143 | + |
| 144 | +* Manually compute derived columns with ad hoc logic |
| 145 | +* Scatter constants and correction factors in multiple files |
| 146 | +* Perform fragile joins between tables (e.g. clusters ↔ tracks) with little traceability |
| 147 | +* Lose transparency into what each column actually means |
| 148 | + |
| 149 | +**AliasDataFrame** turns these practices into a formalized, symbolic layer over your DataFrames: |
| 150 | + |
| 151 | +* 📐 Define all derived quantities as symbolic expressions |
| 152 | +* 🔗 Keep relations between DataFrames declarative, index-based, and reusable |
| 153 | +* 📊 Visualize dependency structures and broken logic automatically |
| 154 | +* 📦 Export the full state of your workflow (including symbolic metadata) |
| 155 | + |
| 156 | +It brings the clarity of a computation graph to structured table analysis — a common but under-supported need in `pandas`, `vaex`, or `polars` workflows. |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +## 🛣 Roadmap Ideas |
| 161 | + |
| 162 | +* [ ] Secure expression parser (no raw `eval`) |
| 163 | +* [ ] Aliased column caching / invalidation strategy |
| 164 | +* [ ] Inter-subframe join strategies (e.g., key-based, 1\:n) |
| 165 | +* [ ] Jupyter widget or CLI tool for alias graph exploration |
| 166 | +* [ ] Broadcasting-aware joins or 2D index support |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## 🧑🔬 Designed for... |
75 | 171 |
|
76 | | -* Defined aliases |
77 | | -* Broken/inconsistent aliases |
78 | | -* Dependency graph |
| 172 | +* Physics workflows (e.g. ALICE clusters ↔ tracks ↔ collisions) |
| 173 | +* Symbolic calibration / correction workflows |
| 174 | +* Structured data exports with traceable metadata |
79 | 175 |
|
80 | 176 | --- |
81 | 177 |
|
82 | | -#### 🧠 Notes |
| 178 | +**Author:** \[You] |
83 | 179 |
|
84 | | -* Dependencies across aliases are auto-resolved via topological sort. |
85 | | -* Cycles in alias definitions are detected and reported. |
86 | | -* Aliases are **not materialized** by default and **not stored** in `.parquet` unless requested. |
87 | | -* `float16` columns are auto-upcast to `float32` for ROOT compatibility. |
| 180 | +MIT License |
0 commit comments