Skip to content

Commit 9b7a038

Browse files
author
miranov25
committed
Updated documentation
1 parent 2a6bd71 commit 9b7a038

File tree

1 file changed

+135
-42
lines changed

1 file changed

+135
-42
lines changed
Lines changed: 135 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,87 +1,180 @@
1-
### `AliasDataFrame`A Lightweight Wrapper for Pandas with Alias Support
1+
# AliasDataFrame – Hierarchical Lazy Evaluation for Pandas + ROOT
22

3-
`AliasDataFrame` is a small utility that extends `pandas.DataFrame` functionality by enabling:
3+
`AliasDataFrame` is an extension of `pandas.DataFrame` that enables **named expression-based columns (aliases)** with:
44

5-
* **Lazy evaluation of derived columns via named aliases**
6-
* **Automatic dependency resolution across aliases**
7-
* **Persistence via Parquet + JSON or ROOT TTree (via `uproot` + `PyROOT`)**
8-
* **ROOT-compatible TTree export/import including alias metadata**
5+
***Lazy evaluation** (on-demand computation)
6+
***Automatic dependency resolution** (topological sort, cycle detection)
7+
***Hierarchical aliasing** across **linked subframes** (e.g. clusters referencing tracks via index joins)
8+
***Persistence** to Parquet and ROOT TTree formats, including full alias metadata
9+
10+
It is designed for physics and data analysis workflows where derived quantities, calibration constants, and multi-table joins should remain symbolic until final export.
911

1012
---
1113

12-
#### 🔧 Example Usage
14+
## ✨ Core Features
15+
16+
### ✅ Alias Definition & Lazy Evaluation
17+
18+
Define symbolic columns as expressions involving other columns or aliases:
1319

1420
```python
15-
import pandas as pd
16-
from AliasDataFrame import AliasDataFrame
21+
adf.add_alias("pt", "sqrt(px**2 + py**2)")
22+
adf.materialize_alias("pt")
23+
```
1724

18-
# Base DataFrame
19-
df = pd.DataFrame({"x": [1, 2], "y": [10, 20]})
20-
adf = AliasDataFrame(df)
25+
### ✅ Subframe Support (Hierarchical Dependencies)
2126

22-
# Add aliases (on-demand expressions)
23-
adf.add_alias("z", "x + y")
24-
adf.add_alias("w", "z * 2")
27+
Reference a subframe (e.g. per-cluster frame linked to a per-track frame):
2528

26-
# Materialize evaluated columns
27-
adf.materialize_all()
28-
print(adf.df)
29+
```python
30+
adf_clusters.register_subframe("T", adf_tracks, index_columns=["track_index"])
31+
adf_clusters.add_alias("dX", "mX - T.mX")
32+
adf_clusters.materialize_alias("dX")
2933
```
3034

31-
---
35+
Under the hood, this performs a join using `track_index` between clusters and tracks, rewrites `T.mX` to the joined column, and evaluates in that context.
3236

33-
#### 📦 Persistence
37+
### ✅ Dependency Graph & Cycle Detection
3438

35-
##### Save to Parquet + Aliases JSON:
39+
* Automatically resolves dependency order
40+
* Detects and raises on circular alias definitions
41+
* Visualize with:
3642

3743
```python
38-
adf.save("mydata")
44+
adf.plot_alias_dependencies()
3945
```
4046

41-
##### Load from disk:
47+
### ✅ Constant Aliases & Dtype Enforcement
4248

4349
```python
44-
adf2 = AliasDataFrame.load("mydata")
45-
adf2.describe_aliases()
50+
adf.add_alias("scale", "1.5", dtype=np.float32, is_constant=True)
4651
```
4752

4853
---
4954

50-
#### 🌲 ROOT TTree Support
55+
## 💾 Persistence
5156

52-
##### Export to `.root` with aliases:
57+
### ➤ Save to Parquet
5358

5459
```python
55-
adf.export_tree("mytree.root", treename="myTree", dropAliasColumns=True)
60+
adf.save("data/my_frame") # Saves data + alias metadata
5661
```
5762

58-
This uses `uproot` for writing columns and `PyROOT` to set alias metadata via `TTree::SetAlias`.
63+
### ➤ Load from Parquet
5964

60-
##### Read `.root` file back:
65+
```python
66+
adf2 = AliasDataFrame.load("data/my_frame")
67+
```
68+
69+
### ➤ Export to ROOT TTree (with aliases!)
6170

6271
```python
63-
adf2 = adf.read_tree("mytree.root", treename="myTree")
72+
adf.export_tree("output.root", treename="MyTree")
6473
```
6574

75+
### ➤ Import from ROOT TTree
76+
77+
```python
78+
adf = AliasDataFrame.read_tree("output.root", treename="MyTree")
79+
```
80+
81+
Subframe alias metadata (including join indices) is preserved recursively.
82+
83+
---
84+
85+
## 🧪 Unit-Tested Features
86+
87+
Tests included for:
88+
89+
* Basic alias chaining and materialization
90+
* Dtype conversion
91+
* Constant and hierarchical aliasing
92+
* Partial materialization
93+
* Subframe joins on index columns
94+
* Persistence round-trips for `.parquet` and `.root`
95+
* Error detection: cycles, invalid expressions, undefined symbols
96+
97+
---
98+
99+
## 🧠 Internals
100+
101+
* Expression evaluation via `eval()` with math/Numpy-safe scope
102+
* Dependency analysis via `networkx`
103+
* Subframes stored in a registry (`SubframeRegistry`) with index-aware entries
104+
* Subframe alias resolution is performed via on-the-fly joins using provided index columns
105+
* Metadata embedded into:
106+
107+
* `.parquet` via Arrow schema metadata
108+
* `.root` via `TTree::SetAlias` and `TObjString`
109+
66110
---
67111

68-
#### 🔍 Introspection
112+
## 🔍 Introspection & Debugging
69113

70114
```python
71-
adf.describe_aliases()
115+
adf.describe_aliases() # Print aliases, dependencies, broken ones
116+
adf.validate_aliases() # List broken/inconsistent aliases
72117
```
73118

74-
Outputs:
119+
---
120+
121+
## 🧩 Requirements
122+
123+
* `pandas`, `numpy`, `pyarrow`, `uproot`, `networkx`, `matplotlib`, `ROOT`
124+
125+
---
126+
127+
## 🔁 Comparison with Other Tools
128+
129+
| Feature | AliasDataFrame | pandas | Vaex | Awkward Arrays | polars | Dask |
130+
| ----------------------------- | -------------- | --------- | -------- | -------------- | --------- | --------- |
131+
| Lazy alias columns | ✅ Yes | ⚠️ Manual | ✅ Yes || ✅ Partial | ✅ Partial |
132+
| Dependency tracking | ✅ Full graph || ⚠️ Basic ||||
133+
| Subframe hierarchy (joins) | ✅ Index-based ||| ⚠️ Nested only || ⚠️ Manual |
134+
| Constant alias support | ✅ With dtype ||||||
135+
| Visualization of dependencies |`networkx` ||||||
136+
| Export to ROOT TTree | ✅ Optional ||| ✅ via uproot |||
137+
138+
---
139+
140+
## ❓ Why AliasDataFrame?
141+
142+
In many data workflows, users recreate the same patterns again and again:
143+
144+
* Manually compute derived columns with ad hoc logic
145+
* Scatter constants and correction factors in multiple files
146+
* Perform fragile joins between tables (e.g. clusters ↔ tracks) with little traceability
147+
* Lose transparency into what each column actually means
148+
149+
**AliasDataFrame** turns these practices into a formalized, symbolic layer over your DataFrames:
150+
151+
* 📐 Define all derived quantities as symbolic expressions
152+
* 🔗 Keep relations between DataFrames declarative, index-based, and reusable
153+
* 📊 Visualize dependency structures and broken logic automatically
154+
* 📦 Export the full state of your workflow (including symbolic metadata)
155+
156+
It brings the clarity of a computation graph to structured table analysis — a common but under-supported need in `pandas`, `vaex`, or `polars` workflows.
157+
158+
---
159+
160+
## 🛣 Roadmap Ideas
161+
162+
* [ ] Secure expression parser (no raw `eval`)
163+
* [ ] Aliased column caching / invalidation strategy
164+
* [ ] Inter-subframe join strategies (e.g., key-based, 1\:n)
165+
* [ ] Jupyter widget or CLI tool for alias graph exploration
166+
* [ ] Broadcasting-aware joins or 2D index support
167+
168+
---
169+
170+
## 🧑‍🔬 Designed for...
75171

76-
* Defined aliases
77-
* Broken/inconsistent aliases
78-
* Dependency graph
172+
* Physics workflows (e.g. ALICE clusters ↔ tracks ↔ collisions)
173+
* Symbolic calibration / correction workflows
174+
* Structured data exports with traceable metadata
79175

80176
---
81177

82-
#### 🧠 Notes
178+
**Author:** \[You]
83179

84-
* Dependencies across aliases are auto-resolved via topological sort.
85-
* Cycles in alias definitions are detected and reported.
86-
* Aliases are **not materialized** by default and **not stored** in `.parquet` unless requested.
87-
* `float16` columns are auto-upcast to `float32` for ROOT compatibility.
180+
MIT License

0 commit comments

Comments
 (0)