Updated documentation

miranov25 · miranov25 · commit 9b7a03802e29 · 2025-06-11T06:18:13.000+02:00
diff --git a/UTILS/dfextensions/AliasDataFrame.md b/UTILS/dfextensions/AliasDataFrame.md
@@ -1,87 +1,180 @@
-### `AliasDataFrame` – A Lightweight Wrapper for Pandas with Alias Support
+# AliasDataFrame – Hierarchical Lazy Evaluation for Pandas + ROOT
 
-`AliasDataFrame` is a small utility that extends `pandas.DataFrame` functionality by enabling:
+`AliasDataFrame` is an extension of `pandas.DataFrame` that enables **named expression-based columns (aliases)** with:
 
-* **Lazy evaluation of derived columns via named aliases**
-* **Automatic dependency resolution across aliases**
-* **Persistence via Parquet + JSON or ROOT TTree (via `uproot` + `PyROOT`)**
-* **ROOT-compatible TTree export/import including alias metadata**
+* ✅ **Lazy evaluation** (on-demand computation)
+* ✅ **Automatic dependency resolution** (topological sort, cycle detection)
+* ✅ **Hierarchical aliasing** across **linked subframes** (e.g. clusters referencing tracks via index joins)
+* ✅ **Persistence** to Parquet and ROOT TTree formats, including full alias metadata
+
+It is designed for physics and data analysis workflows where derived quantities, calibration constants, and multi-table joins should remain symbolic until final export.
 
 ---
 
-#### 🔧 Example Usage
+## ✨ Core Features
+
+### ✅ Alias Definition & Lazy Evaluation
+
+Define symbolic columns as expressions involving other columns or aliases:
 
 ```python
-import pandas as pd
-from AliasDataFrame import AliasDataFrame
+adf.add_alias("pt", "sqrt(px**2 + py**2)")
+adf.materialize_alias("pt")
+```
 
-# Base DataFrame
-df = pd.DataFrame({"x": [1, 2], "y": [10, 20]})
-adf = AliasDataFrame(df)
+### ✅ Subframe Support (Hierarchical Dependencies)
 
-# Add aliases (on-demand expressions)
-adf.add_alias("z", "x + y")
-adf.add_alias("w", "z * 2")
+Reference a subframe (e.g. per-cluster frame linked to a per-track frame):
 
-# Materialize evaluated columns
-adf.materialize_all()
-print(adf.df)
+```python
+adf_clusters.register_subframe("T", adf_tracks, index_columns=["track_index"])
+adf_clusters.add_alias("dX", "mX - T.mX")
+adf_clusters.materialize_alias("dX")
 ```
 
----
+Under the hood, this performs a join using `track_index` between clusters and tracks, rewrites `T.mX` to the joined column, and evaluates in that context.
 
-#### 📦 Persistence
+### ✅ Dependency Graph & Cycle Detection
 
-##### Save to Parquet + Aliases JSON:
+* Automatically resolves dependency order
+* Detects and raises on circular alias definitions
+* Visualize with:
 
 ```python
-adf.save("mydata")
+adf.plot_alias_dependencies()
 ```
 
-##### Load from disk:
+### ✅ Constant Aliases & Dtype Enforcement
 
 ```python
-adf2 = AliasDataFrame.load("mydata")
-adf2.describe_aliases()
+adf.add_alias("scale", "1.5", dtype=np.float32, is_constant=True)
 ```
 
 ---
 
-#### 🌲 ROOT TTree Support
+## 💾 Persistence
 
-##### Export to `.root` with aliases:
+### ➤ Save to Parquet
 
 ```python
-adf.export_tree("mytree.root", treename="myTree", dropAliasColumns=True)
+adf.save("data/my_frame")  # Saves data + alias metadata
 ```
 
-This uses `uproot` for writing columns and `PyROOT` to set alias metadata via `TTree::SetAlias`.
+### ➤ Load from Parquet
 
-##### Read `.root` file back:
+```python
+adf2 = AliasDataFrame.load("data/my_frame")
+```
+
+### ➤ Export to ROOT TTree (with aliases!)
 
 ```python
-adf2 = adf.read_tree("mytree.root", treename="myTree")
+adf.export_tree("output.root", treename="MyTree")
 ```
 
+### ➤ Import from ROOT TTree
+
+```python
+adf = AliasDataFrame.read_tree("output.root", treename="MyTree")
+```
+
+Subframe alias metadata (including join indices) is preserved recursively.
+
+---
+
+## 🧪 Unit-Tested Features
+
+Tests included for:
+
+* Basic alias chaining and materialization
+* Dtype conversion
+* Constant and hierarchical aliasing
+* Partial materialization
+* Subframe joins on index columns
+* Persistence round-trips for `.parquet` and `.root`
+* Error detection: cycles, invalid expressions, undefined symbols
+
+---
+
+## 🧠 Internals
+
+* Expression evaluation via `eval()` with math/Numpy-safe scope
+* Dependency analysis via `networkx`
+* Subframes stored in a registry (`SubframeRegistry`) with index-aware entries
+* Subframe alias resolution is performed via on-the-fly joins using provided index columns
+* Metadata embedded into:
+
+  * `.parquet` via Arrow schema metadata
+  * `.root` via `TTree::SetAlias` and `TObjString`
+
 ---
 
-#### 🔍 Introspection
+## 🔍 Introspection & Debugging
 
 ```python
-adf.describe_aliases()
+adf.describe_aliases()       # Print aliases, dependencies, broken ones
+adf.validate_aliases()       # List broken/inconsistent aliases
 ```
 
-Outputs:
+---
+
+## 🧩 Requirements
+
+* `pandas`, `numpy`, `pyarrow`, `uproot`, `networkx`, `matplotlib`, `ROOT`
+
+---
+
+## 🔁 Comparison with Other Tools
+
+| Feature                       | AliasDataFrame | pandas    | Vaex     | Awkward Arrays | polars    | Dask      |
+| ----------------------------- | -------------- | --------- | -------- | -------------- | --------- | --------- |
+| Lazy alias columns            | ✅ Yes          | ⚠️ Manual | ✅ Yes    | ❌              | ✅ Partial | ✅ Partial |
+| Dependency tracking           | ✅ Full graph   | ❌         | ⚠️ Basic | ❌              | ❌         | ❌         |
+| Subframe hierarchy (joins)    | ✅ Index-based  | ❌         | ❌        | ⚠️ Nested only | ❌         | ⚠️ Manual |
+| Constant alias support        | ✅ With dtype   | ❌         | ❌        | ❌              | ❌         | ❌         |
+| Visualization of dependencies | ✅ `networkx`   | ❌         | ❌        | ❌              | ❌         | ❌         |
+| Export to ROOT TTree          | ✅ Optional     | ❌         | ❌        | ✅ via uproot   | ❌         | ❌         |
+
+---
+
+## ❓ Why AliasDataFrame?
+
+In many data workflows, users recreate the same patterns again and again:
+
+* Manually compute derived columns with ad hoc logic
+* Scatter constants and correction factors in multiple files
+* Perform fragile joins between tables (e.g. clusters ↔ tracks) with little traceability
+* Lose transparency into what each column actually means
+
+**AliasDataFrame** turns these practices into a formalized, symbolic layer over your DataFrames:
+
+* 📐 Define all derived quantities as symbolic expressions
+* 🔗 Keep relations between DataFrames declarative, index-based, and reusable
+* 📊 Visualize dependency structures and broken logic automatically
+* 📦 Export the full state of your workflow (including symbolic metadata)
+
+It brings the clarity of a computation graph to structured table analysis — a common but under-supported need in `pandas`, `vaex`, or `polars` workflows.
+
+---
+
+## 🛣 Roadmap Ideas
+
+* [ ] Secure expression parser (no raw `eval`)
+* [ ] Aliased column caching / invalidation strategy
+* [ ] Inter-subframe join strategies (e.g., key-based, 1\:n)
+* [ ] Jupyter widget or CLI tool for alias graph exploration
+* [ ] Broadcasting-aware joins or 2D index support
+
+---
+
+## 🧑‍🔬 Designed for...
 
-* Defined aliases
-* Broken/inconsistent aliases
-* Dependency graph
+* Physics workflows (e.g. ALICE clusters ↔ tracks ↔ collisions)
+* Symbolic calibration / correction workflows
+* Structured data exports with traceable metadata
 
 ---
 
-#### 🧠 Notes
+**Author:** \[You]
 
-* Dependencies across aliases are auto-resolved via topological sort.
-* Cycles in alias definitions are detected and reported.
-* Aliases are **not materialized** by default and **not stored** in `.parquet` unless requested.
-* `float16` columns are auto-upcast to `float32` for ROOT compatibility.
+MIT License