|
| 1 | +# contextLLM.md — ND Quantile Linear Fit (quick context) |
| 2 | + |
| 3 | +## TL;DR |
| 4 | + |
| 5 | +We fit a **local linear inverse quantile model** per channel and nuisance grid: |
| 6 | +[ |
| 7 | +X(q,n) \approx a(q_0,n) + b(q_0,n),\underbrace{(q - q_0)}_{\Delta q},\quad b>0 |
| 8 | +] |
| 9 | + |
| 10 | +* Monotonic in **q** via (b \gt b_\text{min}). |
| 11 | +* Smooth in nuisance axes (e.g., **z**, later **η**, **time**) via separable interpolation. |
| 12 | +* **Discrete inputs** (tracks/clusters/Poisson): convert to **continuous ranks** (PIT or mid-ranks) *before* fitting. |
| 13 | + |
| 14 | +## Key Files |
| 15 | + |
| 16 | +* `dfextensions/quantile_fit_nd/quantile_fit_nd.py` — core fitter + evaluator |
| 17 | +* `dfextensions/quantile_fit_nd/utils.py` — discrete→uniform helpers (PIT/mid-rank) |
| 18 | +* `dfextensions/quantile_fit_nd/test_quantile_fit_nd.py` — unit tests + rich diagnostics |
| 19 | +* `dfextensions/quantile_fit_nd/bench_quantile_fit_nd.py` — speed & precision benchmark, scaling plots |
| 20 | +* `dfextensions/quantile_fit_nd/quantile_fit_nd.md` — full spec (math, API, guarantees) |
| 21 | + |
| 22 | +## Core Assumptions & Policies |
| 23 | + |
| 24 | +* **Δq-centered OLS** per window (|Q-q_0|\le \Delta q), default (\Delta q=0.05). |
| 25 | +* **Monotonicity**: enforce (b \ge b_\text{min}) (configurable; “auto” heuristic or fixed). |
| 26 | +* **Nuisance interpolation**: separable (linear now; PCHIP later); only q must be monotone. |
| 27 | +* **Discrete inputs**: |
| 28 | + |
| 29 | + * Prefer **randomized PIT**: (U=F(k!-!1)+V,[F(k)-F(k!-!1)]), (V\sim\text{Unif}(0,1)). |
| 30 | + * Or **mid-ranks**: (U=\tfrac{F(k!-!1)+F(k)}{2}) (deterministic). |
| 31 | + * Helpers: `discrete_to_uniform_rank_poisson`, `discrete_to_uniform_rank_empirical`. |
| 32 | +* **Uncertainty**: (\sigma_Q \approx \sigma_{X|Q}/|b|). Irreducible vs reducible split available downstream. |
| 33 | + |
| 34 | +## Public API (stable) |
| 35 | + |
| 36 | +```python |
| 37 | +from dfextensions.quantile_fit_nd.quantile_fit_nd import fit_quantile_linear_nd, QuantileEvaluator |
| 38 | + |
| 39 | +table = fit_quantile_linear_nd( |
| 40 | + df, # columns: channel_id, Q, X, nuisance cols (e.g. z_vtx), is_outlier (optional) |
| 41 | + channel_key="channel_id", |
| 42 | + q_centers=np.arange(0, 1.0001, 0.025), |
| 43 | + dq=0.05, |
| 44 | + nuisance_axes={"z": "z_vtx"}, # later: {"z":"z_vtx","eta":"eta","time":"timestamp"} |
| 45 | + n_bins_axes={"z": 20}, |
| 46 | + mask_col="is_outlier", |
| 47 | + b_min_option="auto", # or "fixed" |
| 48 | +) |
| 49 | + |
| 50 | +evalr = QuantileEvaluator(table) |
| 51 | +q_hat = evalr.invert_rank(X=123.0, channel_id="ch0", z=1.2) |
| 52 | +a, b, sigmaQ = evalr.params(channel_id="ch0", q=0.4, z=0.0) |
| 53 | +``` |
| 54 | + |
| 55 | +### Output table (columns) |
| 56 | + |
| 57 | +`channel_id, q_center, <axis>_center..., a, b, sigma_Q, sigma_Q_irr (optional), dX_dN (optional), db_d<axis>..., fit_stats(json), timestamp(optional)` |
| 58 | + |
| 59 | +## Quickstart (clean run) |
| 60 | + |
| 61 | +```bash |
| 62 | +# 1) Unit tests with diagnostics |
| 63 | +pytest -q -s dfextensions/quantile_fit_nd/test_quantile_fit_nd.py |
| 64 | + |
| 65 | +# 2) Benchmark speed + precision + scaling (and plots) |
| 66 | +python dfextensions/quantile_fit_nd/bench_quantile_fit_nd.py --plot \ |
| 67 | + --dists uniform,poisson,gaussian --Ns 2000,5000,10000,20000,50000 --lam 50 |
| 68 | +``` |
| 69 | + |
| 70 | +* **Interpretation**: `rms_b ~ N^{-1/2}` (α≈−0.5); `rms_rt ~ const` (α≈0) because round-trip error is per-event. |
| 71 | + |
| 72 | +## Reproducibility knobs |
| 73 | + |
| 74 | +* RNG seed fixed in tests/bench (`RNG = np.random.default_rng(123456)`). |
| 75 | +* Poisson rank mode: randomized PIT (default) vs mid-rank (deterministic) — switch in test/bench helpers. |
| 76 | +* Scaling tolerances (`--scaling_tol`, `--rt_tol`) in the benchmark. |
| 77 | + |
| 78 | +## Known Limitations |
| 79 | + |
| 80 | +* Very edge q windows (near 0 or 1) can be data-sparse; we store fit_stats and may skip non-informative windows. |
| 81 | +* With extremely discrete/uniform ranks (without PIT), OLS degenerate: fitter will flag `low_Q_spread`. |
| 82 | +* Current interpolation is linear; PCHIP (shape-preserving) can be enabled later. |
| 83 | +* Inversion uses a stable linear local model and bracketing; works inside grid, clips at edges. |
| 84 | + |
| 85 | +## Next Steps (nice-to-have) |
| 86 | + |
| 87 | +* Optional robust fit (`fit_mode="huber"`), once outlier flags stabilize. |
| 88 | +* Add time as a nuisance axis or do time-sliced parallel fits + chain. |
| 89 | +* Export ROOT trees consistently (Parquet/Arrow already supported). |
| 90 | +* Add ML-friendly derivative grids (db/dz, db/dη) at higher resolution. |
| 91 | + |
| 92 | +## Troubleshooting |
| 93 | + |
| 94 | +* **ImportError in tests**: ensure `dfextensions/quantile_fit_nd/__init__.py` exists and you run from repo root. |
| 95 | +* **.idea committed**: add `.idea/` to repo-level `.gitignore` to avoid IDE noise. |
| 96 | +* **Poisson looks “nonsense”**: confirm PIT/mid-rank preprocessing of counts before calling `fit_*`. |
| 97 | + |
| 98 | +--- |
| 99 | + |
0 commit comments