Skip to content

Commit 512323d

Browse files
author
miranov25
committed
docs(quantile_fit_nd): add v3.1 Δq-centered ND quantile fitting spec
- Introduces per-channel, detector-agnostic model: X(Q,n) = a(q0,n) + b(q0,n)·(Q−q0), centered on Δq - Defines inputs/outputs, fit steps, and monotonicity policy (b > b_min) - Details nuisance-axis interpolation (linear/PCHIP) and uncertainty (σ_Q, σ_Q_irr) - Provides API sketch (fit_quantile_linear_nd, QuantileEvaluator) and persistence (Parquet/Arrow/ROOT) - Outlines unit tests, diagnostics, and performance expectations Refs: calibration, multiplicity/flow estimator framework
1 parent 4ef6973 commit 512323d

File tree

1 file changed

+248
-0
lines changed

1 file changed

+248
-0
lines changed
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
# quantile_fit_nd — Generic ND Quantile Linear Fitting Framework
2+
**Version:** v3.1
3+
**Status:** Implementation Ready
4+
5+
---
6+
7+
## 1. Overview
8+
9+
This module provides a detector-agnostic framework for **quantile-based linear fitting** used in calibration, combined multiplicity estimation, and flow monitoring.
10+
11+
We approximate the local inverse quantile function around each quantile grid point $q_0$ as:
12+
13+
$$
14+
X(Q, \mathbf{n}) \;=\; a(q_0,\mathbf{n}) \;+\; b(q_0,\mathbf{n}) \cdot (Q - q_0)
15+
$$
16+
17+
where:
18+
- $Q$ is the quantile rank of the amplitude,
19+
- $\mathbf{n}$ are nuisance coordinates (e.g., $z_{\mathrm{vtx}}, \eta, t$),
20+
- \(a\) is the OLS intercept at \(q_0\),
21+
- \(b>0\) is the local slope (monotonicity in \(Q\)).
22+
23+
The framework outputs **tabulated coefficients and diagnostics** in a flat DataFrame for time-series monitoring, ML downstream use, and export to Parquet/Arrow/ROOT.
24+
25+
---
26+
27+
## 2. Directory contents
28+
29+
| File | Role |
30+
|---|---|
31+
| `quantile_fit_nd.py` | Implementation (fit, interpolation, evaluator, I/O) |
32+
| `test_quantile_fit_nd.py` | Unit & synthetic tests |
33+
| `quantile_fit_nd.md` | This design & usage document |
34+
35+
---
36+
37+
## 3. Goals
38+
39+
1. Fit local linear inverse-CDF per **channel** with monotonicity in $Q$.
40+
2. Smooth over nuisance axes with separable interpolation (linear/PCHIP).
41+
3. Provide **physics-driven** slope floors to avoid rank blow-ups.
42+
4. Store results as **DataFrames** with rich diagnostics and metadata.
43+
5. Keep the API **detector-independent** (no detector ID in core interface).
44+
45+
---
46+
47+
## 4. Required input columns
48+
49+
| Column | Description |
50+
|---|---|
51+
| `channel_id` | Unique local channel key |
52+
| `Q` | Quantile rank (normalized by detector reference) |
53+
| `X` | Measured amplitude (or normalized signal) |
54+
| `z_vtx`, `eta`, `time` | Nuisance coordinates (configurable subset) |
55+
| `is_outlier` | Optional boolean mask; `True` rows are excluded from fits |
56+
57+
> Preprocessing (e.g., timing outliers) is expected to fill `is_outlier`.
58+
59+
---
60+
61+
## 5. Output table schema
62+
63+
The fit returns a flat, appendable table with explicit grid points.
64+
65+
| Column | Description |
66+
|---|---|
67+
| `channel_id` | Channel identifier |
68+
| `q_center` | Quantile center of the local fit |
69+
| `<axis>_center` | Centers of nuisance bins (e.g., `z_center`) |
70+
| `a` | Intercept (from OLS at $q_0$) |
71+
| `b` | Slope (clipped to $b_{\min}>0$ if needed) |
72+
| `sigma_Q` | Total quantile uncertainty $ \sigma_{X|Q} / |b| $ |
73+
| `sigma_Q_irr` | Irreducible error (from multiplicity fluctuation) |
74+
| `dX_dN` | Sensitivity to multiplicity proxy (optional) |
75+
| `db_d<axis>` | Finite-difference derivative along each nuisance axis |
76+
| `fit_stats` | JSON with `Npoints`, `RMS`, `chi2_ndf`, `masked_frac`, `clipped_frac` |
77+
| `timestamp` | Calibration/run time (optional) |
78+
79+
**Example metadata stored in `DataFrame.attrs`:**
80+
```json
81+
{
82+
"model": "X = a + b*(Q - q_center)",
83+
"dq": 0.05,
84+
"b_min_option": "auto",
85+
"b_min_formula": "b_min = 0.25 * sigma_X / (2*dq)",
86+
"axes": ["q", "z"],
87+
"fit_mode": "ols",
88+
"kappa_w": 1.3
89+
}
90+
````
91+
92+
---
93+
94+
## 6. Fit procedure (per channel, per grid cell)
95+
96+
1. **Window selection**: select rows with (|Q - q_0| \le \Delta q) (default (\Delta q=0.05)).
97+
2. **Masking**: use rows where `is_outlier == False`. Record masked fraction.
98+
3. **Local regression**: OLS fit of (X) vs ((Q-q_0)) → coefficients (a, b).
99+
4. **Uncertainty**:
100+
101+
- Residual RMS → $\sigma_{X|Q}$
102+
- Total quantile uncertainty: $ \sigma_Q = \sigma_{X|Q} / |b| $
103+
- Irreducible term: $ \sigma_{Q,\mathrm{irr}} = |dX/dN| \cdot \sigma_N / |b| $ with $\sigma_N \approx \kappa_w \sqrt{N_{\text{proxy}}}$
104+
5. **Monotonicity**:
105+
106+
- Enforce $ b > b_{\min} $.
107+
* Floor policy:
108+
109+
* `"auto"`: ( b_{\min} = 0.25 \cdot \sigma_X / (2\Delta q) ) (heuristic)
110+
* `"fixed"`: constant floor (default (10^{-6}))
111+
* Record `clipped_frac` in `fit_stats`.
112+
6. **Tabulation**: write row with coefficients, diagnostics, and centers of nuisance bins.
113+
114+
**Edge quantiles**: same $\Delta q$ policy near $q=0,1$ (no special gating by default).
115+
116+
---
117+
118+
## 7. Interpolation and monotonicity preservation
119+
120+
* **Separable interpolation** along nuisance axes (e.g., `z`, `eta`, `time`) using linear or shape-preserving PCHIP.
121+
* **Monotone axis**: (Q). At evaluation: nearest or linear between adjacent `q_center` points.
122+
* **Guarantee**: if all tabulated $b>0$ and nuisance interpolation does not cross zero, monotonicity in $Q$ is preserved. Any interpolated $b \le 0$ is clipped to $b_{\min}$.
123+
124+
Correlations between nuisance axes are **diagnosed** (scores stored) but **not** modeled by tensor interpolation in v3.1.
125+
126+
---
127+
128+
## 8. Public API (summary)
129+
130+
### Fitting
131+
132+
```python
133+
fit_quantile_linear_nd(
134+
df,
135+
channel_key="channel_id",
136+
q_centers=np.linspace(0, 1, 11),
137+
dq=0.05,
138+
nuisance_axes={"z": "z_vtx"}, # add {"eta": "eta"}, {"time": "timestamp"} later
139+
mask_col="is_outlier",
140+
b_min_option="auto", # or "fixed"
141+
fit_mode="ols" # "huber" optional in later versions
142+
) -> pandas.DataFrame
143+
```
144+
145+
### Evaluation
146+
147+
```python
148+
eval = QuantileEvaluator(result_table)
149+
150+
# Interpolated parameters at coordinates:
151+
a, b, sigma_Q = eval.params(channel_id=42, q=0.40, z=2.1)
152+
153+
# Invert amplitude to rank (clip to [0,1]):
154+
Q = eval.invert_rank(X=123.0, channel_id=42, z=2.1)
155+
```
156+
157+
### Persistence
158+
159+
```python
160+
save_table(df, "calibration.parquet")
161+
save_table(df, "calibration.arrow", fmt="arrow")
162+
save_table(df, "calibration.root", fmt="root") # requires uproot/PyROOT
163+
df2 = load_table("calibration.parquet")
164+
```
165+
166+
---
167+
168+
## 9. Derivatives & irreducible error
169+
170+
* **Finite differences** for `db_dz`, `db_deta` at grid centers (central where possible; forward/backward at edges).
171+
* **Irreducible error** (stored as `sigma_Q_irr`):
172+
$ \sigma_{Q,\mathrm{irr}} = |dX/dN| \cdot \sigma_N / |b| $, with $\sigma_N = \kappa_w \sqrt{N_{\text{proxy}}}$.
173+
`kappa_w` (default 1.3) reflects weight fluctuations (documented constant; can be overridden).
174+
175+
> For data without truth $N$, $dX/dN$ may be estimated against a stable multiplicity proxy from the combined estimator.
176+
177+
---
178+
179+
## 10. QA & summaries
180+
181+
Optional **per-channel summary** rows per calibration period:
182+
183+
* mean/median of `sigma_Q`,
184+
* `%` of cells clipped by `b_min`,
185+
* masked fraction,
186+
* residual RMS, `chi2_ndf`,
187+
* counts of fitted vs. skipped cells.
188+
189+
Drift/stability analysis is expected in external tooling by **chaining** calibration tables over time.
190+
191+
---
192+
193+
## 11. Unit & synthetic tests (see `test_quantile_fit_nd.py`)
194+
195+
| Test ID | Purpose |
196+
| ------- | --------------------------------------------- |
197+
| T00 | Smoke test (single channel, (q,z) grid) |
198+
| T01 | Monotonicity enforcement (all (b > b_{\min})) |
199+
| T02 | Edge behavior near (q\in{0,1}) per policy |
200+
| T03 | Outlier masking stability |
201+
| T04 | (\sigma_Q) scaling vs injected noise |
202+
| T05 | `db_dz` finite-diff accuracy on known slope |
203+
| T06 | Round-trip (Q \to X \to Q) small residual |
204+
| T07 | Parquet/Arrow/ROOT save/load parity |
205+
206+
---
207+
208+
## 12. Performance expectations
209+
210+
| Aspect | Estimate |
211+
| --------------- | -------------------------------------------------------- |
212+
| Complexity | (O(N \cdot \Delta q)) per channel |
213+
| CPU | (q,z) fit: seconds; ND adds ~20–30% from interpolation |
214+
| Parallelization | Natural via Pandas/Dask groupby |
215+
| Table size | (O(\text{grid points} \times \text{channels})), MB-scale |
216+
| Storage | Parquet typically < 10 MB per calibration slice |
217+
218+
---
219+
220+
## 13. Configurable parameters
221+
222+
| Name | Default | Meaning |
223+
| --------------- | ---------------- | ---------------------------------------- |
224+
| `dq` | 0.05 | Quantile window half-width |
225+
| `b_min_option` | `auto` | Slope floor policy (`auto` or `fixed`) |
226+
| `fit_mode` | `ols` | Regression type |
227+
| `mask_col` | `is_outlier` | Outlier flag column |
228+
| `kappa_w` | 1.3 | Weight-fluctuation factor (doc/override) |
229+
| `nuisance_axes` | `{"z": "z_vtx"}` | Axes for smoothing |
230+
231+
---
232+
233+
## 14. Future extensions
234+
235+
* Optional **Huber** robust regression mode.
236+
* Degree-2 local fits with derivative-based monotonicity checks.
237+
* Covariance modeling across nuisance axes.
238+
* Adaptive time binning based on drift thresholds.
239+
* ML-ready derivatives and cost-function integration.
240+
241+
---
242+
243+
## 15. References
244+
245+
* PWG-P context: combined multiplicity/flow estimator materials.
246+
* RootInteractive / AliasDataFrame pipelines for calibration QA.
247+
248+
---

0 commit comments

Comments
 (0)