|
| 1 | +# quantile_fit_nd — Generic ND Quantile Linear Fitting Framework |
| 2 | +**Version:** v3.1 |
| 3 | +**Status:** Implementation Ready |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 1. Overview |
| 8 | + |
| 9 | +This module provides a detector-agnostic framework for **quantile-based linear fitting** used in calibration, combined multiplicity estimation, and flow monitoring. |
| 10 | + |
| 11 | +We approximate the local inverse quantile function around each quantile grid point $q_0$ as: |
| 12 | + |
| 13 | +$$ |
| 14 | +X(Q, \mathbf{n}) \;=\; a(q_0,\mathbf{n}) \;+\; b(q_0,\mathbf{n}) \cdot (Q - q_0) |
| 15 | +$$ |
| 16 | + |
| 17 | +where: |
| 18 | +- $Q$ is the quantile rank of the amplitude, |
| 19 | +- $\mathbf{n}$ are nuisance coordinates (e.g., $z_{\mathrm{vtx}}, \eta, t$), |
| 20 | +- \(a\) is the OLS intercept at \(q_0\), |
| 21 | +- \(b>0\) is the local slope (monotonicity in \(Q\)). |
| 22 | + |
| 23 | +The framework outputs **tabulated coefficients and diagnostics** in a flat DataFrame for time-series monitoring, ML downstream use, and export to Parquet/Arrow/ROOT. |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## 2. Directory contents |
| 28 | + |
| 29 | +| File | Role | |
| 30 | +|---|---| |
| 31 | +| `quantile_fit_nd.py` | Implementation (fit, interpolation, evaluator, I/O) | |
| 32 | +| `test_quantile_fit_nd.py` | Unit & synthetic tests | |
| 33 | +| `quantile_fit_nd.md` | This design & usage document | |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## 3. Goals |
| 38 | + |
| 39 | +1. Fit local linear inverse-CDF per **channel** with monotonicity in $Q$. |
| 40 | +2. Smooth over nuisance axes with separable interpolation (linear/PCHIP). |
| 41 | +3. Provide **physics-driven** slope floors to avoid rank blow-ups. |
| 42 | +4. Store results as **DataFrames** with rich diagnostics and metadata. |
| 43 | +5. Keep the API **detector-independent** (no detector ID in core interface). |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 4. Required input columns |
| 48 | + |
| 49 | +| Column | Description | |
| 50 | +|---|---| |
| 51 | +| `channel_id` | Unique local channel key | |
| 52 | +| `Q` | Quantile rank (normalized by detector reference) | |
| 53 | +| `X` | Measured amplitude (or normalized signal) | |
| 54 | +| `z_vtx`, `eta`, `time` | Nuisance coordinates (configurable subset) | |
| 55 | +| `is_outlier` | Optional boolean mask; `True` rows are excluded from fits | |
| 56 | + |
| 57 | +> Preprocessing (e.g., timing outliers) is expected to fill `is_outlier`. |
| 58 | +
|
| 59 | +--- |
| 60 | + |
| 61 | +## 5. Output table schema |
| 62 | + |
| 63 | +The fit returns a flat, appendable table with explicit grid points. |
| 64 | + |
| 65 | +| Column | Description | |
| 66 | +|---|---| |
| 67 | +| `channel_id` | Channel identifier | |
| 68 | +| `q_center` | Quantile center of the local fit | |
| 69 | +| `<axis>_center` | Centers of nuisance bins (e.g., `z_center`) | |
| 70 | +| `a` | Intercept (from OLS at $q_0$) | |
| 71 | +| `b` | Slope (clipped to $b_{\min}>0$ if needed) | |
| 72 | +| `sigma_Q` | Total quantile uncertainty $ \sigma_{X|Q} / |b| $ | |
| 73 | +| `sigma_Q_irr` | Irreducible error (from multiplicity fluctuation) | |
| 74 | +| `dX_dN` | Sensitivity to multiplicity proxy (optional) | |
| 75 | +| `db_d<axis>` | Finite-difference derivative along each nuisance axis | |
| 76 | +| `fit_stats` | JSON with `Npoints`, `RMS`, `chi2_ndf`, `masked_frac`, `clipped_frac` | |
| 77 | +| `timestamp` | Calibration/run time (optional) | |
| 78 | + |
| 79 | +**Example metadata stored in `DataFrame.attrs`:** |
| 80 | +```json |
| 81 | +{ |
| 82 | + "model": "X = a + b*(Q - q_center)", |
| 83 | + "dq": 0.05, |
| 84 | + "b_min_option": "auto", |
| 85 | + "b_min_formula": "b_min = 0.25 * sigma_X / (2*dq)", |
| 86 | + "axes": ["q", "z"], |
| 87 | + "fit_mode": "ols", |
| 88 | + "kappa_w": 1.3 |
| 89 | +} |
| 90 | +```` |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## 6. Fit procedure (per channel, per grid cell) |
| 95 | + |
| 96 | +1. **Window selection**: select rows with (|Q - q_0| \le \Delta q) (default (\Delta q=0.05)). |
| 97 | +2. **Masking**: use rows where `is_outlier == False`. Record masked fraction. |
| 98 | +3. **Local regression**: OLS fit of (X) vs ((Q-q_0)) → coefficients (a, b). |
| 99 | +4. **Uncertainty**: |
| 100 | + |
| 101 | +- Residual RMS → $\sigma_{X|Q}$ |
| 102 | +- Total quantile uncertainty: $ \sigma_Q = \sigma_{X|Q} / |b| $ |
| 103 | +- Irreducible term: $ \sigma_{Q,\mathrm{irr}} = |dX/dN| \cdot \sigma_N / |b| $ with $\sigma_N \approx \kappa_w \sqrt{N_{\text{proxy}}}$ |
| 104 | +5. **Monotonicity**: |
| 105 | + |
| 106 | + - Enforce $ b > b_{\min} $. |
| 107 | + * Floor policy: |
| 108 | + |
| 109 | + * `"auto"`: ( b_{\min} = 0.25 \cdot \sigma_X / (2\Delta q) ) (heuristic) |
| 110 | + * `"fixed"`: constant floor (default (10^{-6})) |
| 111 | + * Record `clipped_frac` in `fit_stats`. |
| 112 | +6. **Tabulation**: write row with coefficients, diagnostics, and centers of nuisance bins. |
| 113 | + |
| 114 | +**Edge quantiles**: same $\Delta q$ policy near $q=0,1$ (no special gating by default). |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## 7. Interpolation and monotonicity preservation |
| 119 | + |
| 120 | +* **Separable interpolation** along nuisance axes (e.g., `z`, `eta`, `time`) using linear or shape-preserving PCHIP. |
| 121 | +* **Monotone axis**: (Q). At evaluation: nearest or linear between adjacent `q_center` points. |
| 122 | +* **Guarantee**: if all tabulated $b>0$ and nuisance interpolation does not cross zero, monotonicity in $Q$ is preserved. Any interpolated $b \le 0$ is clipped to $b_{\min}$. |
| 123 | + |
| 124 | +Correlations between nuisance axes are **diagnosed** (scores stored) but **not** modeled by tensor interpolation in v3.1. |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## 8. Public API (summary) |
| 129 | + |
| 130 | +### Fitting |
| 131 | + |
| 132 | +```python |
| 133 | +fit_quantile_linear_nd( |
| 134 | + df, |
| 135 | + channel_key="channel_id", |
| 136 | + q_centers=np.linspace(0, 1, 11), |
| 137 | + dq=0.05, |
| 138 | + nuisance_axes={"z": "z_vtx"}, # add {"eta": "eta"}, {"time": "timestamp"} later |
| 139 | + mask_col="is_outlier", |
| 140 | + b_min_option="auto", # or "fixed" |
| 141 | + fit_mode="ols" # "huber" optional in later versions |
| 142 | +) -> pandas.DataFrame |
| 143 | +``` |
| 144 | + |
| 145 | +### Evaluation |
| 146 | + |
| 147 | +```python |
| 148 | +eval = QuantileEvaluator(result_table) |
| 149 | + |
| 150 | +# Interpolated parameters at coordinates: |
| 151 | +a, b, sigma_Q = eval.params(channel_id=42, q=0.40, z=2.1) |
| 152 | + |
| 153 | +# Invert amplitude to rank (clip to [0,1]): |
| 154 | +Q = eval.invert_rank(X=123.0, channel_id=42, z=2.1) |
| 155 | +``` |
| 156 | + |
| 157 | +### Persistence |
| 158 | + |
| 159 | +```python |
| 160 | +save_table(df, "calibration.parquet") |
| 161 | +save_table(df, "calibration.arrow", fmt="arrow") |
| 162 | +save_table(df, "calibration.root", fmt="root") # requires uproot/PyROOT |
| 163 | +df2 = load_table("calibration.parquet") |
| 164 | +``` |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## 9. Derivatives & irreducible error |
| 169 | + |
| 170 | +* **Finite differences** for `db_dz`, `db_deta` at grid centers (central where possible; forward/backward at edges). |
| 171 | +* **Irreducible error** (stored as `sigma_Q_irr`): |
| 172 | +$ \sigma_{Q,\mathrm{irr}} = |dX/dN| \cdot \sigma_N / |b| $, with $\sigma_N = \kappa_w \sqrt{N_{\text{proxy}}}$. |
| 173 | + `kappa_w` (default 1.3) reflects weight fluctuations (documented constant; can be overridden). |
| 174 | + |
| 175 | +> For data without truth $N$, $dX/dN$ may be estimated against a stable multiplicity proxy from the combined estimator. |
| 176 | +
|
| 177 | +--- |
| 178 | + |
| 179 | +## 10. QA & summaries |
| 180 | + |
| 181 | +Optional **per-channel summary** rows per calibration period: |
| 182 | + |
| 183 | +* mean/median of `sigma_Q`, |
| 184 | +* `%` of cells clipped by `b_min`, |
| 185 | +* masked fraction, |
| 186 | +* residual RMS, `chi2_ndf`, |
| 187 | +* counts of fitted vs. skipped cells. |
| 188 | + |
| 189 | +Drift/stability analysis is expected in external tooling by **chaining** calibration tables over time. |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## 11. Unit & synthetic tests (see `test_quantile_fit_nd.py`) |
| 194 | + |
| 195 | +| Test ID | Purpose | |
| 196 | +| ------- | --------------------------------------------- | |
| 197 | +| T00 | Smoke test (single channel, (q,z) grid) | |
| 198 | +| T01 | Monotonicity enforcement (all (b > b_{\min})) | |
| 199 | +| T02 | Edge behavior near (q\in{0,1}) per policy | |
| 200 | +| T03 | Outlier masking stability | |
| 201 | +| T04 | (\sigma_Q) scaling vs injected noise | |
| 202 | +| T05 | `db_dz` finite-diff accuracy on known slope | |
| 203 | +| T06 | Round-trip (Q \to X \to Q) small residual | |
| 204 | +| T07 | Parquet/Arrow/ROOT save/load parity | |
| 205 | + |
| 206 | +--- |
| 207 | + |
| 208 | +## 12. Performance expectations |
| 209 | + |
| 210 | +| Aspect | Estimate | |
| 211 | +| --------------- | -------------------------------------------------------- | |
| 212 | +| Complexity | (O(N \cdot \Delta q)) per channel | |
| 213 | +| CPU | (q,z) fit: seconds; ND adds ~20–30% from interpolation | |
| 214 | +| Parallelization | Natural via Pandas/Dask groupby | |
| 215 | +| Table size | (O(\text{grid points} \times \text{channels})), MB-scale | |
| 216 | +| Storage | Parquet typically < 10 MB per calibration slice | |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +## 13. Configurable parameters |
| 221 | + |
| 222 | +| Name | Default | Meaning | |
| 223 | +| --------------- | ---------------- | ---------------------------------------- | |
| 224 | +| `dq` | 0.05 | Quantile window half-width | |
| 225 | +| `b_min_option` | `auto` | Slope floor policy (`auto` or `fixed`) | |
| 226 | +| `fit_mode` | `ols` | Regression type | |
| 227 | +| `mask_col` | `is_outlier` | Outlier flag column | |
| 228 | +| `kappa_w` | 1.3 | Weight-fluctuation factor (doc/override) | |
| 229 | +| `nuisance_axes` | `{"z": "z_vtx"}` | Axes for smoothing | |
| 230 | + |
| 231 | +--- |
| 232 | + |
| 233 | +## 14. Future extensions |
| 234 | + |
| 235 | +* Optional **Huber** robust regression mode. |
| 236 | +* Degree-2 local fits with derivative-based monotonicity checks. |
| 237 | +* Covariance modeling across nuisance axes. |
| 238 | +* Adaptive time binning based on drift thresholds. |
| 239 | +* ML-ready derivatives and cost-function integration. |
| 240 | + |
| 241 | +--- |
| 242 | + |
| 243 | +## 15. References |
| 244 | + |
| 245 | +* PWG-P context: combined multiplicity/flow estimator materials. |
| 246 | +* RootInteractive / AliasDataFrame pipelines for calibration QA. |
| 247 | + |
| 248 | +--- |
0 commit comments