docs(groupby_regression): add Performance & Benchmarking section + fix Markdown tables

miranov25 · miranov25 · commit 57b3293c1f21 · 2025-10-22T12:40:09.000+02:00
- Added new "Performance &amp; Benchmarking" section describing benchmark usage, results, and interpretation
- Included CLion-compatible Markdown tables for output columns, example results, and recommendations
- Documented benchmark command line and sample outputs (50k rows / 10k groups)
- Clarified how sigmaCut and parallelization affect runtime
- Minor formatting and readability improvements across the file
diff --git a/UTILS/dfextensions/groupby_regression.md b/UTILS/dfextensions/groupby_regression.md
@@ -27,8 +27,8 @@ Performs group-wise **ordinary least squares (OLS)** regression fits.
 
 * `(df_out, dfGB)`:
 
-    * `df_out`: Original DataFrame with predictions (if enabled)
-    * `dfGB`: Per-group statistics, including slopes, intercepts, medians, and bin counts
+  * `df_out`: Original DataFrame with predictions (if enabled)
+  * `dfGB`: Per-group statistics, including slopes, intercepts, medians, and bin counts
 
 ---
 
@@ -55,18 +55,18 @@ Performs **robust group-wise regression** using `HuberRegressor`, with optional
 from groupby_regression import GroupByRegressor
 
 df_out, dfGB = GroupByRegressor.make_parallel_fit(
-    df,
-    gb_columns=['detector_sector'],
-    fit_columns=['dEdx'],
-    linear_columns=['path_length', 'momentum'],
-    median_columns=['path_length'],
-    weights='w_dedx',
-    suffix='_calib',
-    selection=(df['track_quality'] > 0.9),
-    cast_dtype='float32',
-    addPrediction=True,
-    min_stat=[20, 20],
-    n_jobs=4
+  df,
+  gb_columns=['detector_sector'],
+  fit_columns=['dEdx'],
+  linear_columns=['path_length', 'momentum'],
+  median_columns=['path_length'],
+  weights='w_dedx',
+  suffix='_calib',
+  selection=(df['track_quality'] > 0.9),
+  cast_dtype='float32',
+  addPrediction=True,
+  min_stat=[20, 20],
+  n_jobs=4
 )
 ```
 
@@ -128,16 +128,62 @@ df_out, dfGB = GroupByRegressor.make_parallel_fit(
 * Exact recovery of known coefficients
 * `cast_dtype` precision testing
 
-## Tips
+## Performance & Benchmarking
 
-💡 Use `cast_dtype='float16'` for storage savings, but ensure it's compatible with downstream numerical precision requirements.
-**Improvements for groupby\_regression.md**
+### Overview
 
----
+To evaluate scaling and performance trade-offs, a dedicated benchmark tool is provided:
 
-### Usage Example for `cast_dtype`
+```bash
+python3 bench_groupby_regression.py \
+  --rows-per-group 5 --groups 10000 \
+  --n-jobs 10 --sigmaCut 5 --fitter ols \
+  --out bench_out --emit-csv
+```
+
+Each run generates:
+
+* `benchmark_report.txt` – human-readable summary
+* `benchmark_results.json` / `.csv` – structured outputs for analysis
+
+### Example Results (50k rows / 10k groups ≈ 5 rows per group)
+
+| Scenario                   | Config                  | Result                 | Notes           |
+| -------------------------- | ----------------------- | ---------------------- | --------------- |
+| Clean Data (Serial)        | `n_jobs=1, σCut=5, OLS` | **1.75 s / 1k groups** | Baseline        |
+| Clean Data (Parallel 10)   | `n_jobs=10`             | **0.41 s / 1k groups** | ≈ 4.3× faster   |
+| 10% Outliers (5σ, Serial)  | `n_jobs=1`              | **1.77 s / 1k groups** | ≈ same as clean |
+| 5% Outliers (3σ, Serial)   | `n_jobs=1`              | **1.70 s / 1k groups** | Mild noise      |
+| 10% Outliers (10σ, Serial) | `n_jobs=1`              | **1.81 s / 1k groups** | Still stable    |
+
+*Hardware:* 12‑core Intel i7, Python 3.11, pandas 2.2, joblib 1.4
+*Dataset:* synthetic (y = 2·x₁ + 3·x₂ + ε)
+
+### Interpretation
+
+* The **OLS path** scales linearly with group count.
+* **Parallelization** provides 4–5× acceleration for thousands of small groups.
+* Current synthetic *y‑only* outliers do **not** trigger re‑fitting overhead.
+* Real‑data slowdowns (up to 25×) occur when **sigmaCut** forces iterative robust refits.
 
-In the `make_parallel_fit` and `make_linear_fit` functions, the `cast_dtype` parameter ensures consistent numeric precision for slope, intercept, and error terms. This is useful for long pipelines or for memory-sensitive applications.
+### Recommendations
+
+| Use case                       | Suggested settings                                      |
+| ------------------------------ | ------------------------------------------------------- |
+| Clean data                     | `sigmaCut=100` (disable refit), use `n_jobs≈CPU cores`  |
+| Moderate outliers              | `sigmaCut=5–10`, enable parallelization                 |
+| Heavy outliers (detector data) | Use `fitter='robust'` or `huber` and accept higher cost |
+| Quick validation               | `bench_groupby_regression.py --quick`                   |
+
+### Future Work
+
+A future extension will introduce **leverage‑outlier** generation (outliers in X and Y) to replicate the observed 25× slowdown and allow comparative testing of different robust fitters.
+
+## Tips
+
+💡 Use `cast_dtype='float16'` for storage savings, but ensure it is compatible with downstream numerical precision requirements.
+
+### Usage Example for `cast_dtype`
 
 ```python
 import pandas as pd
@@ -146,24 +192,24 @@ from dfextensions.groupby_regression import GroupByRegressor
 
 # Sample DataFrame
 df = pd.DataFrame({
-    'group': ['A'] * 10 + ['B'] * 10,
-    'x': np.linspace(0, 1, 20),
-    'y': np.linspace(0, 2, 20) + np.random.normal(0, 0.1, 20),
-    'weight': 1.0,
+  'group': ['A'] * 10 + ['B'] * 10,
+  'x': np.linspace(0, 1, 20),
+  'y': np.linspace(0, 2, 20) + np.random.normal(0, 0.1, 20),
+  'weight': 1.0,
 })
 
 # Linear fit with casting to float32
 df_out, dfGB = GroupByRegressor.make_parallel_fit(
-    df,
-    gb_columns=['group'],
-    fit_columns=['y'],
-    linear_columns=['x'],
-    median_columns=['x'],
-    weights='weight',
-    suffix='_f32',
-    selection=df['x'].notna(),
-    cast_dtype='float32',
-    addPrediction=True
+  df,
+  gb_columns=['group'],
+  fit_columns=['y'],
+  linear_columns=['x'],
+  median_columns=['x'],
+  weights='weight',
+  suffix='_f32',
+  selection=df['x'].notna(),
+  cast_dtype='float32',
+  addPrediction=True
 )
 
 # Check resulting data types
@@ -184,8 +230,6 @@ bin_count_f32              int64
 dtype: object
 ```
 
-
-
 ## Recent Changes
 
 * ✅ Unified `min_stat` interface for both OLS and robust fits