Skip to content

Commit 57b3293

Browse files
author
miranov25
committed
docs(groupby_regression): add Performance & Benchmarking section + fix Markdown tables
- Added new "Performance & Benchmarking" section describing benchmark usage, results, and interpretation - Included CLion-compatible Markdown tables for output columns, example results, and recommendations - Documented benchmark command line and sample outputs (50k rows / 10k groups) - Clarified how sigmaCut and parallelization affect runtime - Minor formatting and readability improvements across the file
1 parent cd63f42 commit 57b3293

File tree

1 file changed

+80
-36
lines changed

1 file changed

+80
-36
lines changed

UTILS/dfextensions/groupby_regression.md

Lines changed: 80 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ Performs group-wise **ordinary least squares (OLS)** regression fits.
2727

2828
* `(df_out, dfGB)`:
2929

30-
* `df_out`: Original DataFrame with predictions (if enabled)
31-
* `dfGB`: Per-group statistics, including slopes, intercepts, medians, and bin counts
30+
* `df_out`: Original DataFrame with predictions (if enabled)
31+
* `dfGB`: Per-group statistics, including slopes, intercepts, medians, and bin counts
3232

3333
---
3434

@@ -55,18 +55,18 @@ Performs **robust group-wise regression** using `HuberRegressor`, with optional
5555
from groupby_regression import GroupByRegressor
5656

5757
df_out, dfGB = GroupByRegressor.make_parallel_fit(
58-
df,
59-
gb_columns=['detector_sector'],
60-
fit_columns=['dEdx'],
61-
linear_columns=['path_length', 'momentum'],
62-
median_columns=['path_length'],
63-
weights='w_dedx',
64-
suffix='_calib',
65-
selection=(df['track_quality'] > 0.9),
66-
cast_dtype='float32',
67-
addPrediction=True,
68-
min_stat=[20, 20],
69-
n_jobs=4
58+
df,
59+
gb_columns=['detector_sector'],
60+
fit_columns=['dEdx'],
61+
linear_columns=['path_length', 'momentum'],
62+
median_columns=['path_length'],
63+
weights='w_dedx',
64+
suffix='_calib',
65+
selection=(df['track_quality'] > 0.9),
66+
cast_dtype='float32',
67+
addPrediction=True,
68+
min_stat=[20, 20],
69+
n_jobs=4
7070
)
7171
```
7272

@@ -128,16 +128,62 @@ df_out, dfGB = GroupByRegressor.make_parallel_fit(
128128
* Exact recovery of known coefficients
129129
* `cast_dtype` precision testing
130130

131-
## Tips
131+
## Performance & Benchmarking
132132

133-
💡 Use `cast_dtype='float16'` for storage savings, but ensure it's compatible with downstream numerical precision requirements.
134-
**Improvements for groupby\_regression.md**
133+
### Overview
135134

136-
---
135+
To evaluate scaling and performance trade-offs, a dedicated benchmark tool is provided:
137136

138-
### Usage Example for `cast_dtype`
137+
```bash
138+
python3 bench_groupby_regression.py \
139+
--rows-per-group 5 --groups 10000 \
140+
--n-jobs 10 --sigmaCut 5 --fitter ols \
141+
--out bench_out --emit-csv
142+
```
143+
144+
Each run generates:
145+
146+
* `benchmark_report.txt` – human-readable summary
147+
* `benchmark_results.json` / `.csv` – structured outputs for analysis
148+
149+
### Example Results (50k rows / 10k groups ≈ 5 rows per group)
150+
151+
| Scenario | Config | Result | Notes |
152+
| -------------------------- | ----------------------- | ---------------------- | --------------- |
153+
| Clean Data (Serial) | `n_jobs=1, σCut=5, OLS` | **1.75 s / 1k groups** | Baseline |
154+
| Clean Data (Parallel 10) | `n_jobs=10` | **0.41 s / 1k groups** | ≈ 4.3× faster |
155+
| 10% Outliers (5σ, Serial) | `n_jobs=1` | **1.77 s / 1k groups** | ≈ same as clean |
156+
| 5% Outliers (3σ, Serial) | `n_jobs=1` | **1.70 s / 1k groups** | Mild noise |
157+
| 10% Outliers (10σ, Serial) | `n_jobs=1` | **1.81 s / 1k groups** | Still stable |
158+
159+
*Hardware:* 12‑core Intel i7, Python 3.11, pandas 2.2, joblib 1.4
160+
*Dataset:* synthetic (y = 2·x₁ + 3·x₂ + ε)
161+
162+
### Interpretation
163+
164+
* The **OLS path** scales linearly with group count.
165+
* **Parallelization** provides 4–5× acceleration for thousands of small groups.
166+
* Current synthetic *y‑only* outliers do **not** trigger re‑fitting overhead.
167+
* Real‑data slowdowns (up to 25×) occur when **sigmaCut** forces iterative robust refits.
139168

140-
In the `make_parallel_fit` and `make_linear_fit` functions, the `cast_dtype` parameter ensures consistent numeric precision for slope, intercept, and error terms. This is useful for long pipelines or for memory-sensitive applications.
169+
### Recommendations
170+
171+
| Use case | Suggested settings |
172+
| ------------------------------ | ------------------------------------------------------- |
173+
| Clean data | `sigmaCut=100` (disable refit), use `n_jobs≈CPU cores` |
174+
| Moderate outliers | `sigmaCut=5–10`, enable parallelization |
175+
| Heavy outliers (detector data) | Use `fitter='robust'` or `huber` and accept higher cost |
176+
| Quick validation | `bench_groupby_regression.py --quick` |
177+
178+
### Future Work
179+
180+
A future extension will introduce **leverage‑outlier** generation (outliers in X and Y) to replicate the observed 25× slowdown and allow comparative testing of different robust fitters.
181+
182+
## Tips
183+
184+
💡 Use `cast_dtype='float16'` for storage savings, but ensure it is compatible with downstream numerical precision requirements.
185+
186+
### Usage Example for `cast_dtype`
141187

142188
```python
143189
import pandas as pd
@@ -146,24 +192,24 @@ from dfextensions.groupby_regression import GroupByRegressor
146192

147193
# Sample DataFrame
148194
df = pd.DataFrame({
149-
'group': ['A'] * 10 + ['B'] * 10,
150-
'x': np.linspace(0, 1, 20),
151-
'y': np.linspace(0, 2, 20) + np.random.normal(0, 0.1, 20),
152-
'weight': 1.0,
195+
'group': ['A'] * 10 + ['B'] * 10,
196+
'x': np.linspace(0, 1, 20),
197+
'y': np.linspace(0, 2, 20) + np.random.normal(0, 0.1, 20),
198+
'weight': 1.0,
153199
})
154200

155201
# Linear fit with casting to float32
156202
df_out, dfGB = GroupByRegressor.make_parallel_fit(
157-
df,
158-
gb_columns=['group'],
159-
fit_columns=['y'],
160-
linear_columns=['x'],
161-
median_columns=['x'],
162-
weights='weight',
163-
suffix='_f32',
164-
selection=df['x'].notna(),
165-
cast_dtype='float32',
166-
addPrediction=True
203+
df,
204+
gb_columns=['group'],
205+
fit_columns=['y'],
206+
linear_columns=['x'],
207+
median_columns=['x'],
208+
weights='weight',
209+
suffix='_f32',
210+
selection=df['x'].notna(),
211+
cast_dtype='float32',
212+
addPrediction=True
167213
)
168214

169215
# Check resulting data types
@@ -184,8 +230,6 @@ bin_count_f32 int64
184230
dtype: object
185231
```
186232

187-
188-
189233
## Recent Changes
190234

191235
* ✅ Unified `min_stat` interface for both OLS and robust fits

0 commit comments

Comments
 (0)