Skip to content

Commit bb51bc0

Browse files
author
miranov25
committed
docs(restartContext): update with 5k/5 default, 30% outliers, and leverage-outlier plan
- Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime - Add action plan: leverage-outlier generator + refit counters + multi-target cost check - Keep PR target; align benchmarks and docs with 5k/5 default
1 parent 7d215d3 commit bb51bc0

File tree

1 file changed

+133
-0
lines changed

1 file changed

+133
-0
lines changed
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
Short answer: **yes — a small update to `restartContext_groupby_regression.md` now will help** keep everyone aligned. We learned that response-only outliers don’t reproduce the slowdown; we set a new 5k/5 default; and we added a 30% scenario. That’s enough to warrant a “delta” refresh so Claude/others don’t chase the wrong thing.
2+
3+
Here’s a **drop-in replacement** you can paste over the current file (keeps the same structure, updated content). If you prefer to append, you can paste the “What changed” + “Action plan” sections at the top.
4+
5+
---
6+
7+
# Restart Context: GroupBy Regression Benchmarking & Documentation
8+
9+
**Date:** October 22, 2025
10+
**Project:** dfextensions (ALICE O2 Physics)
11+
**Focus:** `groupby_regression.py` — benchmarking & performance docs
12+
**Target:** PR by Oct 25, 2025
13+
**Collaborators:** GPT (Primary Coder), Claude (Reviewer), User (Approval)
14+
15+
---
16+
17+
## Executive Summary (updated)
18+
19+
Benchmarks on synthetic data show that **response-only outliers (shift in y)** do **not** slow down the OLS/robust path; runtime remains essentially unchanged even at **30% contamination**. Both Mac and Linux show similar **scaling** (Linux ≈2–2.5× slower wall time per 1k groups due to platform factors).
20+
The **real-data 25× slowdown** is therefore likely due to **sigmaCut-triggered robust re-fits driven by leverage outliers in X** and/or multi-target fits (e.g., `dX,dY,dZ`) compounding the cost.
21+
22+
**New default benchmark:** **5,000 groups × 5 rows/group** (fast, representative).
23+
**New scenarios:** include **30% outliers (5σ)** to demonstrate stability of response-only contamination.
24+
25+
---
26+
27+
## What changed since last update
28+
29+
* **Benchmark defaults:** `--rows-per-group 5 --groups 5000` adopted for docs & CI-friendly runs.
30+
* **Scenarios:** Added **30% outliers (5σ)** in serial + parallel.
31+
* **Findings:**
32+
33+
* Mac (per 1k groups): serial ~**1.69 s**, parallel(10) ~**0.50 s**.
34+
* Linux (per 1k groups): serial ~**4.14 s**, parallel(10) ~**0.98 s**.
35+
* 5–30% response outliers: **no runtime increase** vs clean.
36+
* **Conclusion:** Synthetic setup doesn’t trigger the **re-fit loop**; real data likely has **leverage** characteristics or different fitter path.
37+
38+
---
39+
40+
## Problem Statement (refined)
41+
42+
Observed **~25× slowdown** on real datasets when using `sigmaCut` robust filtering. Root cause is presumed **iterative re-fitting per group** when the mask updates (MAD-based) repeatedly exclude many points — common under **leverage outliers in X** or mixed contamination (X & y). Multi-target fitting (e.g., 3 targets) likely multiplies cost.
43+
44+
---
45+
46+
## Cross-Platform Note
47+
48+
Linux runs are **~2–2.5×** slower in absolute time than Mac, but **parallel speed-ups are consistent** (~4–5×). Differences are due to CPU/BLAS/spawn model (Apptainer), not algorithmic changes.
49+
50+
---
51+
52+
## Action Plan (next 48h)
53+
54+
1. **Add leverage-outlier generator** to benchmarks
55+
56+
* API: `create_data_with_outliers(..., mode="response|leverage|both", x_mag=8.0)`
57+
* Goal: Reproduce sigmaCut re-fit slow path (target 10–25×).
58+
2. **Instrument the fitter**
59+
60+
* Add counters in `process_group_robust()`:
61+
62+
* `n_refits`, `mask_fraction`, and per-group timings.
63+
* Emit aggregated stats in `dfGB` (or a side JSON) for diagnostics.
64+
3. **Multi-target cost check**
65+
66+
* Run with `fit_columns=['dX']`, then `['dX','dY','dZ']` to quantify multiplicative cost.
67+
4. **Config toggles for mitigation** (document in perf section)
68+
69+
* `sigmaCut=100` (disable re-fit) as a “fast path” when upstream filtering is trusted.
70+
* Optional `max_refits` (cap iterations), log a warning when hit.
71+
* Consider `fitter='huber'` fast-path if available.
72+
5. **Finalize docs**
73+
74+
* Keep 5k/5 as **doc default**; show Mac+Linux tables.
75+
* Add a **“Stress Test (Leverage)”** table once generator is merged.
76+
77+
---
78+
79+
## Deliverables Checklist
80+
81+
* [x] Single-file benchmark with 5k/5 default & 30% outlier scenarios
82+
* [x] Performance section in `groupby_regression.md` (Mac/Linux tables)
83+
* [ ] **Leverage-outlier generator** (+ scenarios)
84+
* [ ] Fitter instrumentation (refit counters, timings)
85+
* [ ] Performance tests (CI thresholds for clean vs stress)
86+
* [ ] `BENCHMARKS.md` with full runs & environment capture
87+
88+
---
89+
90+
## Current Commands
91+
92+
**Default quick run (docs/CI):**
93+
94+
```bash
95+
python3 bench_groupby_regression.py \
96+
--rows-per-group 5 --groups 5000 \
97+
--n-jobs 10 --sigmaCut 5 --fitter ols \
98+
--out bench_out --emit-csv
99+
```
100+
101+
**Stress test placeholder (to be added):**
102+
103+
```bash
104+
python3 bench_groupby_regression.py \
105+
--rows-per-group 5 --groups 5000 \
106+
--n-jobs 10 --sigmaCut 5 --fitter ols \
107+
--mode leverage --x-mag 8.0 \
108+
--out bench_out_stress --emit-csv
109+
```
110+
111+
---
112+
113+
## Risks & Open Questions
114+
115+
* What outlier **structure** in real data triggers the re-fit? (X leverage? heteroscedasticity? group size variance?)
116+
* Is the slowdown proportional to **targets × refits × groups**?
117+
* Do container spawn/backends (forkserver/spawn) amplify overhead for very small groups?
118+
119+
---
120+
121+
**Last updated:** Oct 22, 2025 (this revision)
122+
123+
---
124+
125+
### Commit message
126+
127+
```
128+
docs(restartContext): update with 5k/5 default, 30% outliers, and leverage-outlier plan
129+
130+
- Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime
131+
- Add action plan: leverage-outlier generator + refit counters + multi-target cost check
132+
- Keep PR target; align benchmarks and docs with 5k/5 default
133+
```

0 commit comments

Comments
 (0)