PERF: Default OpenBLAS thread count causes up to 13x regression for decomposition-heavy operations

## Summary

Default OpenBLAS thread settings cause severe performance regressions for decomposition-heavy MNE operations on multi-core Linux machines. Shrunk covariance estimation is **13.3x slower**, Maxwell filtering (tSSS) is **6.5x slower**, and Maxwell SSS is **9.2x slower** at the default thread count compared to the optimal setting.

This affects any Linux user with OpenBLAS (the default NumPy BLAS on most systems) and ≥8 cores.

## Benchmarks

**System**: AMD EPYC 7R13, 16 vCPU, OpenBLAS 0.3.31, 60s of sample data.

| Operation | BLAS=1 | BLAS=2 | BLAS=4 | BLAS=8 | BLAS=16 (default) | Best | Regression |
|---|---|---|---|---|---|---|---|
| **Cov (shrunk)** | 3.31s | 2.67s | **2.38s** | 2.97s | **31.58s** | 4 | **13.3x** |
| **Maxwell SSS** | 1.02s | 0.83s | 0.75s | **0.72s** | **6.59s** | 8 | **9.2x** |
| **Maxwell tSSS** | 4.42s | 2.90s | **2.33s** | 2.40s | **15.14s** | 4 | **6.5x** |
| **apply_inverse_epochs** | 13.41s | 7.99s | 5.23s | **3.93s** | 4.16s | 8 | 3.4x |
| Covariance (emp) | 0.58s | 0.46s | **0.41s** | 0.43s | **0.96s** | 4 | 2.3x |
| make_inverse_op | 1.59s | 1.20s | 1.01s | **0.90s** | 1.11s | 8 | 1.8x |
| ICA (20 comp) | 1.60s | 1.18s | 1.01s | **0.92s** | 1.14s | 8 | 1.7x |
| ICA (50 comp) | 7.95s | 6.80s | 6.50s | **5.92s** | 8.23s | 8 | 1.4x |
| Bandpass filter | 0.29s | 0.27s | 0.29s | 0.28s | 0.29s | — | 1.0x |
| TFR morlet | 6.06s | 6.35s | 6.19s | 6.26s | 6.22s | — | 1.0x |
| PSD Welch | 0.22s | 0.21s | 0.22s | 0.21s | 0.21s | — | 1.0x |
| Source morph | 0.67s | 0.69s | 0.67s | 0.66s | 0.67s | — | 1.0x |

Operations fall into three categories:

1. **Catastrophic at default threads**: Maxwell filter, shrunk covariance — dominated by SVD/QR/eigh on tall-skinny or small matrices where thread synchronization overhead exceeds compute benefit
2. **Moderate scaling** (peak at 4-8 threads): ICA, inverse operator, empirical covariance
3. **BLAS-insensitive**: bandpass/notch filter, TFR, PSD, source morph — these are FFT-dominated or sparse and completely unaffected

## Root cause

Per-operation benchmarks confirm the issue is specific to decomposition routines on medium-sized matrices:

| Operation | BLAS=1 | BLAS=4 | BLAS=8 | BLAS=16 |
|---|---|---|---|---|
| SVD (10000×306) | 230ms | 120ms | **99ms** | 111ms ↑ |
| QR+pivot (10000×306) | 200ms | 86ms | **71ms** | 76ms ↑ |
| matmul (306×306)@(306×10000) | 39ms | 13ms | 8ms | **7ms** |
| matmul (5000×5000) | 4848ms | 1229ms | 645ms | **631ms** |

SVD and QR on tall-skinny matrices (the kind used in Maxwell tSSS, shrunk covariance, and ICA) **peak at 4-8 threads and regress at 16** due to OpenBLAS thread synchronization overhead. Large square matmul scales linearly to 16 threads — but the decomposition-heavy operations don't benefit.

## Immediate user workaround

```bash
export OPENBLAS_NUM_THREADS=4  # add to .bashrc
```

This single environment variable provides larger speedups than any code-level optimization for decomposition-heavy workflows.

## Possible fixes

1. **Documentation**: Add a note to the installation/performance docs recommending `OPENBLAS_NUM_THREADS=4` for multi-core Linux systems
2. **Runtime detection**: Use `threadpoolctl` (already an MNE dependency) to detect and warn when OpenBLAS is running with too many threads
3. **Auto-tuning**: Wrap decomposition-heavy functions with `threadpool_limits(limits=N)` to cap BLAS threads — as proposed in #10522 but never implemented

Option 3 is the most robust but also the most invasive. Option 1 is low-effort and immediately helpful.

## Reproduction

<details>
<summary>Benchmark script (requires MNE sample data)</summary>

```python
"""Run with: OPENBLAS_NUM_THREADS=N python bench.py"""
import time, warnings, numpy as np
warnings.filterwarnings('ignore')
import mne
from mne.preprocessing import maxwell_filter

sample_path = mne.datasets.sample.data_path()
raw = mne.io.read_raw_fif(
    sample_path / 'MEG/sample/sample_audvis_raw.fif', verbose=False)
raw.crop(tmax=60.).load_data(verbose=False)

# Maxwell tSSS
raw.fix_mag_coil_types()
t0 = time.perf_counter()
maxwell_filter(raw.copy(),
    cross_talk=sample_path / 'SSS/ct_sparse_mgh.fif',
    calibration=sample_path / 'SSS/sss_cal_mgh.dat',
    st_duration=10., verbose=False)
print(f'Maxwell tSSS: {time.perf_counter()-t0:.3f}s')

# Shrunk covariance
raw2 = raw.copy().set_eeg_reference(projection=True, verbose=False)
events = mne.find_events(raw2, verbose=False)
epochs = mne.Epochs(raw2, events, tmin=-0.2, tmax=0.5,
                    baseline=(None, 0), verbose=False, preload=True)
t0 = time.perf_counter()
mne.compute_covariance(epochs, tmax=0., method='shrunk', verbose=False)
print(f'Cov (shrunk): {time.perf_counter()-t0:.3f}s')
```

Compare: `OPENBLAS_NUM_THREADS=4 python bench.py` vs `OPENBLAS_NUM_THREADS=16 python bench.py`

</details>

## Related

- #10522 — proposed using `threadpoolctl` to control BLAS threads via `n_jobs` (closed, never implemented)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: Default OpenBLAS thread count causes up to 13x regression for decomposition-heavy operations #13766

Summary

Benchmarks

Root cause

Immediate user workaround

Possible fixes

Reproduction

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	BLAS=1	BLAS=2	BLAS=4	BLAS=8	BLAS=16 (default)	Best	Regression
Cov (shrunk)	3.31s	2.67s	2.38s	2.97s	31.58s	4	13.3x
Maxwell SSS	1.02s	0.83s	0.75s	0.72s	6.59s	8	9.2x
Maxwell tSSS	4.42s	2.90s	2.33s	2.40s	15.14s	4	6.5x
apply_inverse_epochs	13.41s	7.99s	5.23s	3.93s	4.16s	8	3.4x
Covariance (emp)	0.58s	0.46s	0.41s	0.43s	0.96s	4	2.3x
make_inverse_op	1.59s	1.20s	1.01s	0.90s	1.11s	8	1.8x
ICA (20 comp)	1.60s	1.18s	1.01s	0.92s	1.14s	8	1.7x
ICA (50 comp)	7.95s	6.80s	6.50s	5.92s	8.23s	8	1.4x
Bandpass filter	0.29s	0.27s	0.29s	0.28s	0.29s	—	1.0x
TFR morlet	6.06s	6.35s	6.19s	6.26s	6.22s	—	1.0x
PSD Welch	0.22s	0.21s	0.22s	0.21s	0.21s	—	1.0x
Source morph	0.67s	0.69s	0.67s	0.66s	0.67s	—	1.0x

Operation	BLAS=1	BLAS=4	BLAS=8	BLAS=16
SVD (10000×306)	230ms	120ms	99ms	111ms ↑
QR+pivot (10000×306)	200ms	86ms	71ms	76ms ↑
matmul (306×306)@(306×10000)	39ms	13ms	8ms	7ms
matmul (5000×5000)	4848ms	1229ms	645ms	631ms

Uh oh!

PERF: Default OpenBLAS thread count causes up to 13x regression for decomposition-heavy operations #13766

Description

Summary

Benchmarks

Root cause

Immediate user workaround

Possible fixes

Reproduction

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions