-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PERF: Default OpenBLAS thread count causes up to 13x regression for decomposition-heavy operations #13766
Description
Summary
Default OpenBLAS thread settings cause severe performance regressions for decomposition-heavy MNE operations on multi-core Linux machines. Shrunk covariance estimation is 13.3x slower, Maxwell filtering (tSSS) is 6.5x slower, and Maxwell SSS is 9.2x slower at the default thread count compared to the optimal setting.
This affects any Linux user with OpenBLAS (the default NumPy BLAS on most systems) and ≥8 cores.
Benchmarks
System: AMD EPYC 7R13, 16 vCPU, OpenBLAS 0.3.31, 60s of sample data.
| Operation | BLAS=1 | BLAS=2 | BLAS=4 | BLAS=8 | BLAS=16 (default) | Best | Regression |
|---|---|---|---|---|---|---|---|
| Cov (shrunk) | 3.31s | 2.67s | 2.38s | 2.97s | 31.58s | 4 | 13.3x |
| Maxwell SSS | 1.02s | 0.83s | 0.75s | 0.72s | 6.59s | 8 | 9.2x |
| Maxwell tSSS | 4.42s | 2.90s | 2.33s | 2.40s | 15.14s | 4 | 6.5x |
| apply_inverse_epochs | 13.41s | 7.99s | 5.23s | 3.93s | 4.16s | 8 | 3.4x |
| Covariance (emp) | 0.58s | 0.46s | 0.41s | 0.43s | 0.96s | 4 | 2.3x |
| make_inverse_op | 1.59s | 1.20s | 1.01s | 0.90s | 1.11s | 8 | 1.8x |
| ICA (20 comp) | 1.60s | 1.18s | 1.01s | 0.92s | 1.14s | 8 | 1.7x |
| ICA (50 comp) | 7.95s | 6.80s | 6.50s | 5.92s | 8.23s | 8 | 1.4x |
| Bandpass filter | 0.29s | 0.27s | 0.29s | 0.28s | 0.29s | — | 1.0x |
| TFR morlet | 6.06s | 6.35s | 6.19s | 6.26s | 6.22s | — | 1.0x |
| PSD Welch | 0.22s | 0.21s | 0.22s | 0.21s | 0.21s | — | 1.0x |
| Source morph | 0.67s | 0.69s | 0.67s | 0.66s | 0.67s | — | 1.0x |
Operations fall into three categories:
- Catastrophic at default threads: Maxwell filter, shrunk covariance — dominated by SVD/QR/eigh on tall-skinny or small matrices where thread synchronization overhead exceeds compute benefit
- Moderate scaling (peak at 4-8 threads): ICA, inverse operator, empirical covariance
- BLAS-insensitive: bandpass/notch filter, TFR, PSD, source morph — these are FFT-dominated or sparse and completely unaffected
Root cause
Per-operation benchmarks confirm the issue is specific to decomposition routines on medium-sized matrices:
| Operation | BLAS=1 | BLAS=4 | BLAS=8 | BLAS=16 |
|---|---|---|---|---|
| SVD (10000×306) | 230ms | 120ms | 99ms | 111ms ↑ |
| QR+pivot (10000×306) | 200ms | 86ms | 71ms | 76ms ↑ |
| matmul (306×306)@(306×10000) | 39ms | 13ms | 8ms | 7ms |
| matmul (5000×5000) | 4848ms | 1229ms | 645ms | 631ms |
SVD and QR on tall-skinny matrices (the kind used in Maxwell tSSS, shrunk covariance, and ICA) peak at 4-8 threads and regress at 16 due to OpenBLAS thread synchronization overhead. Large square matmul scales linearly to 16 threads — but the decomposition-heavy operations don't benefit.
Immediate user workaround
export OPENBLAS_NUM_THREADS=4 # add to .bashrcThis single environment variable provides larger speedups than any code-level optimization for decomposition-heavy workflows.
Possible fixes
- Documentation: Add a note to the installation/performance docs recommending
OPENBLAS_NUM_THREADS=4for multi-core Linux systems - Runtime detection: Use
threadpoolctl(already an MNE dependency) to detect and warn when OpenBLAS is running with too many threads - Auto-tuning: Wrap decomposition-heavy functions with
threadpool_limits(limits=N)to cap BLAS threads — as proposed in Improve control over number of threads used in an mne call #10522 but never implemented
Option 3 is the most robust but also the most invasive. Option 1 is low-effort and immediately helpful.
Reproduction
Benchmark script (requires MNE sample data)
"""Run with: OPENBLAS_NUM_THREADS=N python bench.py"""
import time, warnings, numpy as np
warnings.filterwarnings('ignore')
import mne
from mne.preprocessing import maxwell_filter
sample_path = mne.datasets.sample.data_path()
raw = mne.io.read_raw_fif(
sample_path / 'MEG/sample/sample_audvis_raw.fif', verbose=False)
raw.crop(tmax=60.).load_data(verbose=False)
# Maxwell tSSS
raw.fix_mag_coil_types()
t0 = time.perf_counter()
maxwell_filter(raw.copy(),
cross_talk=sample_path / 'SSS/ct_sparse_mgh.fif',
calibration=sample_path / 'SSS/sss_cal_mgh.dat',
st_duration=10., verbose=False)
print(f'Maxwell tSSS: {time.perf_counter()-t0:.3f}s')
# Shrunk covariance
raw2 = raw.copy().set_eeg_reference(projection=True, verbose=False)
events = mne.find_events(raw2, verbose=False)
epochs = mne.Epochs(raw2, events, tmin=-0.2, tmax=0.5,
baseline=(None, 0), verbose=False, preload=True)
t0 = time.perf_counter()
mne.compute_covariance(epochs, tmax=0., method='shrunk', verbose=False)
print(f'Cov (shrunk): {time.perf_counter()-t0:.3f}s')Compare: OPENBLAS_NUM_THREADS=4 python bench.py vs OPENBLAS_NUM_THREADS=16 python bench.py
Related
- Improve control over number of threads used in an mne call #10522 — proposed using
threadpoolctlto control BLAS threads vian_jobs(closed, never implemented)