Skip to content

PERF: Default OpenBLAS thread count causes up to 13x regression for decomposition-heavy operations #13766

@sharifhsn

Description

@sharifhsn

Summary

Default OpenBLAS thread settings cause severe performance regressions for decomposition-heavy MNE operations on multi-core Linux machines. Shrunk covariance estimation is 13.3x slower, Maxwell filtering (tSSS) is 6.5x slower, and Maxwell SSS is 9.2x slower at the default thread count compared to the optimal setting.

This affects any Linux user with OpenBLAS (the default NumPy BLAS on most systems) and ≥8 cores.

Benchmarks

System: AMD EPYC 7R13, 16 vCPU, OpenBLAS 0.3.31, 60s of sample data.

Operation BLAS=1 BLAS=2 BLAS=4 BLAS=8 BLAS=16 (default) Best Regression
Cov (shrunk) 3.31s 2.67s 2.38s 2.97s 31.58s 4 13.3x
Maxwell SSS 1.02s 0.83s 0.75s 0.72s 6.59s 8 9.2x
Maxwell tSSS 4.42s 2.90s 2.33s 2.40s 15.14s 4 6.5x
apply_inverse_epochs 13.41s 7.99s 5.23s 3.93s 4.16s 8 3.4x
Covariance (emp) 0.58s 0.46s 0.41s 0.43s 0.96s 4 2.3x
make_inverse_op 1.59s 1.20s 1.01s 0.90s 1.11s 8 1.8x
ICA (20 comp) 1.60s 1.18s 1.01s 0.92s 1.14s 8 1.7x
ICA (50 comp) 7.95s 6.80s 6.50s 5.92s 8.23s 8 1.4x
Bandpass filter 0.29s 0.27s 0.29s 0.28s 0.29s 1.0x
TFR morlet 6.06s 6.35s 6.19s 6.26s 6.22s 1.0x
PSD Welch 0.22s 0.21s 0.22s 0.21s 0.21s 1.0x
Source morph 0.67s 0.69s 0.67s 0.66s 0.67s 1.0x

Operations fall into three categories:

  1. Catastrophic at default threads: Maxwell filter, shrunk covariance — dominated by SVD/QR/eigh on tall-skinny or small matrices where thread synchronization overhead exceeds compute benefit
  2. Moderate scaling (peak at 4-8 threads): ICA, inverse operator, empirical covariance
  3. BLAS-insensitive: bandpass/notch filter, TFR, PSD, source morph — these are FFT-dominated or sparse and completely unaffected

Root cause

Per-operation benchmarks confirm the issue is specific to decomposition routines on medium-sized matrices:

Operation BLAS=1 BLAS=4 BLAS=8 BLAS=16
SVD (10000×306) 230ms 120ms 99ms 111ms ↑
QR+pivot (10000×306) 200ms 86ms 71ms 76ms ↑
matmul (306×306)@(306×10000) 39ms 13ms 8ms 7ms
matmul (5000×5000) 4848ms 1229ms 645ms 631ms

SVD and QR on tall-skinny matrices (the kind used in Maxwell tSSS, shrunk covariance, and ICA) peak at 4-8 threads and regress at 16 due to OpenBLAS thread synchronization overhead. Large square matmul scales linearly to 16 threads — but the decomposition-heavy operations don't benefit.

Immediate user workaround

export OPENBLAS_NUM_THREADS=4  # add to .bashrc

This single environment variable provides larger speedups than any code-level optimization for decomposition-heavy workflows.

Possible fixes

  1. Documentation: Add a note to the installation/performance docs recommending OPENBLAS_NUM_THREADS=4 for multi-core Linux systems
  2. Runtime detection: Use threadpoolctl (already an MNE dependency) to detect and warn when OpenBLAS is running with too many threads
  3. Auto-tuning: Wrap decomposition-heavy functions with threadpool_limits(limits=N) to cap BLAS threads — as proposed in Improve control over number of threads used in an mne call #10522 but never implemented

Option 3 is the most robust but also the most invasive. Option 1 is low-effort and immediately helpful.

Reproduction

Benchmark script (requires MNE sample data)
"""Run with: OPENBLAS_NUM_THREADS=N python bench.py"""
import time, warnings, numpy as np
warnings.filterwarnings('ignore')
import mne
from mne.preprocessing import maxwell_filter

sample_path = mne.datasets.sample.data_path()
raw = mne.io.read_raw_fif(
    sample_path / 'MEG/sample/sample_audvis_raw.fif', verbose=False)
raw.crop(tmax=60.).load_data(verbose=False)

# Maxwell tSSS
raw.fix_mag_coil_types()
t0 = time.perf_counter()
maxwell_filter(raw.copy(),
    cross_talk=sample_path / 'SSS/ct_sparse_mgh.fif',
    calibration=sample_path / 'SSS/sss_cal_mgh.dat',
    st_duration=10., verbose=False)
print(f'Maxwell tSSS: {time.perf_counter()-t0:.3f}s')

# Shrunk covariance
raw2 = raw.copy().set_eeg_reference(projection=True, verbose=False)
events = mne.find_events(raw2, verbose=False)
epochs = mne.Epochs(raw2, events, tmin=-0.2, tmax=0.5,
                    baseline=(None, 0), verbose=False, preload=True)
t0 = time.perf_counter()
mne.compute_covariance(epochs, tmax=0., method='shrunk', verbose=False)
print(f'Cov (shrunk): {time.perf_counter()-t0:.3f}s')

Compare: OPENBLAS_NUM_THREADS=4 python bench.py vs OPENBLAS_NUM_THREADS=16 python bench.py

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions