Skip to content

Add support for x86 SIMD (AVX2)#3019

Open
dhiltgen wants to merge 4 commits into
ml-explore:mainfrom
dhiltgen:x86_simd
Open

Add support for x86 SIMD (AVX2)#3019
dhiltgen wants to merge 4 commits into
ml-explore:mainfrom
dhiltgen:x86_simd

Conversation

@dhiltgen
Copy link
Copy Markdown
Contributor

@dhiltgen dhiltgen commented Jan 19, 2026

Proposed changes

Implement AVX2 SIMD support for better performance on CPU-only x86 systems. Quantized matmul leveraging int8 maddubs for 4-bit/8-bit weights, with FP4, and FP8 support. Fast implementations for SDPA, RoPE, Norms, softmax and reduce. Threadpool coordination with OpenBLAS/GCD to utilize all CPU cores. JIT support for CPU SIMD.

Unless stated otherwise, all benchmarks with mlx_lm.benchmark -p 2048 -g 128 (5 trials, averages reported)

Windows 11, AMD Ryzen 9 7950X (Zen 4)

4-bit Quantized

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-4bit 375.9 23.0 1.131
Qwen2.5-1.5B-4bit 281.9 19.7 1.193
Llama-3.2-3B-4bit 134.0 8.1 2.504

8-bit Quantized

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-8bit 366.8 17.1 1.747
Qwen2.5-1.5B-8bit 277.9 14.4 1.965

bf16 (Unquantized)

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-bf16 360.0 12.4 2.817
Qwen2.5-1.5B-bf16 268.6 10.1 3.393
Llama-3.2-3B-bf16 113.6 4.4 7.017

vs Upstream MLX (unoptimized)

Upstream is too slow for p2048/g128, so both sides use p16/g4

Model Benchmark Upstream pp tok/s Upstream gen tok/s
Llama-3.2-1B-4bit p16/g4 0.020 0.022
Llama-3.2-1B-bf16 p16/g4 3.031 1.609

Linux, Intel Core i7-11700K @ 3.60GHz (Rocket Lake)

4-bit Quantized

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-4bit 237.0 24.8 1.131
Qwen2.5-1.5B-4bit 171.1 20.9 1.193
granite-3.3-2b-4bit 90.8 11.0 1.895
Llama-3.2-3B-4bit 87.0 8.7 2.504

8-bit Quantized

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-8bit 247.5 17.0 1.745
Qwen2.5-1.5B-8bit 179.8 14.2 1.965
granite-3.3-2b-8bit 91.2 8.0 3.162

bf16 (Unquantized)

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-bf16 221.8 10.9 2.816
Qwen2.5-1.5B-bf16 161.5 9.1 3.393
Llama-3.2-3B-bf16 82.9 4.0 7.017

MacOS 26.0, M3 Max (CPU only build)

Not the focus of this PR, but to demonstrate a net improvement due to the threading addition.

4-bit Quantized

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-4bit 28.6 0.71 0.73
Qwen2.5-1.5B-4bit 22.1 0.57 0.89
Llama-3.2-3B-4bit 11.4 0.29 1.87

8-bit Quantized

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-8bit 29.0 0.72 1.34
Qwen2.5-1.5B-8bit 22.2 0.58 1.66

bf16 (Unquantized)

Model Prompt tok/s Gen tok/s Peak GB
Llama-3.2-1B-bf16 94.1 5.47 2.50
Qwen2.5-1.5B-bf16 73.8 5.07 3.11
Llama-3.2-3B-bf16 58.6 2.37 6.48

vs Upstream MLX

Shorter settings used.

Model Settings Upstream pp Upstream tg Our pp Our tg pp speedup tg speedup
Llama-3.2-1B-4bit p16/g4 n3 ~0.04 ~0.07 0.89 0.91 ~22x ~13x
Llama-3.2-1B-bf16 p128/g32 n3 26.3 4.69 95.3 5.48 3.6x 1.17x

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@dhiltgen dhiltgen changed the title Add support for x85 SIMD (SSE, AVX2, AVX512) Add support for x86 SIMD (SSE, AVX2, AVX512) Jan 19, 2026
Comment on lines +15 to +23
#if !defined(MLX_USE_ACCELERATE)
#if defined(__AVX512F__)
#include "mlx/backend/cpu/simd/avx512_simd.h"
#elif defined(__AVX2__)
#include "mlx/backend/cpu/simd/avx_simd.h"
#elif defined(__SSE4_2__)
#include "mlx/backend/cpu/simd/sse_simd.h"
#endif
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this will break our linux x86 distribution in some cases. If we build with avx512 then someone tries to run it on a machine which doesn't support avx512 it will crash right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it looks like just the lowest level is enabled by default. So we should be ok.

@awni
Copy link
Copy Markdown
Member

awni commented Jan 23, 2026

@dhiltgen what are you thinking for next steps here?

I might suggest we split this out into multiple PRs to make it easier to review and incorporate. The first PR could be the basic SSE backend for X86 which we should definitely integrate. Following that we could add the extra back-ends (there is a question of how to tests those as well).

We will probably also want a neon-only back-end for linux ARM (i.e. no through accelerate).

@dhiltgen
Copy link
Copy Markdown
Contributor Author

Splitting up to smaller chunks sounds like a reasonable approach.

I'll probably keep this in draft for a bit, while we focus on full GPU load for best performance.

@awni
Copy link
Copy Markdown
Member

awni commented Jan 28, 2026

Sounds good!

@dhiltgen
Copy link
Copy Markdown
Contributor Author

I've updated this branch with a more focused implementation targeting just AVX2, fleshed out to provide a real-world performance boost for mlx_lm models running on the CPU.

@dhiltgen dhiltgen marked this pull request as ready for review March 14, 2026 17:36
@dhiltgen dhiltgen changed the title Add support for x86 SIMD (SSE, AVX2, AVX512) Add support for x86 SIMD (AVX2) Mar 14, 2026
@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented Mar 16, 2026

I think lots of changes can be submitted as separate PRs, for example the JIT compiler and allocator changes, which we can merge in a much faster manner.

dhiltgen added 4 commits May 14, 2026 08:39
Integrate BufferCache into the CPU allocator to enable memory reuse for CPU-only builds. Previously the no_gpu allocator called malloc/free on every allocation with no caching, while the Metal and CUDA backends had buffer caching for better performance.

Track cached buffers by their physical capacity when they are reused so get_cache_memory(), active memory, and cache limit enforcement continue to reflect retained memory. Add a regression test for reusing a larger cached block for a smaller request.

Changes:
- Add CpuCachedBuffer struct with intrusive freelist for object pooling
- Use BufferCache to recycle freed buffers with a 32MB default cache limit
- Preserve cached block capacity across reuse and avoid caching zero-size allocations
- Implement get_cache_memory(), set_cache_limit(), clear_cache() (were no-ops)
- Cache-first allocation path with fallback to OS malloc on cache miss
Leak the IO ThreadPool singletons and CPU CompilerCache using the same process-lifetime pattern already used by the Scheduler singleton.

The CompilerCache owns dlopen handles for JIT shared libraries. Destroying it during static teardown can dlclose generated code while stream worker threads may still be winding down. The IO loader thread pools have the same shutdown-order risk on Windows CRT teardown. These objects are process-lifetime infrastructure, and the OS reclaims their resources at exit.

Changes:
- Leak CompilerCache so JIT libraries remain mapped through process exit
- Leak IO ThreadPool singletons to avoid teardown-order races
- Clarify the Scheduler singleton comment that documents this pattern
Enable CPU mx.compile() on Windows by detecting and using clang-cl bundled with Visual Studio, or MSVC cl.exe, for JIT compilation.

Keep GPU compile availability independent from the CPU compiler probe so CPU+GPU builds do not disable GPU mx.compile() when a host C++ compiler is unavailable.

Changes:
- Add clang-cl detection via vswhere and prefer a compiler matching the build toolchain
- Add JitCompiler::available() to probe CPU JIT availability
- Emit and load .dll JIT libraries on Windows
- Support both MSVC and GCC/Clang preamble generation scripts, including optional SIMD flags
- Use WIN32 shell detection and pass preamble SIMD flags through CMake
- Define NOMINMAX/WIN32_LEAN_AND_MEAN on all WIN32 compilers
Add an AVX2 SIMD backend, CPU thread pool, and vectorized implementations for the major CPU operations used by CPU inference on x86_64.

This makes x86 CPU inference practical for small models and substantially improves CPU throughput versus the scalar baseline. Exact speedups depend on model, quantization, prompt/generation mix, BLAS implementation, and CPU power profile, so benchmark details belong in the PR notes rather than the commit message.

SIMD foundation:
- avx_simd.h: Simd<T,8> for float/double/int/float16/bfloat16 with F16C conversion, comparisons, and reductions
- x86_simd_macros.h: comparison predicates and boolean mask operations
- base_simd.h: int64/uint64 additions and x86 conditional includes
- math.h: x86 special functions with Newton-Raphson refinement

Thread pool:
- GCD backend on Apple platforms and persistent std::thread backend on Linux/Windows
- parallel_for with serialized dispatch and per-worker spin-then-sleep wakeup
- Physical-core default thread count, MLX_CPU_THREADS override, and optional OpenBLAS single-thread coordination

Vectorized ops:
- quantized.cpp/quantized_avx2.h: multi-column dequantize+FMA for Q4/Q8
- norms.cpp: RMSNorm and LayerNorm with SIMD parallel reduction
- rope.cpp/rope_avx2.h: AVX2 interleaved sin/cos rotation
- sdpa.cpp: tiled Q*K^T with online softmax, threaded across heads
- compiled.cpp: SIMD codegen plus parallel dispatch
- binary.h, unary.h, copy.cpp, indexing.cpp, reduce.cpp, softmax.cpp, gemms/: SIMD and threading improvements throughout
@dhiltgen
Copy link
Copy Markdown
Contributor Author

@zcbenz I've split a few pieces out of this one and rebased it so it's ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants