Related to #25 (broad Windows + Linux x86_64 release-matrix work). This issue is the narrower native-Windows + CUDA build feasibility spike that de-risks the Windows portion of #25.
Goal
Produce a working native Windows build of mlxcel and mlxcel-server for single-node CUDA inference on NVIDIA GPUs via the existing cuda feature, targeting x86_64-pc-windows-msvc. This issue is both a feasibility spike (the first phase is a hard go/no-go gate) and, if the gate passes, the porting plan to a functional end-to-end binary.
This is not a supported configuration today. docs/installation.md lists Windows as "not documented here", the release workflow has no Windows job, and the CUDA build glue in build.rs is written for Linux. The binding constraint is upstream: the MLX C++ engine we statically link does not officially build on Windows.
Scope premise — distributed inference is OFF on Windows initially
Native Windows starts single-node only. Distributed inference — both tensor parallelism and pipeline parallelism — is disabled from the outset, not a stretch goal or a "nice to have if it falls out." The deliverable for this issue is single-host CUDA inference (one box, one or more local GPUs) with the distributed path compiled out on Windows.
This is a deliberate premise, not an accident of scoping. The src/distributed/* transport is built on Unix networking primitives (io_uring / kqueue / RawFd, RDMA capability probes) that are not portable to Windows without separate, substantial work; pulling that into the first build would expand the spike well past its go/no-go purpose. Restoring distributed support on Windows is explicitly out of scope here and tracked as a Phase 6 follow-up. Everything below — the blocker analysis, phases, and acceptance criteria — assumes this premise.
Context
mlxcel links MLX C++ statically. The MLX source is fetched via FetchContent and pinned in src/lib/mlx-cpp/CMakeLists.txt:91-92 (commit 84961223c02925bef6bef95d3a0a046779bde935, ~MLX v0.31.2). The same commit is asserted in src/lib/mlxcel-core/build.rs:137 (MLX_EXPECTED_COMMIT). MLX_BUILD_CUDA=ON is set for the cuda feature in build.rs:209-234.
The blockers split into three layers, ordered by how fundamental they are.
Layer 1 — Upstream MLX does not build on Windows (hard gate)
MLX's official build docs (0.31.2) cover macOS, Linux, and Linux+CUDA only; Windows is not mentioned. CUDA requirements: CUDA toolkit >= 12.0, NVIDIA driver >= 550.54.14, SM >= 7.5, cuDNN. Known Windows blockers from upstream issue ml-explore/mlx#1513:
- CMake cannot auto-detect OpenBLAS on Windows (needs hardcoded paths).
- MSVC has no
_Complex, which lapack_complex_float is defined in terms of.
make_compiled_preamble.sh (kernel preamble codegen) has no Windows implementation.
io/load.h uses Unix system calls for I/O.
- MSVC discovery/invocation needed for
mx.compile (JIT/NVRTC path).
There is partial Windows CUDA plumbing upstream (CUDA DLL delay-loading, see PR #1983 / Discussion #2422), but the common/CPU layer is not Windows-ready at our pinned commit.
Implication: Phase 1 below must succeed (build vanilla MLX at our pinned commit on Windows+CUDA, standalone) before any mlxcel-side work is meaningful. A documented "not feasible without carrying upstream patches / waiting for upstream" is a valid outcome of this issue.
Layer 2 — mlxcel build glue is Linux-only
Even with MLX building, src/lib/mlxcel-core/build.rs assumes Linux for CUDA:
link_cuda() (build.rs:301-334) searches CUDA_HOME/lib64 or /usr/local/cuda/lib64, probes a stubs/ subdir, and links .so-style names (cudart, cublas, cublasLt, cufft, cuda, cudnn, nvrtc). Windows CUDA layout is %CUDA_PATH%\lib\x64, import libs are *.lib, and there is no stubs/. Needs a cfg(windows) branch.
- The CMake CUDA compiler hint (
build.rs:215-216) falls back to /usr/local/cuda/bin/nvcc. On Windows this is nvcc.exe under %CUDA_PATH%\bin.
- CPU-backend BLAS is linked inside
#[cfg(target_os = "linux")] (build.rs:95-101: stdc++, openblas, lapack). Windows needs its own branch and a BLAS source (ties into the upstream OpenBLAS-on-Windows blocker).
- The cxx-bridge compile flags are GCC/Clang syntax:
-std=c++20 (build.rs:46), -O3/-ffast-math/-march=native (build.rs:54-66), -flto (macOS-gated). Under MSVC, flag_if_supported silently drops all of these, so MLX's required C++20 standard is never set and optimizations are lost. Need MSVC equivalents (/std:c++20, /O2, etc.).
Layer 3 — Rust code has Unix dependencies
libc is a [target.'cfg(unix)'.dependencies] entry (Cargo.toml:154-158), so it is absent on Windows. There are ~55 Unix-only API usages (std::os::unix, io_uring, kqueue, RawFd/AsRawFd, /dev/fd/N) across:
src/distributed/* (RDMA, tensor/pipeline parallel transport)
src/server/media.rs, src/multimodal/video.rs
src/downloader/, src/execution/runtime.rs
Some are already gated (src/server/media.rs:324 #[cfg(unix)], multiple #[cfg(unix)] in src/multimodal/video.rs, and a cfg!(target_os = "windows") branch already exists at src/distributed/rdma_capabilities.rs:39). The distributed (tensor/pipeline parallelism) path is deeply tied to Unix networking primitives and is compiled out on Windows for the initial scope (see the Scope premise above), then revisited as a Phase 6 follow-up.
Prerequisites to prepare (toolchain)
Phased plan
Phase 0 — Environment
Phase 1 — GATE: build vanilla MLX on Windows+CUDA (standalone, no mlxcel)
Phase 2 — mlxcel build glue (build.rs)
Phase 3 — Rust platform gating
Phase 4 — dependent C++ crates
Phase 5 — end-to-end smoke test
Phase 6 — follow-ups (out of scope for first close, file separately)
- CI Windows job, packaging, optional code signing.
- Restore distributed features on Windows.
- Reconcile CUDA 12.x vs 13.x version matrix with the Linux release.
Acceptance criteria
This issue is complete when either:
- A native
x86_64-pc-windows-msvc --features cuda build of mlxcel + mlxcel-server builds and passes the Phase 5 smoke tests on an NVIDIA Windows host (end-to-end functional single-node binary; per the Scope premise, distributed inference is compiled out on Windows and its absence does not block this issue — it is tracked as a Phase 6 follow-up); or
- Phase 1 produces a documented go/no-go determination showing native MLX-on-Windows is not feasible at the pinned commit without unacceptable local patching, with the specific blockers and evidence recorded for a future re-attempt.
Risks & fallback
- Upstream MLX may be a hard blocker; carrying local MLX patches across the pinned-commit upgrade process adds maintenance cost.
- Distributed inference is unlikely to work natively on Windows initially.
- Fallback already supported: WSL2 (Ubuntu) + NVIDIA WSL CUDA driver runs the documented Linux CUDA path (
cargo build --release --features cuda) with none of the above work. This issue exists specifically because a native .exe is wanted instead of the WSL2 path.
Goal
Produce a working native Windows build of
mlxcelandmlxcel-serverfor single-node CUDA inference on NVIDIA GPUs via the existingcudafeature, targetingx86_64-pc-windows-msvc. This issue is both a feasibility spike (the first phase is a hard go/no-go gate) and, if the gate passes, the porting plan to a functional end-to-end binary.This is not a supported configuration today.
docs/installation.mdlists Windows as "not documented here", the release workflow has no Windows job, and the CUDA build glue inbuild.rsis written for Linux. The binding constraint is upstream: the MLX C++ engine we statically link does not officially build on Windows.Scope premise — distributed inference is OFF on Windows initially
Native Windows starts single-node only. Distributed inference — both tensor parallelism and pipeline parallelism — is disabled from the outset, not a stretch goal or a "nice to have if it falls out." The deliverable for this issue is single-host CUDA inference (one box, one or more local GPUs) with the distributed path compiled out on Windows.
This is a deliberate premise, not an accident of scoping. The
src/distributed/*transport is built on Unix networking primitives (io_uring / kqueue /RawFd, RDMA capability probes) that are not portable to Windows without separate, substantial work; pulling that into the first build would expand the spike well past its go/no-go purpose. Restoring distributed support on Windows is explicitly out of scope here and tracked as a Phase 6 follow-up. Everything below — the blocker analysis, phases, and acceptance criteria — assumes this premise.Context
mlxcellinks MLX C++ statically. The MLX source is fetched viaFetchContentand pinned insrc/lib/mlx-cpp/CMakeLists.txt:91-92(commit84961223c02925bef6bef95d3a0a046779bde935, ~MLX v0.31.2). The same commit is asserted insrc/lib/mlxcel-core/build.rs:137(MLX_EXPECTED_COMMIT).MLX_BUILD_CUDA=ONis set for thecudafeature inbuild.rs:209-234.The blockers split into three layers, ordered by how fundamental they are.
Layer 1 — Upstream MLX does not build on Windows (hard gate)
MLX's official build docs (0.31.2) cover macOS, Linux, and Linux+CUDA only; Windows is not mentioned. CUDA requirements: CUDA toolkit >= 12.0, NVIDIA driver >= 550.54.14, SM >= 7.5, cuDNN. Known Windows blockers from upstream issue ml-explore/mlx#1513:
_Complex, whichlapack_complex_floatis defined in terms of.make_compiled_preamble.sh(kernel preamble codegen) has no Windows implementation.io/load.huses Unix system calls for I/O.mx.compile(JIT/NVRTC path).There is partial Windows CUDA plumbing upstream (CUDA DLL delay-loading, see PR #1983 / Discussion #2422), but the common/CPU layer is not Windows-ready at our pinned commit.
Implication: Phase 1 below must succeed (build vanilla MLX at our pinned commit on Windows+CUDA, standalone) before any
mlxcel-side work is meaningful. A documented "not feasible without carrying upstream patches / waiting for upstream" is a valid outcome of this issue.Layer 2 —
mlxcelbuild glue is Linux-onlyEven with MLX building,
src/lib/mlxcel-core/build.rsassumes Linux for CUDA:link_cuda()(build.rs:301-334) searchesCUDA_HOME/lib64or/usr/local/cuda/lib64, probes astubs/subdir, and links.so-style names (cudart,cublas,cublasLt,cufft,cuda,cudnn,nvrtc). Windows CUDA layout is%CUDA_PATH%\lib\x64, import libs are*.lib, and there is nostubs/. Needs acfg(windows)branch.build.rs:215-216) falls back to/usr/local/cuda/bin/nvcc. On Windows this isnvcc.exeunder%CUDA_PATH%\bin.#[cfg(target_os = "linux")](build.rs:95-101:stdc++,openblas,lapack). Windows needs its own branch and a BLAS source (ties into the upstream OpenBLAS-on-Windows blocker).-std=c++20(build.rs:46),-O3/-ffast-math/-march=native(build.rs:54-66),-flto(macOS-gated). Under MSVC,flag_if_supportedsilently drops all of these, so MLX's required C++20 standard is never set and optimizations are lost. Need MSVC equivalents (/std:c++20,/O2, etc.).Layer 3 — Rust code has Unix dependencies
libcis a[target.'cfg(unix)'.dependencies]entry (Cargo.toml:154-158), so it is absent on Windows. There are ~55 Unix-only API usages (std::os::unix, io_uring, kqueue,RawFd/AsRawFd,/dev/fd/N) across:src/distributed/*(RDMA, tensor/pipeline parallel transport)src/server/media.rs,src/multimodal/video.rssrc/downloader/,src/execution/runtime.rsSome are already gated (
src/server/media.rs:324#[cfg(unix)], multiple#[cfg(unix)]insrc/multimodal/video.rs, and acfg!(target_os = "windows")branch already exists atsrc/distributed/rdma_capabilities.rs:39). The distributed (tensor/pipeline parallelism) path is deeply tied to Unix networking primitives and is compiled out on Windows for the initial scope (see the Scope premise above), then revisited as a Phase 6 follow-up.Prerequisites to prepare (toolchain)
nvccon Windows requires the MSVCcl.exehost compiler — MinGW/GNU is not a CUDA-supported host.nvcc, cudart, cublas, cufft, nvrtc.x86_64-pc-windows-msvctarget (NOT the GNU target — must link MSVC-built CUDA/MLX libs).Phased plan
Phase 0 — Environment
nvcc --version,cl.exe,cmake --version,rustc --version --verbose(host = msvc),nvidia-smi.Phase 1 — GATE: build vanilla MLX on Windows+CUDA (standalone, no mlxcel)
84961223c02925bef6bef95d3a0a046779bde935and attempt a CUDA build with-DMLX_BUILD_CUDA=ONdirectly (no mlxcel)._Complex/lapack, preamble codegen,io/load.h, JIT compile host discovery).Phase 2 — mlxcel build glue (
build.rs)cfg(windows)CUDA link branch:%CUDA_PATH%\lib\x64,.libimport-lib names, nostubs/.%CUDA_PATH%\bin\nvcc.exe)./std:c++20,/O2) so C++20 + optimization are actually applied.Phase 3 — Rust platform gating
cfg-gate or provide Windows fallbacks.mlxcel-serverand CLI build with distributed gated off.Phase 4 — dependent C++ crates
sentencepiece(Cargo.toml:77, builds a C++ lib via CMake) compiles under MSVC. (tokenizers,llguidance,toktrieare pure Rust and should be fine.)Phase 5 — end-to-end smoke test
cargo build --release --target x86_64-pc-windows-msvc --features cudaproduces both binaries.mlxcel --version/mlxcel-server --version.mlxcel download mlx-community/Qwen3-0.6B-4bitthenmlxcel generate -m ... -p "Hello" -n 16produces tokens on GPU.mlxcel-serverserves a/v1/chat/completionsrequest.Phase 6 — follow-ups (out of scope for first close, file separately)
Acceptance criteria
This issue is complete when either:
x86_64-pc-windows-msvc--features cudabuild ofmlxcel+mlxcel-serverbuilds and passes the Phase 5 smoke tests on an NVIDIA Windows host (end-to-end functional single-node binary; per the Scope premise, distributed inference is compiled out on Windows and its absence does not block this issue — it is tracked as a Phase 6 follow-up); orRisks & fallback
cargo build --release --features cuda) with none of the above work. This issue exists specifically because a native.exeis wanted instead of the WSL2 path.