Skip to content

feat: Native Windows + CUDA build feasibility spike and porting plan (x86_64-pc-windows-msvc) #58

@inureyes

Description

@inureyes

Related to #25 (broad Windows + Linux x86_64 release-matrix work). This issue is the narrower native-Windows + CUDA build feasibility spike that de-risks the Windows portion of #25.

Goal

Produce a working native Windows build of mlxcel and mlxcel-server for single-node CUDA inference on NVIDIA GPUs via the existing cuda feature, targeting x86_64-pc-windows-msvc. This issue is both a feasibility spike (the first phase is a hard go/no-go gate) and, if the gate passes, the porting plan to a functional end-to-end binary.

This is not a supported configuration today. docs/installation.md lists Windows as "not documented here", the release workflow has no Windows job, and the CUDA build glue in build.rs is written for Linux. The binding constraint is upstream: the MLX C++ engine we statically link does not officially build on Windows.

Scope premise — distributed inference is OFF on Windows initially

Native Windows starts single-node only. Distributed inference — both tensor parallelism and pipeline parallelism — is disabled from the outset, not a stretch goal or a "nice to have if it falls out." The deliverable for this issue is single-host CUDA inference (one box, one or more local GPUs) with the distributed path compiled out on Windows.

This is a deliberate premise, not an accident of scoping. The src/distributed/* transport is built on Unix networking primitives (io_uring / kqueue / RawFd, RDMA capability probes) that are not portable to Windows without separate, substantial work; pulling that into the first build would expand the spike well past its go/no-go purpose. Restoring distributed support on Windows is explicitly out of scope here and tracked as a Phase 6 follow-up. Everything below — the blocker analysis, phases, and acceptance criteria — assumes this premise.

Context

mlxcel links MLX C++ statically. The MLX source is fetched via FetchContent and pinned in src/lib/mlx-cpp/CMakeLists.txt:91-92 (commit 84961223c02925bef6bef95d3a0a046779bde935, ~MLX v0.31.2). The same commit is asserted in src/lib/mlxcel-core/build.rs:137 (MLX_EXPECTED_COMMIT). MLX_BUILD_CUDA=ON is set for the cuda feature in build.rs:209-234.

The blockers split into three layers, ordered by how fundamental they are.

Layer 1 — Upstream MLX does not build on Windows (hard gate)

MLX's official build docs (0.31.2) cover macOS, Linux, and Linux+CUDA only; Windows is not mentioned. CUDA requirements: CUDA toolkit >= 12.0, NVIDIA driver >= 550.54.14, SM >= 7.5, cuDNN. Known Windows blockers from upstream issue ml-explore/mlx#1513:

  • CMake cannot auto-detect OpenBLAS on Windows (needs hardcoded paths).
  • MSVC has no _Complex, which lapack_complex_float is defined in terms of.
  • make_compiled_preamble.sh (kernel preamble codegen) has no Windows implementation.
  • io/load.h uses Unix system calls for I/O.
  • MSVC discovery/invocation needed for mx.compile (JIT/NVRTC path).

There is partial Windows CUDA plumbing upstream (CUDA DLL delay-loading, see PR #1983 / Discussion #2422), but the common/CPU layer is not Windows-ready at our pinned commit.

Implication: Phase 1 below must succeed (build vanilla MLX at our pinned commit on Windows+CUDA, standalone) before any mlxcel-side work is meaningful. A documented "not feasible without carrying upstream patches / waiting for upstream" is a valid outcome of this issue.

Layer 2 — mlxcel build glue is Linux-only

Even with MLX building, src/lib/mlxcel-core/build.rs assumes Linux for CUDA:

  • link_cuda() (build.rs:301-334) searches CUDA_HOME/lib64 or /usr/local/cuda/lib64, probes a stubs/ subdir, and links .so-style names (cudart, cublas, cublasLt, cufft, cuda, cudnn, nvrtc). Windows CUDA layout is %CUDA_PATH%\lib\x64, import libs are *.lib, and there is no stubs/. Needs a cfg(windows) branch.
  • The CMake CUDA compiler hint (build.rs:215-216) falls back to /usr/local/cuda/bin/nvcc. On Windows this is nvcc.exe under %CUDA_PATH%\bin.
  • CPU-backend BLAS is linked inside #[cfg(target_os = "linux")] (build.rs:95-101: stdc++, openblas, lapack). Windows needs its own branch and a BLAS source (ties into the upstream OpenBLAS-on-Windows blocker).
  • The cxx-bridge compile flags are GCC/Clang syntax: -std=c++20 (build.rs:46), -O3/-ffast-math/-march=native (build.rs:54-66), -flto (macOS-gated). Under MSVC, flag_if_supported silently drops all of these, so MLX's required C++20 standard is never set and optimizations are lost. Need MSVC equivalents (/std:c++20, /O2, etc.).

Layer 3 — Rust code has Unix dependencies

libc is a [target.'cfg(unix)'.dependencies] entry (Cargo.toml:154-158), so it is absent on Windows. There are ~55 Unix-only API usages (std::os::unix, io_uring, kqueue, RawFd/AsRawFd, /dev/fd/N) across:

  • src/distributed/* (RDMA, tensor/pipeline parallel transport)
  • src/server/media.rs, src/multimodal/video.rs
  • src/downloader/, src/execution/runtime.rs

Some are already gated (src/server/media.rs:324 #[cfg(unix)], multiple #[cfg(unix)] in src/multimodal/video.rs, and a cfg!(target_os = "windows") branch already exists at src/distributed/rdma_capabilities.rs:39). The distributed (tensor/pipeline parallelism) path is deeply tied to Unix networking primitives and is compiled out on Windows for the initial scope (see the Scope premise above), then revisited as a Phase 6 follow-up.

Prerequisites to prepare (toolchain)

  • Visual Studio 2022 with the "Desktop development with C++" workload (MSVC v143). CUDA's nvcc on Windows requires the MSVC cl.exe host compiler — MinGW/GNU is not a CUDA-supported host.
  • CUDA Toolkit 12.x or newer (MLX requires >= 12.0; our Linux releases target CUDA 13). Includes nvcc, cudart, cublas, cufft, nvrtc.
  • cuDNN matching the CUDA version.
  • NVIDIA driver >= 550.
  • CMake (latest), Git, and Git Bash (in case MLX's shell-based codegen scripts are still on the CUDA path).
  • Rust toolchain with the x86_64-pc-windows-msvc target (NOT the GNU target — must link MSVC-built CUDA/MLX libs).
  • OpenBLAS/LAPACK for Windows (e.g. via vcpkg) plus CMake hints so it is discoverable.

Phased plan

Phase 0 — Environment

  • Stand up a Windows host with an NVIDIA GPU and the toolchain above.
  • Confirm nvcc --version, cl.exe, cmake --version, rustc --version --verbose (host = msvc), nvidia-smi.

Phase 1 — GATE: build vanilla MLX on Windows+CUDA (standalone, no mlxcel)

  • Check out MLX at pinned commit 84961223c02925bef6bef95d3a0a046779bde935 and attempt a CUDA build with -DMLX_BUILD_CUDA=ON directly (no mlxcel).
  • Resolve / document each blocker hit (OpenBLAS detection, MSVC _Complex/lapack, preamble codegen, io/load.h, JIT compile host discovery).
  • Decision point: record whether vanilla MLX can be built (with how many local patches), or whether this is blocked pending upstream. If blocked, document evidence and stop; that is a valid close.

Phase 2 — mlxcel build glue (build.rs)

  • Add a cfg(windows) CUDA link branch: %CUDA_PATH%\lib\x64, .lib import-lib names, no stubs/.
  • Add Windows nvcc path resolution (%CUDA_PATH%\bin\nvcc.exe).
  • Add a Windows CPU/BLAS link branch (and decide BLAS source).
  • Add MSVC compile flags for the cxx bridge (/std:c++20, /O2) so C++20 + optimization are actually applied.

Phase 3 — Rust platform gating

  • Audit the ~55 Unix API usages; cfg-gate or provide Windows fallbacks.
  • Compile out the distributed (tensor/pipeline parallel) path on Windows per the Scope premise; track the Phase 6 follow-up to restore it.
  • Ensure mlxcel-server and CLI build with distributed gated off.

Phase 4 — dependent C++ crates

  • Verify sentencepiece (Cargo.toml:77, builds a C++ lib via CMake) compiles under MSVC. (tokenizers, llguidance, toktrie are pure Rust and should be fine.)

Phase 5 — end-to-end smoke test

  • cargo build --release --target x86_64-pc-windows-msvc --features cuda produces both binaries.
  • mlxcel --version / mlxcel-server --version.
  • mlxcel download mlx-community/Qwen3-0.6B-4bit then mlxcel generate -m ... -p "Hello" -n 16 produces tokens on GPU.
  • mlxcel-server serves a /v1/chat/completions request.

Phase 6 — follow-ups (out of scope for first close, file separately)

  • CI Windows job, packaging, optional code signing.
  • Restore distributed features on Windows.
  • Reconcile CUDA 12.x vs 13.x version matrix with the Linux release.

Acceptance criteria

This issue is complete when either:

  1. A native x86_64-pc-windows-msvc --features cuda build of mlxcel + mlxcel-server builds and passes the Phase 5 smoke tests on an NVIDIA Windows host (end-to-end functional single-node binary; per the Scope premise, distributed inference is compiled out on Windows and its absence does not block this issue — it is tracked as a Phase 6 follow-up); or
  2. Phase 1 produces a documented go/no-go determination showing native MLX-on-Windows is not feasible at the pinned commit without unacceptable local patching, with the specific blockers and evidence recorded for a future re-attempt.

Risks & fallback

  • Upstream MLX may be a hard blocker; carrying local MLX patches across the pinned-commit upgrade process adds maintenance cost.
  • Distributed inference is unlikely to work natively on Windows initially.
  • Fallback already supported: WSL2 (Ubuntu) + NVIDIA WSL CUDA driver runs the documented Linux CUDA path (cargo build --release --features cuda) with none of the above work. This issue exists specifically because a native .exe is wanted instead of the WSL2 path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:coremlxcel-core: MLX FFI, primitives, KV cache, layersarea:inferenceGeneration, sampling, decoding (incl. speculative, DRY)platform:windowsWindows (native) specificpriority:mediumMedium prioritystatus:investigationFeasibility spike / under investigationtype:enhancementNew features, capabilities, or significant additions

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions