feat: Native Windows + CUDA build feasibility spike and porting plan (x86_64-pc-windows-msvc)

> Related to #25 (broad Windows + Linux x86_64 release-matrix work). This issue is the narrower native-Windows + CUDA build feasibility spike that de-risks the Windows portion of #25.

## Goal

Produce a working **native Windows** build of `mlxcel` and `mlxcel-server` for **single-node CUDA inference** on NVIDIA GPUs via the existing `cuda` feature, targeting `x86_64-pc-windows-msvc`. This issue is both a **feasibility spike** (the first phase is a hard go/no-go gate) and, if the gate passes, the **porting plan** to a functional end-to-end binary.

This is not a supported configuration today. `docs/installation.md` lists Windows as "not documented here", the release workflow has no Windows job, and the CUDA build glue in `build.rs` is written for Linux. The binding constraint is upstream: the MLX C++ engine we statically link does not officially build on Windows.

## Scope premise — distributed inference is OFF on Windows initially

Native Windows starts **single-node only**. Distributed inference — both **tensor parallelism and pipeline parallelism** — is **disabled from the outset**, not a stretch goal or a "nice to have if it falls out." The deliverable for this issue is single-host CUDA inference (one box, one or more local GPUs) with the distributed path compiled out on Windows.

This is a deliberate premise, not an accident of scoping. The `src/distributed/*` transport is built on Unix networking primitives (io_uring / kqueue / `RawFd`, RDMA capability probes) that are not portable to Windows without separate, substantial work; pulling that into the first build would expand the spike well past its go/no-go purpose. Restoring distributed support on Windows is explicitly **out of scope** here and tracked as a Phase 6 follow-up. Everything below — the blocker analysis, phases, and acceptance criteria — assumes this premise.

## Context

`mlxcel` links MLX C++ statically. The MLX source is fetched via `FetchContent` and pinned in `src/lib/mlx-cpp/CMakeLists.txt:91-92` (commit `84961223c02925bef6bef95d3a0a046779bde935`, ~MLX v0.31.2). The same commit is asserted in `src/lib/mlxcel-core/build.rs:137` (`MLX_EXPECTED_COMMIT`). `MLX_BUILD_CUDA=ON` is set for the `cuda` feature in `build.rs:209-234`.

The blockers split into three layers, ordered by how fundamental they are.

### Layer 1 — Upstream MLX does not build on Windows (hard gate)

MLX's official build docs (0.31.2) cover macOS, Linux, and Linux+CUDA only; Windows is not mentioned. CUDA requirements: CUDA toolkit >= 12.0, NVIDIA driver >= 550.54.14, SM >= 7.5, cuDNN. Known Windows blockers from upstream issue [ml-explore/mlx#1513](https://github.com/ml-explore/mlx/issues/1513):

- CMake cannot auto-detect OpenBLAS on Windows (needs hardcoded paths).
- MSVC has no `_Complex`, which `lapack_complex_float` is defined in terms of.
- `make_compiled_preamble.sh` (kernel preamble codegen) has no Windows implementation.
- `io/load.h` uses Unix system calls for I/O.
- MSVC discovery/invocation needed for `mx.compile` (JIT/NVRTC path).

There is partial Windows CUDA plumbing upstream (CUDA DLL delay-loading, see [PR #1983](https://github.com/ml-explore/mlx/pull/1983) / [Discussion #2422](https://github.com/ml-explore/mlx/discussions/2422)), but the common/CPU layer is not Windows-ready at our pinned commit.

**Implication:** Phase 1 below must succeed (build vanilla MLX at our pinned commit on Windows+CUDA, standalone) before any `mlxcel`-side work is meaningful. A documented "not feasible without carrying upstream patches / waiting for upstream" is a valid outcome of this issue.

### Layer 2 — `mlxcel` build glue is Linux-only

Even with MLX building, `src/lib/mlxcel-core/build.rs` assumes Linux for CUDA:

- `link_cuda()` (`build.rs:301-334`) searches `CUDA_HOME/lib64` or `/usr/local/cuda/lib64`, probes a `stubs/` subdir, and links `.so`-style names (`cudart`, `cublas`, `cublasLt`, `cufft`, `cuda`, `cudnn`, `nvrtc`). Windows CUDA layout is `%CUDA_PATH%\lib\x64`, import libs are `*.lib`, and there is no `stubs/`. Needs a `cfg(windows)` branch.
- The CMake CUDA compiler hint (`build.rs:215-216`) falls back to `/usr/local/cuda/bin/nvcc`. On Windows this is `nvcc.exe` under `%CUDA_PATH%\bin`.
- CPU-backend BLAS is linked inside `#[cfg(target_os = "linux")]` (`build.rs:95-101`: `stdc++`, `openblas`, `lapack`). Windows needs its own branch and a BLAS source (ties into the upstream OpenBLAS-on-Windows blocker).
- The cxx-bridge compile flags are GCC/Clang syntax: `-std=c++20` (`build.rs:46`), `-O3`/`-ffast-math`/`-march=native` (`build.rs:54-66`), `-flto` (macOS-gated). Under MSVC, `flag_if_supported` silently drops all of these, so MLX's required C++20 standard is never set and optimizations are lost. Need MSVC equivalents (`/std:c++20`, `/O2`, etc.).

### Layer 3 — Rust code has Unix dependencies

`libc` is a `[target.'cfg(unix)'.dependencies]` entry (`Cargo.toml:154-158`), so it is absent on Windows. There are ~55 Unix-only API usages (`std::os::unix`, io_uring, kqueue, `RawFd`/`AsRawFd`, `/dev/fd/N`) across:

- `src/distributed/*` (RDMA, tensor/pipeline parallel transport)
- `src/server/media.rs`, `src/multimodal/video.rs`
- `src/downloader/`, `src/execution/runtime.rs`

Some are already gated (`src/server/media.rs:324` `#[cfg(unix)]`, multiple `#[cfg(unix)]` in `src/multimodal/video.rs`, and a `cfg!(target_os = "windows")` branch already exists at `src/distributed/rdma_capabilities.rs:39`). The distributed (tensor/pipeline parallelism) path is deeply tied to Unix networking primitives and is **compiled out on Windows for the initial scope** (see the Scope premise above), then revisited as a Phase 6 follow-up.

## Prerequisites to prepare (toolchain)

- [ ] Visual Studio 2022 with the "Desktop development with C++" workload (MSVC v143). CUDA's `nvcc` on Windows requires the MSVC `cl.exe` host compiler — MinGW/GNU is not a CUDA-supported host.
- [ ] CUDA Toolkit 12.x or newer (MLX requires >= 12.0; our Linux releases target CUDA 13). Includes `nvcc`, cudart, cublas, cufft, nvrtc.
- [ ] cuDNN matching the CUDA version.
- [ ] NVIDIA driver >= 550.
- [ ] CMake (latest), Git, and Git Bash (in case MLX's shell-based codegen scripts are still on the CUDA path).
- [ ] Rust toolchain with the `x86_64-pc-windows-msvc` target (NOT the GNU target — must link MSVC-built CUDA/MLX libs).
- [ ] OpenBLAS/LAPACK for Windows (e.g. via vcpkg) plus CMake hints so it is discoverable.

## Phased plan

### Phase 0 — Environment
- [ ] Stand up a Windows host with an NVIDIA GPU and the toolchain above.
- [ ] Confirm `nvcc --version`, `cl.exe`, `cmake --version`, `rustc --version --verbose` (host = msvc), `nvidia-smi`.

### Phase 1 — GATE: build vanilla MLX on Windows+CUDA (standalone, no mlxcel)
- [ ] Check out MLX at pinned commit `84961223c02925bef6bef95d3a0a046779bde935` and attempt a CUDA build with `-DMLX_BUILD_CUDA=ON` directly (no mlxcel).
- [ ] Resolve / document each blocker hit (OpenBLAS detection, MSVC `_Complex`/lapack, preamble codegen, `io/load.h`, JIT compile host discovery).
- [ ] **Decision point:** record whether vanilla MLX can be built (with how many local patches), or whether this is blocked pending upstream. If blocked, document evidence and stop; that is a valid close.

### Phase 2 — mlxcel build glue (`build.rs`)
- [ ] Add a `cfg(windows)` CUDA link branch: `%CUDA_PATH%\lib\x64`, `.lib` import-lib names, no `stubs/`.
- [ ] Add Windows nvcc path resolution (`%CUDA_PATH%\bin\nvcc.exe`).
- [ ] Add a Windows CPU/BLAS link branch (and decide BLAS source).
- [ ] Add MSVC compile flags for the cxx bridge (`/std:c++20`, `/O2`) so C++20 + optimization are actually applied.

### Phase 3 — Rust platform gating
- [ ] Audit the ~55 Unix API usages; `cfg`-gate or provide Windows fallbacks.
- [ ] Compile out the distributed (tensor/pipeline parallel) path on Windows per the Scope premise; track the Phase 6 follow-up to restore it.
- [ ] Ensure `mlxcel-server` and CLI build with distributed gated off.

### Phase 4 — dependent C++ crates
- [ ] Verify `sentencepiece` (`Cargo.toml:77`, builds a C++ lib via CMake) compiles under MSVC. (`tokenizers`, `llguidance`, `toktrie` are pure Rust and should be fine.)

### Phase 5 — end-to-end smoke test
- [ ] `cargo build --release --target x86_64-pc-windows-msvc --features cuda` produces both binaries.
- [ ] `mlxcel --version` / `mlxcel-server --version`.
- [ ] `mlxcel download mlx-community/Qwen3-0.6B-4bit` then `mlxcel generate -m ... -p "Hello" -n 16` produces tokens on GPU.
- [ ] `mlxcel-server` serves a `/v1/chat/completions` request.

### Phase 6 — follow-ups (out of scope for first close, file separately)
- CI Windows job, packaging, optional code signing.
- Restore distributed features on Windows.
- Reconcile CUDA 12.x vs 13.x version matrix with the Linux release.

## Acceptance criteria

This issue is complete when **either**:

1. A native `x86_64-pc-windows-msvc` `--features cuda` build of `mlxcel` + `mlxcel-server` builds and passes the Phase 5 smoke tests on an NVIDIA Windows host (end-to-end functional **single-node** binary; per the Scope premise, distributed inference is compiled out on Windows and its absence does not block this issue — it is tracked as a Phase 6 follow-up); **or**
2. Phase 1 produces a documented go/no-go determination showing native MLX-on-Windows is not feasible at the pinned commit without unacceptable local patching, with the specific blockers and evidence recorded for a future re-attempt.

## Risks & fallback

- Upstream MLX may be a hard blocker; carrying local MLX patches across the pinned-commit upgrade process adds maintenance cost.
- Distributed inference is unlikely to work natively on Windows initially.
- **Fallback already supported:** WSL2 (Ubuntu) + NVIDIA WSL CUDA driver runs the documented Linux CUDA path (`cargo build --release --features cuda`) with none of the above work. This issue exists specifically because a native `.exe` is wanted instead of the WSL2 path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Native Windows + CUDA build feasibility spike and porting plan (x86_64-pc-windows-msvc) #58

Goal

Scope premise — distributed inference is OFF on Windows initially

Context

Layer 1 — Upstream MLX does not build on Windows (hard gate)

Layer 2 — `mlxcel` build glue is Linux-only

Layer 3 — Rust code has Unix dependencies

Prerequisites to prepare (toolchain)

Phased plan

Phase 0 — Environment

Phase 1 — GATE: build vanilla MLX on Windows+CUDA (standalone, no mlxcel)

Phase 2 — mlxcel build glue (`build.rs`)

Phase 3 — Rust platform gating

Phase 4 — dependent C++ crates

Phase 5 — end-to-end smoke test

Phase 6 — follow-ups (out of scope for first close, file separately)

Acceptance criteria

Risks & fallback

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: Native Windows + CUDA build feasibility spike and porting plan (x86_64-pc-windows-msvc) #58

Description

Goal

Scope premise — distributed inference is OFF on Windows initially

Context

Layer 1 — Upstream MLX does not build on Windows (hard gate)

Layer 2 — mlxcel build glue is Linux-only

Layer 3 — Rust code has Unix dependencies

Prerequisites to prepare (toolchain)

Phased plan

Phase 0 — Environment

Phase 1 — GATE: build vanilla MLX on Windows+CUDA (standalone, no mlxcel)

Phase 2 — mlxcel build glue (build.rs)

Phase 3 — Rust platform gating

Phase 4 — dependent C++ crates

Phase 5 — end-to-end smoke test

Phase 6 — follow-ups (out of scope for first close, file separately)

Acceptance criteria

Risks & fallback

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Layer 2 — `mlxcel` build glue is Linux-only

Phase 2 — mlxcel build glue (`build.rs`)