Support Windows and Linux x86_64 binary builds and release artifacts

## Summary

Extend mlxcel's release build matrix to cover Linux x86_64 and Windows in addition to the current macOS aarch64 + Linux aarch64 (CUDA) targets. This aligns the distribution surface with the Backend.AI:GO desktop application and makes mlxcel deployable to customer sites that standardize on Linux x86_64 (the dominant CUDA host architecture in production).

## Background

Today's releases ship two artifact families:

- `mlxcel-macos-aarch64.zip` — Apple Silicon
- `mlxcel-linux-aarch64-cuda13-{gb10,gh200}` — Linux aarch64 + CUDA (Grace Blackwell / Grace Hopper)

The README prerequisites already claim "Linux (aarch64 or x86_64)" support, but no published binary exists for x86_64 today. Windows is not built at all, although the runtime distribution layout (MLX upstream's `mlx/backend/cuda/delayload.cpp` delay-load mechanism + NVIDIA DLL bundling, ~1.0–1.3 GB total) has been worked out.

Two motivations:

1. **Customer fit** — Most enterprise CUDA hosts (RTX, A100, L40, H100/H200 outside of GH200 Grace systems) are Linux x86_64. Without a published x86_64 binary, those sites cannot adopt mlxcel without building from source.
2. **Backend.AI:GO parity** — The Backend.AI:GO desktop application supports macOS + Windows + Linux x86_64. mlxcel should match so the same runtime can serve any Backend.AI:GO target.

## Proposed Solution

Split the work into three deliverables.

### A. Linux x86_64 + CUDA build

- Add a release artifact `mlxcel-linux-x86_64-cuda13.{tar.gz|zip}` produced from an x86_64 + NVIDIA host (build-only is acceptable for the first iteration; smoke test on a real GPU is preferred).
- Decide on the target CUDA architectures — SM 80/86/89/90 covers Ampere through Hopper; the existing CUDA arch matrix in `README.md` already documents non-Hopper quantization limitations.
- Verify `src/lib/mlxcel-core/build.rs` and the bundled MLX C++ build produce equivalent output on x86_64 — expected to be mechanical because Linux paths already differentiate from macOS via `#[cfg(target_os = "linux")]`.

### B. Windows + CUDA build

- Extend `src/lib/mlxcel-core/build.rs` with `#[cfg(target_os = "windows")]` branches:
  - MSVC toolchain detection (vs. GCC/Clang on Linux)
  - `CUDA_PATH` env var (Windows convention) in addition to `CUDA_HOME`
  - Link against `cudart.lib` and friends; static-vs-dynamic linkage decision mirrors MLX's Windows build
- Add Windows handling in the bundled MLX C++ build for OpenBLAS FetchContent and the delay-load DLL compile definitions (`MLX_CUDA_BIN_DIR`, `MLX_CUDNN_BIN_DIR`).
- Package the Windows artifact (`mlxcel.exe` + `nvidia/cublas/bin/...` etc.) per the runtime distribution layout. Bundle size is expected to be ~1.0–1.3 GB because of cuBLAS + NVRTC + cuDNN — fits within the 2 GB-per-asset GitHub Release limit.
- Code signing — investigate `signtool.exe` / Azure Code Signing, or ship unsigned for the first release with a documented SmartScreen warning.
- Confirm all crates compile on `x86_64-pc-windows-msvc`: `tokenizers`, `safetensors`, `axum`, `cxx`, `sentencepiece-sys`, and the multimodal stack (ffmpeg / video frame extraction may need a Windows-friendly variant).

### C. Documentation + README updates

- Update `README.md` prerequisites to enumerate Linux aarch64, Linux x86_64, and Windows distinctly rather than the current "Linux (aarch64 or x86_64)" shorthand.
- Add a `docs/windows-build-guide.md` for developers building from source on Windows.
- Update the CUDA architecture compatibility table to reflect the broader x86_64 GPU coverage (RTX 30/40/50 series, A100, L40, H100).

## Implementation Notes

- **Code surface that already touches Windows**
  - `src/distributed/rdma_capabilities.rs` already has `cfg!(target_os = "windows")` branches. The rest of `src/distributed/` should be audited for Windows-specific socket / transport quirks before claiming pipeline parallelism works on Windows.
  - No other `target_os = "windows"` gates exist today.
- **MLX upstream Windows status** — MLX has the delay-load CUDA mechanism (`mlx/backend/cuda/delayload.cpp`), but full Windows-only correctness may require additional upstream patches.
- **Self-hosted runner needs** — x86_64 + NVIDIA Linux runner (RTX class is enough for build, but actual smoke testing benefits from Hopper or Blackwell). Windows + NVIDIA runner ideally has the same.
- **Build-only vs. test-on-platform** — For the first iteration, building plus a smoke test (model load + 10-token generate) on each platform is sufficient. Full benchmark + parity runs can land in a follow-up.
- **Rust target triples**
  - Linux x86_64: `x86_64-unknown-linux-gnu`
  - Windows: `x86_64-pc-windows-msvc`

The release pipeline lives in the development repository and not this mirror; this issue tracks the user-visible deliverable. Implementation discussion is welcome here.

## Acceptance Criteria

- [ ] Release publishes `mlxcel-linux-x86_64-cuda13.{tar.gz|zip}`
- [ ] Release publishes `mlxcel-windows-x86_64-cuda13.zip` with bundled CUDA runtime DLLs
- [ ] Smoke test (`mlxcel generate` with a 1B 4-bit model, 10 tokens) succeeds on Linux x86_64 + NVIDIA GPU
- [ ] Smoke test succeeds on Windows + NVIDIA GPU
- [ ] `README.md` prerequisites section enumerates Linux aarch64, Linux x86_64, and Windows distinctly
- [ ] `docs/windows-build-guide.md` walks through a developer-side Windows build
- [ ] CUDA architecture compatibility table reflects x86_64 GPU coverage (Ampere, Ada, Hopper, Blackwell)
- [ ] Code-signing strategy decided for the Windows binary (signed / unsigned with documented warning / deferred)

---

## Original Suggestion

> Let's match the compatibility matrix with Backend.AI:GO desktop application, and also make it available to our customer sites which often use Linux x86-64 environments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Windows and Linux x86_64 binary builds and release artifacts #25

Summary

Background

Proposed Solution

A. Linux x86_64 + CUDA build

B. Windows + CUDA build

C. Documentation + README updates

Implementation Notes

Acceptance Criteria

Original Suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support Windows and Linux x86_64 binary builds and release artifacts #25

Description

Summary

Background

Proposed Solution

A. Linux x86_64 + CUDA build

B. Windows + CUDA build

C. Documentation + README updates

Implementation Notes

Acceptance Criteria

Original Suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions