Skip to content

Pr/cross platform cuda gradio Windows / Linux / CUDA cross-platform support + Gradio Explorer UI#1

Open
cronos3k wants to merge 10 commits into
mainfrom
pr/cross-platform-cuda-gradio
Open

Pr/cross platform cuda gradio Windows / Linux / CUDA cross-platform support + Gradio Explorer UI#1
cronos3k wants to merge 10 commits into
mainfrom
pr/cross-platform-cuda-gradio

Conversation

@cronos3k
Copy link
Copy Markdown
Owner

Following your video yesterday — watched it, spent the day on this,
here's what came out of it.

What's in this PR

10 commits, each self-contained:

  1. fix(compute/build) — CARGO_CFG_* env vars for correct cross-compilation platform detection
  2. fix(compute) — q4_dot.c: MSVC-safe memcpy + scalar x86 fallback alongside ARM NEON path
  3. feat(compute) — per-platform BLAS: system OpenBLAS on Linux, matrixmultiply on Windows (no install required)
  4. feat(compute) — CUDA/cuBLAS backend, --features cuda, priority: Metal > CUDA > CPU
  5. refactor — replace every hardcoded CpuBackend with default_backend() so GPU features actually fire
  6. feat(models) — Windows HF cache path fix, HF_HOME/HUGGINGFACE_HUB_CACHE support, auto-download via hf-hub
  7. ci — GitHub Actions Linux x86_64 build on ubuntu-22.04 (glibc 2.35, Bookworm-compatible)
  8. chore — .gitignore: exclude vindex dirs, model weights, demo logs
  9. feat(demo) — Gradio 6 Explorer UI: 6 tabs, HF Space Dockerfile, no Rust reimplementation
  10. docs — README: HF Space link

What is not changed

  • No LQL query language changes
  • No vindex format changes
  • No changes to extract-index or walk logic
  • macOS / Metal path is untouched

Tested on

  • Windows 11, MSVC, CUDA 12.x, RTX 3080
  • Ubuntu 22.04 (GitHub Actions CI, CPU only)
  • HuggingFace Spaces Docker (cpu-basic, glibc 2.35)

Live demo

https://huggingface.co/spaces/cronos3k/LARQL-Explorer
Happy to split into separate PRs or adjust anything you'd like to change.

ghmk added 10 commits April 16, 2026 13:16
…fe platform detection

build.rs previously used #[cfg(target_arch = "...")] macros directly inside the
build script. These macros reflect the HOST architecture, not the TARGET, which
causes incorrect behaviour during cross-compilation.

Changed to read CARGO_CFG_TARGET_ARCH and CARGO_CFG_TARGET_ENV environment
variables, which Cargo sets to the actual target platform.

Also added the missing cargo:rerun-if-changed directive so the build script
re-runs whenever q4_dot.c is modified.

In lib.rs, guarded `extern crate blas_src` with #[cfg(unix)] so Windows builds
do not attempt to link a BLAS library that is not present on that platform.

No algorithm changes. No behaviour changes on macOS or existing Linux builds.
…allback

Two portability issues prevented the file from compiling on Windows (MSVC)
and on non-ARM platforms:

1. __builtin_memcpy is a GCC/Clang extension. MSVC does not recognise it.
   Replaced every use with the standard <string.h> memcpy, which is safe
   and equally inlined by all major compilers at -O2 or higher.

2. The ARM NEON dot-product path (vdotq_s32) was the only implementation.
   Added a portable scalar path that activates on any non-ARM platform
   (x86_64 Windows, x86_64 Linux, RISC-V, etc.) via #ifdef __aarch64__.
   The decode_f16 helper is now defined unconditionally and shared by both
   paths.

No changes to the algorithm or numerical results on ARM. The scalar path
produces identical results to the ARM path, just without SIMD acceleration.
…ixmultiply on Windows

Previously the codebase assumed macOS Accelerate everywhere and would fail
to compile on Linux or Windows due to unconditional blas-src/accelerate
dependencies.

Changes per platform, using Cargo's [target.'cfg(...)'.dependencies] tables:

  macOS   — unchanged: Accelerate framework, ships with every Mac.
  Linux   — blas-src + openblas-src with the "system" feature flag.
             Links the installed system library (apt install libopenblas-dev).
             Does NOT build OpenBLAS from source (avoids 10+ minute CI builds).
  Windows — pure ndarray/matrixmultiply backend, no external library required.
             Performant enough for CPU-only extraction; BLAS can be added later
             via openblas-src with the "static" feature if needed.

cpu::device_info() updated to report the actual backend in use per OS.
Feature flags (metal, cuda) are also threaded through larql-cli/Cargo.toml
so they can be activated from the workspace root with --features metal/cuda.

No changes to any algorithm or numerical behaviour.
Adds a new compute backend backed by cudarc + cuBLAS for f32 GEMM.
The CUDA path accelerates the two hottest operations during vindex
extraction: the down_meta projection and the embedding similarity pass.

Implementation notes:

  Row-major (ndarray) → column-major (cuBLAS) conversion:
    C[m×n] = A[m×k] · B[k×n]
    is equivalent to
    C^T[n×m] = B_colmaj[n×k] · A_colmaj[k×m]
  so A and B are swapped in the cuBLAS call with M and N also swapped,
  keeping OP_N for both operands. No explicit transpose is performed.

  Q4 operations are not yet implemented on GPU; they fall back to the
  existing CPU scalar kernel automatically.

  CudaBackend::new() returns None if no CUDA device is found, allowing
  default_backend() to fall back to CPU transparently.

Feature flag: cargo build --release --features cuda
Requires: CUDA toolkit ≥ 12.0, cuBLAS, a CUDA-capable GPU.
Not available on macOS — use --features metal there instead.

default_backend() priority is now: Metal > CUDA > CPU.
All internal call sites that created larql_compute::CpuBackend directly
have been replaced with larql_compute::default_backend().

Without this change, building with --features cuda or --features metal
compiles the GPU backend but never uses it: every matmul inside larql-vindex
and larql-inference still dispatches to the CPU. This refactor closes that gap.

No algorithmic changes. On a CPU-only build, default_backend() returns
CpuBackend as before, so behaviour is identical for existing users.

Also adds #[cfg(unix)] guards to the blas_src extern crate declaration in
example and bench files that previously assumed a Unix host.
… auto-download

Three related improvements to model resolution in safetensors.rs:

1. Windows HF cache path
   The previous code used $HOME/.cache/huggingface/hub which does not exist
   on Windows (the env var is USERPROFILE, not HOME). Resolution order is
   now: HUGGINGFACE_HUB_CACHE → HF_HOME/hub → $HOME/.cache/huggingface/hub
   (Unix) / %USERPROFILE%\.cache\huggingface\hub (Windows). This matches
   the behaviour of the official huggingface-hub Python and Rust libraries.

2. HF_HOME / HUGGINGFACE_HUB_CACHE support
   Both env vars are now respected per the HuggingFace caching spec, so
   users with non-default cache locations don't need to copy files.

3. Auto-download via hf-hub
   When a model string looks like a HuggingFace repo ID (contains '/') and
   is not found in the local cache, the model is now downloaded automatically
   using the hf-hub crate. HF_TOKEN is forwarded if set. This removes the
   need to manually download models before running larql extract-index.

larql-models/Cargo.toml: added hf-hub = "0.5" dependency (was already
present in larql-vindex; aligning both crates).
Adds a workflow that builds the larql CLI binary for Linux x86_64 on
every push to main and on manual dispatch.

Runner: ubuntu-22.04
  Pinned to 22.04 rather than latest to produce a binary linked against
  glibc 2.35, which is compatible with Debian Bookworm, Ubuntu 22.04+,
  and other currently-supported distributions. A glibc 2.39 binary
  (from ubuntu-24.04) would not run on Bookworm.

Dependencies installed: libopenblas-dev, pkg-config, libssl-dev.

Artefacts:
  - Binary uploaded as a workflow artefact (90-day retention).
  - A rolling GitHub prerelease tagged latest-linux is created/updated
    with the binary, so it can be fetched from a stable URL by external
    tools (e.g. a HuggingFace Space Dockerfile).

Requires the GITHUB_TOKEN secret, which is provided automatically by
GitHub Actions — no additional setup needed.
…logs

Added entries for artefacts that are generated locally and should not be
tracked:

  models/       — downloaded HuggingFace model weights (can be multi-GB)
  *.vindex/     — extracted vindex directories (binary data)
  demo/*.log    — Gradio and subprocess logs written by the demo app

Also fixes a missing newline at end of file from the original .gitignore.
… config

Adds a self-contained web interface for exploring vindexes interactively,
located in demo/. No changes to any Rust crate.

Six tabs:

  Walk Explorer     — per-layer FFN feature activation for a prompt
  Knowledge Probe   — compare how three prompts encode at the same layer
  LQL Console       — run raw LQL queries against the vindex
  Vindex Info       — metadata, layer count, model family from index.json
  Extract           — trigger larql extract-index from the UI
  Setup & About     — build instructions, binary check, environment info

Key implementation details:

  demo/app.py       — Gradio 6 Blocks app, ~640 lines. Calls the larql
                      binary as a subprocess; no Python reimplementation
                      of any Rust logic. Results are parsed and displayed
                      via gr.HTML (avoids Gradio DataFrame JS issues).
  demo/utils.py     — Output parsers for larql walk, verify, lql output.
                      Also provides vindex discovery and index.json loading.
  demo/Dockerfile   — Docker image for HuggingFace Spaces (Docker SDK).
                      Downloads the pre-built Linux binary at image build
                      time from the latest-linux GitHub release; no Rust
                      toolchain required in the image.
  demo/setup.sh     — Local setup helper: builds the binary and installs
                      Python deps.
  demo/hf_space/    — Minimal HuggingFace Space configuration template.
                      Maintainers wanting to deploy their own Space can copy
                      this directory and adjust the repo URLs.

The demo auto-downloads a small demo vindex from HuggingFace Hub on first
start if no local vindex is found (requires huggingface_hub Python package).

Live reference deployment: https://huggingface.co/spaces/cronos3k/LARQL-Explorer
Adds a one-line reference to the live LARQL Explorer Space so users
reading the README can try the tool without building from source.
@workturnedplay
Copy link
Copy Markdown

I'm a bit confused, was this PR meant to be made on the original repo instead of on your own fork? I might be missing something here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants