Pr/cross platform cuda gradio Windows / Linux / CUDA cross-platform support + Gradio Explorer UI#1
Open
cronos3k wants to merge 10 commits into
Open
Pr/cross platform cuda gradio Windows / Linux / CUDA cross-platform support + Gradio Explorer UI#1cronos3k wants to merge 10 commits into
cronos3k wants to merge 10 commits into
Conversation
added 10 commits
April 16, 2026 13:16
…fe platform detection build.rs previously used #[cfg(target_arch = "...")] macros directly inside the build script. These macros reflect the HOST architecture, not the TARGET, which causes incorrect behaviour during cross-compilation. Changed to read CARGO_CFG_TARGET_ARCH and CARGO_CFG_TARGET_ENV environment variables, which Cargo sets to the actual target platform. Also added the missing cargo:rerun-if-changed directive so the build script re-runs whenever q4_dot.c is modified. In lib.rs, guarded `extern crate blas_src` with #[cfg(unix)] so Windows builds do not attempt to link a BLAS library that is not present on that platform. No algorithm changes. No behaviour changes on macOS or existing Linux builds.
…allback Two portability issues prevented the file from compiling on Windows (MSVC) and on non-ARM platforms: 1. __builtin_memcpy is a GCC/Clang extension. MSVC does not recognise it. Replaced every use with the standard <string.h> memcpy, which is safe and equally inlined by all major compilers at -O2 or higher. 2. The ARM NEON dot-product path (vdotq_s32) was the only implementation. Added a portable scalar path that activates on any non-ARM platform (x86_64 Windows, x86_64 Linux, RISC-V, etc.) via #ifdef __aarch64__. The decode_f16 helper is now defined unconditionally and shared by both paths. No changes to the algorithm or numerical results on ARM. The scalar path produces identical results to the ARM path, just without SIMD acceleration.
…ixmultiply on Windows
Previously the codebase assumed macOS Accelerate everywhere and would fail
to compile on Linux or Windows due to unconditional blas-src/accelerate
dependencies.
Changes per platform, using Cargo's [target.'cfg(...)'.dependencies] tables:
macOS — unchanged: Accelerate framework, ships with every Mac.
Linux — blas-src + openblas-src with the "system" feature flag.
Links the installed system library (apt install libopenblas-dev).
Does NOT build OpenBLAS from source (avoids 10+ minute CI builds).
Windows — pure ndarray/matrixmultiply backend, no external library required.
Performant enough for CPU-only extraction; BLAS can be added later
via openblas-src with the "static" feature if needed.
cpu::device_info() updated to report the actual backend in use per OS.
Feature flags (metal, cuda) are also threaded through larql-cli/Cargo.toml
so they can be activated from the workspace root with --features metal/cuda.
No changes to any algorithm or numerical behaviour.
Adds a new compute backend backed by cudarc + cuBLAS for f32 GEMM.
The CUDA path accelerates the two hottest operations during vindex
extraction: the down_meta projection and the embedding similarity pass.
Implementation notes:
Row-major (ndarray) → column-major (cuBLAS) conversion:
C[m×n] = A[m×k] · B[k×n]
is equivalent to
C^T[n×m] = B_colmaj[n×k] · A_colmaj[k×m]
so A and B are swapped in the cuBLAS call with M and N also swapped,
keeping OP_N for both operands. No explicit transpose is performed.
Q4 operations are not yet implemented on GPU; they fall back to the
existing CPU scalar kernel automatically.
CudaBackend::new() returns None if no CUDA device is found, allowing
default_backend() to fall back to CPU transparently.
Feature flag: cargo build --release --features cuda
Requires: CUDA toolkit ≥ 12.0, cuBLAS, a CUDA-capable GPU.
Not available on macOS — use --features metal there instead.
default_backend() priority is now: Metal > CUDA > CPU.
All internal call sites that created larql_compute::CpuBackend directly have been replaced with larql_compute::default_backend(). Without this change, building with --features cuda or --features metal compiles the GPU backend but never uses it: every matmul inside larql-vindex and larql-inference still dispatches to the CPU. This refactor closes that gap. No algorithmic changes. On a CPU-only build, default_backend() returns CpuBackend as before, so behaviour is identical for existing users. Also adds #[cfg(unix)] guards to the blas_src extern crate declaration in example and bench files that previously assumed a Unix host.
… auto-download Three related improvements to model resolution in safetensors.rs: 1. Windows HF cache path The previous code used $HOME/.cache/huggingface/hub which does not exist on Windows (the env var is USERPROFILE, not HOME). Resolution order is now: HUGGINGFACE_HUB_CACHE → HF_HOME/hub → $HOME/.cache/huggingface/hub (Unix) / %USERPROFILE%\.cache\huggingface\hub (Windows). This matches the behaviour of the official huggingface-hub Python and Rust libraries. 2. HF_HOME / HUGGINGFACE_HUB_CACHE support Both env vars are now respected per the HuggingFace caching spec, so users with non-default cache locations don't need to copy files. 3. Auto-download via hf-hub When a model string looks like a HuggingFace repo ID (contains '/') and is not found in the local cache, the model is now downloaded automatically using the hf-hub crate. HF_TOKEN is forwarded if set. This removes the need to manually download models before running larql extract-index. larql-models/Cargo.toml: added hf-hub = "0.5" dependency (was already present in larql-vindex; aligning both crates).
Adds a workflow that builds the larql CLI binary for Linux x86_64 on
every push to main and on manual dispatch.
Runner: ubuntu-22.04
Pinned to 22.04 rather than latest to produce a binary linked against
glibc 2.35, which is compatible with Debian Bookworm, Ubuntu 22.04+,
and other currently-supported distributions. A glibc 2.39 binary
(from ubuntu-24.04) would not run on Bookworm.
Dependencies installed: libopenblas-dev, pkg-config, libssl-dev.
Artefacts:
- Binary uploaded as a workflow artefact (90-day retention).
- A rolling GitHub prerelease tagged latest-linux is created/updated
with the binary, so it can be fetched from a stable URL by external
tools (e.g. a HuggingFace Space Dockerfile).
Requires the GITHUB_TOKEN secret, which is provided automatically by
GitHub Actions — no additional setup needed.
…logs Added entries for artefacts that are generated locally and should not be tracked: models/ — downloaded HuggingFace model weights (can be multi-GB) *.vindex/ — extracted vindex directories (binary data) demo/*.log — Gradio and subprocess logs written by the demo app Also fixes a missing newline at end of file from the original .gitignore.
… config
Adds a self-contained web interface for exploring vindexes interactively,
located in demo/. No changes to any Rust crate.
Six tabs:
Walk Explorer — per-layer FFN feature activation for a prompt
Knowledge Probe — compare how three prompts encode at the same layer
LQL Console — run raw LQL queries against the vindex
Vindex Info — metadata, layer count, model family from index.json
Extract — trigger larql extract-index from the UI
Setup & About — build instructions, binary check, environment info
Key implementation details:
demo/app.py — Gradio 6 Blocks app, ~640 lines. Calls the larql
binary as a subprocess; no Python reimplementation
of any Rust logic. Results are parsed and displayed
via gr.HTML (avoids Gradio DataFrame JS issues).
demo/utils.py — Output parsers for larql walk, verify, lql output.
Also provides vindex discovery and index.json loading.
demo/Dockerfile — Docker image for HuggingFace Spaces (Docker SDK).
Downloads the pre-built Linux binary at image build
time from the latest-linux GitHub release; no Rust
toolchain required in the image.
demo/setup.sh — Local setup helper: builds the binary and installs
Python deps.
demo/hf_space/ — Minimal HuggingFace Space configuration template.
Maintainers wanting to deploy their own Space can copy
this directory and adjust the repo URLs.
The demo auto-downloads a small demo vindex from HuggingFace Hub on first
start if no local vindex is found (requires huggingface_hub Python package).
Live reference deployment: https://huggingface.co/spaces/cronos3k/LARQL-Explorer
Adds a one-line reference to the live LARQL Explorer Space so users reading the README can try the tool without building from source.
|
I'm a bit confused, was this PR meant to be made on the original repo instead of on your own fork? I might be missing something here |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following your video yesterday — watched it, spent the day on this,
here's what came out of it.
What's in this PR
10 commits, each self-contained:
fix(compute/build)— CARGO_CFG_* env vars for correct cross-compilation platform detectionfix(compute)— q4_dot.c: MSVC-safe memcpy + scalar x86 fallback alongside ARM NEON pathfeat(compute)— per-platform BLAS: system OpenBLAS on Linux, matrixmultiply on Windows (no install required)feat(compute)— CUDA/cuBLAS backend,--features cuda, priority: Metal > CUDA > CPUrefactor— replace every hardcodedCpuBackendwithdefault_backend()so GPU features actually firefeat(models)— Windows HF cache path fix, HF_HOME/HUGGINGFACE_HUB_CACHE support, auto-download via hf-hubci— GitHub Actions Linux x86_64 build on ubuntu-22.04 (glibc 2.35, Bookworm-compatible)chore— .gitignore: exclude vindex dirs, model weights, demo logsfeat(demo)— Gradio 6 Explorer UI: 6 tabs, HF Space Dockerfile, no Rust reimplementationdocs— README: HF Space linkWhat is not changed
Tested on
Live demo
https://huggingface.co/spaces/cronos3k/LARQL-Explorer
Happy to split into separate PRs or adjust anything you'd like to change.