feat: CUDA/NVIDIA port — Qwen3.5-397B on single GPU at 5.35 tok/s (5.86 peak) by ssubbotin · Pull Request #7 · danveloper/flash-moe

ssubbotin · 2026-03-22T14:12:42Z

Summary

Complete CUDA inference engine that runs Qwen3.5-397B-A17B on a single NVIDIA GPU, streaming 209GB of expert weights from NVMe SSD at 5.35 tokens/second (RTX 4090, 5.86 peak).

Port of the Metal/Apple Silicon engine to x86/NVIDIA hardware with significant enhancements:

15 CUDA kernels ported from Metal (FMA-optimized vec4 4-bit dequant matvec, GatedDeltaNet, attention, SwiGLU, RMS norm)
Frequency-weighted VRAM expert cache (~17GB LRU on RTX 4090): hot experts served instantly, ~95% hit rate, +135% throughput
Three-tier caching: VRAM cache -> OS page cache -> NVMe SSD
HTTP server with both OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) APIs
Tool calling with <tool_call> parsing and OpenAI/Anthropic response formats
Multi-turn sessions: KV cache persists across requests
System prompt caching: prefilled once at startup, snapshot restored per request
Only 16GB+ system RAM required (process uses 5.5GB)

Multi-Hardware Benchmarks

GPU	VRAM	RAM	Disk	Avg tok/s	Peak tok/s
RTX 4090	24 GB	64 GB	NVMe 7 GB/s	5.35	5.86
RTX 3060	12 GB	755 GB	NVMe 9 GB/s	2.92	3.23
RTX 2080 Ti	11 GB	16 GB	virtio 520 MB/s	0.51	0.54
Apple M3 Max	48 GB unified	--	NVMe 17.5 GB/s	4.36	--

VRAM Cache Warm-Up (RTX 4090)

Request	tok/s	Improvement
1 (cold)	2.49	baseline
2	3.22	+29%
4	5.25	+111%
8 (hot)	5.86	+135%

Comparison with Other Solutions

System	tok/s	RAM Required	Approach
Flash-MoE CUDA	5.35	16GB+	VRAM cache + page cache + SSD
KTransformers*	~14	384GB	CPU expert compute (AMX)
llama.cpp	~1-2	256GB	CPU/GPU layer split

*KTransformers numbers for Qwen3-235B; 397B single-GPU not published.

Key Architecture Decisions

VRAM expert cache: Frequency-weighted LRU using 17GB of unused VRAM. ~2500 expert slots. 95% hit rate. +135% throughput.
vec4 FMA kernel: uint4 (128-bit) loads + bit shifts + __ldg(). All divisions eliminated. Consistent 5+ tok/s.
Page cache over GDS: pread populates OS page cache; beats GDS for sustained generation.
I/O-compute overlap: Shared expert runs on GPU while experts load from SSD.

Features

HTTP server (--serve PORT): OpenAI + Anthropic SSE streaming
Tool calling: <tool_call> parsing, OpenAI tool_calls / Anthropic tool_use
Multi-turn sessions: session_id maintains conversation state
Claude Code compatible: native /v1/messages, just set ANTHROPIC_BASE_URL
Per-layer timing: --timing shows phase breakdown

Files

cuda_infer/infer.cu -- Complete engine + HTTP server (~2000 lines)
cuda_infer/kernels.cuh -- 15 CUDA kernels (~570 lines)
cuda_infer/README.md -- Full documentation
bench_transfer.cu -- Transfer path benchmarks

Test Plan

Complete CUDA inference engine that runs the full 397B parameter MoE model on a single RTX 4090 (24GB VRAM) + 64GB RAM + NVMe SSD. Key components: - cuda_infer/infer.cu: Full inference engine (~1400 lines) Model loading (mmap + GPU upload), 60-layer forward pass, GatedDeltaNet linear attention, full attention with KV cache, MoE routing + expert SSD streaming, tokenizer integration. - cuda_infer/kernels.cuh: 15 CUDA kernels ported from Metal FMA-optimized 4-bit dequant matvec, SwiGLU, RMS norm, attention (Q@K^T, softmax, scores@V), GatedDeltaNet recurrence, conv1d, MoE combine+residual. - bench_transfer.cu: Transfer path benchmarks Measured GDS (5.3ms), pread+cudaMemcpy (8.3ms), warm cache (2.7ms) per layer for K=4 experts. Performance: 2.45 tok/s (RTX 4090, Samsung 990 EVO Plus, PCIe 4.0 x4) Comparison: requires only 64GB RAM vs 256-384GB for llama.cpp/KTransformers NVIDIA GPUDirect Storage (GDS) enables direct NVMe-to-GPU DMA, providing 37% speedup over traditional pread+cudaMemcpy path.

ssubbotin · 2026-03-22T20:09:33Z

Correction: works on 16Gb RAM + 16Gb VRAM
see https://gist.github.com/ssubbotin/e778f97ebf1f76b8fd423c0bc3961a1a

Actual running process utilizes 6 Gb RAM and 6 Gb VRAM for running 397B MoE

Add --serve PORT mode to the CUDA inference engine. Implements: - POST /v1/chat/completions with SSE streaming (token-by-token) - GET /v1/models (OpenAI model list) - GET /health (status check) - CORS headers for browser clients ChatML tokenization for user messages, state reset between requests. Tested at 2.68 tok/s streaming via curl.

Add tool/function calling to the HTTP server: - Accept "tools" array in /v1/chat/completions requests - Inject tool definitions into prompt using Qwen Hermes format - Parse <tool_call> tags from model output - Return OpenAI-compatible tool_calls SSE chunks - Handle tool results via role="tool" messages - Build full ChatML conversation from messages array Tested: model correctly calls get_weather({"location": "Tokyo"}) when given the tool definition and asked about weather. Known issues: model doesn't stop after tool call, special tokens leak into content stream. Will fix in follow-up.

- Stop generation immediately after </tool_call> is detected (was continuing to generate 200 tokens after the tool call) - Filter special tokens by ID (151643-151654) and by decoded text (<|im_end|>, <|im_start|>, <|endoftext|>, <think>/</ think>) - Stop on <|im_end|> in decoded text (model generates these as regular tokens, not just special token IDs) - Clean output: "Hello there, friend!" with finish_reason="stop" - Tool calls: immediate stop with finish_reason="tool_calls"

…ments Update cuda_infer/README.md with: - HTTP server usage (--serve PORT) - Tool calling examples with curl - Sending tool results back (multi-turn tool use) - Claude Code integration via litellm proxy - OpenAI Python SDK, aider, continue.dev examples - Custom system prompt (~/.flash-moe/system.md) - Corrected RAM requirements: 16GB min, 32GB recommended (process uses only 5.5GB; GDS bypasses RAM for expert data)

Add POST /v1/messages endpoint implementing the Anthropic Messages API with SSE streaming, eliminating the need for a litellm proxy. Supports: - message_start/content_block_start/content_block_delta/content_block_stop/ message_delta/message_stop event sequence - Text content blocks with text_delta streaming - Tool use: tool_use content blocks with input_json_delta - stop_reason: "end_turn" for normal completion, "tool_use" for tool calls - System prompt as top-level field - Array content blocks (text + tool_result) - Anthropic tool format (input_schema) Both APIs now available simultaneously: POST /v1/chat/completions (OpenAI format) POST /v1/messages (Anthropic format) Tested: basic chat and tool calling both produce correct Anthropic SSE event streams at 2.6-2.8 tok/s.

System prompt pre-caching: - Tokenize and prefill system prompt at server startup (~4s) - Snapshot all 60 layers of KV cache + delta-net + conv state - Restore from snapshot on each request instead of resetting to zero - Saves ~4s per request (no more re-prefilling system prompt) Fixed special token IDs for this model (MLX 4-bit quantization): - <|endoftext|> = 248044 (was 151643) - <|im_start|> = 248045 (was 151644) - <|im_end|> = 248046 (was 151645) - <think>/</ think> = 248068/248069 Prompt builders now only generate user turn content since system prompt is already in the KV cache from the snapshot. Custom system prompt: ~/.flash-moe/system.md (loaded at startup)

Keep KV cache and attention state across requests in the same session: - Pass "session_id" in request body to maintain conversation state - Same session_id: continue from where the last response ended (no re-prefill) - Different/no session_id: restore from system prompt snapshot (new conversation) - Single active session at a time (one GPU = one conversation) - Also supports x-session-id header for Anthropic endpoint Tested: Turn 1 "My name is Alice" → Turn 2 (same session) "What is my name?" → "Your name is Alice." New session → "I don't know your name yet!" Also fixed special token IDs for MLX 4-bit model: <|endoftext|>=248044, <|im_start|>=248045, <|im_end|>=248046

Add detailed per-layer timing when --timing flag is used: norm, attn, oproj, route, shared, io, expert, combine Measured on RTX 4090 + Samsung 990 EVO Plus (PCIe 4.0 x4): norm=0.02 attn=0.28 oproj=0.02 route=0.04 shared=0.04 io=5.79 expert=0.13 combine=0.01 ms/layer Key finding: 87% of per-layer time is SSD I/O (5.8ms). GPU compute is only 0.5ms — pipelining across layers would save at most 8%, not worth the complexity.

GDS bypasses the OS page cache, leaving 58GB of RAM unused. pread populates the page cache, so hot experts stay in RAM (~3ms) instead of always hitting SSD (~5.3ms via GDS). Measured improvement with warm cache: pread + page cache: 2.52 tok/s (best burst: 4.56 tok/s) GDS direct: 2.41 tok/s (constant, no cache benefit) GDS is still available via ENABLE_GDS=1 env var for systems with less than 32GB RAM where page cache isn't beneficial. Page cache grows to ~50GB during sustained generation, caching roughly half the 203GB expert data and accelerating repeat accesses.

LRU cache of recently-used experts in GPU VRAM. Uses ~17GB of the 24GB RTX 4090 VRAM (remaining after model weights + scratch buffers). Holds ~2,500 experts; after a few requests, ~95% of expert accesses hit the cache and skip SSD/page-cache entirely. Three-tier caching hierarchy: 1. VRAM cache (~17GB): instant access, LRU eviction 2. OS page cache (~50GB): pread populates it, ~10 GB/s 3. NVMe SSD: cold misses only, ~5-7 GB/s Performance progression in server mode: Request 1 (cold): 2.49 tok/s Request 2 (warm): 3.22 tok/s (+29%) Request 3: 3.24 tok/s (+30%) Request 4 (hot): 3.55 tok/s (+43%) Cache misses use async D2D copy to fill the VRAM slot in the background while expert forward runs from the temp buffer. Set DISABLE_VRAM_CACHE=1 to disable (saves 17GB VRAM for other uses).

Three optimizations combined: 1. Frequency-weighted VRAM cache eviction: - Eviction score = access_count * FREQ_WEIGHT + last_used - Hot experts (high access_count) survive topic changes - Pure LRU peak: 4.74 tok/s → freq-weighted peak: 5.86 tok/s 2. uint4 vectorized loads in dequant kernel: - Load 128 bits (4 × uint32 = 32 nibbles) per instruction - #pragma unroll over 4 words for better instruction scheduling - __ldg() intrinsic for read-through L1 cache on weights/scales 3. Eliminated all runtime divisions and branches: - All /8 /64 /4 *8 → bit shifts (>>3 >>6 >>2 <<3) - Removed if-branch in launch helper (vec4 always used) - More consistent execution: 5.12-5.86 range vs 5.01-6.30 Performance progression: Original (GDS): 2.45 tok/s + page cache: 2.52 tok/s (+3%) + VRAM cache (pure LRU): 3.55 tok/s (+45%) + freq-weighted LRU: 4.74 tok/s peak + vec4 + shifts + __ldg: 5.35 tok/s avg, 5.86 peak (+118%) Now 23% faster than Apple Silicon version (4.36 tok/s).

Paper (paper/flash_moe_cuda.tex): - Expanded Related Work to 17 references (PowerInfer, Pre-gated MoE, DeepSpeed-MoE, S-LoRA, LRFU, ARC, Mixtral, DeepSeek-V3, etc.) - Positioned against PowerInfer hot/cold partitioning - Clarified "sustained" → "steady-state" with cold-start numbers - Labeled RTX 2080 Ti virtualized storage as non-comparable - Paper now 7 pages, IEEE two-column format Review (paper/flash_moe_cuda_review.md): - Full 5-reviewer peer review with editorial decision - Revision roadmap with 7 required + 7 suggested items Code (cuda_infer/infer.cu): - Added expert logging for profiling (EXPERT_LOG env var)

…urve, S5 kernel metrics R4: Expert profiling expanded to 1,290 tokens across 3 diverse prompts (science, code, creative). 309,600 routing decisions confirm: 26.6% temporal locality, 0.8% cross-layer correlation (stable). S1: W parameter sensitivity — tested W=0,1,5,10,20,50. All W>=1 within 2% of each other (4.80-4.94 tok/s). Not sensitive — any W>=1 works. S3: Working set curve (cache hit rate vs size) from 1290-token data. Static top-N: 500 experts=20%, 2500=48.6%. Runtime LRU achieves 95% because active working set is smaller. S5: CUDA kernel metrics from ncu profiling: 28% DRAM throughput, 16-56% occupancy, 37 regs/thread. Also added context-length degradation data (2.55→1.86 tok/s over 10 sequential requests with growing context). Paper now 8 pages with all reviewer-requested data.

All model constants now guarded with #ifndef, allowing override via -D flags at compile time. Expert offsets computed from dimensions instead of hardcoded. Added configure.py: reads model_weights.json config section and generates the correct nvcc -D flags or a per-model Makefile. Workflow for any MoE model: python3 configure.py --manifest model_weights.json --print-cmd # outputs: nvcc -DHIDDEN_DIM=3072 -DNUM_LAYERS=48 ... Default build (no -D flags) targets Qwen3.5-397B-A17B. Each model gets its own binary with exact-sized arrays — no wasted memory from MAX_LAYERS or runtime indirection.

Add dequant_matvec_q4k kernel for GGML Q4_K quantization format, enabling direct use of GGUF model files without format conversion. Q4_K format: 256-element super-blocks with packed 6-bit scales, fp16 super-block scale/min, 4-bit quantized values. Optimizations applied: - Precompute all 8 scale/min pairs (no branch in inner loop) - uint32 loads for qs array (4 bytes = 8 nibbles per load) - FMA optimization: fma(nibble, ds*x, -ms*x) - __ldg() for read-through L1 cache - All divisions replaced with bit shifts - Full #pragma unroll Benchmark vs MLX affine 4-bit (RTX 4090): gate/up [1024, 4096]: 1.06x (near parity) routing [512, 4096]: 1.08x (near parity) lm_head [248320, 4096]: 1.34x down [4096, 1024]: 1.70x (narrow input, few blocks/row) Net impact: ~5% throughput reduction vs MLX format. GGUF users skip the 209GB safetensors download.

FNsi · 2026-03-29T06:54:58Z

How many experts are you using compare with llama.cpp?

repack_experts.py no longer has hardcoded sizes for Qwen3.5-397B. Component sizes, expert count, and layer count are auto-detected from expert_index.json at runtime. Works for any MoE model: python3 build_expert_index.py --model /path/to/safetensors --output index.json python3 repack_experts.py --index index.json Tested formats: Qwen3.5-397B-A17B: 512 experts, 7,077,888 bytes/expert Qwen3.5-122B-A10B: 256 experts, different dimensions (auto-detected)

Added Section 4.2: llama.cpp vs Flash-MoE on identical RTX 4090 + 64GB RAM. Same model (Qwen3.5-397B at 4-bit), same prompt: Flash-MoE CUDA (warm): 5.35 tok/s, 5.5 GB RAM llama.cpp -ngl 99: OOM (228GB > 24GB VRAM) llama.cpp -ngl 0: <0.05 tok/s (2h+ for 20 tokens, 54GB RAM) The comparison demonstrates Flash-MoE's fundamental advantage: expert-level streaming with VRAM caching vs whole-model mmap. When the model doesn't fit in RAM, llama.cpp falls back to OS paging which thrashes catastrophically. Flash-MoE streams only the active experts (~27MB/layer) and caches hot ones in VRAM.

- Add 5-run measurements with std dev (5.57 ± 0.12 tok/s, n=15) - Update warm-up table with honest diverse-prompt data - Add Limitations subsection (batched serving, warm-up, W, multi-GPU) - Cite all 17 references in text (Mixtral, DeepSeek-V3, MoE-Gen, FloE) - Add AI disclosure statement - Add measurement methodology note - Reduce em dash density, add Table 2 footnote Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

ssubbotin · 2026-03-29T10:53:15Z

Both Flash-MoE and llama.cpp activate the same number of experts — K=4 out of 512 per layer (plus 1 shared expert). This is determined by the model's router, not the inference engine.

The difference is how those experts are loaded:

	Flash-MoE CUDA	llama.cpp
Experts per layer	K=4 (same)	K=4 (same)
Data loaded per layer	~27 MB (4 experts × 6.75 MB)	N/A (entire model mmap'd)
Total model in memory	5.5 GB process + VRAM cache	228 GB mmap (54 GB resident)
Expert access	Stream from SSD → page cache → VRAM cache	Page fault from mmap'd file

Flash-MoE only reads the 4 active experts each layer needs. llama.cpp memory-maps the entire 228 GB GGUF file, so the OS has to page in/out continuously with only 64 GB of physical RAM — that's why it thrashes and gets <0.05 tok/s.

The VRAM expert cache (~17 GB, ~2565 expert slots) means ~95% of expert accesses hit GPU memory at 1008 GB/s after warm-up, which is where the 5.57 tok/s comes from.

- Switch from twocolumn to single-column (tables no longer overflow) - Replace all 11 prose em dashes with commas, parentheses, semicolons, or colons for cleaner academic style Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

FNsi · 2026-03-29T11:33:01Z

Looks great, I definitely know nothing about that, so if active 10 experts, I think the speed would be around 1 tg/s ? Also, seems it's highly dependent on ssd speed, but in that case it actually hit the gpu vram speed first, theoretically, i think it is possible to store some experts, or even with some diff, assuming it actually can predict and prepare for the next layer? Again, I have no clue and no knowledge around this, and you did great.

Major fix: all GGUF dequant matvec kernels now quantize the input vector to Q8_K format (int8 per-256-block) before computing the dot product, matching llama.cpp's vec_dot_qX_K_q8_K implementation. This is REQUIRED because GGUF models are quantized assuming Q8K input arithmetic. Our previous float-exact computation gave mathematically different results (e.g., QKV[0] = +0.446 vs llama.cpp's -0.788). Results after fix: - All layer 0 projections match llama.cpp to 6 decimal places: conv_Q, conv_K, conv_V, z_proj, alpha, beta ✓ - QKV output matches exactly: -0.787507 ✓ - Layer output still diverges 2x (oproj with tiny input) - 397B MLX path completely unaffected Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

Per-head Pearson correlation = 1.0000 for all 32 heads (direction matches perfectly). But heads 1, 17, 9, 24, 25 have 2-22x magnitude difference. All other heads match within 5%. The direction match + magnitude mismatch pattern is consistent with float precision accumulation in the delta-net state update kernel producing slightly different intermediate values that get amplified. All layer 0 projections (QKV, Z, alpha, beta, conv) match llama.cpp to 6 decimal places after Q8K kernel fix. Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

Root cause traced to element 4156 of conv_output (V[60] of value head 0) being an outlier from Q8K/Q5K precision differences. The Q8K input quantization produces slightly different dot product results for specific weight rows where catastrophic cancellation occurs. Per-head delta_out correlation = 1.0000 (direction perfect). Magnitude differs for heads with outlier elements (1-381x ratio). llama.cpp L2 norm uses max(||x||, eps) vs our sqrt(||x||^2 + eps) — minor difference, not the main cause. Status: all projections match exactly, gated_norm 69% correlated. The remaining quality gap comes from a few outlier elements in delta_out that compound across 40 layers. Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

Match llama.cpp vec_dot exactly: - 8 separate int32 accumulators (aux32[8]) instead of single accumulator - lroundf rounding mode for Q8K quantization (ties away from zero) QKV[4156] outlier (-27.53) confirmed as Q8K quantization noise for a specific weight row with catastrophic cancellation. The rounding mode change doesn't affect this — the outlier is inherent to the Q8K precision limit for this particular input/weight combination. Pearson r = -0.131 (same as before these changes — the outliers dominate the overall correlation, masking the improvement in non-outlier elements). Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

CRITICAL FIX: llama.cpp scales Q by 1/sqrt(S_k) before the delta-net recurrence (build_delta_net_autoregressive line 330). Our engine was missing this scaling, causing delta_out to be 11.3x too large (exactly 1/scale = sqrt(128) = 11.31). Impact: - Logit Pearson r: -0.131 → +0.232 - gated_norm Pearson r: 0.692 → 0.888 - Per-head magnitude ratios: head 1 from 22x to 3.4x - Top predicted token now matches llama.cpp (',') Added vec_scale GPU kernel for efficient in-place Q scaling. Only applied in GGUF mode (MLX has different scaling in rms_norm_qk). Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

Match llama.cpp's ggml_float (double) precision for norm computations: - L2 norm Q/K: double accumulation for sum of squares - L2 norm: use max(sqrt(sum), eps) formula matching ggml_l2_norm - RMS norm: double per-thread accumulation for sum of squares Impact: Pearson r improved from 0.232 to 0.350. Output now produces relevant English words ("Capital have?") instead of random unicode. Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

ROOT CAUSE FOUND: The delta-net kernel mapped value heads to key heads using DIVISION (kh = head_id / k_heads_per_v), but llama.cpp uses MODULO (kh = head_id % num_key_heads). With 32 value heads and 16 key heads: Division: heads 0,1→kh0, 2,3→kh1, 4,5→kh2, ... Modulo: heads 0,16→kh0, 1,17→kh1, 2,18→kh2, ... Every value head was using the WRONG key head, causing 2-80x magnitude errors in delta_out per head. Impact: Pearson logit correlation with llama.cpp: 0.350 → 0.996! Token logits now match within 1%: Token ',' : CUDA=14.85 vs llama=14.73 Token '!' : CUDA=13.06 vs llama=13.04 Token '\n': CUDA=12.90 vs llama=12.85 Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

The previous "BREAKTHROUGH" commit 0de1ee3 changed the key head mapping in gated_delta_net_step from division to modulo to match llama.cpp's iq1 = iv1 % neq1 convention. This was correct for the GGUF path it was debugging, but it silently broke the MLX path because the two formats use opposite conventions: MLX: kh = head_id / k_heads_per_v chunked V heads 0..3 share K head 0, V heads 4..7 share K head 1, ... llama.cpp: kh = head_id % num_k_heads interleaved V heads 0,16,32,48 share K head 0, V heads 1,17,33,49 share K head 1, ... Any cuda_infer binary built from the post-0de1ee3 source produces gibberish on MLX-format Qwen3.5-397B (`rrrrrr`, `|(\n{`, ...). The older pre-0de1ee3 binaries (like the Mar 29 02:11 build on aeronav-llm that produced " Paris.<|im_end|>...") kept working only because they were not rebuilt from the newer source. Fix: add a runtime kh_mode parameter to gated_delta_net_step. The host picks 0 (division) when g_quant_format != 1 (MLX) and 1 (modulo) when g_quant_format == 1 (GGUF). Both formats now work from the same kernel. Verified on aeronav-llm (RTX 4090, CUDA 12.8) with the MLX 4-bit 397B: Prompt: "The capital of France is" Output: Paris.<|im_end|> <|im_start|>assistant <think> Thinking Process: 1. **Analyze the Request:** * Input: "The Matching commits on the sibling branches: - apu: 12e6ca0 (original fix + apu_infer/) - rocm: ca14373 (rocm_infer/ backport)

Runs generation 3 times, comparing token IDs against the first run as reference. Reports per-iteration tok/s with MATCH/MISMATCH and prints avg/min/max summary. Resets KV cache and GDN state between runs to ensure identical starting conditions.

WMMA 16x16x16 int8 tensor core kernel for 4-bit dequant matvec. Processes 16 output rows per warp, tiles K dimension in steps of 16. Correctness: MATCH. Performance: 2.35 tok/s warm — 16x slower than dp4a baseline (38.36 tok/s). WMMA is fundamentally unsuitable for M=1 matvec: 15/16 of tensor core compute is wasted, and the shared memory staging for weight repacking + result extraction dominates.

Third INT8 matvec approach using cublasGemmEx with CUDA_R_8I compute. Converts 4-bit affine weights to INT8 with absorbed per-group scales on-the-fly, quantizes activations with a single global scale, then applies per-row scale correction after cuBLAS INT32 output. Result: 15.86 tok/s (2.0x slower than FMA baseline at 31.53). cuBLAS GEMM is not optimized for N=1 GEMV workloads. Correctness is verified (MATCH across all benchmark iterations). VRAM overhead: ~972 MB scratch buffer for INT8 weight conversion.

Adds gpu_softmax_topk kernel (danveloper#16 in kernels.cuh) that performs numerically-stable softmax and top-K selection entirely on GPU, eliminating the cudaDeviceSynchronize between gate matvec and routing computation. Also removes the redundant H2D memcpy of expert weights since they're now written directly to buf_expert_weights by the kernel. Verified: --bench 20 produces MATCH on all iterations, 37.4 tok/s warm.

Launch shared expert forward (gate/up/swiglu/down/gate_score) on stream_compute so it runs concurrently with the CPU-side expert I/O (pread + cudaMemcpy). Synchronize stream_compute before K expert forward which reuses the shared scratch buffers. Benchmark (30 tokens, RTX PRO 6000): - Cold: ~6.3 tok/s (neutral — shared expert is only 0.03ms/layer) - Warm: ~32.9 tok/s (neutral — I/O is 0ms when VRAM-cached)

…A streams" This reverts commit 8eedad2.

…ell SM 12.0 The debug printf inside gated_delta_net_step (added in cf54a0b) causes garbage output on Blackwell GPUs. Device printf on SM 12.0 appears to corrupt the kernel's register state or introduce synchronization artifacts, producing blank lines, repeated characters, or echoed prompt templates instead of coherent text. Also reverts the kh_mode parameter to the simpler division-only formula (head_id / k_heads_per_v) since the MLX format is the only one used on the CUDA backend. Verified: 'The capital of France is' → 'Paris' on both RTX 4090 (SM 8.9) and RTX PRO 6000 Blackwell (SM 12.0).

…uest The KV/delta state snapshot save/restore produces different model output than fresh prefill on CUDA. Root cause under investigation. Workaround: reset state and re-prefill the full system+user prompt for each request (~2s extra latency). Also fixes: remove device printf from GDN kernel (corrupts output on Blackwell SM 12.0), revert kh_mode to division-only (MLX format).

Root cause: two bugs caused serve mode to produce empty responses: 1. Prompt mode used wrong EOS token IDs (151643/151645 from Qwen2) instead of correct Qwen3.5 IDs (248044/248046). This made prompt mode appear to work by never stopping at <|im_end|>. 2. The model needs <think>\n appended after <|im_start|>assistant\n to enter thinking mode. Without it, the model generates <think><|im_end|> (empty thinking, end turn). Serve mode correctly stopped at <|im_end|>, showing only 1 token. Fixes: - Append <think>\n to prompt suffix in both API prompt builders - Add thinking mode suppression in generation loops (suppress output until </think> is seen, then stream the response) - Fix prompt mode EOS check to use EOS_TOKEN_1/EOS_TOKEN_2 defines - Fix default system prompt (remove stale "/think") - Add --prompt-file flag for testing with preserved trailing newlines

… budget - MAX_SEQ_LEN 4096 → 131072 (7.5 GB KV cache on 96 GB RTX PRO 6000) - forward() returns EOS gracefully instead of crashing on overflow - Think budget caps thinking at max_tokens/2, then force-injects </think> - Auto-detect thinking mode from prompt tokens instead of hardcoding

…on detector - Replace argmax with proper sampling: temperature (default 0.6), top-p nucleus sampling (0.95), repetition penalty (1.3) - 1024-token ring buffer tracks recent tokens for rep penalty - Degeneration detector: forces </think> when >75% of last 32 tokens lack ASCII letters (catches CJK/symbol degeneration loops) - Min-think threshold: suppress premature </think> for first 16 tokens (greedy decoding barely favors </think> over content with long prompts) - CLI flags: --temp, --rep-penalty, --greedy - Per-request temperature from API body - Thinking content logged to /tmp/think.txt for debugging - extract_temperature() for OpenAI API requests

…ript

…, investigating MoE

Root cause of flash-moe producing wrong output: the GPU softmax_topk kernel returned raw softmax probabilities for the selected top-K experts without renormalizing them to sum to 1.0. With 256 experts and K=8, the top-8 softmax values sum to ~0.26. Without renormalization, expert contributions are diluted by ~4x every layer, compounding to 16x magnitude shrinkage by layer 39. This caused systematically wrong logits and degenerate text output. The CPU topk() function already had renormalization (line 1588), but the GPU kernel (added later as an optimization) didn't. Found by comparing per-layer hidden state magnitudes between flash-moe and transformers BF16 reference on Qwen3.5-35B-A3B.

The model generates <think>...</think> on its own when appropriate. Forcing <think>\n in the prompt caused degenerate thinking loops. Now the serve mode just appends <|im_start|>assistant\n and the model decides whether to think. Simplified thinking suppression: track <think>/<</think> tokens dynamically instead of hardcoded prompt detection.

…k blocks

ssubbotin added 13 commits March 22, 2026 22:05

docs: native Anthropic API, no litellm proxy needed for Claude Code

9e724a5

docs: system prompt caching, persistent state, dual API endpoints

4578b4c

docs: page cache > GDS discovery, updated benchmarks to 2.52 tok/s

404522d

ssubbotin changed the title ~~feat: CUDA/NVIDIA port — Qwen3.5-397B on single RTX 4090 at 2.45 tok/s~~ feat: CUDA/NVIDIA port — Qwen3.5-397B on single RTX 4090 at 3.55 tok/s Mar 22, 2026

ssubbotin changed the title ~~feat: CUDA/NVIDIA port — Qwen3.5-397B on single RTX 4090 at 3.55 tok/s~~ feat: CUDA/NVIDIA port — Qwen3.5-397B on single GPU at 5.35 tok/s (5.86 peak) Mar 28, 2026

ssubbotin added 5 commits March 28, 2026 20:56

docs: multi-hardware benchmarks (RTX 4090/3060/2080Ti), 5.35 tok/s

d260cca

ssubbotin added 3 commits March 29, 2026 12:04

paper: single-column layout, remove all prose em dashes

24d4fe6

- Switch from twocolumn to single-column (tables no longer overflow) - Replace all 11 prose em dashes with commas, parentheses, semicolons, or colons for cleaner academic style Co-Authored-By: Sergey Subbotin <ssubbotin@gmail.com>

ssubbotin added 16 commits March 29, 2026 20:36

plan: Blackwell optimization — INT8 kernels, GPU routing, compiler flags

a71d959

Revert "Overlap shared expert GPU compute with expert SSD I/O via CUD…

c8219ac

…A streams" This reverts commit 8eedad2.

ssubbotin force-pushed the cuda branch from 8f5e152 to 0bc1c47 Compare April 11, 2026 19:59

ssubbotin force-pushed the cuda branch from 51fb09d to 5235d4b Compare April 11, 2026 21:04

plan: serve mode snapshot debug — side-by-side numerical comparison

4c4f4ec

ssubbotin force-pushed the cuda branch from 5235d4b to 4c4f4ec Compare April 11, 2026 21:05

ssubbotin added 9 commits April 12, 2026 00:52

add: build script for Qwen3.5-35B-A3B (debug reference model)

5c52c53

fix: repack verification for variable expert counts, add 35B build sc…

0d93b74

…ript

debug: layer-0 reference comparison verified — attention path correct…

1f5542a

…, investigating MoE

serve: remove thinking suppression — stream all tokens including thin…

82981e5

…k blocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CUDA/NVIDIA port — Qwen3.5-397B on single GPU at 5.35 tok/s (5.86 peak)#7

feat: CUDA/NVIDIA port — Qwen3.5-397B on single GPU at 5.35 tok/s (5.86 peak)#7
ssubbotin wants to merge 61 commits into
danveloper:mainfrom
ssubbotin:cuda

ssubbotin commented Mar 22, 2026 •

edited

Loading

Uh oh!

ssubbotin commented Mar 22, 2026

Uh oh!

FNsi commented Mar 29, 2026

Uh oh!

ssubbotin commented Mar 29, 2026

Uh oh!

FNsi commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ssubbotin commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Multi-Hardware Benchmarks

VRAM Cache Warm-Up (RTX 4090)

Comparison with Other Solutions

Key Architecture Decisions

Features

Files

Test Plan

Uh oh!

ssubbotin commented Mar 22, 2026

Uh oh!

FNsi commented Mar 29, 2026

Uh oh!

ssubbotin commented Mar 29, 2026

Uh oh!

FNsi commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ssubbotin commented Mar 22, 2026 •

edited

Loading

FNsi commented Mar 29, 2026 •

edited

Loading