-
-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Hey folks. I ran into a hard crash while running some heavy embedding workloads on Windows using the CUDA backend. It looks like it's tied to the VMM allocator.
The Problem
When running a large indexing job (about 32,000 chunks via qmd), the process dies with a CUDA out of memory error.
Digging into the debug logs, the exact failure happens at ggml-cuda.cu:97. It aborts inside ggml_cuda_pool_vmm::alloc (around line 476) when calling:
cuMemAddressReserve(&pool_addr, CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)
Why it's failing
I'm on an RTX 3090 (24GB). In ggml-cuda.cu, CUDA_POOL_VMM_MAX_SIZE is hardcoded to reserve 32GB of virtual memory. Even with plenty of actual VRAM available, the virtual address space reservation fails. Instead of gracefully falling back to a non-VMM pool, the whole process hard-aborts.
The Workaround
I managed to bypass this locally by compiling node-llama-cpp from source with VMM disabled:
GGML_CUDA_NO_VMM=ON
With that flag, the exact same embedding job finishes perfectly and memory usage stays stable.
The Request
Would it be possible to add a runtime fallback here? If cuMemAddressReserve fails (which seems to happen on some Windows/WDDM setups), it would be great if it logged a warning and fell back to the standard allocator instead of crashing. That would make the prebuilt binaries a lot more stable for Windows users hitting this edge case.
My Environment
- OS: Windows 11 Pro N (10.0.22631)
- GPU: RTX 3090 24GB (Driver 591.44)
- CUDA: 13.1
- Node: v24.13.0
- node-llama-cpp: 3.17.1