Skip to content

Windows CUDA: cuMemAddressReserve failure in VMM pool causes hard abort (GGML_CUDA_NO_VMM workaround) #580

@enceos

Description

@enceos

Hey folks. I ran into a hard crash while running some heavy embedding workloads on Windows using the CUDA backend. It looks like it's tied to the VMM allocator.

The Problem
When running a large indexing job (about 32,000 chunks via qmd), the process dies with a CUDA out of memory error.

Digging into the debug logs, the exact failure happens at ggml-cuda.cu:97. It aborts inside ggml_cuda_pool_vmm::alloc (around line 476) when calling:
cuMemAddressReserve(&pool_addr, CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)

Why it's failing
I'm on an RTX 3090 (24GB). In ggml-cuda.cu, CUDA_POOL_VMM_MAX_SIZE is hardcoded to reserve 32GB of virtual memory. Even with plenty of actual VRAM available, the virtual address space reservation fails. Instead of gracefully falling back to a non-VMM pool, the whole process hard-aborts.

The Workaround
I managed to bypass this locally by compiling node-llama-cpp from source with VMM disabled:
GGML_CUDA_NO_VMM=ON
With that flag, the exact same embedding job finishes perfectly and memory usage stays stable.

The Request
Would it be possible to add a runtime fallback here? If cuMemAddressReserve fails (which seems to happen on some Windows/WDDM setups), it would be great if it logged a warning and fell back to the standard allocator instead of crashing. That would make the prebuilt binaries a lot more stable for Windows users hitting this edge case.

My Environment

  • OS: Windows 11 Pro N (10.0.22631)
  • GPU: RTX 3090 24GB (Driver 591.44)
  • CUDA: 13.1
  • Node: v24.13.0
  • node-llama-cpp: 3.17.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions