NVIDIA Open GPU Kernel Modules Version
595.71.05 (Arch package nvidia-open-dkms 595.71.05-2)
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Arch Linux (rolling release)
Kernel Release
Linux host 7.0.3-arch1-2 #1 SMP PREEMPT_DYNAMIC Fri, 01 May 2026 15:49:22 +0000 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-)
Describe the bug
On a single-GPU RTX 3090 desktop running Linux 7.0.3 with
nvidia-open-dkms 595.71.05, the kernel logged a resource sanity check
warning naming __nv_drm_gem_nvkms_map as the caller of an mmap that
"spans more than" the device's BAR1 region. The same instant, the GPU
took an MMU fault on Copy Engine 2 (Xid 31) and the driver self-declared
the GPU unrecoverable (Xid 154, "Node Reboot Required") with
uvm encountered global fatal error 0x60. GSP RPC then timed out
(Xid 175). The display compositor's vblank stalled, the screen froze, and
neither nvidia-smi nor systemctl reboot could complete; recovery
required a hardware power-cycle. The trigger workload was a
Chromium-based browser (Brave) starting a new renderer process.
To Reproduce
- Wayland compositor (Hyprland) running, ~2 hours uptime since boot
- Brave (Chromium-based browser) open with several tabs
- Brave subprocess started a new renderer/GPU process — call stack shows
Chromium worker thread deep in kperfBoostSet_IMPL → rpcRmApiControl_GSP →
_kgspRpcRecvPoll, consistent with a GPU-frequency-boost RPC during
renderer spin-up
- No CUDA process active; no userspace had /dev/nvidia-uvm open
- System RAM healthy: 7.6 GiB / 61 GiB used, no swap pressure
- Single occurrence so far; not yet a deterministic reproducer
- See "Smoking-gun evidence" and "Fault sequence" in More Info below
Bug Incidence
Once
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
Note: I have not tested with the proprietary nvidia-dkms package, so I have
left the proprietary-driver-confirmation checkbox unchecked. The kernel's
own resource sanity check warning names __nv_drm_gem_nvkms_map+0x99/0xf0 [nvidia_drm] as the caller, which is specific to nvidia-open's DRM layer.
I am happy to test the proprietary driver if maintainers think it would
help isolate the regression.
Smoking-gun evidence
Single line, logged by the kernel core (not by NVRM) at t = 0:
resource: resource sanity check: requesting [mem 0x000000fccfdd0000-0x000000fcd00fffff], which spans more than 0000:01:00.0 [mem 0xfcc0000000-0xfccfffffff 64bit pref]
caller __nv_drm_gem_nvkms_map+0x99/0xf0 [nvidia_drm] mapping multiple BARs
The requested range starts ~3 MiB before the end of BAR1
(0xfcc0000000-0xfccfffffff) and runs ~33 MiB past it, into BAR3
(0xfcd0000000 + 32 MiB). The kernel's PCI resource validation rejects
the request, and the subsequent
[drm:_nv_drm_gem_nvkms_map] ERROR Failed to map NvKmsKapiMemory 0x00000000616506ff
confirms the map failed.
Immediately preceding this, NVRM logged ~25 repetitions of:
NVRM: dmaAllocMapping_GM107: can't alloc VA space for mapping.
NVRM: nvAssertOkFailedNoLog: ... [NV_ERR_NO_MEMORY] (0x00000051) ... @ mapping_reuse.c:273
... @ kern_bus_gm107.c:3141 // ("pBar1VaInfo->reuseDb")
so BAR1 VA space was being repeatedly exhausted in the seconds leading up
to the bad-range request. That suggests the bad mapping is a fallback (or
an arithmetic mistake) on the BAR1-VA-exhausted path rather than a
random misuse of pci_resource*.
Fault sequence
All times relative to t = 0 (the resource sanity check line above).
Full redacted log in kernel-log-excerpt.txt.
| Offset |
Event |
| t+0:00:00 |
resource sanity check, __nv_drm_gem_nvkms_map ... mapping multiple BARs, Failed to map NvKmsKapiMemory. |
| t+0:00:00 |
Xid 31 — MMU Fault: ENGINE CE2 HUBCLIENT_CE0 faulted @ 0x1_21000000, FAULT_PTE ACCESS_TYPE_VIRT_WRITE. |
| t+0:00:00 |
nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover. |
| t+0:00:00 |
Xid 154 — GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required). |
| t+0:00:00 |
Brave GPU subprocess receives SIGILL (trap invalid opcode ... in brave[...]). |
| t+0:00:01 |
[drm:nv_drm_atomic_apply_modeset_config] Failed to initialize semaphore for plane fence, nv_drm_atomic_commit Error code: -11. |
| t+0:01:15 |
_kgspIsHeartbeatTimedOut: diff 75117 timeout 5200. GSP heartbeat lost. |
| t+0:01:45 |
Memory Subsystem Error detected. kgmmuInvalidateTlb failed. |
| t+0:01:45 |
Xid 175 — Timeout after 75s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL). Originating thread name ThreadPoolSingl (Chromium worker). |
| t+0:01:48 |
Call trace dumped: _kgspRpcRecvPoll → _issueRpcAndWait → rpcRmApiControl_GSP → kperfBoostSet_IMPL → resControl_IMPL → ... → nvidia_unlocked_ioctl. |
| t+0:01:48 onward |
RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7 repeats every 30-60 s. Hundreds of NV_ERR_RESET_REQUIRED assertions firing as the fullchip-reset path itself fails its preconditions. |
| t+0:06:18 |
Xid 16, Head 00000003 Count ..., RM has detected that 7 Seconds without a Vblank Counter Update on head:D0. Display visibly froze. |
| t+0:12:48 |
Second Xid 16 / vblank-watchdog. |
Recovery
nvidia-smi accepted the ioctl but never returned (hung indefinitely;
killed after ~5 min).
The driver's own RC path tried FULLCHIP_RESET repeatedly; every
attempt failed with NV_ERR_RESET_REQUIRED precondition assertions —
the chip-reset path itself was wedged.
systemctl reboot was invoked from an SSH session and hung at
nvidia_drm module teardown for >5 minutes without progress.
Recovery required holding the hardware power button.
The system was otherwise functional throughout: SSH stayed up, the
Wayland compositor's main thread was alive in do_epoll_wait, no
processes were in D-state. The wedge is entirely below nvidia_drm.
What I have ruled out
Hardware fault on the GPU. This 3090 had been stable for many
months on the previous linux 6.19.11 + nvidia-open 595.58.03 stack
with the same workload. After the hardware power-cycle, the system
came up cleanly on the same 7.0.3 + 595.71.05 stack and has so far
been stable.
Host OOM. 7.6 GiB / 61 GiB host RAM in use at fault time. No swap
pressure. No oom_reaper activity in the journal. The OOM was
GPU-VA, not host RAM.
Userspace-only fault. The kernel core's
resource sanity check was emitted from inside nvidia_drm's
__nv_drm_gem_nvkms_map. The subsequent Xid 31 MMU fault is a
consequence of the bad mapping being used. The Brave SIGILL came
after the kernel error and looks like a downstream consequence of
the GPU buffer the renderer expected being inaccessible.
DKMS build mismatch / firmware mismatch. DKMS built nvidia-open
595.71.05 cleanly for both kernels at upgrade time; modules load
cleanly; firmware version matches the driver expectations
(linux-firmware-nvidia 20260410-1).
I cannot rule out — and want to be careful not to overclaim — which
component regressed. The kernel and the driver were both upgraded in the
same transaction, so this could be a bug in nvidia-open's PCI
BAR-range arithmetic, a kernel-side change to the resource validation
that nvidia-open is the first to trip, or a problem in the combination
(e.g. a new pci_resource_* semantic on 7.0.x that nvidia-open hasn't
adopted yet). I have not yet had the opportunity to bisect.
Open questions
If you (or anyone reading) have seen this signature before, I'd value
pointers on any of:
Does this reproduce on nvidia-dkms (proprietary kernel module) at
595.71.05, holding kernel 7.0.3 fixed?
Does this reproduce on kernel 6.19.11 with nvidia-open 595.71.05?
Does disabling Chromium-side GPU acceleration (e.g.
--disable-gpu-rasterization, --disable-gpu) prevent it on
7.0.3 + 595.71.05?
Does the resource sanity check line precede every freeze of this
form, or are there freezes without it? (I have only this one
occurrence.)
NVIDIA Open GPU Kernel Modules Version
595.71.05 (Arch package nvidia-open-dkms 595.71.05-2)
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Arch Linux (rolling release)
Kernel Release
Linux host 7.0.3-arch1-2 #1 SMP PREEMPT_DYNAMIC Fri, 01 May 2026 15:49:22 +0000 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-)
Describe the bug
On a single-GPU RTX 3090 desktop running Linux 7.0.3 with
nvidia-open-dkms595.71.05, the kernel logged aresource sanity checkwarning naming
__nv_drm_gem_nvkms_mapas the caller of an mmap that"spans more than" the device's BAR1 region. The same instant, the GPU
took an MMU fault on Copy Engine 2 (Xid 31) and the driver self-declared
the GPU unrecoverable (Xid 154, "Node Reboot Required") with
uvm encountered global fatal error 0x60. GSP RPC then timed out(Xid 175). The display compositor's vblank stalled, the screen froze, and
neither
nvidia-sminorsystemctl rebootcould complete; recoveryrequired a hardware power-cycle. The trigger workload was a
Chromium-based browser (Brave) starting a new renderer process.
To Reproduce
Chromium worker thread deep in kperfBoostSet_IMPL → rpcRmApiControl_GSP →
_kgspRpcRecvPoll, consistent with a GPU-frequency-boost RPC during
renderer spin-up
Bug Incidence
Once
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
Note: I have not tested with the proprietary nvidia-dkms package, so I have
left the proprietary-driver-confirmation checkbox unchecked. The kernel's
own
resource sanity checkwarning names__nv_drm_gem_nvkms_map+0x99/0xf0 [nvidia_drm]as the caller, which is specific to nvidia-open's DRM layer.I am happy to test the proprietary driver if maintainers think it would
help isolate the regression.
Smoking-gun evidence
Single line, logged by the kernel core (not by NVRM) at t = 0:
The requested range starts ~3 MiB before the end of BAR1
(
0xfcc0000000-0xfccfffffff) and runs ~33 MiB past it, into BAR3(
0xfcd0000000+ 32 MiB). The kernel's PCI resource validation rejectsthe request, and the subsequent
[drm:_nv_drm_gem_nvkms_map] ERROR Failed to map NvKmsKapiMemory 0x00000000616506ffconfirms the map failed.
Immediately preceding this, NVRM logged ~25 repetitions of:
so BAR1 VA space was being repeatedly exhausted in the seconds leading up
to the bad-range request. That suggests the bad mapping is a fallback (or
an arithmetic mistake) on the BAR1-VA-exhausted path rather than a
random misuse of
pci_resource*.Fault sequence
All times relative to t = 0 (the
resource sanity checkline above).Full redacted log in
kernel-log-excerpt.txt.Recovery
nvidia-smiaccepted the ioctl but never returned (hung indefinitely; killed after ~5 min).The driver's own RC path tried
FULLCHIP_RESETrepeatedly; every attempt failed withNV_ERR_RESET_REQUIREDprecondition assertions — the chip-reset path itself was wedged.systemctl rebootwas invoked from an SSH session and hung atnvidia_drmmodule teardown for >5 minutes without progress.Recovery required holding the hardware power button.
The system was otherwise functional throughout: SSH stayed up, the Wayland compositor's main thread was alive in
do_epoll_wait, no processes were in D-state. The wedge is entirely belownvidia_drm.What I have ruled out
Hardware fault on the GPU. This 3090 had been stable for many months on the previous
linux 6.19.11+nvidia-open 595.58.03stack with the same workload. After the hardware power-cycle, the system came up cleanly on the same 7.0.3 + 595.71.05 stack and has so far been stable.Host OOM. 7.6 GiB / 61 GiB host RAM in use at fault time. No swap pressure. No
oom_reaperactivity in the journal. The OOM was GPU-VA, not host RAM.Userspace-only fault. The kernel core's
resource sanity checkwas emitted from insidenvidia_drm's__nv_drm_gem_nvkms_map. The subsequentXid 31MMU fault is a consequence of the bad mapping being used. The BraveSIGILLcame after the kernel error and looks like a downstream consequence of the GPU buffer the renderer expected being inaccessible.DKMS build mismatch / firmware mismatch. DKMS built
nvidia-open595.71.05 cleanly for both kernels at upgrade time; modules load cleanly; firmware version matches the driver expectations (linux-firmware-nvidia 20260410-1).I cannot rule out — and want to be careful not to overclaim — which component regressed. The kernel and the driver were both upgraded in the same transaction, so this could be a bug in nvidia-open's PCI BAR-range arithmetic, a kernel-side change to the resource validation that nvidia-open is the first to trip, or a problem in the combination (e.g. a new
pci_resource_*semantic on 7.0.x that nvidia-open hasn't adopted yet). I have not yet had the opportunity to bisect.Open questions
If you (or anyone reading) have seen this signature before, I'd value pointers on any of:
Does this reproduce on
nvidia-dkms(proprietary kernel module) at 595.71.05, holding kernel 7.0.3 fixed?Does this reproduce on kernel 6.19.11 with
nvidia-open595.71.05?Does disabling Chromium-side GPU acceleration (e.g.
--disable-gpu-rasterization,--disable-gpu) prevent it on 7.0.3 + 595.71.05?Does the
resource sanity checkline precede every freeze of this form, or are there freezes without it? (I have only this one occurrence.)