Skip to content

Fix EGL context creation on headless NVIDIA (EGL_BAD_ACCESS)#13332

Open
sam-kpm wants to merge 3 commits intoComfy-Org:masterfrom
sam-kpm:fix/egl-headless-nvidia-device-enumeration
Open

Fix EGL context creation on headless NVIDIA (EGL_BAD_ACCESS)#13332
sam-kpm wants to merge 3 commits intoComfy-Org:masterfrom
sam-kpm:fix/egl-headless-nvidia-device-enumeration

Conversation

@sam-kpm
Copy link
Copy Markdown

@sam-kpm sam-kpm commented Apr 9, 2026

Problem

On headless Linux with an NVIDIA GPU and no display server (no $DISPLAY / $WAYLAND_DISPLAY), the GLSL shader node fails with:

RuntimeError: Failed to create OpenGL context.
Backend errors:
  GLFW: glfw.create_window() failed
  EGL: EGLError(err=EGL_BAD_ACCESS, baseOperation=eglInitialize, ...)
  OSMesa: 'GLXPlatform' object has no attribute 'OSMesa'

This is a common setup: cloud VMs, remote GPU servers, and Docker containers with NVIDIA GPUs typically have no display server.

Root cause: eglInitialize(EGL_DEFAULT_DISPLAY) requires a running X or Wayland compositor. On a bare headless system, NVIDIA's EGL returns EGL_BAD_ACCESS. The correct approach for headless GPU rendering is the EGL_EXT_platform_device extension — enumerate EGL devices and obtain a display from a specific device handle.

There are two additional complications:

  1. eglInitialize raises EGLError rather than returning False in some PyOpenGL versions/EGL vendor combinations — the original code only checked the return value.
  2. PyOpenGL's egl_get_devices() wrapper does not reliably resolve the eglQueryDevicesEXT function pointer in headless NVIDIA scenarios, so the fallback must call libEGL.so.1 directly via ctypes.

Fix

When eglInitialize(EGL_DEFAULT_DISPLAY) fails (either by returning False or raising EGLError), fall back to device enumeration:

  1. Load eglQueryDevicesEXT and eglGetPlatformDisplayEXT directly from libEGL.so.1 via ctypes (bypassing PyOpenGL's broken wrapper)
  2. Enumerate available EGL devices
  3. Obtain a display via eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, device, NULL)
  4. Proceed with normal EGL context setup

Testing

Verified on Ubuntu 24.04, NVIDIA driver 580.65.06, no display server, using the built-in GLSL shader node.

On headless Linux with NVIDIA GPUs and no display server, eglInitialize()
with EGL_DEFAULT_DISPLAY fails with EGL_BAD_ACCESS. The fix falls back to
EGL_EXT_platform_device: enumerate EGL devices and obtain a display via
eglGetPlatformDisplayEXT(EGL_PLATFORM_DEVICE_EXT, ...).

PyOpenGL's egl_get_devices() wrapper doesn't reliably resolve the
eglQueryDevicesEXT function pointer in this scenario, so both functions
are called directly from libEGL.so.1 via ctypes.

Also handles the case where eglInitialize raises EGLError rather than
returning False, which varies by PyOpenGL version and EGL vendor.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Added ctypes and a new helper _egl_device_display(eglInitialize) that uses eglGetProcAddress to load eglQueryDevicesEXT and eglGetPlatformDisplayEXT from libEGL, enumerates EGL devices, obtains an EGLDisplay per device via EGL_EXT_platform_device, and attempts eglInitialize on each device, returning the first successful (display, major, minor). _init_egl() now tries eglGetDisplay(EGL_DEFAULT_DISPLAY) and eglInitialize first but treats failures as non‑fatal and falls back to _egl_device_display; debug logging was added for missing entry points, empty device lists, and per‑device init outcomes.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: resolving EGL context creation failures on headless NVIDIA systems with the EGL_BAD_ACCESS error.
Description check ✅ Passed The description comprehensively explains the problem, root cause, solution approach, and testing verification, all directly related to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
comfy_extras/nodes_glsl.py (1)

256-259: Consider using c_uint32 instead of c_bool for the EGLboolean return type of eglQueryDevicesEXT.

EGLboolean is defined as unsigned int (32-bit) in the EGL specification, whereas ctypes.c_bool maps to C99 _Bool (typically 1 byte). While this works in practice due to calling conventions, using c_uint32 is more semantically correct and matches the actual EGL header definition.

🔧 Suggested fix
             _query_devices = ctypes.CFUNCTYPE(
-                ctypes.c_bool,
+                ctypes.c_uint32,
                 ctypes.c_int32, ctypes.POINTER(ctypes.c_void_p), ctypes.POINTER(ctypes.c_int32),
             )(_query_devices_ptr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_glsl.py` around lines 256 - 259, The EGL function wrapper
_query_devices currently uses ctypes.c_bool for the EGLboolean return, but
EGLboolean is a 32-bit unsigned int; update the CFUNCTYPE signature for
_query_devices (and any similar wrappers like eglQueryDevicesEXT) to use
ctypes.c_uint32 as the return type instead of ctypes.c_bool so the ctypes
signature matches the EGL header and avoids size/mapping mismatches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 256-259: The EGL function wrapper _query_devices currently uses
ctypes.c_bool for the EGLboolean return, but EGLboolean is a 32-bit unsigned
int; update the CFUNCTYPE signature for _query_devices (and any similar wrappers
like eglQueryDevicesEXT) to use ctypes.c_uint32 as the return type instead of
ctypes.c_bool so the ctypes signature matches the EGL header and avoids
size/mapping mismatches.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9b95a465-80c0-4b95-bb53-9e32ac03729b

📥 Commits

Reviewing files that changed from the base of the PR and between b615af1 and 8e0558c.

📒 Files selected for processing (1)
  • comfy_extras/nodes_glsl.py

- Extract device enumeration into _egl_device_display() helper
- Use ctypes.util.find_library("EGL") instead of hardcoded libEGL.so.1
- Fix eglGetDisplay(EGL_DEFAULT_DISPLAY) failure also falling through to
  device enumeration (previously raised immediately, skipping the fallback)
- Two-pass eglQueryDevicesEXT to avoid arbitrary device count cap
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 245-254: The current code uses raw_devices[0] and only attempts
the first EGL device; change this to iterate over raw_devices and try each
device in turn by calling _get_platform_display(EGL_PLATFORM_DEVICE_EXT, device,
None), casting result to ctypes.c_void_p as display, then calling
eglInitialize(display, major, minor) for each until one returns true; on first
successful eglInitialize stop and use that display, and if none succeed raise a
RuntimeError indicating initialization failed for all enumerated devices
(include device info if available) — update references in this block for
EGL_PLATFORM_DEVICE_EXT, _get_platform_display, raw_devices, display, and
eglInitialize.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 91f5bc7b-70e5-4eea-9db3-bc404500666a

📥 Commits

Reviewing files that changed from the base of the PR and between 8e0558c and e24d0f0.

📒 Files selected for processing (1)
  • comfy_extras/nodes_glsl.py

Comment thread comfy_extras/nodes_glsl.py Outdated
- Use c_uint32 for EGLboolean return type (unsigned int per EGL spec, not _Bool)
- Try all enumerated EGL devices in order rather than only the first;
  skip devices where eglGetPlatformDisplayEXT or eglInitialize fails
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
comfy_extras/nodes_glsl.py (1)

200-264: Well-implemented EGL device enumeration fallback.

The implementation correctly addresses the headless NVIDIA issue by:

  • Loading EGL extensions directly via ctypes when PyOpenGL's wrapper is unreliable
  • Using c_uint32 for EGLboolean return type per EGL spec
  • Iterating all enumerated devices rather than just the first (addresses the prior review feedback)

One minor suggestion: the exception handler at lines 260-261 silently discards the exception. Logging it at debug level would help diagnose edge cases where eglInitialize raises instead of returning False.

,

Optional: Log caught exception for debugging
         try:
             if eglInitialize(display, major, minor):
                 logger.debug(f"_egl_device_display: device {i} succeeded, EGL version {major.value}.{minor.value}")
                 return display, major, minor
-        except Exception:
-            pass
+        except Exception as e:
+            logger.debug(f"_egl_device_display: device {i} eglInitialize raised {type(e).__name__}: {e}")
         logger.debug(f"_egl_device_display: device {i} eglInitialize failed, skipping")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_glsl.py` around lines 200 - 264, In _egl_device_display,
don't silently swallow exceptions from the eglInitialize call; change the except
block that currently catches Exception and passes to log the exception at debug
level (include the device index and exception info) so you can diagnose failures
where eglInitialize raises instead of returning False; update the except
Exception handler around the call to eglInitialize(display, major, minor) to
capture the exception as e and call logger.debug (or logger.debug(...,
exc_info=True)) with a short message referencing the device index and the
exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 200-264: In _egl_device_display, don't silently swallow exceptions
from the eglInitialize call; change the except block that currently catches
Exception and passes to log the exception at debug level (include the device
index and exception info) so you can diagnose failures where eglInitialize
raises instead of returning False; update the except Exception handler around
the call to eglInitialize(display, major, minor) to capture the exception as e
and call logger.debug (or logger.debug(..., exc_info=True)) with a short message
referencing the device index and the exception.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: caf4ff88-d37f-45b1-b643-ac94e116fb43

📥 Commits

Reviewing files that changed from the base of the PR and between e24d0f0 and 9e28569.

📒 Files selected for processing (1)
  • comfy_extras/nodes_glsl.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant