GLSLShader refactoring: singleton factory architecture#12587
GLSLShader refactoring: singleton factory architecture#12587Lex-DRL wants to merge 15 commits intoComfy-Org:masterfrom
GLSLShader refactoring: singleton factory architecture#12587Conversation
📝 WalkthroughWalkthroughThe PR refactors OpenGL handling in comfy_extras/nodes_glsl.py by introducing a singleton GLContext that selects a concrete backend (fallback order: GLFW → EGL → OSMesa) at first use. OpenGL imports occur after context creation and a VAO is created during init. Backend-specific classes implement 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
comfy_extras/nodes_glsl.py (1)
251-264:⚠️ Potential issue | 🔴 Critical
NameErrorifglGenVertexArraysitself raises.
vaois only assigned inside thetrybody. IfglGenVertexArraysthrows (e.g., on systems where VAOs are unsupported), the name is never bound and theif vao:guard in theexceptblock raisesNameError, masking the original exception.🐛 Proposed fix
+ vao = None try: vao = gl.glGenVertexArrays(1) gl.glBindVertexArray(vao)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_glsl.py` around lines 251 - 264, The try/except can raise a NameError in GLContext.__init__ because vao is only assigned inside the try; initialize a sentinel before calling gl.glGenVertexArrays (e.g., set vao = None) and change the cleanup guard to check "if vao is not None" before calling gl.glDeleteVertexArrays; ensure self._vao is only set after successful bind as currently done and that the except block references the sentinel to avoid masking the original exception from gl.glGenVertexArrays/gl.glBindVertexArray.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 748-749: The code in _init_backend_concrete (OSMesa) sets
os.environ["PYOPENGL_PLATFORM"] = "osmesa" as a permanent side effect; change
this to save the original os.environ.get("PYOPENGL_PLATFORM") before setting,
perform the OSMesa/OpenGL initialization inside a try block, and in the
except/finally restore the original value (either resetting to the saved value
or deleting the key if it was absent) so the PYOPENGL_PLATFORM value is only
left modified on successful initialization and is reverted on failure.
- Around line 563-566: The pad logic currently appends the same numpy buffer
`black_img` multiple times, causing aliasing when converted via
`torch.from_numpy`; fix by creating a fresh array for each unused slot (e.g.,
allocate a new zeros array or append `black_img.copy()`) when adding to
`batch_outputs` in the padding loop that references `black_img`,
`batch_outputs`, `MAX_OUTPUTS`, and later
`all_outputs`/`torch.from_numpy`/`torch.stack`, so each padded output has its
own independent buffer.
- Around line 157-163: The singleton initialization in GLContext.__new__ has a
TOCTOU race: guard the check-and-create sequence with a class-level lock (e.g.,
a threading.Lock stored on GLContext, like GLContext._init_lock) so only one
thread can run the creation path that calls
GLContext.__new_instance_using_concrete_class_fallback_order and assigns
GLContext.__instance; drop the redundant assert isinstance(...) after
assignment. Ensure the lock is acquired before checking GLContext.__instance and
released after assignment (use a context manager).
---
Outside diff comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 251-264: The try/except can raise a NameError in
GLContext.__init__ because vao is only assigned inside the try; initialize a
sentinel before calling gl.glGenVertexArrays (e.g., set vao = None) and change
the cleanup guard to check "if vao is not None" before calling
gl.glDeleteVertexArrays; ensure self._vao is only set after successful bind as
currently done and that the except block references the sentinel to avoid
masking the original exception from gl.glGenVertexArrays/gl.glBindVertexArray.
| def __new__(cls): | ||
| if cls._instance is None: | ||
| cls._instance = super().__new__(cls) | ||
| return cls._instance | ||
| # Since ``GLContext`` is a singleton anyway, we should store it | ||
| # explicitly in ``GLContext.__instance``, NOT in ``cls.__instance``. | ||
| if GLContext.__instance is None: | ||
| GLContext.__instance = GLContext.__new_instance_using_concrete_class_fallback_order() | ||
| assert isinstance(GLContext.__instance, GLContext) | ||
| return GLContext.__instance |
There was a problem hiding this comment.
Missing lock on singleton initialization creates a TOCTOU race.
Two concurrent callers can both observe GLContext.__instance is None, both invoke __new_instance_using_concrete_class_fallback_order, and both create a real GL context. The second assignment on line 161 silently overwrites the first, leaking the earlier context and leaving whichever thread received the first instance holding a now-abandoned GL context object.
🔒 Suggested fix: guard with a class-level lock
+import threading
class GLContext:
...
__instance: 'GLContext' = None
+ __init_lock: threading.Lock = threading.Lock()
def __new__(cls):
- if GLContext.__instance is None:
- GLContext.__instance = GLContext.__new_instance_using_concrete_class_fallback_order()
- assert isinstance(GLContext.__instance, GLContext)
- return GLContext.__instance
+ if GLContext.__instance is None:
+ with GLContext.__init_lock:
+ if GLContext.__instance is None: # double-checked locking
+ GLContext.__instance = GLContext.__new_instance_using_concrete_class_fallback_order()
+ return GLContext.__instanceThe assert can be dropped — it is vacuously true by construction and is silently removed under python -O.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_glsl.py` around lines 157 - 163, The singleton
initialization in GLContext.__new__ has a TOCTOU race: guard the
check-and-create sequence with a class-level lock (e.g., a threading.Lock stored
on GLContext, like GLContext._init_lock) so only one thread can run the creation
path that calls GLContext.__new_instance_using_concrete_class_fallback_order and
assigns GLContext.__instance; drop the redundant assert isinstance(...) after
assignment. Ensure the lock is acquired before checking GLContext.__instance and
released after assignment (use a context manager).
| # Pad with black images for unused outputs | ||
| black_img = np.zeros((height, width, 4), dtype=np.float32) | ||
| for _ in range(num_outputs, MAX_OUTPUTS): | ||
| batch_outputs.append(black_img) |
There was a problem hiding this comment.
Padding outputs within a single batch share the same np.zeros buffer.
All unused output slots in one batch iteration point to the same black_img array. torch.from_numpy shares the buffer rather than copying, so the tensors in all_outputs are aliased until torch.stack materializes them. If any code path between these lines and torch.stack were to write into one of those tensors, it would corrupt the others. Constructing a fresh array per slot (or using .copy()) eliminates the alias.
✨ Suggested fix
- black_img = np.zeros((height, width, 4), dtype=np.float32)
for _ in range(num_outputs, MAX_OUTPUTS):
- batch_outputs.append(black_img)
+ batch_outputs.append(np.zeros((height, width, 4), dtype=np.float32))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_glsl.py` around lines 563 - 566, The pad logic currently
appends the same numpy buffer `black_img` multiple times, causing aliasing when
converted via `torch.from_numpy`; fix by creating a fresh array for each unused
slot (e.g., allocate a new zeros array or append `black_img.copy()`) when adding
to `batch_outputs` in the padding loop that references `black_img`,
`batch_outputs`, `MAX_OUTPUTS`, and later
`all_outputs`/`torch.from_numpy`/`torch.stack`, so each padded output has its
own independent buffer.
| logger.debug("_init_backend_concrete (OSMesa): starting") | ||
| os.environ["PYOPENGL_PLATFORM"] = "osmesa" |
There was a problem hiding this comment.
PYOPENGL_PLATFORM is set as a permanent process-wide side effect.
os.environ["PYOPENGL_PLATFORM"] = "osmesa" persists for the lifetime of the process even if OSMesa initialization subsequently fails. Any other component in the process that later imports OpenGL.GL (before or after this node) will inherit the OSMesa platform selection. The effect is intentional here (the env var must precede the OpenGL.GL import), but the lack of cleanup on failure is worth noting — a context manager or try/finally that restores the original value on failure would be safer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_glsl.py` around lines 748 - 749, The code in
_init_backend_concrete (OSMesa) sets os.environ["PYOPENGL_PLATFORM"] = "osmesa"
as a permanent side effect; change this to save the original
os.environ.get("PYOPENGL_PLATFORM") before setting, perform the OSMesa/OpenGL
initialization inside a try block, and in the except/finally restore the
original value (either resetting to the saved value or deleting the key if it
was absent) so the PYOPENGL_PLATFORM value is only left modified on successful
initialization and is reverted on failure.
christian-byrne
left a comment
There was a problem hiding this comment.
Is this going to work with process isolation and async nodes? i.e., each node being in a separate process.
|
I haven't changed the original behavior in any way. I merely rearranged the existing code. Originally, the backend initialization was done entirely in What I did is extracted the body of those try-excepts into special class-specific functions, to be called from If anything, the initialization became more syncronous, i.e. more thread-safe. Though, as I mentioned, I couldn't fully test it because the node in its current implementation (before refactoring) doesn't work for me.. |
|
What I can tell is the existing implementation seem to be an MVP with a HUUUUUGE room for improvement, so probably asyncronous execution wasn't considered by the original author ( @pythongosssss ) |
Test Evidence CheckIf this PR modifies behavior that requires testing, a test explanation is required. PRs lacking applicable test explanations may not be reviewed until added. Please add test explanations to ensure code quality and prevent regressions. If this PR changes user-facing behavior, visual proof (screen recording or screenshot) is required. PRs without applicable visual documentation may not be reviewed until provided. You can add it by:
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
comfy_extras/nodes_glsl.py (1)
157-163:⚠️ Potential issue | 🟠 MajorProtect singleton creation with a lock to avoid double-initialization races.
Two concurrent first callers can both observe
GLContext.__instance is Noneand create competing concrete contexts; one gets overwritten, leaking resources and returning inconsistent instances across threads.🔒 Suggested fix
+import threading + class GLContext: @@ __instance: 'GLContext' = None # The singleton + __init_lock: threading.Lock = threading.Lock() @@ def __new__(cls): # Since ``GLContext`` is a singleton anyway, we should store it # explicitly in ``GLContext.__instance``, NOT in ``cls.__instance``. if GLContext.__instance is None: - GLContext.__instance = GLContext.__new_instance_using_concrete_class_fallback_order() - assert isinstance(GLContext.__instance, GLContext) + with GLContext.__init_lock: + if GLContext.__instance is None: + GLContext.__instance = GLContext.__new_instance_using_concrete_class_fallback_order() return GLContext.__instance🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_glsl.py` around lines 157 - 163, Protect GLContext singleton initialization by adding a per-class lock around the check-and-create in GLContext.__new__; i.e., acquire a threading.Lock (or RLock) before testing GLContext.__instance is None and creating via GLContext.__new_instance_using_concrete_class_fallback_order(), then set GLContext.__instance and release the lock so two threads cannot race and create competing instances; keep the existing assert isinstance(GLContext.__instance, GLContext) after creation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 157-163: Protect GLContext singleton initialization by adding a
per-class lock around the check-and-create in GLContext.__new__; i.e., acquire a
threading.Lock (or RLock) before testing GLContext.__instance is None and
creating via GLContext.__new_instance_using_concrete_class_fallback_order(),
then set GLContext.__instance and release the lock so two threads cannot race
and create competing instances; keep the existing assert
isinstance(GLContext.__instance, GLContext) after creation.
|
After thinking for a while, I introduced a second class before the PR is reviewed: Again, no changes to the underlying behavior. My refactor might look like an overkill for now, but the overall goal is to build a foundation for ALL the future OpenGL-based rendering in ComfyUI. I already was working on it as a pet project (though, relying on ModernGL). @pythongosssss just implemented an MVP sooner than me. |
23b53e0 to
eac8703
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
comfy_extras/nodes_glsl.py (2)
775-778:⚠️ Potential issue | 🟡 MinorPadded outputs share one numpy buffer per batch.
All padded slots reference the same
black_img, so these tensors are aliased until stacking.✨ Suggested fix
- black_img = np.zeros((height, width, 4), dtype=np.float32) for _ in range(num_outputs, MAX_OUTPUTS): - batch_outputs.append(black_img) + batch_outputs.append(np.zeros((height, width, 4), dtype=np.float32))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_glsl.py` around lines 775 - 778, The padded output currently reuses a single numpy array (black_img) for all missing outputs causing aliasing; change the padding loop in the code that fills batch_outputs (references: black_img, batch_outputs, num_outputs, MAX_OUTPUTS) so that each appended slot gets its own buffer—either by moving np.zeros(...) into the loop to create a fresh array per iteration or by appending black_img.copy() instead of black_img.
157-163:⚠️ Potential issue | 🟠 MajorProtect singleton creation with a lock to prevent double context initialization.
Concurrent callers can both pass the
Nonecheck and build separate contexts; the later assignment overwrites the first and leaks resources.🔒 Suggested fix
+import threading ... class GLContext: __instance: 'GLContext' = None # The singleton + __init_lock: threading.Lock = threading.Lock() def __new__(cls): - if GLContext.__instance is None: - GLContext.__instance = GLContext.__new_instance_using_concrete_class_fallback_order() - assert isinstance(GLContext.__instance, GLContext) + if GLContext.__instance is None: + with GLContext.__init_lock: + if GLContext.__instance is None: + GLContext.__instance = GLContext.__new_instance_using_concrete_class_fallback_order() return GLContext.__instance🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_glsl.py` around lines 157 - 163, The singleton creation in GLContext.__new__ is not thread-safe; add a class-level lock (e.g., GLContext.__creation_lock) and use it to guard the creation path: acquire the lock, re-check GLContext.__instance is None, call GLContext.__new_instance_using_concrete_class_fallback_order() to create and assign GLContext.__instance, assert type, then release the lock and return GLContext.__instance; ensure the double-checked pattern is used so concurrent callers don't both construct contexts and leak resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 182-193: The except block in the GLContext creation can leak
partially initialized backend resources; modify the error path to call a cleanup
hook on the partly constructed instance before moving to the next backend: add
an abstract/protected method name like _cleanup_backend_concrete() to the
GLContext base class and implement it in each concrete backend class
(_GLContextGLFW, _GLContextEGL, _GLContextOSMesa) to release any OS/GL
resources, then update the except block around instance.__init__() to call
instance._cleanup_backend_concrete() (guarded with hasattr/try) before appending
to errors so any allocated resources are torn down on failure.
---
Duplicate comments:
In `@comfy_extras/nodes_glsl.py`:
- Around line 775-778: The padded output currently reuses a single numpy array
(black_img) for all missing outputs causing aliasing; change the padding loop in
the code that fills batch_outputs (references: black_img, batch_outputs,
num_outputs, MAX_OUTPUTS) so that each appended slot gets its own buffer—either
by moving np.zeros(...) into the loop to create a fresh array per iteration or
by appending black_img.copy() instead of black_img.
- Around line 157-163: The singleton creation in GLContext.__new__ is not
thread-safe; add a class-level lock (e.g., GLContext.__creation_lock) and use it
to guard the creation path: acquire the lock, re-check GLContext.__instance is
None, call GLContext.__new_instance_using_concrete_class_fallback_order() to
create and assign GLContext.__instance, assert type, then release the lock and
return GLContext.__instance; ensure the double-checked pattern is used so
concurrent callers don't both construct contexts and leak resources.
| try: | ||
| instance: GLContext = object.__new__(cls) | ||
| # Since this code is called while in `__new__()`, we need to manually call `__init__()`, too. | ||
| # Otherwise, Python would call it only AFTER `__new__()`, causing init errors outside our try-except check. | ||
| instance.__init__() | ||
| logger.debug(f"GLContext.__init__: {name} backend succeeded. The singleton is: {cls!r}") | ||
| logger.info(f"Concrete GLSL context initialized as: {name}") | ||
| return instance | ||
| except Exception as e: | ||
| logger.debug(f"GLContext.__init__: {name} backend failed: {e}") | ||
| errors.append((name, e)) | ||
|
|
There was a problem hiding this comment.
Fallback retries can leak partially initialized backend resources.
If backend setup succeeds but a later init step fails, this catch block retries the next backend without tearing down resources already created by the failed instance.
🧹 Suggested fix
class GLContext:
+ def _cleanup_backend_concrete(self):
+ """Best-effort cleanup for partially initialized backend state."""
+ pass
...
try:
instance: GLContext = object.__new__(cls)
instance.__init__()
logger.debug(f"GLContext.__init__: {name} backend succeeded. The singleton is: {cls!r}")
logger.info(f"Concrete GLSL context initialized as: {name}")
return instance
except Exception as e:
+ try:
+ instance._cleanup_backend_concrete()
+ except Exception as cleanup_err:
+ logger.debug(f"GLContext.__init__: cleanup after {name} failure also failed: {cleanup_err}")
logger.debug(f"GLContext.__init__: {name} backend failed: {e}")
errors.append((name, e))Implement _cleanup_backend_concrete() in each concrete backend (_GLContextGLFW, _GLContextEGL, _GLContextOSMesa) to release created resources.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| instance: GLContext = object.__new__(cls) | |
| # Since this code is called while in `__new__()`, we need to manually call `__init__()`, too. | |
| # Otherwise, Python would call it only AFTER `__new__()`, causing init errors outside our try-except check. | |
| instance.__init__() | |
| logger.debug(f"GLContext.__init__: {name} backend succeeded. The singleton is: {cls!r}") | |
| logger.info(f"Concrete GLSL context initialized as: {name}") | |
| return instance | |
| except Exception as e: | |
| logger.debug(f"GLContext.__init__: {name} backend failed: {e}") | |
| errors.append((name, e)) | |
| try: | |
| instance: GLContext = object.__new__(cls) | |
| # Since this code is called while in `__new__()`, we need to manually call `__init__()`, too. | |
| # Otherwise, Python would call it only AFTER `__new__()`, causing init errors outside our try-except check. | |
| instance.__init__() | |
| logger.debug(f"GLContext.__init__: {name} backend succeeded. The singleton is: {cls!r}") | |
| logger.info(f"Concrete GLSL context initialized as: {name}") | |
| return instance | |
| except Exception as e: | |
| try: | |
| instance._cleanup_backend_concrete() | |
| except Exception as cleanup_err: | |
| logger.debug(f"GLContext.__init__: cleanup after {name} failure also failed: {cleanup_err}") | |
| logger.debug(f"GLContext.__init__: {name} backend failed: {e}") | |
| errors.append((name, e)) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_glsl.py` around lines 182 - 193, The except block in the
GLContext creation can leak partially initialized backend resources; modify the
error path to call a cleanup hook on the partly constructed instance before
moving to the next backend: add an abstract/protected method name like
_cleanup_backend_concrete() to the GLContext base class and implement it in each
concrete backend class (_GLContextGLFW, _GLContextEGL, _GLContextOSMesa) to
release any OS/GL resources, then update the except block around
instance.__init__() to call instance._cleanup_backend_concrete() (guarded with
hasattr/try) before appending to errors so any allocated resources are torn down
on failure.
|
Hi thanks for this, I'll give this a more thorough look over, however - we're currently considering switching the backend system to use ANGLE (https://github.com/google/angle) as a single unified backend instead of needing multiple fallbacks (note this is not the fixed direction, currently in exploration) and so at present for your time I think it might be worth pausing on any additional changes to this. If you have any other thoughs on that, please let me know. The issue with ModernGL was current lack of prebuilt support for py3.14 |
Is there a discussion thread in Discord on the subject? Could you add me to it? (I'm the one who started this thread). |
The current
GLSLShadernode implementation is designed (vibe coded?) in such a way, that it contains runtime conditional logic for various backends (which is checked at each execution).This causes GLFW/EGL/OSMesa-specific code to be spread across the whole module and mixed together in the same functions, i.e. it relies on globals, is bug prone, and hard to maintain/extend in the future (e.g., add a new OGL backend).
I've refactored the entire module to adhere to singleton factory pattern instead:
GLContextis still a singleton, but it's an "abstract" one (follows the same approach as with WebDriver in Selenium) and works like this:GLContextinstance methods.No changes to the logic itself.
Though, I wasn't able to test it because the current implementation doesn't work for me either: #12584
But at least the error I get is the same as the one I got before refactoring.