[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681
[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681bghgary wants to merge 103 commits intoBabylonJS:masterfrom
Conversation
- Add Tests.ExternalTexture.Render.cpp: end-to-end test that renders a texture array through a ShaderMaterial to an external render target, verifying each slice (red, green, blue) via pixel readback. - Add tests.externalTexture.render.ts: JS test with sampler2DArray shader. - Add RenderDoc.h/cpp to UnitTests for optional GPU capture support. - Add Utils helpers: CreateTestTextureArrayWithData, CreateRenderTargetTexture, ReadBackRenderTarget, DestroyRenderTargetTexture (D3D11, Metal, stubs for D3D12/OpenGL). - Fix ExternalTexture_Shared.h: pass m_impl->NumLayers() instead of hardcoded 1 in Attach(), preserving texture array metadata. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove layerIndex parameter from Impl::Update declaration to match the updated signature in ExternalTexture_Shared.h. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate HeadlessScreenshotApp, StyleTransferApp, and PrecompiledShaderTest from AddToContextAsync (promise-based) to CreateForJavaScript (synchronous). This removes the extra frame pump and promise callbacks that were previously required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llers - Move RenderDoc.h/cpp to D3D11-only CMake block with HAS_RENDERDOC define - Guard RenderDoc calls with HAS_RENDERDOC instead of WIN32 - Update StyleTransferApp and PrecompiledShaderTest to use CreateForJavaScript Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On OpenGL, TextureT is unsigned int (not a pointer), so reinterpret_cast fails. Add NativeHandleToUintPtr helper using if constexpr to handle both pointer types (D3D11/Metal/D3D12) and integer types (OpenGL). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- RenderDoc.h/cpp now accept void* device instead of ID3D11Device* - Move RenderDoc to WIN32 block (not D3D11-only) since it works with any API - Fix OpenGL build: use NativeHandleToUintPtr helper for TextureT cast - Add Linux support (dlopen librenderdoc.so) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move DestroyTestTexture after FinishRenderingCurrentFrame so bgfx::frame() processes the texture creation command before the native resource is released. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ensure the JS startup dispatch completes before calling deviceUpdate.Finish() and FinishRenderingCurrentFrame(). This guarantees that bgfx::frame() processes the CreateForJavaScript texture creation command, making the texture available for subsequent render frames. The old async API had an implicit sync point (addToContext.wait) that the new sync API lost. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eation - Rename CreateForJavaScriptWithTextureArray to CreateForJavaScript and use arraySize=1 since texture array rendering is covered by RenderTextureArray. The old test crashed on CI (STATUS_BREAKPOINT in bgfx when creating texture arrays via encoder on WARP). - Revert two-step create+override approach back to single createTexture2D call with _external parameter (overrideInternal from JS thread doesn't work since the D3D11 resource isn't created until bgfx::frame). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CreateForJavaScript already exists in Tests.ExternalTexture.cpp, causing a linker duplicate symbol error. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… test) The D3D11-specific CreateForJavaScript test crashed on CI due to bgfx assertions when calling createTexture2D with _external on the encoder thread. The cross-platform CreateForJavaScript test in Tests.ExternalTexture.cpp already covers this functionality. The texture array rendering is covered by RenderTextureArray. Also revert app startup ordering to Finish->Wait (matching the pattern used by HeadlessScreenshotApp on master). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bgfx callback's fatal() handler was silently calling debugBreak() on DebugCheck assertions with no output, making CI crashes impossible to diagnose. Now logs the file, line, error code and message to stderr before breaking, so the assertion details appear in CI logs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add DISM/d3dconfig step to CI to enable D3D debug layer, which will provide detailed D3D11 validation messages for the CreateShaderResourceView E_INVALIDARG failure. Kept the _external createTexture2D path (reverted the AfterRenderScheduler approach) so we can see the actual D3D debug output that explains the SRV mismatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bgfx _external texture path triggers E_INVALIDARG in CreateShaderResourceView on CI's WARP D3D11 driver. The overrideInternal alternative doesn't support full array textures (hardcodes ArraySize=1). Since the _external path works on real GPUs, skip the render test on CI via BABYLON_NATIVE_SKIP_RENDER_TESTS and keep the direct _external path. Also adds D3D debug layer enablement to CI for future diagnostics, and logs bgfx fatal errors to stderr before crashing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from _external parameter (crashes on WARP) to create+overrideInternal two-step approach. The overrideInternal path is compatible with WARP but sets ArraySize=1 in the SRV, so the RenderTextureArray test (which needs full array access) is skipped on CI. The render test works on real GPUs where the _external path succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The overrideInternal call fires on AfterRenderScheduler after the first bgfx::frame(). An additional frame pump ensures the native texture backing is applied before the scene render. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove deleted Tests.ExternalTexture.D3D11.cpp from Install/Test/CMakeLists.txt - Add extra frame pump after CreateForJavaScript in HeadlessScreenshotApp and StyleTransferApp so overrideInternal has time to apply the native texture backing before the first render. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er FrameCompletionScope acquisition.
…ompletionScope instead of deferred member release.
This reverts commit 0d315bb.
…cquired FrameCompletionScope in SubmitCommands.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…x CI" The DRI3/LIBGL hypothesis was wrong. The 4-30x runtime variance is fixed by 140894b (hold frame gate open during pump wait); LIBGL_ALWAYS_SOFTWARE is a no-op for the actual root cause (FrameCompletionScope scheduler race) and adds confusion. CI matrix kept trimmed to Linux JSC for now (separate decision). [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
[Responded by Copilot on behalf of @bghgary] Correction — earlier root-cause comment was wrong. The DRI3 /
Actual root cause
The unit-test pump was changed to: while (!ready) {
Start; // open gate, notify_all
Finish; // wait scopes==0, close gate, bgfx::frame()
wait_for(16ms);
}The window between FixInvert the pump so the gate stays open during Start;
while (true) {
wait_for(16ms); // gate OPEN here, JS thread can submit freely
if (ready) break;
Finish;
Start;
}ValidationApplied the fix on
All 4-second Unit Tests, total 352–393 s — back in master baseline range. Variance gone. Open questionShould the fix live in the test pump only (committed), or also propose a change to the Reverted the unrelated |
The unit-test pump has been calling Start; Finish back-to-back per loop iteration with the wait_for AFTER Finish, which collapses the FrameCompletionScope gate's open window to ~zero. With the new shared-gate model on rework-thread-model, JS-thread SubmitCommands (and similar) acquire a scope that blocks on the gate's CV; whenever they wake on notify_all from Start, they have to win a scheduler race against the pump thread re-entering Finish to land their scope before the gate closes. On contended Linux CI runners that race is lost frequently, accumulating 16ms+ per missed acquisition into 30+ minute test runtimes (vs ~5 minutes on master). Restructure the pump to follow the "Start ASAP / Finish; Start tick / final Finish" pattern: open the frame immediately after device creation so the JS thread can submit at any time, then in the pump loop just tick bgfx (Finish; Start) once per iteration, and rely on the existing trailing Finish (after nativeCanvas.reset()) to close the gate at shutdown. Drops the dead Start; Finish priming pair and the redundant post-loop Start. Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter= JavaScript.* passes in 2.4s with all 24 sub-tests green. Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC, prior version of this fix on bghgary/unittest-watchdog 140894b): all 8 jobs ran the Unit Tests step in 4 seconds consistently with total job duration 352-393s — back in master baseline range. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The unit-test pump has been calling Start; Finish back-to-back per loop iteration with the wait_for AFTER Finish, which collapses the FrameCompletionScope gate's open window to ~zero. With the new shared-gate model on rework-thread-model, JS-thread SubmitCommands (and similar) acquire a scope that blocks on the gate's CV; whenever they wake on notify_all from Start, they have to win a scheduler race against the pump thread re-entering Finish to land their scope before the gate closes. On contended Linux CI runners that race is lost frequently, accumulating 16ms+ per missed acquisition into 30+ minute test runtimes (vs ~5 minutes on master). Restructure the pump to follow the "Start ASAP / Finish; Start tick / final Finish" pattern: open the frame immediately after device creation so the JS thread can submit at any time, then in the pump loop just tick bgfx (Finish; Start) once per iteration, and rely on the existing trailing Finish (after nativeCanvas.reset()) to close the gate at shutdown. Drops the dead Start; Finish priming pair and the redundant post-loop Start. Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter= JavaScript.* passes in 2.4s with all 24 sub-tests green. Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC, prior version of this fix on bghgary/unittest-watchdog 140894b): all 8 jobs ran the Unit Tests step in 4 seconds consistently with total job duration 352-393s — back in master baseline range. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…abylonJS#1646 In BabylonJS#1674's combination of rework-thread-model (BabylonJS#1652) and the ExternalTexture sync API (BabylonJS#1646), three apps swapped the original async AddToContextAsync flow for synchronous CreateForJavaScript but kept the two-frame skeleton from the async version. The result is that the loader.Dispatch which runs the JS startup() callback now gets queued in frame 1 even though its observable wait (startup.get_future().wait()) is in frame 2 — the dispatch can land in either frame depending on JS-thread scheduling, breaking the "each Start/Finish pair wraps a phase of work" pattern. On rework-thread-model the JS thread will block on the closed gate between frames 1 and 2 anyway, so it functionally works, but the ordering is wrong. Move the dispatch into the second frame so the texture creation, startup() call, and the wait that observes them all run inside the same frame: frame 1: load scripts (no dispatch) frame 2: dispatch[startup] -> wait frame 3+: per-asset / render-loop Three apps had this issue; UnitTests' ExternalTexture tests already do it correctly. On bare rework-thread-model these apps still use AddToContextAsync and don't have the bug. Files: - Apps/HeadlessScreenshotApp/Win32/App.cpp - Apps/PrecompiledShaderTest/Source/App.cpp - Apps/StyleTransferApp/Win32/App.cpp Local build verification (Win32 RelWithDebInfo): - HeadlessScreenshotApp: builds cleanly - StyleTransferApp: builds cleanly - PrecompiledShaderTest: not configured in this Build/Win32; change is structurally identical to HeadlessScreenshotApp which compiled. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # Apps/UnitTests/Source/Tests.JavaScript.cpp
Resolves four real conflicts from the integration of master's recent canvas/blit fixes (BabylonJS#1683) with the threading-model rework (BabylonJS#1652): * Core/Graphics/InternalInclude/Babylon/Graphics/DeviceContext.h: keep rework's AcquireNewViewId() (no encoder param), add master's PeekNextViewId() const. * Core/Graphics/Source/DeviceImpl.h: same — keep rework's AcquireNewViewId / IncrementPendingFrameScopes / DecrementPendingFrameScopes / SetActiveEncoder / GetActiveEncoder, add master's PeekNextViewId() const. * Plugins/NativeEngine/Source/NativeEngine.cpp::CopyTexture: adopt master's BabylonJS#1683 fix structurally — fetch encoder at top of function, schedule blit on PeekNextViewId() so it runs after every view used so far (matching bgfx's blit-in-numeric-view-order semantics). Substitute rework's GetEncoder() member for master's GetUpdateToken().GetEncoder() since UpdateToken is a no-op shim on rework. * Polyfills/Canvas/Source/Context.cpp::Flush: keep both — master's EnsureFontsLoaded() at top (CPU-only nanovg font registration, no GPU/encoder dependency), then rework's optional FrameCompletionScope acquisition for callers that hit Flush outside a frame cycle, then GetActiveEncoder + discard. Local validation: Win32 RelWithDebInfo UnitTests --gtest_filter= JavaScript.* passes in 2.3s, all 24 sub-tests green. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The X11 Validation Tests flake on Linux JSC was caused by a gap in FrameCompletionScope coverage between two scope acquisitions in one logical JS-side update. SubmitCommands acquired a scope locally, processed its command stream, and released the scope when the function returned. If a single JS task made multiple submit calls (or did any non-bgfx work between them), the scope count dropped to 0 between calls. A concurrent FinishRenderingCurrentFrame on the render thread could then proceed through the !m_pendingFrameScopes wait, close the gate, and run bgfx::frame() with only a partial scene submitted -- producing the empty/clear-color screenshots observed in the flake's pixel diffs. Fix: capture the scope into an m_runtime.Dispatch lambda so it survives to the end of the current JS-thread task and is released when the runtime services its queue next. Continuation work in the same JS frame sees count > 0 and remains protected from concurrent Finish. Same shape as the existing RAF dispatcher's "prevent_frame" pattern, applied at the SubmitCommands funnel which is reached by every render path. Validated locally on Win32 RelWithDebInfo (UnitTests --gtest_filter= JavaScript.* passes in 3.3s, all 24 sub-tests green). [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fresh docs The std::move(scope) capture in 7f98c7c produced a move-only lambda. Dispatchable at the AppRuntime layer is move-only and accepts it, but JsRuntime downstream type-erases the callable through std::function which requires copy-constructibility: std_function.h:439:18: error: static assertion failed due to requirement 'is_copy_constructible<...>::value': std::function target must be copy-constructible MSVC accepted this; gcc 14 (Ubuntu CI) rejects it. Wrap the scope in shared_ptr so the lambda is copyable, matching the existing RAF dispatcher pattern in ScheduleRequestAnimationFrameCallbacks. Also refresh the comments on FrameCompletionScope and AcquireFrameCompletionScope. The previous class-level comment listed three usage patterns, but one of them ("NativeEngine::GetEncoder() acquires lazily") didn't match the code — GetEncoder doesn't acquire a scope. Replace with the two patterns that are actually in use: 1. JS-frame scoped via m_runtime.Dispatch capture (RAF, SubmitCommands) 2. Block scoped on the stack (Canvas::Flush fallback, ReadTextureAsync) [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Schedule the FrameCompletionScope-releasing lambda before the command stream is processed, not after. Equivalent in the success path, but makes the JS-frame coverage exception-safe — if the command processing throws, the lambda is already queued and the scope survives to the next dispatch cycle, instead of being destroyed at the throw and dropping the count to 0 on the render thread's view. Also drops the explicit local in favour of an inline make_shared in the lambda capture — the local was only there to be captured. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Status: DO-NOT-MERGE — diagnostic only
[Created by Copilot on behalf of @bghgary]
Verdict
Heartbeat-fed watchdog confirms: slow progress on Ubuntu llvmpipe JSC, not a deadlock.
Run 25026066984: 27✅ 1❌. The lone ❌ is
Ubuntu_GCC_JSCfailing at Playground rendering with a pixel-diff (First pixel off at 108872 ... Pixel difference: 17308 pixels) — a pre-existing render flake, aborted the job before UnitTests ran. Unrelated to the watchdog.Ubuntu_Clang_JSC(the previously-failing job) passedJavaScript.Allin 141.5 s with 3 PASSED, 4 SKIPPED (ExternalTexture). No watchdog abort. WSL local run: 209 s with[watchdog-heartbeat] pump_frames=...lines visible between mocha output gaps.Why the previous "60 → 600 s default" wasn't enough
The earlier run on this branch hit a 600-s console-output gap in
Materials/Empty-ShaderMaterialand dumped a core. Post-mortem showedgtest in wait_for(16ms), all bgfx threads idle, JS thread blocked inIncrementPendingFrameScopes,DeviceImpl{m_frameBlocked=true, m_pendingFrameScopes=0, m_rendering=false, m_firstFrameStarted=true}— the same quiescent inter-frame snapshot you'd see in either slow-progress or deadlock.A single core dump can't distinguish the two: the gtest pump loop spends almost all its wall time in
wait_for(16ms), so any random-time capture lands there in either case. We needed a tiebreaker — does the render thread actually keep ticking?Mechanism (this PR)
Two-source watchdog:
g_pumpFrameCount(atomic), incremented after eachStart/Finishcycle in the gtest pump loop.Watchdog thread polls the atomic every 1 s. When it advances, the "last activity" timestamp resets — so a slow-but-progressing test never trips. Every 30 s, the watchdog emits
[watchdog-heartbeat] pump_frames=Nso we can see progress in the log even during shader-compile silence. If both sources stall for the timeout window, abort with a message that names both.This means a real deadlock — render thread stuck in
bgfx::frame()orFinishRenderingCurrentFrame()— would now trip cleanly, and slow Materials shader compiles do not.Default restored to 60 s since false positives are eliminated.
Where it changes
Apps/UnitTests/Source/Tests.JavaScript.cpp— atomic + watchdog dual-source logic + pump-loop increment.Follow-ups (other PRs / repos, not this one)
Materials/*are real and consume ~20–30% of the JSC job budget. See [DO-NOT-MERGE] Partner integration test: #1652 + #1646 #1674 for partner-side discussion. Possible mitigations: skip slowest cases on llvmpipe, bump JSC job timeout, or investigate bgfx-side llvmpipe compile cost.