Skip to content

[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681

Draft
bghgary wants to merge 103 commits intoBabylonJS:masterfrom
bghgary:unittest-watchdog
Draft

[DO-NOT-MERGE] Diagnostic: JS-progress watchdog for UnitTests hangs#1681
bghgary wants to merge 103 commits intoBabylonJS:masterfrom
bghgary:unittest-watchdog

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented Apr 27, 2026

Status: DO-NOT-MERGE — diagnostic only

[Created by Copilot on behalf of @bghgary]

Verdict

Heartbeat-fed watchdog confirms: slow progress on Ubuntu llvmpipe JSC, not a deadlock.

Run 25026066984: 27✅ 1❌. The lone ❌ is Ubuntu_GCC_JSC failing at Playground rendering with a pixel-diff (First pixel off at 108872 ... Pixel difference: 17308 pixels) — a pre-existing render flake, aborted the job before UnitTests ran. Unrelated to the watchdog.

Ubuntu_Clang_JSC (the previously-failing job) passed JavaScript.All in 141.5 s with 3 PASSED, 4 SKIPPED (ExternalTexture). No watchdog abort. WSL local run: 209 s with [watchdog-heartbeat] pump_frames=... lines visible between mocha output gaps.

Why the previous "60 → 600 s default" wasn't enough

The earlier run on this branch hit a 600-s console-output gap in Materials/Empty-ShaderMaterial and dumped a core. Post-mortem showed gtest in wait_for(16ms), all bgfx threads idle, JS thread blocked in IncrementPendingFrameScopes, DeviceImpl{m_frameBlocked=true, m_pendingFrameScopes=0, m_rendering=false, m_firstFrameStarted=true} — the same quiescent inter-frame snapshot you'd see in either slow-progress or deadlock.

A single core dump can't distinguish the two: the gtest pump loop spends almost all its wall time in wait_for(16ms), so any random-time capture lands there in either case. We needed a tiebreaker — does the render thread actually keep ticking?

Mechanism (this PR)

Two-source watchdog:

  1. Console output activity (mocha lines on stdout). Existing channel.
  2. Render-thread frame progress. g_pumpFrameCount (atomic), incremented after each Start/Finish cycle in the gtest pump loop.

Watchdog thread polls the atomic every 1 s. When it advances, the "last activity" timestamp resets — so a slow-but-progressing test never trips. Every 30 s, the watchdog emits [watchdog-heartbeat] pump_frames=N so we can see progress in the log even during shader-compile silence. If both sources stall for the timeout window, abort with a message that names both.

This means a real deadlock — render thread stuck in bgfx::frame() or FinishRenderingCurrentFrame() — would now trip cleanly, and slow Materials shader compiles do not.

Default restored to 60 s since false positives are eliminated.

Where it changes

  • Apps/UnitTests/Source/Tests.JavaScript.cpp — atomic + watchdog dual-source logic + pump-loop increment.

Follow-ups (other PRs / repos, not this one)

  • Slow llvmpipe shader compile chains in Babylon Materials/* are real and consume ~20–30% of the JSC job budget. See [DO-NOT-MERGE] Partner integration test: #1652 + #1646 #1674 for partner-side discussion. Possible mitigations: skip slowest cases on llvmpipe, bump JSC job timeout, or investigate bgfx-side llvmpipe compile cost.

bghgary and others added 30 commits March 24, 2026 22:05
- Add Tests.ExternalTexture.Render.cpp: end-to-end test that renders a
  texture array through a ShaderMaterial to an external render target,
  verifying each slice (red, green, blue) via pixel readback.
- Add tests.externalTexture.render.ts: JS test with sampler2DArray shader.
- Add RenderDoc.h/cpp to UnitTests for optional GPU capture support.
- Add Utils helpers: CreateTestTextureArrayWithData, CreateRenderTargetTexture,
  ReadBackRenderTarget, DestroyRenderTargetTexture (D3D11, Metal, stubs for
  D3D12/OpenGL).
- Fix ExternalTexture_Shared.h: pass m_impl->NumLayers() instead of
  hardcoded 1 in Attach(), preserving texture array metadata.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove layerIndex parameter from Impl::Update declaration to match
the updated signature in ExternalTexture_Shared.h.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate HeadlessScreenshotApp, StyleTransferApp, and PrecompiledShaderTest
from AddToContextAsync (promise-based) to CreateForJavaScript (synchronous).
This removes the extra frame pump and promise callbacks that were previously
required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llers

- Move RenderDoc.h/cpp to D3D11-only CMake block with HAS_RENDERDOC define
- Guard RenderDoc calls with HAS_RENDERDOC instead of WIN32
- Update StyleTransferApp and PrecompiledShaderTest to use CreateForJavaScript

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On OpenGL, TextureT is unsigned int (not a pointer), so reinterpret_cast
fails. Add NativeHandleToUintPtr helper using if constexpr to handle both
pointer types (D3D11/Metal/D3D12) and integer types (OpenGL).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- RenderDoc.h/cpp now accept void* device instead of ID3D11Device*
- Move RenderDoc to WIN32 block (not D3D11-only) since it works with any API
- Fix OpenGL build: use NativeHandleToUintPtr helper for TextureT cast
- Add Linux support (dlopen librenderdoc.so)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move DestroyTestTexture after FinishRenderingCurrentFrame so bgfx::frame()
processes the texture creation command before the native resource is released.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ensure the JS startup dispatch completes before calling
deviceUpdate.Finish() and FinishRenderingCurrentFrame(). This
guarantees that bgfx::frame() processes the CreateForJavaScript
texture creation command, making the texture available for subsequent
render frames. The old async API had an implicit sync point
(addToContext.wait) that the new sync API lost.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eation

- Rename CreateForJavaScriptWithTextureArray to CreateForJavaScript and
  use arraySize=1 since texture array rendering is covered by
  RenderTextureArray. The old test crashed on CI (STATUS_BREAKPOINT in
  bgfx when creating texture arrays via encoder on WARP).
- Revert two-step create+override approach back to single createTexture2D
  call with _external parameter (overrideInternal from JS thread doesn't
  work since the D3D11 resource isn't created until bgfx::frame).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CreateForJavaScript already exists in Tests.ExternalTexture.cpp,
causing a linker duplicate symbol error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… test)

The D3D11-specific CreateForJavaScript test crashed on CI due to
bgfx assertions when calling createTexture2D with _external on
the encoder thread. The cross-platform CreateForJavaScript test
in Tests.ExternalTexture.cpp already covers this functionality.
The texture array rendering is covered by RenderTextureArray.

Also revert app startup ordering to Finish->Wait (matching the
pattern used by HeadlessScreenshotApp on master).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bgfx callback's fatal() handler was silently calling debugBreak()
on DebugCheck assertions with no output, making CI crashes impossible
to diagnose. Now logs the file, line, error code and message to stderr
before breaking, so the assertion details appear in CI logs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add DISM/d3dconfig step to CI to enable D3D debug layer, which will
provide detailed D3D11 validation messages for the CreateShaderResourceView
E_INVALIDARG failure. Kept the _external createTexture2D path (reverted
the AfterRenderScheduler approach) so we can see the actual D3D debug
output that explains the SRV mismatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bgfx _external texture path triggers E_INVALIDARG in
CreateShaderResourceView on CI's WARP D3D11 driver. The overrideInternal
alternative doesn't support full array textures (hardcodes ArraySize=1).
Since the _external path works on real GPUs, skip the render test on CI
via BABYLON_NATIVE_SKIP_RENDER_TESTS and keep the direct _external path.

Also adds D3D debug layer enablement to CI for future diagnostics, and
logs bgfx fatal errors to stderr before crashing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from _external parameter (crashes on WARP) to create+overrideInternal
two-step approach. The overrideInternal path is compatible with WARP but
sets ArraySize=1 in the SRV, so the RenderTextureArray test (which needs
full array access) is skipped on CI. The render test works on real GPUs
where the _external path succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The overrideInternal call fires on AfterRenderScheduler after the first
bgfx::frame(). An additional frame pump ensures the native texture
backing is applied before the scene render.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove deleted Tests.ExternalTexture.D3D11.cpp from Install/Test/CMakeLists.txt
- Add extra frame pump after CreateForJavaScript in HeadlessScreenshotApp
  and StyleTransferApp so overrideInternal has time to apply the native
  texture backing before the first render.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ompletionScope instead of deferred member release.
…cquired FrameCompletionScope in SubmitCommands.
bghgary and others added 5 commits April 28, 2026 17:54
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…x CI"

The DRI3/LIBGL hypothesis was wrong. The 4-30x runtime variance is fixed by 140894b (hold frame gate open during pump wait); LIBGL_ALWAYS_SOFTWARE is a no-op for the actual root cause (FrameCompletionScope scheduler race) and adds confusion.

CI matrix kept trimmed to Linux JSC for now (separate decision).

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 29, 2026

[Responded by Copilot on behalf of @bghgary]

Correction — earlier root-cause comment was wrong.

The DRI3 / LIBGL_ALWAYS_SOFTWARE hypothesis didn't pan out. After comparing master vs rework-thread-model directly, the regression is branch-specific to anything carrying #1652, not a runner-pool issue:

Actual root cause

SafeTimespanGuarantor (per-named-update independent gates) was replaced in #1652 with a single shared mutex+CV+counter FrameCompletionScope gate. JS-thread SubmitCommands and similar must acquire a scope, which blocks on m_frameSyncCV while m_frameBlocked == true.

The unit-test pump was changed to:

while (!ready) {
    Start;        // open gate, notify_all
    Finish;       // wait scopes==0, close gate, bgfx::frame()
    wait_for(16ms);
}

The window between Start and Finish is ~zero. JS thread woken by notify_all has to win a scheduler race against the pump thread re-acquiring the mutex inside Finish to land its scope before the gate closes. On contended runners JS loses that race repeatedly → 16 ms + bgfx::frame() per missed scope → tests that should run 4 s take 30+ min.

Fix

Invert the pump so the gate stays open during wait_for:

Start;
while (true) {
    wait_for(16ms);   // gate OPEN here, JS thread can submit freely
    if (ready) break;
    Finish;
    Start;
}

Validation

Applied the fix on bghgary/unittest-watchdog (commit 140894be). 4 parallel runs × 2 jobs each = 8 jobs:

Run Job Total Unit Tests step
25085490269 gcc 359 s 4 s
25085490269 clang 393 s 4 s
25085504626 gcc 355 s 4 s
25085504626 clang 384 s 4 s
25085506892 gcc 364 s 4 s
25085506892 clang 352 s 4 s
25085508987 gcc 362 s 4 s
25085508987 clang 377 s 4 s

All 4-second Unit Tests, total 352–393 s — back in master baseline range. Variance gone.

Open question

Should the fix live in the test pump only (committed), or also propose a change to the FrameCompletionScope gate model in #1652? Real-world consumers (Playground apps) drive Start/Finish from OS event loops with natural delays, so they likely don't hit this race the way the test pump does. Test-only is small and provably fixes CI; gate-model change is the more architectural fix but is in scope of #1652 itself. Will discuss with @bghgary on his return.

Reverted the unrelated LIBGL_ALWAYS_SOFTWARE env-var fix (it was a no-op for this).

bghgary and others added 24 commits April 29, 2026 15:59
The unit-test pump has been calling Start; Finish back-to-back per
loop iteration with the wait_for AFTER Finish, which collapses the
FrameCompletionScope gate's open window to ~zero. With the new
shared-gate model on rework-thread-model, JS-thread SubmitCommands
(and similar) acquire a scope that blocks on the gate's CV; whenever
they wake on notify_all from Start, they have to win a scheduler race
against the pump thread re-entering Finish to land their scope before
the gate closes. On contended Linux CI runners that race is lost
frequently, accumulating 16ms+ per missed acquisition into 30+ minute
test runtimes (vs ~5 minutes on master).

Restructure the pump to follow the "Start ASAP / Finish; Start tick /
final Finish" pattern: open the frame immediately after device
creation so the JS thread can submit at any time, then in the pump
loop just tick bgfx (Finish; Start) once per iteration, and rely on
the existing trailing Finish (after nativeCanvas.reset()) to close
the gate at shutdown. Drops the dead Start; Finish priming pair and
the redundant post-loop Start.

Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter=
JavaScript.* passes in 2.4s with all 24 sub-tests green.

Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC,
prior version of this fix on bghgary/unittest-watchdog 140894b):
all 8 jobs ran the Unit Tests step in 4 seconds consistently with
total job duration 352-393s — back in master baseline range.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The unit-test pump has been calling Start; Finish back-to-back per
loop iteration with the wait_for AFTER Finish, which collapses the
FrameCompletionScope gate's open window to ~zero. With the new
shared-gate model on rework-thread-model, JS-thread SubmitCommands
(and similar) acquire a scope that blocks on the gate's CV; whenever
they wake on notify_all from Start, they have to win a scheduler race
against the pump thread re-entering Finish to land their scope before
the gate closes. On contended Linux CI runners that race is lost
frequently, accumulating 16ms+ per missed acquisition into 30+ minute
test runtimes (vs ~5 minutes on master).

Restructure the pump to follow the "Start ASAP / Finish; Start tick /
final Finish" pattern: open the frame immediately after device
creation so the JS thread can submit at any time, then in the pump
loop just tick bgfx (Finish; Start) once per iteration, and rely on
the existing trailing Finish (after nativeCanvas.reset()) to close
the gate at shutdown. Drops the dead Start; Finish priming pair and
the redundant post-loop Start.

Validated locally (Win32 RelWithDebInfo): UnitTests --gtest_filter=
JavaScript.* passes in 2.4s with all 24 sub-tests green.

Validated on bghgary fork CI (4 parallel runs x 2 jobs, Linux JSC,
prior version of this fix on bghgary/unittest-watchdog 140894b):
all 8 jobs ran the Unit Tests step in 4 seconds consistently with
total job duration 352-393s — back in master baseline range.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…abylonJS#1646

In BabylonJS#1674's combination of rework-thread-model (BabylonJS#1652) and the
ExternalTexture sync API (BabylonJS#1646), three apps swapped the original
async AddToContextAsync flow for synchronous CreateForJavaScript but
kept the two-frame skeleton from the async version. The result is
that the loader.Dispatch which runs the JS startup() callback now
gets queued in frame 1 even though its observable wait
(startup.get_future().wait()) is in frame 2 — the dispatch can land
in either frame depending on JS-thread scheduling, breaking the
"each Start/Finish pair wraps a phase of work" pattern. On
rework-thread-model the JS thread will block on the closed gate
between frames 1 and 2 anyway, so it functionally works, but the
ordering is wrong.

Move the dispatch into the second frame so the texture creation,
startup() call, and the wait that observes them all run inside the
same frame:

  frame 1: load scripts            (no dispatch)
  frame 2: dispatch[startup] -> wait
  frame 3+: per-asset / render-loop

Three apps had this issue; UnitTests' ExternalTexture tests already
do it correctly. On bare rework-thread-model these apps still use
AddToContextAsync and don't have the bug.

Files:
- Apps/HeadlessScreenshotApp/Win32/App.cpp
- Apps/PrecompiledShaderTest/Source/App.cpp
- Apps/StyleTransferApp/Win32/App.cpp

Local build verification (Win32 RelWithDebInfo):
- HeadlessScreenshotApp: builds cleanly
- StyleTransferApp: builds cleanly
- PrecompiledShaderTest: not configured in this Build/Win32; change
  is structurally identical to HeadlessScreenshotApp which compiled.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	Apps/UnitTests/Source/Tests.JavaScript.cpp
Resolves four real conflicts from the integration of master's recent
canvas/blit fixes (BabylonJS#1683) with the threading-model rework (BabylonJS#1652):

* Core/Graphics/InternalInclude/Babylon/Graphics/DeviceContext.h:
  keep rework's AcquireNewViewId() (no encoder param), add master's
  PeekNextViewId() const.

* Core/Graphics/Source/DeviceImpl.h: same — keep rework's
  AcquireNewViewId / IncrementPendingFrameScopes /
  DecrementPendingFrameScopes / SetActiveEncoder / GetActiveEncoder,
  add master's PeekNextViewId() const.

* Plugins/NativeEngine/Source/NativeEngine.cpp::CopyTexture: adopt
  master's BabylonJS#1683 fix structurally — fetch encoder at top of function,
  schedule blit on PeekNextViewId() so it runs after every view used
  so far (matching bgfx's blit-in-numeric-view-order semantics).
  Substitute rework's GetEncoder() member for master's
  GetUpdateToken().GetEncoder() since UpdateToken is a no-op shim on
  rework.

* Polyfills/Canvas/Source/Context.cpp::Flush: keep both — master's
  EnsureFontsLoaded() at top (CPU-only nanovg font registration, no
  GPU/encoder dependency), then rework's optional FrameCompletionScope
  acquisition for callers that hit Flush outside a frame cycle, then
  GetActiveEncoder + discard.

Local validation: Win32 RelWithDebInfo UnitTests --gtest_filter=
JavaScript.* passes in 2.3s, all 24 sub-tests green.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The X11 Validation Tests flake on Linux JSC was caused by a gap in
FrameCompletionScope coverage between two scope acquisitions in one
logical JS-side update.

SubmitCommands acquired a scope locally, processed its command stream,
and released the scope when the function returned. If a single JS task
made multiple submit calls (or did any non-bgfx work between them), the
scope count dropped to 0 between calls. A concurrent
FinishRenderingCurrentFrame on the render thread could then proceed
through the !m_pendingFrameScopes wait, close the gate, and run
bgfx::frame() with only a partial scene submitted -- producing the
empty/clear-color screenshots observed in the flake's pixel diffs.

Fix: capture the scope into an m_runtime.Dispatch lambda so it survives
to the end of the current JS-thread task and is released when the
runtime services its queue next. Continuation work in the same JS frame
sees count > 0 and remains protected from concurrent Finish.

Same shape as the existing RAF dispatcher's "prevent_frame" pattern,
applied at the SubmitCommands funnel which is reached by every render
path.

Validated locally on Win32 RelWithDebInfo (UnitTests --gtest_filter=
JavaScript.* passes in 3.3s, all 24 sub-tests green).

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fresh docs

The std::move(scope) capture in 7f98c7c produced a move-only lambda.
Dispatchable at the AppRuntime layer is move-only and accepts it, but
JsRuntime downstream type-erases the callable through std::function
which requires copy-constructibility:

  std_function.h:439:18: error: static assertion failed due to
  requirement 'is_copy_constructible<...>::value':
  std::function target must be copy-constructible

MSVC accepted this; gcc 14 (Ubuntu CI) rejects it. Wrap the scope in
shared_ptr so the lambda is copyable, matching the existing RAF
dispatcher pattern in ScheduleRequestAnimationFrameCallbacks.

Also refresh the comments on FrameCompletionScope and
AcquireFrameCompletionScope. The previous class-level comment listed
three usage patterns, but one of them ("NativeEngine::GetEncoder()
acquires lazily") didn't match the code — GetEncoder doesn't acquire a
scope. Replace with the two patterns that are actually in use:

  1. JS-frame scoped via m_runtime.Dispatch capture (RAF, SubmitCommands)
  2. Block scoped on the stack (Canvas::Flush fallback, ReadTextureAsync)

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Schedule the FrameCompletionScope-releasing lambda before the command
stream is processed, not after. Equivalent in the success path, but
makes the JS-frame coverage exception-safe — if the command processing
throws, the lambda is already queued and the scope survives to the next
dispatch cycle, instead of being destroyed at the throw and dropping
the count to 0 on the render thread's view.

Also drops the explicit local in favour of an inline make_shared in
the lambda capture — the local was only there to be captured.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants