Skip to content

Harden Playground snippet-fetch retry against transient failures#1690

Open
bghgary wants to merge 2 commits intoBabylonJS:masterfrom
bghgary:harden-playground-retry
Open

Harden Playground snippet-fetch retry against transient failures#1690
bghgary wants to merge 2 commits intoBabylonJS:masterfrom
bghgary:harden-playground-retry

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented May 7, 2026

Context

The Win32_x64_D3D11 / build Playground job in BabylonNative CI has been failing intermittently with this exact pattern, on commits that don't touch playground code:

[Error] SyntaxError: JSON.parse Error: Unexpected input at position:0
[Error] SyntaxError: JSON.parse Error: Unexpected input at position:0
[Error] SyntaxError: JSON.parse Error: Unexpected input at position:0
[Error] SyntaxError: JSON.parse Error: Unexpected input at position:0
[Error] SyntaxError: JSON.parse Error: Unexpected input at position:0
[Log] Running the playground failed.
##[error]Process completed with exit code -1.

Five identical errors → maxRetry=5 retries are all blowing up on the same parse step.

Root cause

Apps/Playground/Scripts/validation_native.js:178-247 fetches each test's snippet from https://snippet.babylonjs.com/<id>/<rev> and unconditionally calls JSON.parse(xmlHttp.responseText) on readyState === 4, without checking xmlHttp.status. When the snippet service returns a transient error (503, 429, gateway timeout, empty body), responseText is empty or HTML and the parse fails with "Unexpected input at position:0". The catch path correctly retries, but the retry policy is too weak:

  • maxRetry = 5, retryTime = 500ms, no backoff.
  • Total retry budget: 2 seconds.

Two seconds is nowhere near enough to ride out a normal CDN/upstream blip (typically 10–30s).

Change

Three small fixes to the snippet-fetch retry path:

  1. Status check before parse. Check xmlHttp.status === 200 before calling JSON.parse. Non-200 responses are logged with the status code and the playground id, then routed to onError instead of bubbling up as a misleading SyntaxError.

  2. More retries. Increase maxRetry from 5 → 8.

  3. Exponential backoff. Replace the fixed 500ms delay with min(500ms × 2^(retry-1), 30s). Schedule:

    Retry # Delay
    1 500ms
    2 1s
    3 2s
    4 4s
    5 8s
    6 16s
    7 30s (capped)

    Total budget grows from ~2s to ~60s. Persistent outages still fail-fast (just at ~60s instead of ~2s); transient blips now have room to recover.

Validation against upstream

The canonical snippet loader in BabylonJS/Babylon.js — packages/tools/snippetLoader/src/fetchSnippet.ts — also checks response.ok before calling response.json(). The status-check approach is consistent with the upstream pattern.

Scope

Targeted at the playgroundId retry path (validation_native.js:152-259) which is the only one we have evidence of flaking. The scriptToRun branch and BABYLON.Tools.LoadFile calls are unchanged — they have their own (retry-less) error handling that can be hardened in a follow-up if they ever flake.

Observed failure example

CI run on PR #1687, commit d3cdfeab (a comment-only change to a different test file): job 25524000188 failed with the snippet-parse-error pattern above. Re-run passed without any code change.

[Created by Copilot on behalf of @bghgary]

The Win32_x64_D3D11 Playground job fails intermittently with five
identical "[Error] SyntaxError: JSON.parse Error: Unexpected input at
position:0" lines followed by "[Log] Running the playground failed."
on commits that don't touch playground code.

Root cause: validation_native.js fetches each test's snippet from
https://snippet.babylonjs.com/<id>/<rev> and unconditionally calls
JSON.parse(xmlHttp.responseText) on readyState === 4, ignoring
xmlHttp.status. When the snippet service returns a transient error
(5xx, 429, gateway timeout, empty body), the parse fails and falls
through to the catch which calls onError. The retry policy is
maxRetry=5 with a fixed 500ms delay -- a 2-second total budget that
cannot ride out a normal CDN/upstream blip.

Three changes:

1. Check xmlHttp.status === 200 before parsing. Non-200 responses are
   logged with the status code and the playground id, then routed to
   onError instead of bubbling up as a misleading SyntaxError.
2. Increase maxRetry from 5 to 8.
3. Replace the fixed 500ms delay with exponential backoff capped at
   30 seconds (500ms, 1s, 2s, 4s, 8s, 16s, 30s). Total budget grows
   from ~2s to ~60s, which is sufficient to ride out typical service
   blips without changing the eventual fail-fast behavior on persistent
   outages.

Validates against the canonical snippet loader in BabylonJS/Babylon.js
(packages/tools/snippetLoader/src/fetchSnippet.ts) which also checks
response.ok before calling response.json().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 22:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the robustness of BabylonNative’s Playground visual validation runner by hardening the snippet-fetch path against transient HTTP failures from snippet.babylonjs.com, reducing flaky CI failures unrelated to code changes.

Changes:

  • Add an HTTP status check before attempting to parse the snippet response as JSON.
  • Increase retry attempts (5 → 8) and implement exponential backoff (capped at 30s).
  • Improve the final failure log message to reflect the number of attempts.

Comment thread Apps/Playground/Scripts/validation_native.js Outdated
The readystatechange listener is registered via addEventListener, not via the onreadystatechange property -- those are separate slots in the XHR API. Setting the property to null does not detach the listener, just as the reviewer observed. Spec also guarantees readystatechange fires only once on transition to DONE, so the line was a misleading no-op.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants