Skip to content

Jittered backoff and safe-transport retry for remoting client #1487

@michaellwest

Description

@michaellwest

Summary

After a Sitecore app-pool recycle or deployment, every SPE client running against that CM retries on the same wall-clock tick because every retry sleep is a flat Start-Sleep -Seconds N. The synchronized burst from a fleet of CI runners, scheduled scripts, and automation hits the CM during its most fragile window, exactly when it is least able to absorb it. This issue replaces the flat sleeps with jittered backoff at four retry sites and adds a classifier that retries unambiguously safe transport failures (DNS, connection refused, 502) under -MaxRetries. Phase 1 is in branch feature/jittered-backoff. Phase 2 (under consideration) adds an opt-in switch for ambiguous failures.

Problem

A task deploy, an IIS app-pool recycle, a Docker compose up, or any Sitecore restart leaves the CM unable to serve requests for roughly 5-30 seconds. Any SPE client running against that CM during that window hits one of three retry paths, all of which sleep a flat number of seconds:

  • Wait-RemoteConnection: flat 2s between warmup probes.
  • Invoke-RemoteScript: sleeps exactly Retry-After on 429/503.
  • Wait-RemoteScriptSession: flat -Delay on TransportError or legacy poll.

When a fleet of clients was already running before the recycle (CI pipelines, scheduled remoting scripts, the warmup loop in task test, an operator's local PowerShell session), they all start probing again on the same tick after the outage. CM gets a synchronized burst of N requests every 2s while it is still warming up. Worst case: CM finishes warming, accepts the burst, overloads, looks unhealthy, recycles again.

Separately, when the CM is genuinely unreachable (DNS hiccup mid-recycle, connection refused on app-pool restart, 502 from Traefik when the upstream dies), Invoke-RemoteScript fails immediately even with -MaxRetries 5, because today's retry path only handles 429/503 status codes. Operators have to script their own retry loop on top, defeating the purpose of -MaxRetries.

Phase 1: jittered backoff and safe-transport retry

Two private helpers in SPE.psm1:

  • Get-SpeBackoffDelay. With a Retry-After hint, jitters +/-20% so concurrent clients do not rebound on the same tick. Without a hint, full jitter: rand(0.1, min(cap, base * 2^min(attempt,6))).
  • Test-SpeSafeTransportError. Returns $true only when the failure is unambiguously pre-send: HTTP 502, or SocketException codes ConnectionRefused (10061), HostNotFound (11001), TryAgain (11002), NoData (11004). Mid-stream resets, 504, post-send timeouts, generic IOException deliberately return $false.

Wired into four retry sites:

Site Behavior change
Invoke-RemoteScript Jitter on 429/502/503 retry; new safe-transport retry under -MaxRetries. Also: switched .Result to GetAwaiter().GetResult() so the catch sees unwrapped exceptions.
Invoke-RemoteWait Jitter on 429/502/503 retry.
Wait-RemoteScriptSession Jitter on TransportError sleep + jittered legacy-poll cadence.
Wait-RemoteConnection Jittered backoff (base 2s, cap 15s) in the deploy-warmup loop.

Why 502 is in Phase 1 and 504 is not

502 from Traefik or any fronting proxy means the request never reached the application, so retry cannot double-execute a non-idempotent script. 504 means the request might have run and the response got lost. Silent retry on 504 is a disaster waiting to happen for any cmdlet with side effects (New-Item, Send-MailMessage, Publish-Item).

Cmdlet idempotency map

The Phase 2 design needs this to decide what should auto-retry vs require an explicit opt-in. Server endpoints land in three buckets:

Cmdlet Endpoint Mutates server? Retry-safe?
Test-RemoteConnection GET probe No Yes
Receive-RemoteItem (file/media) GET /-/script/file/... or /media/... No Yes (but starts from byte 0; see #1488 for resume support)
Invoke-RemoteWait GET /-/script/wait/... No Yes (long-poll, designed for re-attempts)
Wait-RemoteScriptSession composes Invoke-RemoteWait + final receive No Yes
Wait-RemoteSitecoreJob composes Invoke-RemoteScript poll loop No Yes
Stop-ScriptSession POST /-/script/script/?action=cleanup Yes (removes session) Effectively yes - re-cleanup of an already-removed session is a no-op
Send-RemoteItem (file) POST /-/script/file/... Yes (overwrites) Conditional - same bytes + fixed destination = same end state, network-wasteful
Send-RemoteItem (media) POST /-/script/media/... Yes Not when versioned media is on (Settings.Media.UploadAsVersionableByDefault); each retry creates a new version
Invoke-RemoteScript POST /-/script/script/?action=execute Depends entirely on script body No by default - script could Send-MailMessage, Publish-Item, increment counters
Invoke-RemoteScript -AsJob same Yes (starts a job) No - a second POST starts a second job, even if the first ran

This refines the Phase 2 question of "what should -RetryOnConnectionFailure cover":

Cmdlet Suggested ambiguous-failure behavior
Receive-RemoteItem, Invoke-RemoteWait, Wait-*, Stop-ScriptSession Auto-retry under -MaxRetries. Safe by construction. Could land as Phase 1.5 ahead of the opt-in switch.
Send-RemoteItem non-versioned + fixed destination Per-call opt-in (caller declares idempotency)
Send-RemoteItem versioned media User opt-in only - duplicate versions are a real cost
Invoke-RemoteScript User opt-in only (the existing Phase 2 design)

Phase 2 (under consideration): opt-in for ambiguous failures

Adds a switch (working name -RetryOnConnectionFailure) to Invoke-RemoteScript. Shares the existing -MaxRetries budget. When set, also retries:

Failure Why it is ambiguous
HTTP 504 Gateway Timeout Upstream may have processed before the timeout.
SocketException ConnectionReset (10054) Reset could be either side of the request boundary.
IOException mid-read Connection died between request and response.
TaskCanceledException from HttpClient timeout Request may or may not have completed server-side.

User is responsible for ensuring scripts are idempotent before opting in. The docstring will say so explicitly. Matches AWS SDK / Polly idiom.

Open questions before committing to the design:

  • Whether to land "Phase 1.5" first - auto-retry under -MaxRetries for the safe-by-construction cmdlets (Receive-RemoteItem, Invoke-RemoteWait, Wait-*, Stop-ScriptSession), which today do not retry at all on transport blips. This is strictly additive and does not need an opt-in switch.
  • Switch name for the script / upload path. Options considered: -RetryOnConnectionFailure, -RetryUnsafeFailures, or a separate -IdempotentScript switch that broadens what -MaxRetries covers.
  • Whether the opt-in switch lives only on Invoke-RemoteScript and Send-RemoteItem, or whether Send-RemoteItem's case is split further by versioned-media vs non-versioned.
  • Default backoff cap when ambiguous-retry kicks in. Same 10s ceiling, or longer because 504 likely indicates a longer-running upstream?
  • Whether to log a Warning on each ambiguous retry to flag duplicate-execution risk to operators.

Heuristic values (re-tune triggers)

Constant Value Why When to revisit
Retry-After jitter spread +/-20% Breaks wall-clock sync without ignoring the server hint Observed retry storms still cluster
Wait-RemoteConnection base/cap 2s / 15s Cold containers settle in 10-60s Cold-start times shift (e.g. heavier Sitecore baseline)
502/503 cap 10s Existing $retryCeiling503, kept for back-compat CMs unavailable longer than 10s per probe
Full-jitter exponent ceiling 2^min(attempt, 6) Prevents pathological growth at large attempts Only matters above 6 attempts; -MaxRetries validates 0-10

Compatibility

Path Phase 1 behavior
Default (-MaxRetries=0) Identical: no retry on anything.
-MaxRetries N against 429/503 Identical retry count; sleep is now jittered around Retry-After.
-MaxRetries N against 502 / DNS / connection refused Retries (was: immediate fail).
Concurrent clients on shared Retry-After Spread by +/-20% instead of rebounding in lockstep.

No template changes, no serialized item migrations, no auth surface changes.

Tests

Unit (tests/unit/SPE.ClientRetry.Tests.ps1):

  • 50-sample bounds checks for Get-SpeBackoffDelay (both modes, cap honored).
  • Test-SpeSafeTransportError covers 502, 504-not-safe, all four SocketException codes, wrapped exception chain, null.
  • 502 retry: with and without -MaxRetries.
  • Transport-error retry: ConnectionRefused retried, ConnectionReset not retried, default-off without -MaxRetries.
  • Existing 429/503 timing assertion relaxed to accept the new jitter floor (~0.8s).

Integration (tests/integration/Remoting.ClientRetry.Tests.ps1):

  • Group 5: closed loopback port (TcpListener bound, then Stop-ed). Asserts no-retry baseline < 5s and retry adds >= 1.5s.
  • Group 6: HttpListener stub returns 502 on call 1 and 200 on call 2. Verifies end-to-end behavior through the real HttpClient and socket layer.

Bundled bug fix

Invoke-RemoteScript -ConnectionUri ... (inline, no -Session) was silently broken: $UseDefaultCredentials was never initialized in the URI-only branch, and New-SpeHttpClient rejected the empty string when binding [bool]. The pre-jitter retry path consumed the resulting null-reference error, so the user just saw "No response returned" instead of a connect error. Fixed by defaulting to $false. Surfaced while writing the closed-port integration test.

Branch

feature/jittered-backoff off release/9.0. Three commits, will rebase to #1487: prefix.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions