Summary
After a Sitecore app-pool recycle or deployment, every SPE client running against that CM retries on the same wall-clock tick because every retry sleep is a flat Start-Sleep -Seconds N. The synchronized burst from a fleet of CI runners, scheduled scripts, and automation hits the CM during its most fragile window, exactly when it is least able to absorb it. This issue replaces the flat sleeps with jittered backoff at four retry sites and adds a classifier that retries unambiguously safe transport failures (DNS, connection refused, 502) under -MaxRetries. Phase 1 is in branch feature/jittered-backoff. Phase 2 (under consideration) adds an opt-in switch for ambiguous failures.
Problem
A task deploy, an IIS app-pool recycle, a Docker compose up, or any Sitecore restart leaves the CM unable to serve requests for roughly 5-30 seconds. Any SPE client running against that CM during that window hits one of three retry paths, all of which sleep a flat number of seconds:
Wait-RemoteConnection: flat 2s between warmup probes.
Invoke-RemoteScript: sleeps exactly Retry-After on 429/503.
Wait-RemoteScriptSession: flat -Delay on TransportError or legacy poll.
When a fleet of clients was already running before the recycle (CI pipelines, scheduled remoting scripts, the warmup loop in task test, an operator's local PowerShell session), they all start probing again on the same tick after the outage. CM gets a synchronized burst of N requests every 2s while it is still warming up. Worst case: CM finishes warming, accepts the burst, overloads, looks unhealthy, recycles again.
Separately, when the CM is genuinely unreachable (DNS hiccup mid-recycle, connection refused on app-pool restart, 502 from Traefik when the upstream dies), Invoke-RemoteScript fails immediately even with -MaxRetries 5, because today's retry path only handles 429/503 status codes. Operators have to script their own retry loop on top, defeating the purpose of -MaxRetries.
Phase 1: jittered backoff and safe-transport retry
Two private helpers in SPE.psm1:
Get-SpeBackoffDelay. With a Retry-After hint, jitters +/-20% so concurrent clients do not rebound on the same tick. Without a hint, full jitter: rand(0.1, min(cap, base * 2^min(attempt,6))).
Test-SpeSafeTransportError. Returns $true only when the failure is unambiguously pre-send: HTTP 502, or SocketException codes ConnectionRefused (10061), HostNotFound (11001), TryAgain (11002), NoData (11004). Mid-stream resets, 504, post-send timeouts, generic IOException deliberately return $false.
Wired into four retry sites:
| Site |
Behavior change |
Invoke-RemoteScript |
Jitter on 429/502/503 retry; new safe-transport retry under -MaxRetries. Also: switched .Result to GetAwaiter().GetResult() so the catch sees unwrapped exceptions. |
Invoke-RemoteWait |
Jitter on 429/502/503 retry. |
Wait-RemoteScriptSession |
Jitter on TransportError sleep + jittered legacy-poll cadence. |
Wait-RemoteConnection |
Jittered backoff (base 2s, cap 15s) in the deploy-warmup loop. |
Why 502 is in Phase 1 and 504 is not
502 from Traefik or any fronting proxy means the request never reached the application, so retry cannot double-execute a non-idempotent script. 504 means the request might have run and the response got lost. Silent retry on 504 is a disaster waiting to happen for any cmdlet with side effects (New-Item, Send-MailMessage, Publish-Item).
Cmdlet idempotency map
The Phase 2 design needs this to decide what should auto-retry vs require an explicit opt-in. Server endpoints land in three buckets:
| Cmdlet |
Endpoint |
Mutates server? |
Retry-safe? |
Test-RemoteConnection |
GET probe |
No |
Yes |
Receive-RemoteItem (file/media) |
GET /-/script/file/... or /media/... |
No |
Yes (but starts from byte 0; see #1488 for resume support) |
Invoke-RemoteWait |
GET /-/script/wait/... |
No |
Yes (long-poll, designed for re-attempts) |
Wait-RemoteScriptSession |
composes Invoke-RemoteWait + final receive |
No |
Yes |
Wait-RemoteSitecoreJob |
composes Invoke-RemoteScript poll loop |
No |
Yes |
Stop-ScriptSession |
POST /-/script/script/?action=cleanup |
Yes (removes session) |
Effectively yes - re-cleanup of an already-removed session is a no-op |
Send-RemoteItem (file) |
POST /-/script/file/... |
Yes (overwrites) |
Conditional - same bytes + fixed destination = same end state, network-wasteful |
Send-RemoteItem (media) |
POST /-/script/media/... |
Yes |
Not when versioned media is on (Settings.Media.UploadAsVersionableByDefault); each retry creates a new version |
Invoke-RemoteScript |
POST /-/script/script/?action=execute |
Depends entirely on script body |
No by default - script could Send-MailMessage, Publish-Item, increment counters |
Invoke-RemoteScript -AsJob |
same |
Yes (starts a job) |
No - a second POST starts a second job, even if the first ran |
This refines the Phase 2 question of "what should -RetryOnConnectionFailure cover":
| Cmdlet |
Suggested ambiguous-failure behavior |
Receive-RemoteItem, Invoke-RemoteWait, Wait-*, Stop-ScriptSession |
Auto-retry under -MaxRetries. Safe by construction. Could land as Phase 1.5 ahead of the opt-in switch. |
Send-RemoteItem non-versioned + fixed destination |
Per-call opt-in (caller declares idempotency) |
Send-RemoteItem versioned media |
User opt-in only - duplicate versions are a real cost |
Invoke-RemoteScript |
User opt-in only (the existing Phase 2 design) |
Phase 2 (under consideration): opt-in for ambiguous failures
Adds a switch (working name -RetryOnConnectionFailure) to Invoke-RemoteScript. Shares the existing -MaxRetries budget. When set, also retries:
| Failure |
Why it is ambiguous |
| HTTP 504 Gateway Timeout |
Upstream may have processed before the timeout. |
SocketException ConnectionReset (10054) |
Reset could be either side of the request boundary. |
IOException mid-read |
Connection died between request and response. |
TaskCanceledException from HttpClient timeout |
Request may or may not have completed server-side. |
User is responsible for ensuring scripts are idempotent before opting in. The docstring will say so explicitly. Matches AWS SDK / Polly idiom.
Open questions before committing to the design:
- Whether to land "Phase 1.5" first - auto-retry under
-MaxRetries for the safe-by-construction cmdlets (Receive-RemoteItem, Invoke-RemoteWait, Wait-*, Stop-ScriptSession), which today do not retry at all on transport blips. This is strictly additive and does not need an opt-in switch.
- Switch name for the script / upload path. Options considered:
-RetryOnConnectionFailure, -RetryUnsafeFailures, or a separate -IdempotentScript switch that broadens what -MaxRetries covers.
- Whether the opt-in switch lives only on
Invoke-RemoteScript and Send-RemoteItem, or whether Send-RemoteItem's case is split further by versioned-media vs non-versioned.
- Default backoff cap when ambiguous-retry kicks in. Same 10s ceiling, or longer because 504 likely indicates a longer-running upstream?
- Whether to log a Warning on each ambiguous retry to flag duplicate-execution risk to operators.
Heuristic values (re-tune triggers)
| Constant |
Value |
Why |
When to revisit |
| Retry-After jitter spread |
+/-20% |
Breaks wall-clock sync without ignoring the server hint |
Observed retry storms still cluster |
| Wait-RemoteConnection base/cap |
2s / 15s |
Cold containers settle in 10-60s |
Cold-start times shift (e.g. heavier Sitecore baseline) |
| 502/503 cap |
10s |
Existing $retryCeiling503, kept for back-compat |
CMs unavailable longer than 10s per probe |
| Full-jitter exponent ceiling |
2^min(attempt, 6) |
Prevents pathological growth at large attempts |
Only matters above 6 attempts; -MaxRetries validates 0-10 |
Compatibility
| Path |
Phase 1 behavior |
Default (-MaxRetries=0) |
Identical: no retry on anything. |
-MaxRetries N against 429/503 |
Identical retry count; sleep is now jittered around Retry-After. |
-MaxRetries N against 502 / DNS / connection refused |
Retries (was: immediate fail). |
Concurrent clients on shared Retry-After |
Spread by +/-20% instead of rebounding in lockstep. |
No template changes, no serialized item migrations, no auth surface changes.
Tests
Unit (tests/unit/SPE.ClientRetry.Tests.ps1):
- 50-sample bounds checks for
Get-SpeBackoffDelay (both modes, cap honored).
Test-SpeSafeTransportError covers 502, 504-not-safe, all four SocketException codes, wrapped exception chain, null.
- 502 retry: with and without
-MaxRetries.
- Transport-error retry:
ConnectionRefused retried, ConnectionReset not retried, default-off without -MaxRetries.
- Existing 429/503 timing assertion relaxed to accept the new jitter floor (~0.8s).
Integration (tests/integration/Remoting.ClientRetry.Tests.ps1):
- Group 5: closed loopback port (
TcpListener bound, then Stop-ed). Asserts no-retry baseline < 5s and retry adds >= 1.5s.
- Group 6:
HttpListener stub returns 502 on call 1 and 200 on call 2. Verifies end-to-end behavior through the real HttpClient and socket layer.
Bundled bug fix
Invoke-RemoteScript -ConnectionUri ... (inline, no -Session) was silently broken: $UseDefaultCredentials was never initialized in the URI-only branch, and New-SpeHttpClient rejected the empty string when binding [bool]. The pre-jitter retry path consumed the resulting null-reference error, so the user just saw "No response returned" instead of a connect error. Fixed by defaulting to $false. Surfaced while writing the closed-port integration test.
Branch
feature/jittered-backoff off release/9.0. Three commits, will rebase to #1487: prefix.
Summary
After a Sitecore app-pool recycle or deployment, every SPE client running against that CM retries on the same wall-clock tick because every retry sleep is a flat
Start-Sleep -Seconds N. The synchronized burst from a fleet of CI runners, scheduled scripts, and automation hits the CM during its most fragile window, exactly when it is least able to absorb it. This issue replaces the flat sleeps with jittered backoff at four retry sites and adds a classifier that retries unambiguously safe transport failures (DNS, connection refused, 502) under-MaxRetries. Phase 1 is in branchfeature/jittered-backoff. Phase 2 (under consideration) adds an opt-in switch for ambiguous failures.Problem
A
task deploy, an IIS app-pool recycle, a Docker composeup, or any Sitecore restart leaves the CM unable to serve requests for roughly 5-30 seconds. Any SPE client running against that CM during that window hits one of three retry paths, all of which sleep a flat number of seconds:Wait-RemoteConnection: flat 2s between warmup probes.Invoke-RemoteScript: sleeps exactlyRetry-Afteron 429/503.Wait-RemoteScriptSession: flat-Delayon TransportError or legacy poll.When a fleet of clients was already running before the recycle (CI pipelines, scheduled remoting scripts, the warmup loop in
task test, an operator's local PowerShell session), they all start probing again on the same tick after the outage. CM gets a synchronized burst of N requests every 2s while it is still warming up. Worst case: CM finishes warming, accepts the burst, overloads, looks unhealthy, recycles again.Separately, when the CM is genuinely unreachable (DNS hiccup mid-recycle, connection refused on app-pool restart, 502 from Traefik when the upstream dies),
Invoke-RemoteScriptfails immediately even with-MaxRetries 5, because today's retry path only handles 429/503 status codes. Operators have to script their own retry loop on top, defeating the purpose of-MaxRetries.Phase 1: jittered backoff and safe-transport retry
Two private helpers in
SPE.psm1:Get-SpeBackoffDelay. With aRetry-Afterhint, jitters +/-20% so concurrent clients do not rebound on the same tick. Without a hint, full jitter:rand(0.1, min(cap, base * 2^min(attempt,6))).Test-SpeSafeTransportError. Returns$trueonly when the failure is unambiguously pre-send: HTTP 502, orSocketExceptioncodesConnectionRefused(10061),HostNotFound(11001),TryAgain(11002),NoData(11004). Mid-stream resets, 504, post-send timeouts, genericIOExceptiondeliberately return$false.Wired into four retry sites:
Invoke-RemoteScript-MaxRetries. Also: switched.ResulttoGetAwaiter().GetResult()so the catch sees unwrapped exceptions.Invoke-RemoteWaitWait-RemoteScriptSessionWait-RemoteConnectionWhy 502 is in Phase 1 and 504 is not
502 from Traefik or any fronting proxy means the request never reached the application, so retry cannot double-execute a non-idempotent script. 504 means the request might have run and the response got lost. Silent retry on 504 is a disaster waiting to happen for any cmdlet with side effects (
New-Item,Send-MailMessage,Publish-Item).Cmdlet idempotency map
The Phase 2 design needs this to decide what should auto-retry vs require an explicit opt-in. Server endpoints land in three buckets:
Test-RemoteConnectionReceive-RemoteItem(file/media)GET /-/script/file/...or/media/...Invoke-RemoteWaitGET /-/script/wait/...Wait-RemoteScriptSessionInvoke-RemoteWait+ final receiveWait-RemoteSitecoreJobInvoke-RemoteScriptpoll loopStop-ScriptSessionPOST /-/script/script/?action=cleanupSend-RemoteItem(file)POST /-/script/file/...Send-RemoteItem(media)POST /-/script/media/...Settings.Media.UploadAsVersionableByDefault); each retry creates a new versionInvoke-RemoteScriptPOST /-/script/script/?action=executeSend-MailMessage,Publish-Item, increment countersInvoke-RemoteScript -AsJobThis refines the Phase 2 question of "what should
-RetryOnConnectionFailurecover":Receive-RemoteItem,Invoke-RemoteWait,Wait-*,Stop-ScriptSession-MaxRetries. Safe by construction. Could land as Phase 1.5 ahead of the opt-in switch.Send-RemoteItemnon-versioned + fixed destinationSend-RemoteItemversioned mediaInvoke-RemoteScriptPhase 2 (under consideration): opt-in for ambiguous failures
Adds a switch (working name
-RetryOnConnectionFailure) toInvoke-RemoteScript. Shares the existing-MaxRetriesbudget. When set, also retries:SocketExceptionConnectionReset (10054)IOExceptionmid-readTaskCanceledExceptionfrom HttpClient timeoutUser is responsible for ensuring scripts are idempotent before opting in. The docstring will say so explicitly. Matches AWS SDK / Polly idiom.
Open questions before committing to the design:
-MaxRetriesfor the safe-by-construction cmdlets (Receive-RemoteItem,Invoke-RemoteWait,Wait-*,Stop-ScriptSession), which today do not retry at all on transport blips. This is strictly additive and does not need an opt-in switch.-RetryOnConnectionFailure,-RetryUnsafeFailures, or a separate-IdempotentScriptswitch that broadens what-MaxRetriescovers.Invoke-RemoteScriptandSend-RemoteItem, or whetherSend-RemoteItem's case is split further by versioned-media vs non-versioned.Heuristic values (re-tune triggers)
$retryCeiling503, kept for back-compat-MaxRetriesvalidates 0-10Compatibility
-MaxRetries=0)-MaxRetries Nagainst 429/503Retry-After.-MaxRetries Nagainst 502 / DNS / connection refusedRetry-AfterNo template changes, no serialized item migrations, no auth surface changes.
Tests
Unit (
tests/unit/SPE.ClientRetry.Tests.ps1):Get-SpeBackoffDelay(both modes, cap honored).Test-SpeSafeTransportErrorcovers 502, 504-not-safe, all fourSocketExceptioncodes, wrapped exception chain, null.-MaxRetries.ConnectionRefusedretried,ConnectionResetnot retried, default-off without-MaxRetries.Integration (
tests/integration/Remoting.ClientRetry.Tests.ps1):TcpListenerbound, thenStop-ed). Asserts no-retry baseline < 5s and retry adds >= 1.5s.HttpListenerstub returns 502 on call 1 and 200 on call 2. Verifies end-to-end behavior through the realHttpClientand socket layer.Bundled bug fix
Invoke-RemoteScript -ConnectionUri ...(inline, no-Session) was silently broken:$UseDefaultCredentialswas never initialized in the URI-only branch, andNew-SpeHttpClientrejected the empty string when binding[bool]. The pre-jitter retry path consumed the resulting null-reference error, so the user just saw "No response returned" instead of a connect error. Fixed by defaulting to$false. Surfaced while writing the closed-port integration test.Branch
feature/jittered-backoffoffrelease/9.0. Three commits, will rebase to#1487:prefix.