Jittered backoff and safe-transport retry for remoting client

## Summary

After a Sitecore app-pool recycle or deployment, every SPE client running against that CM retries on the same wall-clock tick because every retry sleep is a flat `Start-Sleep -Seconds N`. The synchronized burst from a fleet of CI runners, scheduled scripts, and automation hits the CM during its most fragile window, exactly when it is least able to absorb it. This issue replaces the flat sleeps with jittered backoff at four retry sites and adds a classifier that retries unambiguously safe transport failures (DNS, connection refused, 502) under `-MaxRetries`. Phase 1 is in branch `feature/jittered-backoff`. Phase 2 (under consideration) adds an opt-in switch for ambiguous failures.

## Problem

A `task deploy`, an IIS app-pool recycle, a Docker compose `up`, or any Sitecore restart leaves the CM unable to serve requests for roughly 5-30 seconds. Any SPE client running against that CM during that window hits one of three retry paths, all of which sleep a flat number of seconds:

- `Wait-RemoteConnection`: flat 2s between warmup probes.
- `Invoke-RemoteScript`: sleeps exactly `Retry-After` on 429/503.
- `Wait-RemoteScriptSession`: flat `-Delay` on TransportError or legacy poll.

When a fleet of clients was already running before the recycle (CI pipelines, scheduled remoting scripts, the warmup loop in `task test`, an operator's local PowerShell session), they all start probing again on the *same* tick after the outage. CM gets a synchronized burst of N requests every 2s while it is still warming up. Worst case: CM finishes warming, accepts the burst, overloads, looks unhealthy, recycles again.

Separately, when the CM is genuinely unreachable (DNS hiccup mid-recycle, connection refused on app-pool restart, 502 from Traefik when the upstream dies), `Invoke-RemoteScript` fails immediately even with `-MaxRetries 5`, because today's retry path only handles 429/503 status codes. Operators have to script their own retry loop on top, defeating the purpose of `-MaxRetries`.

## Phase 1: jittered backoff and safe-transport retry

Two private helpers in `SPE.psm1`:

- `Get-SpeBackoffDelay`. With a `Retry-After` hint, jitters +/-20% so concurrent clients do not rebound on the same tick. Without a hint, full jitter: `rand(0.1, min(cap, base * 2^min(attempt,6)))`.
- `Test-SpeSafeTransportError`. Returns `$true` only when the failure is unambiguously pre-send: HTTP 502, or `SocketException` codes `ConnectionRefused` (10061), `HostNotFound` (11001), `TryAgain` (11002), `NoData` (11004). Mid-stream resets, 504, post-send timeouts, generic `IOException` deliberately return `$false`.

Wired into four retry sites:

| Site | Behavior change |
|---|---|
| `Invoke-RemoteScript` | Jitter on 429/502/503 retry; new safe-transport retry under `-MaxRetries`. Also: switched `.Result` to `GetAwaiter().GetResult()` so the catch sees unwrapped exceptions. |
| `Invoke-RemoteWait` | Jitter on 429/502/503 retry. |
| `Wait-RemoteScriptSession` | Jitter on TransportError sleep + jittered legacy-poll cadence. |
| `Wait-RemoteConnection` | Jittered backoff (base 2s, cap 15s) in the deploy-warmup loop. |

### Why 502 is in Phase 1 and 504 is not

502 from Traefik or any fronting proxy means *the request never reached the application*, so retry cannot double-execute a non-idempotent script. 504 means the request might have run and the response got lost. Silent retry on 504 is a disaster waiting to happen for any cmdlet with side effects (`New-Item`, `Send-MailMessage`, `Publish-Item`).

## Cmdlet idempotency map

The Phase 2 design needs this to decide what should auto-retry vs require an explicit opt-in. Server endpoints land in three buckets:

| Cmdlet | Endpoint | Mutates server? | Retry-safe? |
|---|---|---|---|
| `Test-RemoteConnection` | GET probe | No | Yes |
| `Receive-RemoteItem` (file/media) | `GET /-/script/file/...` or `/media/...` | No | Yes (but starts from byte 0; see #1488 for resume support) |
| `Invoke-RemoteWait` | `GET /-/script/wait/...` | No | Yes (long-poll, designed for re-attempts) |
| `Wait-RemoteScriptSession` | composes `Invoke-RemoteWait` + final receive | No | Yes |
| `Wait-RemoteSitecoreJob` | composes `Invoke-RemoteScript` poll loop | No | Yes |
| `Stop-ScriptSession` | `POST /-/script/script/?action=cleanup` | Yes (removes session) | Effectively yes - re-cleanup of an already-removed session is a no-op |
| `Send-RemoteItem` (file) | `POST /-/script/file/...` | Yes (overwrites) | Conditional - same bytes + fixed destination = same end state, network-wasteful |
| `Send-RemoteItem` (media) | `POST /-/script/media/...` | Yes | Not when versioned media is on (`Settings.Media.UploadAsVersionableByDefault`); each retry creates a new version |
| `Invoke-RemoteScript` | `POST /-/script/script/?action=execute` | Depends entirely on script body | No by default - script could `Send-MailMessage`, `Publish-Item`, increment counters |
| `Invoke-RemoteScript -AsJob` | same | Yes (starts a job) | No - a second POST starts a second job, even if the first ran |

This refines the Phase 2 question of "what should `-RetryOnConnectionFailure` cover":

| Cmdlet | Suggested ambiguous-failure behavior |
|---|---|
| `Receive-RemoteItem`, `Invoke-RemoteWait`, `Wait-*`, `Stop-ScriptSession` | Auto-retry under `-MaxRetries`. Safe by construction. Could land as Phase 1.5 ahead of the opt-in switch. |
| `Send-RemoteItem` non-versioned + fixed destination | Per-call opt-in (caller declares idempotency) |
| `Send-RemoteItem` versioned media | User opt-in only - duplicate versions are a real cost |
| `Invoke-RemoteScript` | User opt-in only (the existing Phase 2 design) |

## Phase 2 (under consideration): opt-in for ambiguous failures

Adds a switch (working name `-RetryOnConnectionFailure`) to `Invoke-RemoteScript`. Shares the existing `-MaxRetries` budget. When set, also retries:

| Failure | Why it is ambiguous |
|---|---|
| HTTP 504 Gateway Timeout | Upstream may have processed before the timeout. |
| `SocketException` ConnectionReset (10054) | Reset could be either side of the request boundary. |
| `IOException` mid-read | Connection died between request and response. |
| `TaskCanceledException` from HttpClient timeout | Request may or may not have completed server-side. |

User is responsible for ensuring scripts are idempotent before opting in. The docstring will say so explicitly. Matches AWS SDK / Polly idiom.

**Open questions** before committing to the design:

- Whether to land "Phase 1.5" first - auto-retry under `-MaxRetries` for the safe-by-construction cmdlets (`Receive-RemoteItem`, `Invoke-RemoteWait`, `Wait-*`, `Stop-ScriptSession`), which today do not retry at all on transport blips. This is strictly additive and does not need an opt-in switch.
- Switch name for the script / upload path. Options considered: `-RetryOnConnectionFailure`, `-RetryUnsafeFailures`, or a separate `-IdempotentScript` switch that broadens what `-MaxRetries` covers.
- Whether the opt-in switch lives only on `Invoke-RemoteScript` and `Send-RemoteItem`, or whether `Send-RemoteItem`'s case is split further by versioned-media vs non-versioned.
- Default backoff cap when ambiguous-retry kicks in. Same 10s ceiling, or longer because 504 likely indicates a longer-running upstream?
- Whether to log a Warning on each ambiguous retry to flag duplicate-execution risk to operators.

## Heuristic values (re-tune triggers)

| Constant | Value | Why | When to revisit |
|---|---|---|---|
| Retry-After jitter spread | +/-20% | Breaks wall-clock sync without ignoring the server hint | Observed retry storms still cluster |
| Wait-RemoteConnection base/cap | 2s / 15s | Cold containers settle in 10-60s | Cold-start times shift (e.g. heavier Sitecore baseline) |
| 502/503 cap | 10s | Existing `$retryCeiling503`, kept for back-compat | CMs unavailable longer than 10s per probe |
| Full-jitter exponent ceiling | 2^min(attempt, 6) | Prevents pathological growth at large attempts | Only matters above 6 attempts; `-MaxRetries` validates 0-10 |

## Compatibility

| Path | Phase 1 behavior |
|---|---|
| Default (`-MaxRetries=0`) | Identical: no retry on anything. |
| `-MaxRetries N` against 429/503 | Identical retry count; sleep is now jittered around `Retry-After`. |
| `-MaxRetries N` against 502 / DNS / connection refused | Retries (was: immediate fail). |
| Concurrent clients on shared `Retry-After` | Spread by +/-20% instead of rebounding in lockstep. |

No template changes, no serialized item migrations, no auth surface changes.

## Tests

**Unit** (`tests/unit/SPE.ClientRetry.Tests.ps1`):

- 50-sample bounds checks for `Get-SpeBackoffDelay` (both modes, cap honored).
- `Test-SpeSafeTransportError` covers 502, 504-not-safe, all four `SocketException` codes, wrapped exception chain, null.
- 502 retry: with and without `-MaxRetries`.
- Transport-error retry: `ConnectionRefused` retried, `ConnectionReset` not retried, default-off without `-MaxRetries`.
- Existing 429/503 timing assertion relaxed to accept the new jitter floor (~0.8s).

**Integration** (`tests/integration/Remoting.ClientRetry.Tests.ps1`):

- Group 5: closed loopback port (`TcpListener` bound, then `Stop`-ed). Asserts no-retry baseline < 5s and retry adds >= 1.5s.
- Group 6: `HttpListener` stub returns 502 on call 1 and 200 on call 2. Verifies end-to-end behavior through the real `HttpClient` and socket layer.

## Bundled bug fix

`Invoke-RemoteScript -ConnectionUri ...` (inline, no `-Session`) was silently broken: `$UseDefaultCredentials` was never initialized in the URI-only branch, and `New-SpeHttpClient` rejected the empty string when binding `[bool]`. The pre-jitter retry path consumed the resulting null-reference error, so the user just saw "No response returned" instead of a connect error. Fixed by defaulting to `$false`. Surfaced while writing the closed-port integration test.

## Branch

`feature/jittered-backoff` off `release/9.0`. Three commits, will rebase to `#1487:` prefix.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jittered backoff and safe-transport retry for remoting client #1487

Summary

Problem

Phase 1: jittered backoff and safe-transport retry

Why 502 is in Phase 1 and 504 is not

Cmdlet idempotency map

Phase 2 (under consideration): opt-in for ambiguous failures

Heuristic values (re-tune triggers)

Compatibility

Tests

Bundled bug fix

Branch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Site	Behavior change
`Invoke-RemoteScript`	Jitter on 429/502/503 retry; new safe-transport retry under `-MaxRetries`. Also: switched `.Result` to `GetAwaiter().GetResult()` so the catch sees unwrapped exceptions.
`Invoke-RemoteWait`	Jitter on 429/502/503 retry.
`Wait-RemoteScriptSession`	Jitter on TransportError sleep + jittered legacy-poll cadence.
`Wait-RemoteConnection`	Jittered backoff (base 2s, cap 15s) in the deploy-warmup loop.

Cmdlet	Endpoint	Mutates server?	Retry-safe?
`Test-RemoteConnection`	GET probe	No	Yes
`Receive-RemoteItem` (file/media)	`GET /-/script/file/...` or `/media/...`	No	Yes (but starts from byte 0; see #1488 for resume support)
`Invoke-RemoteWait`	`GET /-/script/wait/...`	No	Yes (long-poll, designed for re-attempts)
`Wait-RemoteScriptSession`	composes `Invoke-RemoteWait` + final receive	No	Yes
`Wait-RemoteSitecoreJob`	composes `Invoke-RemoteScript` poll loop	No	Yes
`Stop-ScriptSession`	`POST /-/script/script/?action=cleanup`	Yes (removes session)	Effectively yes - re-cleanup of an already-removed session is a no-op
`Send-RemoteItem` (file)	`POST /-/script/file/...`	Yes (overwrites)	Conditional - same bytes + fixed destination = same end state, network-wasteful
`Send-RemoteItem` (media)	`POST /-/script/media/...`	Yes	Not when versioned media is on (`Settings.Media.UploadAsVersionableByDefault`); each retry creates a new version
`Invoke-RemoteScript`	`POST /-/script/script/?action=execute`	Depends entirely on script body	No by default - script could `Send-MailMessage`, `Publish-Item`, increment counters
`Invoke-RemoteScript -AsJob`	same	Yes (starts a job)	No - a second POST starts a second job, even if the first ran

Cmdlet	Suggested ambiguous-failure behavior
`Receive-RemoteItem`, `Invoke-RemoteWait`, `Wait-*`, `Stop-ScriptSession`	Auto-retry under `-MaxRetries`. Safe by construction. Could land as Phase 1.5 ahead of the opt-in switch.
`Send-RemoteItem` non-versioned + fixed destination	Per-call opt-in (caller declares idempotency)
`Send-RemoteItem` versioned media	User opt-in only - duplicate versions are a real cost
`Invoke-RemoteScript`	User opt-in only (the existing Phase 2 design)

Failure	Why it is ambiguous
HTTP 504 Gateway Timeout	Upstream may have processed before the timeout.
`SocketException` ConnectionReset (10054)	Reset could be either side of the request boundary.
`IOException` mid-read	Connection died between request and response.
`TaskCanceledException` from HttpClient timeout	Request may or may not have completed server-side.

Constant	Value	Why	When to revisit
Retry-After jitter spread	+/-20%	Breaks wall-clock sync without ignoring the server hint	Observed retry storms still cluster
Wait-RemoteConnection base/cap	2s / 15s	Cold containers settle in 10-60s	Cold-start times shift (e.g. heavier Sitecore baseline)
502/503 cap	10s	Existing `$retryCeiling503`, kept for back-compat	CMs unavailable longer than 10s per probe
Full-jitter exponent ceiling	2^min(attempt, 6)	Prevents pathological growth at large attempts	Only matters above 6 attempts; `-MaxRetries` validates 0-10

Path	Phase 1 behavior
Default (`-MaxRetries=0`)	Identical: no retry on anything.
`-MaxRetries N` against 429/503	Identical retry count; sleep is now jittered around `Retry-After`.
`-MaxRetries N` against 502 / DNS / connection refused	Retries (was: immediate fail).
Concurrent clients on shared `Retry-After`	Spread by +/-20% instead of rebounding in lockstep.

Uh oh!

Jittered backoff and safe-transport retry for remoting client #1487

Description

Summary

Problem

Phase 1: jittered backoff and safe-transport retry

Why 502 is in Phase 1 and 504 is not

Cmdlet idempotency map

Phase 2 (under consideration): opt-in for ambiguous failures

Heuristic values (re-tune triggers)

Compatibility

Tests

Bundled bug fix

Branch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions