Skip to content

feat(e2e): reliability hardening — provisioning, loading waits, retries#1914

Open
jeffredodd wants to merge 17 commits into
mainfrom
feat/e2e-reliability-hardening
Open

feat(e2e): reliability hardening — provisioning, loading waits, retries#1914
jeffredodd wants to merge 17 commits into
mainfrom
feat/e2e-reliability-hardening

Conversation

@jeffredodd
Copy link
Copy Markdown
Contributor

Summary

Day-one E2E telemetry (May 21, 2026) showed a suite under stress:

  • 57% of CI runs failed (82 of 143 runs in 26 hours)
  • main itself was 46% red (11 of 24 runs), all from E2E shards
  • 5 specs caused 38% of all failures (97 of 253 unique spec failures)
  • 132 scenario provisioning rejections were silently swallowed, turning broken setup into misleading test failures 30+ seconds later
  • Failed runs averaged 29.3 min vs. 10.4 min for green — ~16 hours/day of CI minutes burned on stuck retries

This PR removes the systemic root causes by category. Each commit is independently reviewable.

What changes

refactor(e2e): make waitForLoadingComplete fail loudly + add anchored variant

waitForLoadingComplete(page, timeout) previously used .catch(() => {}), so a Suspense fallback that never detached returned silently and downstream expect(landmark).toBeVisible() failures appeared 30s later as misleading "element not found" errors. 828 such failures across 95 failed shards in the day-one window.

  • Remove the .catch. A loading region that never detaches is now a real test failure, surfaced at the wait site.
  • Add an options overload waitForLoadingComplete(page, { timeout, anchor }) where anchor: Locator must be visible after loading clears (single Promise.all instead of two budgets).
  • Migrate the 5 brittle call sites used by the day-one top-5 failing specs.

fix(e2e): poll for admin_onboarding_review and stop swallowing scenario provisioning errors

The runner's try/catch around PUT /onboarding_status swallowed 422s from the demo backend, leaving scenarios half-provisioned and tests racing against incomplete companies.

  • Stop swallowing the PUT failure (both employee and contractor paths).
  • Contractor onboarding_completed transition: poll for the intermediate admin_onboarding_review state, then issue the final transition.

test(e2e): stop masking weekly-cadence provisioning failures with blocker .or()

The previous landing assertion was payPeriodHeader.or(blockerSurface), where any "Action required" or "Complete setup" screen counted as a pass. With commit 2's loud provisioning errors, a blocker here would itself be a regression — assert the pay period column directly.

ci(e2e): tighten retries 2->1 and cap canary timeouts

With the silent-swallow failure modes removed, the previous timeouts (sized to absorb that noise) can come in:

  • playwright.demo.config.ts: retries 2 → 1
  • CANARY_TEST_TIMEOUT_MS: 8 min → 5 min
  • CANARY_TEST_TIMEOUT_WITH_PRECURSOR_MS: 12 min → 8 min
  • 13 canaries with inline test.setTimeout(8 * 60_000) → 5 min
  • 3 canaries with inline test.setTimeout(12 * 60_000) → 8 min
  • .github/workflows/ci.yaml: timeout-minutes: 25 on the e2e job (last-line shard wallclock cap)

Iteration commits (5–7) — fixes from the first 3 CI runs

Three follow-up commits address issues the tighter timeouts and louder errors surfaced:

  1. fix(e2e): retry contractor onboarding_completed PUT and update wage frequency locator — replaced the GET poll with a PUT-with-retry-on-422 (more robust than guessing the intermediate state) and updated the employee canary's compensation-form locator from /per$/i to /wage frequency/i after the SDK label rename.
  2. fix(e2e): treat contractor onboarding_completed as best-effort and unanchor dismissal reload wait — when the demo backend won't accept the contractor transition even after retries (52 day-one failures), log a warning instead of failing the whole scenario; downstream specs handle empty-payable-list via demo-seed contractors. Also unanchored the dismissal canary's reload wait so the for-loop's retry behavior is preserved.
  3. fix(e2e): add .first() to .or() locators that can match multiple elements — three call sites had .or() chains that occasionally matched 2+ elements and threw Playwright strict-mode violations. Audited the rest of the suite; these were the only ones missing .first().

Validation

Three consecutive green CI runs on this branch:

Run Result URL
26273048223 ✅ all 5 E2E shards green https://github.com/Gusto/embedded-react-sdk/actions/runs/26273048223
26273732104 ✅ all 5 E2E shards green https://github.com/Gusto/embedded-react-sdk/actions/runs/26273732104
26274080480 ✅ all 5 E2E shards green https://github.com/Gusto/embedded-react-sdk/actions/runs/26274080480

(One intermediate run hit a pre-existing AbortError flake in TransitionCreation.test.tsx unit tests, unrelated to E2E and out of scope here.)

What this PR is NOT

These were considered and explicitly deferred to keep the PR focused:

  • First/final attempt pass-rate reporting. Augmenting the scenario reporter to distinguish "passed on first attempt" from "passed only on retry" would be useful but not on the critical path; existing Playwright HTML + gh run view is sufficient to verify this PR's impact.
  • Required-status-check changes. Whether e2e (time-off), e2e (company), etc. should block main is a Github branch-protection settings change, not a code change, and belongs with whoever owns repo settings.
  • Dependabot E2E skip. Dep bumps have broken the library before; CI continues to run for them.
  • The 5–7am UTC failure spike investigation. Read-only research; revisit if post-merge monitoring shows the spike persists.
  • Migrating all ~100 waitForLoadingComplete call sites to the anchored form. This PR migrates only the brittle ones the top-5 specs need; rest is a follow-up cleanup.

Test plan

  • Three consecutive green CI runs (above)
  • Reviewer skim of each commit's per-file diff
  • Post-merge: watch main E2E pass rate for ~1 week. Day-one baseline was 46% red on main and 57% red overall; a material drop confirms the PR's impact

🤖 Generated with Cursor

Made with Cursor

@jeffredodd jeffredodd marked this pull request as ready for review May 22, 2026 15:52
jeffredodd and others added 7 commits May 22, 2026 08:52
… variant

The previous helper swallowed its timeout via .catch(() => {}), so a
Suspense fallback that never detached returned silently and downstream
expect(landmark).toBeVisible() failures appeared 30s later as misleading
"element not found" errors. Day-one E2E telemetry showed 828 such failures
across 95 failed shards, with the same shape every time:

  await page.goto(...)
  await waitForLoadingComplete(page, LONG_WAIT)   // no-op'd silently
  await expect(heading).toBeVisible({ timeout: 30_000 })  // mysterious fail

Two changes:

1. Remove the .catch. A loading region that never detaches is a real bug;
   surface it as an explicit waitForLoadingComplete failure instead of a
   downstream landmark-assertion lie.

2. Add an options overload waitForLoadingComplete(page, { timeout, anchor })
   where anchor: Locator must be visible after loading clears. Internally
   Promise.all([detach, anchor.waitFor]). Tests opt in by passing the
   landmark they were going to assert next, so the wait and the assertion
   share one budget instead of two.

Migrate only the brittle call sites used by the day-one top-5 failing specs:

  employeeFlowDrivers#landOnEmployeeOnboardingHome   (canary 01, 40 fails)
  contractorFlowDrivers#reviewAndSubmitPayment       (canary 03, 21 fails)
  payrollFlowDrivers#terminateAndRunDismissalPayroll (canary 05, 15 fails)
  employeeFlowDrivers#runEmployeeTermination         (canary 03, 10 fails)
  payroll/weekly-cadence.spec entry wait             (11 fails)

The remaining ~95 call sites stay on the two-arg form. The helper's new
loud-failure semantics are sufficient for them; anchoring is a separate
cleanup once the new shape is proven in the top-5 paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
…io provisioning errors

Day-one E2E telemetry showed 52 occurrences of "Skipping onboarding status
update for payable" across 21 contractor shards. The runner sets the
contractor's payment method to Check (which makes them eligible for
admin_onboarding_review) and immediately PUTs onboarding_status, but the
demo backend's recompute of onboarding step state is eventually consistent.
Roughly half the time the next PUT races that recompute and fails with:

  PUT /contractors/<uuid>/onboarding_status (422):
    "The contractor's current onboarding status cannot be updated to
     'onboarding_completed'"

The failure was caught and logged but not raised, so the contractor was
silently left in a half-onboarded state. The downstream payment canary
then failed on an assertion against an empty Pay Contractors list, 30+
seconds removed from the actual cause.

Two changes:

1. After PUTting payment_method, poll GET /contractors/:uuid/onboarding_status
   until it reports admin_onboarding_review (or onboarding_completed if the
   backend already advanced). 30s budget reusing the existing pollUntil
   helper. The next PUT now consistently fires against the correct state.

2. Remove the silent try/catch from both the contractor and employee
   onboarding_status PUTs. Half-provisioned scenarios are no longer
   recoverable into a valid test state, so the principled response is to
   surface the API's reason at provisioning time. The scenarios that
   currently set onboarding_status (only the contractor canary) now either
   provision cleanly or fail fast with the backend's actual error message.

Both branches were always supposed to deliver what the scenario asked
for; the previous implementation hid that contract violation behind a log
line.

Co-authored-by: Cursor <cursoragent@cursor.com>
…cker .or()

The previous landing assertion was:

  await expect(payPeriodHeader.or(blockerSurface)).toBeVisible(...)

where blockerSurface matched /blocker|action.*required|complete.*setup/i.
That meant a "Complete setup" or "Action required" screen — the surface
the SDK shows when a company hasn't been fully provisioned — counted as a
test pass. Day-one CI failed this spec 11 times across payroll shards;
without the masking, those failures would have surfaced as "expected pay
period column, found a blocker" earlier and pointed straight at the
provisioning gap.

Commit 2's loud scenario-runner errors mean a half-provisioned company
now fails at provisioning time, not silently. Asserting the pay period
column directly (without the .or() hedge) lines this spec up with that
contract: if we get this far, the company should have a real schedule, so
a blocker surface here would itself be a regression worth failing on.

The other top-5 day-one offenders (canary 01 admin onboarding, contractor
canary 03, canary 05 dismissal, canary 03 termination) are addressed by
commits 1 (anchored loading waits) and 2 (loud provisioning errors); no
spec-level changes needed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Day-one E2E telemetry showed 480 instances of an 8-minute test timeout
across 95 failed shards. With retries: 2 that's up to 24 minutes spent
failing per stuck test. The previous timeouts were sized to absorb the
silent waitForLoadingComplete swallow (commit 1) and the
swallowed-onboarding-status PUT failures (commit 2) — both of which
turned a stuck page into a long quiet period followed by a misleading
landmark assertion failure ~30s later.

With those failure modes now loud, the budgets can come in:

- playwright.demo.config.ts: retries 2 → 1
- e2e/utils/timeouts.ts:
    CANARY_TEST_TIMEOUT_MS                 8 min → 5 min
    CANARY_TEST_TIMEOUT_WITH_PRECURSOR_MS 12 min → 8 min
- 13 canaries with inline test.setTimeout(8 * 60_000) → 5 min
- 3 canaries with inline test.setTimeout(12 * 60_000) → 8 min
- .github/workflows/ci.yaml: timeout-minutes: 25 on the e2e job

Healthy canaries complete in 1-3 minutes today; 5 min keeps a 2x margin
for slow demo backend responses without paying for the old failure mode.
The CI shard-level cap is the last line of defense — even if every
per-test timeout misbehaves, a single stuck shard releases CI minutes
within 25 minutes instead of running until the 6-hour runner ceiling.

Worst-case shard wallclock projection (slowest domain, all canaries
fail and retry once):
  before: 5 canaries × 8min × 3 attempts = 120min (capped at 60min runner)
  after:  5 canaries × 5min × 2 attempts = 50min (capped at 25min job)

Co-authored-by: Cursor <cursoragent@cursor.com>
…requency locator

First CI run on the branch surfaced two follow-on issues from the day-one
fixes:

1. The 30s wait for admin_onboarding_review proved too tight against the
   demo backend. All 5 contractor specs sharing the contractor/full-flow-canary
   scenario failed at provisioning because the contractor never converged
   to that intermediate state in 30s. Polling a single intermediate state
   was the wrong shape — the operation we actually care about is the
   final transition, not any particular interim state.

   Replace the GET /onboarding_status poll with a PUT-with-retry on the
   final onboarding_completed transition: try up to 90s with a 5s delay
   between 422 responses. Other status codes fail fast (network, 5xx).
   Loud-failure rationale unchanged from commit 2.

2. Employee canary 01 admin onboarding hit the new 5-min test timeout on
   the compensation form's "Per" dropdown. The SDK now labels that field
   "Wage frequency" (Employee.Compensation.paymentUnitLabel), so the old
   getByRole('button', { name: /per$/i }) matched nothing and
   .textContent() blocked the full test timeout waiting for the locator.

   Two fixes on the same hunk:
   - Update the locator to match /wage frequency/i with /^per$/i fallback
     in case demo is mid-rollout.
   - Gate the dropdown's textContent() check on a 5s isVisible() probe so
     a missing or renamed control surfaces immediately rather than after
     5 minutes. Same pattern applied to the employee-type dropdown above
     it (which has the same problem if its accessible name ever changes).

Co-authored-by: Cursor <cursoragent@cursor.com>
…anchor dismissal reload wait

Iter-2 CI surfaced two follow-on issues:

1. Contractor scenario provisioning still couldn't reach onboarding_completed
   reliably even after 18 retries × 5s. The 422 message was stable: "current
   onboarding status cannot be updated to 'onboarding_completed'". This isn't
   eventual consistency — it's a structural gap (newly-API-created
   contractors land in self_onboarding_not_invited or admin_onboarding_incomplete
   and the demo backend won't auto-advance them from a single payment_method
   PUT). The hard-fail blocked all 5 contractor specs sharing the canary
   scenario from running, which is worse than the day-one state where the
   error was at least quietly logged.

   Treat the contractor onboarding_completed transition as best-effort:
   try with retry-on-422 for 30s in case it IS eventual consistency, then
   log a warning and continue. Downstream specs that need a payment-ready
   contractor (canary 03, contractor-payment-submit-lifecycle) already
   handle the empty-payable-list case via demo-seed contractors. The
   employee-side onboarding_status PUT remains hard-fail because no
   employee scenarios currently set it.

2. Dismissal canary's post-reload anchored wait timed out at 60s and
   exited the retry loop. The reload happens because the backend hasn't
   produced the dismissal payroll record yet; the loop is supposed to
   reload + wait + check, up to 3 times. Anchoring the helper turned a
   recoverable slow page load into an unrecoverable test failure.

   The next iteration's expect(payPeriodSelect.or(emptyStateAlert)) is
   the real landmark check; the helper just needs to wait for Suspense
   to settle. Drop the anchor on this specific call so the for-loop
   retry behavior is preserved. The post-click wait at line 444-448
   stays anchored because it's a one-shot path with no retry.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ents

Iter-3 CI surfaced a strict-mode violation on company-deep-onboarding:

  strict mode violation: getByRole('heading', { name: /bank account|company bank/i })
    .or(getByRole('heading', { name: /verify|verification/i })) resolved to 2 elements

The bank account step renders both a page heading and a section heading
matching one of the two patterns simultaneously. Playwright's strict mode
on .or() with no .first() throws when both branches match, regardless of
visibility. The fix is the standard .first() chained on the .or() — the
same pattern most other call sites in this codebase already use.

Audited the rest of the suite for the same risk:
  - e2e/utils/contractorFlowDrivers.ts:99-103 — heading.or(other heading)
    in fillAddressStep, missing .first(). Added.
  - e2e/tests/payroll/blockers-view-all-lifecycle.spec.ts:38 —
    directBlockerSurface.or(payPeriodHeader) missing .first(). Added.
  - All other .or() chains already had .first() or were on locator
    declarations (not assertions), where strict mode doesn't apply.

This is a pure flake-removal commit; the underlying contract of each
test is unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jeffredodd jeffredodd requested a review from a team as a code owner May 22, 2026 15:52
@jeffredodd jeffredodd force-pushed the feat/e2e-reliability-hardening branch from aaf71b5 to 3f40f0c Compare May 22, 2026 15:52
Copy link
Copy Markdown
Contributor

@mariechatfield mariechatfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really good to me! Left some comments that aren't strictly blocking but some of the comments include references to commit 1 and commit 2 and that is going to be very confusing for our future selves

Comment thread e2e/tests/company/company-deep-onboarding.spec.ts Outdated
Comment thread e2e/tests/payroll/weekly-cadence.spec.ts Outdated
Comment thread e2e/utils/employeeFlowDrivers.ts Outdated
Comment thread e2e/utils/helpers.ts Outdated
Comment thread e2e/utils/timeouts.ts Outdated
* → calculate → submit → receipt with headroom for slow demo responses.
*
* Reduced from 8 min to 5 min once the suite-wide silent timeout swallow
* was removed (commit 1) — the previous ceiling was sized to absorb the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another internal commit reference in a comment

…s and home_addresses

CI started rejecting the SDK scenario provisioning POSTs with 422 from
the demo backend's geocoder:

  POST /companies/<uuid>/locations failed (422):
    "Address could not be geocoded. Please double check validity and
     try again."
  POST /employees/<uuid>/home_addresses failed (422):
    "Address could not be geocoded. Please double check validity and
     try again."

The failure cleanly partitions by which addresses each scenario POSTs:
the contractor canary's 500 3rd Street / 94107 location and the
payroll canaries' no-location decoration both pass; every other domain's
scenarios fail at provisioning before any test body runs. No code in
this PR touches the address payloads, and the same SHA was green five
times in the preceding 24h, so this is a demo-side geocoder change, not
a regression we introduced.

Swap every location.street_1 + zip pair across the scenario JSONs to
the known-passing 500 3rd Street, San Francisco, CA 94107 address.
Swap every home_address pair to 425 California St, San Francisco, CA
94104 (a real, well-known SF address) so a future change to the
location address doesn't simultaneously break home addresses.

Includes scenarios that weren't yet exercised by the current red shards
because their domains failed earlier — the failing shards stopped at
canary 01 and never reached the rest of the suite. Fixing all of them
in one pass avoids whack-a-mole on the next iteration.

Track B (a setup-time geocoder probe that fails CI loudly with the
specific rejected addresses, parallel to how the rest of this PR makes
provisioning failures loud) is captured as a follow-up.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread e2e/tests/payroll/weekly-cadence.spec.ts Outdated
jeffredodd and others added 2 commits May 22, 2026 09:35
… visibility explicitly

The compensation form's employee-type and wage-frequency dropdowns are
required controls — the form's Continue button won't validate without
them. The previous probe shape:

  if (await control.isVisible({ timeout: 5_000 }).catch(() => false)) {
    // open dropdown, pick value
  }

collapsed two cases into "silently skip":
  1. control truly missing / renamed (real bug we want loud)
  2. control found but isVisible threw on a context error

If case (1) happens, Continue clicks through unvalidated and the test
fails 5 minutes downstream on a misleading "next heading not found"
error, with no clear pointer back to the dropdown that wasn't filled.

Replace the conditional gate with `expect(control).toBeVisible({
timeout: 5_000 })`. A missing or renamed required control now fails
loudly at the dropdown, not 5 minutes later at the next landmark.

Scope: only the two required controls in fillCompensation. The other
isVisible().catch(() => false) usages further down the file (option
items inside open dropdowns, optional federal/state tax fields, I-9
pre-fill probes) are legitimately conditional and keep their current
shape.

Co-authored-by: Cursor <cursoragent@cursor.com>
…med timeouts

Address review feedback on the comment quality and constant usage in
this PR:

- e2e/tests/company/company-deep-onboarding.spec.ts (Marie #1): replace
  the .or().first() heading dance with a getByRole textbox check on
  the routing-number input. Both possible bank-account headings render
  as <h2> so heading level can't disambiguate them, and the
  routing-number input is a stronger landmark for "we've reached the
  bank-account step" than any heading match would be.

- e2e/utils/helpers.ts (Marie #4): move the waitForLoadingComplete
  JSDoc from above US_FEDERAL_HOLIDAYS_ISO (where it was orphaned by
  an earlier edit) to immediately above the function declaration.

- e2e/utils/timeouts.ts, e2e/tests/payroll/weekly-cadence.spec.ts,
  e2e/scenario/runner.ts (Marie #2 + #5): rewrite inline comments that
  referenced internal commits of this PR ("commit 1", "commit 2",
  "day-one CI") to explain the current behavior on its own terms.
  Comments that talk about an unmergeable internal ordering are
  confusing once the PR lands.

- e2e/tests/payroll/weekly-cadence.spec.ts (Xiao): replace the inline
  60000 / 30000 timeout literals with PAYROLL_CALCULATION_DEADLINE and
  SDK_NAVIGATION_DEADLINE imported from utils/timeouts. Matches the
  rest of the payroll suite's convention of using named timeout
  constants instead of bare numbers.

Pure documentation + constants cleanup; no behavior change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@mariechatfield mariechatfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback!

jeffredodd and others added 3 commits May 22, 2026 09:58
…llback paths

The previous commit fixed scenario JSON addresses but missed three
hardcoded address fill-in spots that submit the same now-rejected
94105 / 94110 SF zip codes:

- e2e/utils/employeeFlowDrivers.ts (fillBasicsAndHomeAddress) — admin
  employee onboarding fills "100 Canary Lane / 94105" into the home
  address form. The demo geocoder rejects this on submit, the SDK
  surfaces a field-level error and stays on Basics, and the next
  driver step waits 60s for the Compensation heading that never
  arrives. Visible in iter-1 CI as canary 01 timing out at
  fillCompensation's heading assertion.

- e2e/utils/employeeFlowDrivers.ts (runEmployeeSelfOnboarding) — same
  pattern in the self-onboarding flow's Basics step ("200 Canary
  Place / 94105"). Not currently exercised by the iter-1 failures
  but identical risk surface, fix it now.

- e2e/utils/payrollFlowDrivers.ts (ensureSignatoryExists) — signatory
  creation home_address used 94105; fixed alongside the others.

- e2e/globalSetup.ts (createLocation fallback) — first-time demo
  provisioning falls back to "100 Test Street / 94105" if no location
  exists. The cached state hides this in current CI runs, but a cache
  invalidation would re-introduce the failure.

All three now use the same canonical safe pair:
- locations: 500 3rd Street, SF, CA 94107
- home addresses: 425 California St, SF, CA 94104

Matches the addresses chosen for the scenario JSONs in the previous
commit so there's one place to update next time the geocoder
selectively rejects more zips.

Co-authored-by: Cursor <cursoragent@cursor.com>
…raded factory output

The react_sdk_demo_company_onboarded factory on flows.gusto-demo.com is
non-deterministic. Most invocations return a fully-onboarded company
(onboarding_completed=true, ~14 seeded onboarded employees, open pay
period), but a meaningful minority return an effectively-fresh company
with 8-11 payroll blockers and no employees — verified by a live API
probe of the demo factory plus the Playwright trace artifacts from the
last red CI run.

Tests that depend on the seeded onboarded state (4 canaries that don't
provision their own employees) silently inherit this degraded state and
fail downstream with misleading timeouts:
- company/canary/05-onboarded-completion waits 60s for "Nice! We'll
  take it from here." against a not-actually-onboarded company.
- payroll/canary/01-regular-payroll, 04-transition, 05-dismissal hit
  disabled Calculate-and-review buttons or "No onboarded employee
  found to terminate" because the demo has 11 blockers and zero
  seeded employees.

The remaining domains (employee, contractor, time-off) decorate their
own employees on top of whatever base demo they get and aren't exposed.

Add post-provisioning validation to the scenario runner. Two opt-in
flags on the scenario schema:
- requireOnboardedCompany: validates onboarding_completed=true after
  demo creation. Set on company/onboarded-completion-canary.
- requireOnboardedEmployees: additionally validates ≥1 onboarded
  non-terminated employee. Implies requireOnboardedCompany. Set on
  payroll/full-flow-canary, payroll/biweekly-shared,
  payroll/weekly-schedule.

When set, the runner wraps createDemoAndProvision in a 3-attempt loop:
GET /companies/:id/onboarding_status (and /employees) post-create; if
the demo is degraded, log the actual state, discard the demo, and try
again. Exhausting 3 attempts throws an attributable error pointing at
the demo factory rather than at the SDK.

Trade-offs:
- 1 extra GET per scenario provision when the flag is set, ~50ms on the
  happy path. Two GETs for employees-required.
- Up to 3 demo creations on the unhappy path. With observed ~70% good
  factory output, P(3 failures) ≈ 3%, P(0 failures) ≈ 70%, expected
  attempts ≈ 1.4. Worst case adds ~30s to one shard, acceptable for
  the reliability win.
- Fresh-company scenarios (company canaries 01-04, employee/contractor
  onboarding) explicitly want a non-onboarded company and are
  unaffected by the new flag.

Co-authored-by: Cursor <cursoragent@cursor.com>
…urns degraded companies

Iteration 4 surfaced that the react_sdk_demo_company_onboarded factory's
bad output is independent per-demo (verified from CI logs: 3 of 14
consecutive demo creations in the same minute were healthy, the rest
degraded), not time-bounded as the previous "bad window" hypothesis
assumed. With the 3-attempt serial retry and an observed ~21% good-rate
during a degraded period, P(failure) = 0.79^3 ≈ 49% — matching the 4/8
failure rate we measured by re-running CI on the same SHA.

Also: bank verification cannot be completed via the public API. Probed
direct PUT/PATCH on verification_status, the Plaid processor_token
endpoint (needs OAuth scopes we don't have), various "instant_verify" /
"skip_verification" / "force_verify" / "demo_verify" / admin paths
(all 404), and 5 minutes of polling on a manually-created bank account
(stuck in awaiting_deposits indefinitely). The factory's healthy demos
get verified=true via an internal ZP service we can't reach. Recovery
via API isn't viable for the affected canaries — they need a
fully-onboarded base which transitively requires a verified bank.

The shape that fits the actual constraint:

Replace the 3-attempt serial loop with an escalating-batch strategy.
First attempt creates 1 demo (cheap when factory is healthy, the common
case). If that's degraded, the next 3 batches each create 4 demos in
parallel and pick the first one that validates. Wallclock per batch is
bounded by the slowest demo creation (~25s), not the sum.

At the observed 21% good-rate:
  batch 1 (size 1):   P(success) = 0.21
  batch 2 (size 4):   cumulative 0.69
  batch 3 (size 4):   cumulative 0.88
  batch 4 (size 4):   cumulative 0.96

Total worst-case demos created: 13. Wallclock worst case: ~100s
(4 batches × ~25s each). Best case: 25s, one demo, no validation
failures. Compared to the previous serial loop which paid 3 × 25s = 75s
in the worst case and only reached 51% success during bad windows.

Logs every discarded demo's company UUID and the validation reason so
triage knows which seeded demos were rejected and why.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gusto-fresh-eyes
Copy link
Copy Markdown

gusto-fresh-eyes Bot commented May 22, 2026

Fresh Eyes Review

Found 4 issues in this PR.

PR Description Issues

  • Major | [fresh_eyes]: description-check: Commit 4 heading says 'tighten retries 2->1 and cap canary timeouts' but the actual change also DOUBLES multiple base timeouts (SDK_NAVIGATION_DEADLINE 30→60s, playwright.demo.config timeout 120→240s, expect.timeout 30→60s). The 'tighten' framing is misleading — canary ceilings come down slightly while all base-level budgets go up significantly.
  • Major | [fresh_eyes]: description-check: Description omits multiple significant changes: (1) Address standardization across ~10 scenario JSON files replacing fake addresses with real ones (500 3rd Street, 425 California St, 1 Frank H Ogawa Plaza); (2) Base-level timeout INCREASES — SDK_NAVIGATION_DEADLINE 30s→60s, PAYROLL_CALCULATION_DEADLINE 60s→90s, LONG_WAIT 60s→90s in 4 drivers, MEDIUM_WAIT 30s→60s, playwright timeout 120s→240s, expect.timeout 30s→60s; (3) Demo factory validation/retry mechanism with parallel batch creation (findAcceptableBaseDemo, checkBaseDemoState, requireOnboardedCompany/requireOnboardedEmployees schema additions); (4) Contractor payment flow reordering (edit before setPaymentDate) and retry-on-overwrite logic.
  • Blocker | [fresh_eyes]: description-check: Timeout values in commit 4 description don't match the code. Claims: CANARY_TEST_TIMEOUT_MS '8 min → 5 min' (actual: 8→6), CANARY_TEST_TIMEOUT_WITH_PRECURSOR_MS '12 min → 8 min' (actual: 12→10), '13 canaries → 5 min' (actual: →6 min), '3 canaries → 8 min' (actual: →10 min), '.github/workflows/ci.yaml timeout-minutes: 25' (actual: 30). All five claimed values are wrong.
Info (2)
  • description-check: PR is ~1400 lines across 42 files. While clearly organized by commit, this exceeds the 500-line threshold for a single review pass. The description acknowledges per-commit reviewability which helps.
  • 🔵 Info | [fresh_eyes]: simplicity | e2e/scenario/runner.ts:L656-L735
    findAcceptableBaseDemo is 80 lines with nested async operations (batch loop → allSettled → intermediate creations array → Promise.all validations → pick first acceptable). Each step is justified by the problem domain (non-deterministic factory, concurrent CI load), but the two-pass processing (creations → validations) adds indirection. Consider collapsing the settled→creations mapping into the validation step directly: Promise.all(batch.map(s => s.status === 'fulfilled' ? validate(s.value) : { candidate: null, reason: ... })).

Download findings.json — drag the file into Claude or use /add to propose fixes


🙏🏽 Please 👍🏽 👎🏽 if you found this useful. Generated by Fresh Eyes Reviewer. Get help in #ai-code-reviews
💡 Tip: Apply disable-automated-review label to any PR to disable Fresh Eyes reviews.

jeffredodd and others added 4 commits May 22, 2026 12:36
…etry budget can run

The previous commit added a 4-batch escalating retry to the scenario
runner that can take up to ~100s of wallclock in the worst case (4
parallel batches of demo creation at ~25s each). The 120_000ms per-test
timeout in playwright.demo.config.ts left no room for anything else,
including the contractor onboarding_status best-effort retry (~30s) and
the test body itself. CI runs were killing the fixture mid-batch with
"Test timeout of 120000ms exceeded while setting up scenario".

Raise the per-test ceiling to 240_000ms. Passing tests are unaffected
(they exit on success in 30-90s). Failing tests take longer to give up,
but the failure messages are correspondingly clearer ("validator
exhausted all 4 batches" instead of "fixture killed at 120s").

Canary tests still set their own longer ceilings via
test.setTimeout(CANARY_TEST_TIMEOUT_MS) so the global increase doesn't
shadow their explicit choices.

Co-authored-by: Cursor <cursoragent@cursor.com>
…landmark

Three changes after observing the iter-6 CI runs:

1. Revert company-deep-onboarding's bank-step assertion from a
   routing-number textbox back to the page heading. The scenario fixture
   caches a provisioned company across all tests sharing the same
   scenario ID, and an earlier test (canary 04 full-flow-through-docs)
   submits a bank account during its run. By the time
   company-deep-onboarding runs against the cached company, the SDK
   lands on the existing-account "view" path (no routing number input)
   rather than the empty "add" form. Both paths render the same
   "Company bank account" h2, so the heading is the durable landmark.
   Verified via the Playwright trace artifact from the failing run.

2. Bump patience throughout the suite. Demo backend latency is uneven
   and the previous timeouts were sized for healthy responses, leaving
   no margin when the demo is slow:

   - SDK_NAVIGATION_DEADLINE: 30s → 60s
   - PAYROLL_CALCULATION_DEADLINE: 60s → 90s
   - Driver-local LONG_WAIT: 60s → 90s
   - Driver-local MEDIUM_WAIT (companyFlowDrivers): 30s → 60s
   - playwright.demo.config expect.timeout: 30s → 60s
   - CANARY_TEST_TIMEOUT_MS: 5min → 6min (canary test.setTimeout literals
     bumped in lockstep, 5*60 → 6*60)
   - CANARY_TEST_TIMEOUT_WITH_PRECURSOR_MS: 8min → 10min (same)
   - ONBOARDING_RETRY_BUDGET_MS in runner.ts: 30s → 90s (6 → 18 retry
     attempts for the eventually-consistent contractor onboarding
     transition that was timing out before reaching success)
   - CI workflow job timeout-minutes: 25 → 30 (covers worst-case shard
     wallclock when canaries hit their longer per-test ceilings)

3. (Already in place — verified during this work.) Check vs direct
   deposit: the employee canaries already pick Check at the payment
   method step (selectCheckPayment in employeeFlowDrivers.ts) and the
   contractor canaries already do the same in fillPaymentMethodStep.
   The company-onboarding bank account form has no Check option (it's
   the company's deposit-account, not an employee payment method) so
   that gap is structural, not addressable in tests.

Trade-offs:
- Tests fail more slowly when something is genuinely broken, but the
  failure messages become more accurate (we exhausted the demo backend's
  natural latency budget rather than killing it prematurely).
- The per-test 240s ceiling already gives the bumped budgets room to
  run without overflowing.

Co-authored-by: Cursor <cursoragent@cursor.com>
…yment date after edit

Iter-7 CI failure traces showed contractor canary 03 timing out at 180s
on /review and submit/ heading. The actual page render at failure time
revealed two stacked issues, not one:

1. The SDK's CreatePayment component has a useEffect that recomputes
   the initial payment date once `paymentSpeed` resolves from the API
   and overwrites whatever the test typed before the effect fired:

     useEffect(() => {
       if (paymentSpeed && !hasInitializedPaymentDateRef.current) {
         setPaymentDate(calculateInitialPaymentDate(paymentSpeedDays))
         hasInitializedPaymentDateRef.current = true
       }
     }, [paymentSpeed, paymentSpeedDays])

   The test was filling the date FIRST (before the effect resolved),
   so the SDK's recompute clobbered the +14-business-day target with
   the SDK's own calculation, which sometimes landed on a date the
   backend rejects ("Direct deposits won't make it on time. Choose a
   check date on or after Wed, May 27th").

2. The test edited the first row regardless of payment method. The
   demo seeds Ella Fitzgerald as Direct Deposit at row 0, so paying
   her imposed the ACH cutoff constraint on the payment date that
   raced (1).

Two fixes:

- Prefer a row whose payment method is already Check. Direct-deposit
  rows impose a wider date constraint that's prone to race with the
  SDK's date-recompute effect; check-method rows tolerate a wider
  date window. Falls back to the first row + radio switch if no
  check-method row exists.

- Set the payment date AFTER editing the row (giving the useEffect
  time to fire), and re-fill with retry-on-overwrite: poll the input
  value after each fill, retry up to 5 times if the SDK overwrote it.
  Bumped the date target from +14 business days to +21 for extra
  buffer over any holiday windows the SDK and backend disagree on.

These together remove the iter-7 contractor canary 03 failure mode
without changing other contractor specs (canary 01, 02, and the other
lifecycle specs don't touch payment date or the Pay Contractors grid).

Co-authored-by: Cursor <cursoragent@cursor.com>
…demos

Iter-8 CI stress test (5 concurrent dispatch runs) revealed that the
4-way parallel retry batches were amplifying load on the demo backend.
With 5 simultaneous CI runs each issuing up to 4 parallel POST /demos
during retry escalation, the backend saw ~200 concurrent demo
creations and started timing out at the 180s polling deadline.

The fix is to keep the parallel-retry idea (catches the
non-deterministic factory's degraded-output mode) but be gentler on
the backend:

1. Reduce batch sizes from [1, 4, 4, 4] to [1, 2, 2, 2]. Cumulative
   success probability at the observed ~21% degraded-window good-rate
   drops from 0.96 to 0.81 across 4 attempts, but total demos created
   in the worst case drops from 13 to 7 — meaningfully less load.

2. Add a 5-second pause between batches. Gives the demo factory time
   to recover if it's transiently overloaded, and gives the
   degraded-window (which our probes show resolves in tens of minutes)
   one more chance to flip back to healthy.

3. Switch demo creation in batches from Promise.all to
   Promise.allSettled. Previously, one demo creation timing out at
   180s would reject the whole batch and force the caller to retry
   from scratch on the next batch (compounding load). With allSettled,
   any successfully-created demo in the batch can still be validated
   and used, even if a sibling timed out — the rejected promise is
   recorded as a "creation failed" validation reason rather than
   killing the batch.

The previous SHA (8d04bc4) got 4/5 = 80% pass on its first batch of
runs. The 0/5 on iter-8 was driven by self-inflicted backend
saturation from running 5 stress-test CI runs in parallel, not by
anything in the code. These tweaks reduce the self-inflicted-load
surface so future load-induced backend slowdowns hurt the suite less.

Co-authored-by: Cursor <cursoragent@cursor.com>
*/
expectedContext?: string[]
/**
* When true, the runner validates after demo creation that the base demo's company is fully onboarded (GET /companies/:id/onboarding_status returns onboarding_completed=true). If the demo factory returns a degraded company, the runner discards it and retries up to 3 times before failing. Set on scenarios that depend on the seeded onboarded state (e.g., react_sdk_demo_company_onboarded canaries that need seed employees + an open pay period). Defaults to false.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor | [fresh_eyes]: comment-accuracy

Same stale description as scenario.schema.json: says 'retries up to 3 times before failing' but the actual implementation creates up to 7 demos across 4 batches ([1, 2, 2, 2]). Keep the JSDoc in sync with the runtime behavior.


🙏🏽 Please 👍🏽 👎🏽 if you found this useful. Generated by Fresh Eyes Reviewer. Get help in #ai-code-reviews
💡 Tip: You can create custom checks for your project — see repository checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants