ci: experiment with parallel Windows shard setup + build by midleman · Pull Request #13828 · posit-dev/positron

midleman · 2026-05-27T21:17:46Z

Summary

Experimental fan-out: drop needs: [setup-windows] on e2e-windows shards so their ~15 min of artifact-independent setup (Python, R, Quarto, Graphviz, secrets) runs in parallel with the build job. Shards poll for the build artifact internally before downloading + extracting.

Build runs as before (~45 min)
Shards start at workflow trigger and do their setup in parallel
Shards wait for build, then download + extract, then run tests

QA Notes

Goal: ~15 min off Windows critical path on test-merge
Built on top of mi/cache-e2e-test-deps (ci: Windows build-once + Linux PR worker isolation #13615); validate that branch first
The test-merge.yml temporary pull_request trigger from the parent branch is still active, so this PR will fire test-merge.yml on push for validation

…EMFILE errors Retries EMFILE (too many open files) errors during gulp vscode-win32-x64 packaging. Same fix VS Code uses in their build tooling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… fix

fromLocalWebpack was constructing File objects for all extension files at once and passing them via es.readArray(), which opened thousands of ReadStreams simultaneously. On Windows, this hit the OS EMFILE (too many open files) limit when packaging positron-python's 11,784 python_files. Switch to createSequentialFileStream (already used by fromLocalNormal) which opens files one at a time via a pump queue, matching the pattern that fromLocalNormal already relied on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The EMFILE fix in build/lib/extensions.ts (sequential file opening via createSequentialFileStream) validated successfully on Windows CI. The spike workflow and the original graceful-fs patch are no longer needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add test-e2e-windows-build.yml that compiles Positron once and uploads the full workspace (node_modules + compiled output) as a tar artifact. Strip the build steps from test-e2e-windows-run.yml so each shard just downloads and extracts that artifact instead of compiling independently. Update test-merge.yml (3 shards), test-full-suite.yml (4 shards), and test-pull-request.yml to call the build workflow first, then pass the artifact to the run shards. Also fix a pre-existing bug: test-merge.yml's slack-notify-win-extensions was checking needs.e2e-windows-electron.outputs.extensions_failed but the job is named e2e-windows. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each Playwright worker now gets its own Xvfb display (:10 + parallelIndex) so workers don't race on a shared X server. The setup-xvfb action accepts a workers input and starts that many displays at :10..:(10+N-1). Linux PR e2e-electron workers bumped from 2 to 3, expected to drop the job from ~31m to ~25m on cold cache / ~19m on warm cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

3 workers saved ~2m off Linux PR e2e. Trying 4 to see if there's another 1-2m available before scaling hits a ceiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The spike confirmed zstd is available on windows-latest-8x runners (191ms compress vs gzip's 1228ms, 320ms decompress vs 416ms). The production change landed in test-e2e-windows-build.yml; this validator is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fail fast if WORKERS is not a positive integer so an invalid value can't silently start zero Xvfb displays. Also verify zstd is on PATH before extracting the .tar.zst Windows build artifact so a missing tool fails loudly instead of with a cryptic decompression error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

3 and 4 workers both surface monaco-workbench readiness timeouts on ubuntu-latest-8x (4 Electrons cold-starting at once exceeds the 30s checkPositronReady window). The savings (~2m vs 2 workers) is too small to justify the added flake risk. Keeping the per-worker Xvfb infrastructure so future runners with more cores can dial workers up without code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… to 4) Linux electron merge now runs 3 shards (was 2). Windows merge now runs 4 shards (was 3), benefiting from the build-once pattern shipped in this branch where setup-windows produces one artifact that all shards consume. Adjust max_failures to keep the per-shard ratio comparable (20 total / N shards rounded up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

REVERT THIS COMMIT BEFORE MERGING. - Add pull_request trigger to test-merge.yml, gated on changes to that file - Comment out slack-notify and slack-notify-win-extensions so the PR run doesn't spam #positron-rstats-test-results Goal: prove the new shard counts (Linux 3, Windows 4) actually run as expected before main starts using them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…setup Two Windows shard wins, both targeting the 8-15min of setup overhead before tests even start: 1. Add Defender exclusions for the workspace and temp paths before extracting the build artifact. Defender's real-time scanning of the ~50k small files in node_modules trees is the dominant cost of the extract step (currently ~8 min). Excluding these paths typically cuts extract time 3-5x on Windows GitHub-hosted runners. 2. Drop the 'Set up Quarto' action. Positron bundles its own Quarto 1.9.38 into the build artifact (build/lib/quarto.ts) and uses it at runtime. The CI-installed Quarto was never used by Positron or exercised by e2e tests, which interact through the UI rather than shelling out to the quarto CLI. Saves ~2 min per shard. Both changes are reversible — Defender exclusions only apply to the ephemeral runner, and Quarto can be re-added if tests turn out to need a system Quarto we haven't found yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hromium Push each suite to 6 parallel shards on test-merge to see if we hit new bottlenecks (build-once artifact, predictive sharding distribution, or runner availability) before settling on a permanent ceiling. max_failures: 4 across all suites (20 / 6 = 3.33, rounded up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply the same Defender path exclusion to setup-windows/build-windows as we added to the e2e shards. This covers npm install (100k+ files, currently ~22 min), cache restore, compile (writes thousands of .js), and Playwright browser install (~600 MB). Realistic upside is 5-10 minutes off the 43 min build job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

6 chromium shards was a copy of the Linux/Windows bump but chromium has a smaller test footprint that doesn't justify the extra shard. 5 keeps it under the artifact-contention concern raised by the multi-shard download throttling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier Defender exclusion only covered github.workspace (D:\a\positron\positron) but the extract step writes to dirname() of that — D:\a\positron — which Defender was still scanning. That's why extract only dropped from 8m to 6m instead of the projected 2-3m. Now also exclude: - Split-Path -Parent of github.workspace (actual extract destination) - D:\a (GitHub Actions Windows runner workspace root) - Keep C:\a for safety Also log Get-MpPreference.ExclusionPath after applying to verify each path actually got registered. Applies to both build-windows and the e2e shard run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous commit (c02396e) removed this on the assumption that Positron's bundled Quarto 1.9.38 was sufficient for e2e tests. That was wrong: the e2e tests under test/e2e/tests/quarto/ failed without a system Quarto installed. Restoring the action as-is. The cost is real (~2m/shard × 6 = ~12 runner-min/merge) but tests pass is non-negotiable. Worth revisiting later as a separate effort, possibly via cache or runner image, after we have more data on what specifically needs system Quarto vs the bundled binary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

EXPERIMENT branched from mi/cache-e2e-test-deps. Today the e2e-windows shards wait for setup-windows to complete before doing anything, then spend ~15 min on Python/R/Quarto/Graphviz setup that doesn't actually need the build artifact. That 15 min sits idle on the critical path. This change overlaps that 15 min of setup with the ~45 min build: 1. Drop `needs: [setup-windows]` on e2e-windows in test-merge.yml so shards start at workflow trigger time, in parallel with build. 2. Move the artifact-dependent steps (download, Defender exclusion, extract) to AFTER all the artifact-independent setup. Insert a new polling step that waits for the build job to complete before downloading. Expected wallclock saving: ~15 min off the Windows critical path on test-merge runs, since shard setup now happens during the build window instead of after it. Caveats: - Build job failure: the polling step detects this and fails early - More runner-min consumed (shard waits idle if its setup finishes before the build does) vs the current model where the shard hadn't even started yet - Brittle to upstream job name changes (currently matches "build-windows") Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

midleman and others added 27 commits May 27, 2026 08:45

ci: cache test/e2e node_modules on Windows e2e shards

63db958

fix(build): patch fs with graceful-fs in Win32 gulp tasks to prevent …

4ad0011

…EMFILE errors Retries EMFILE (too many open files) errors during gulp vscode-win32-x64 packaging. Same fix VS Code uses in their build tooling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

spike: add Windows packaged-build workflow to test graceful-fs EMFILE…

cfa1b0d

… fix

spike: trigger packaged-build workflow on PR to test graceful-fs fix

b50ee30

ci: add build/lib/extensions.ts to spike workflow paths filter

b17ba8d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

spike: test zstd availability and performance on Windows runner

5f5425d

spike: trigger zstd test on pull_request

bb511a8

spike: fix missing mkdir for gzip dest in comparison step

4c7ba52

ci: use zstd compression for Windows build artifact

8c1c031

ci: bump Linux PR workers from 3 to 4

e7164c6

3 workers saved ~2m off Linux PR e2e. Trying 4 to see if there's another 1-2m available before scaling hits a ceiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci: trigger workflows on this experimental PR

2df656f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: experiment with parallel Windows shard setup + build#13828

ci: experiment with parallel Windows shard setup + build#13828
midleman wants to merge 27 commits into
mainfrom
mi/windows-parallel-setup

midleman commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

midleman commented May 27, 2026

Summary

QA Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant