Skip to content

Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444

Merged
crazytonyli merged 12 commits intotrunkfrom
iangmaia/ci-ai-e2e-tests-gem
May 10, 2026
Merged

Add Buildkite pipeline for AI E2E tests (simulator-llm-pilot gem)#25444
crazytonyli merged 12 commits intotrunkfrom
iangmaia/ci-ai-e2e-tests-gem

Conversation

@iangmaia
Copy link
Copy Markdown
Contributor

@iangmaia iangmaia commented Mar 24, 2026

Summary

  • Adds a Buildkite command script and pipeline step for running AI E2E tests using the simulator-llm-pilot gem
  • Checks for "Testing" label on PR (skips if missing to save CI resources)
  • Downloads build artifacts, installs app on simulator, installs the gem from GitHub, runs tests

The gem handles everything internally: simulator detection, WDA lifecycle, agent loop with sandboxed tools, context window compression, verification/cleanup enforcement, and structured results.

Alternative approach: see #25443 for a Claude Code + wrapper scripts version of the same pipeline.

Ref: AINFRA-2176

Test plan

  • Run .buildkite/commands/run-ai-e2e-tests.sh locally with a booted simulator and test site credentials
  • Run a simple test case (users-screen-loads.md) end-to-end
  • Verify results.md is written with correct pass/fail status

🤖 Generated with Claude Code

@dangermattic
Copy link
Copy Markdown
Collaborator

1 Message
📖 This PR is still a Draft: some checks will be skipped.

Generated by 🚫 Danger

@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Mar 24, 2026

App Icon📲 You can test the changes from this Pull Request in WordPress by scanning the QR code below to install the corresponding build.
App NameWordPress
ConfigurationRelease-Alpha
Build Number32185
VersionPR #25444
Bundle IDorg.wordpress.alpha
Commit146daa1
Installation URL7l99o9dpqifu8
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Mar 24, 2026

App Icon📲 You can test the changes from this Pull Request in Jetpack by scanning the QR code below to install the corresponding build.
App NameJetpack
ConfigurationRelease-Alpha
Build Number32185
VersionPR #25444
Bundle IDcom.jetpack.alpha
Commit146daa1
Installation URL04hfgpqarjii0
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

@iangmaia iangmaia self-assigned this Mar 24, 2026
@iangmaia iangmaia added the Testing Unit and UI Tests and Tooling label Mar 25, 2026
@iangmaia iangmaia force-pushed the iangmaia/ci-ai-e2e-tests-gem branch 2 times, most recently from 1602fa9 to 8589139 Compare March 30, 2026 17:25
@sonarqubecloud
Copy link
Copy Markdown

@crazytonyli
Copy link
Copy Markdown
Contributor

Hi @iangmaia , shall we land this and start running nightly jobs?

iangmaia and others added 12 commits May 8, 2026 17:16
The gem provides a sandboxed agent that drives the simulator through a
fixed set of tools (tap, swipe, type, REST API) with no arbitrary code
execution. It handles WDA lifecycle, session management, context window
compression, and verification/cleanup enforcement internally.

The Buildkite step:
- Checks for "Testing" label (skips if missing)
- Downloads build artifacts and installs app on simulator
- Installs the simulator-llm-pilot gem from GitHub
- Runs all test cases in Tests/AgentTests/ui-tests/

Ref: AINFRA-2176

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gem build resolves spec file paths relative to cwd, so
bin/simulator-llm-pilot wasn't found when building from the
wordpress-ios repo root.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract WDA build to a separate build-wda.sh script for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gem no longer hardcodes WordPress login flow in its system prompt.
Add app-instructions.md with the WordPress/Jetpack login flow and pass
it via --app-instructions-file. Also pass --app-name so the LLM knows
the app's display name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iangmaia iangmaia force-pushed the iangmaia/ci-ai-e2e-tests-gem branch from efadbc8 to 146daa1 Compare May 8, 2026 15:16
@iangmaia iangmaia marked this pull request as ready for review May 8, 2026 15:17
Copilot AI review requested due to automatic review settings May 8, 2026 15:18
@iangmaia
Copy link
Copy Markdown
Contributor Author

iangmaia commented May 8, 2026

@crazytonyli Hey Tony! With all the recent changes and updates this got left behind 😓 sorry about that. There's not much work left to start running it and iterating on the tests IMO, so that's the good side.
There are a couple of open questions in paaHJt-9Te-p2 related to the tests themselves (one of them always failed iinm).

As mentioned in the P2, I think that this PR + simulator-llm-pilot is the way to go for E2E AI tests, but it would be nice to make it fully 🟢 to start with.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Buildkite CI step to run AI-driven end-to-end UI tests on an iOS Simulator using the simulator-llm-pilot gem, including helper scripts for installing the gem, locating/booting a simulator, and building WebDriverAgent.

Changes:

  • Adds a new Buildkite pipeline step (PR-only) to run AI E2E tests and upload Tests/AgentTests/results/** artifacts.
  • Introduces CI scripts to install simulator-llm-pilot, find a booted simulator, build WebDriverAgent, install the app, and run the test suite.
  • Updates AI test/navigation skill docs and adds app login instructions used by the test runner.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
Tests/AgentTests/app-instructions.md Adds login-flow instructions for the agent-runner to avoid unsafe/manual credential entry.
Scripts/ci/install-simulator-llm-pilot.sh Installs simulator-llm-pilot by building from a local checkout or cloning from GitHub.
Scripts/ci/find-booted-simulator.rb Helper to return a booted simulator UDID (optionally waiting/polling).
.claude/skills/ios-sim-navigation/SKILL.md Aligns documentation placeholder naming (<APP_BUNDLE_ID>).
.claude/skills/ai-test-runner/SKILL.md Aligns documentation placeholder naming (<APP_BUNDLE_ID>).
.buildkite/pipeline.yml Adds a new “AI E2E Tests” Buildkite step gated to PR builds.
.buildkite/commands/run-ai-e2e-tests.sh Orchestrates artifact download, simulator/app setup, WDA build, and simulator-llm-pilot run.
.buildkite/commands/build-wda.sh Clones/builds WebDriverAgent and skips rebuild when artifacts already exist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +32
SIMULATOR_LLM_PILOT_REPO_URL="${SIMULATOR_LLM_PILOT_REPO_URL:-https://github.com/Automattic/simulator-llm-pilot.git}"
SIMULATOR_LLM_PILOT_SOURCE_PATH="${SIMULATOR_LLM_PILOT_SOURCE_PATH:-}"

build_dir="$(mktemp -d)"
trap 'rm -rf "$build_dir"' EXIT

source_path="${SIMULATOR_LLM_PILOT_SOURCE_PATH}"
if [[ -z "$source_path" && -f "${DEFAULT_LOCAL_GEM_PATH}/simulator-llm-pilot.gemspec" ]]; then
source_path="${DEFAULT_LOCAL_GEM_PATH}"
fi

if [[ -n "$source_path" ]]; then
echo "Using local simulator-llm-pilot source at ${source_path}"
if [[ -d "${source_path}/.git" ]]; then
source_revision="$(git -C "${source_path}" rev-parse HEAD)"
git -C "${source_path}" archive HEAD | tar -x -C "$build_dir"
else
source_revision="local-filesystem"
tar -cf - -C "${source_path}" . | tar -xf - -C "$build_dir"
fi
else
echo "Cloning simulator-llm-pilot from ${SIMULATOR_LLM_PILOT_REPO_URL}"
git clone --depth 1 "${SIMULATOR_LLM_PILOT_REPO_URL}" "$build_dir"
source_revision="$(git -C "$build_dir" rev-parse HEAD)"
fi
Comment on lines +24 to +25
WEBDRIVERAGENT_REPO_URL="${WEBDRIVERAGENT_REPO_URL:-https://github.com/appium/WebDriverAgent.git}"
WEBDRIVERAGENT_REF="${WEBDRIVERAGENT_REF:-}"
Comment on lines +51 to +55
ensure_wda_checkout

if [[ -d "$WDA_PROJECT" ]] && has_built_artifacts; then
echo "WebDriverAgent already built, skipping."
exit 0
Comment on lines +126 to +127
TIMESTAMP="$(date +%Y-%m-%d-%H%M)"
RESULTS_DIR="Tests/AgentTests/results/${TIMESTAMP}"
Comment on lines +94 to +98
UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 2>/dev/null || true)"
if [[ -z "$UDID" ]]; then
echo "No booted simulator named '$SIMULATOR_NAME' found. Booting..."
xcrun simctl boot "$SIMULATOR_NAME" 2>/dev/null || true
UDID="$(ruby Scripts/ci/find-booted-simulator.rb "$SIMULATOR_NAME" 30 1 2>/dev/null || true)"
Comment on lines +14 to +15
output, status = Open3.capture2('xcrun', 'simctl', 'list', 'devices', 'booted', '-j')
exit 1 unless status.success?
@crazytonyli crazytonyli added this pull request to the merge queue May 10, 2026
@crazytonyli
Copy link
Copy Markdown
Contributor

@iangmaia I have removed the media block test in #25550. I think it'd be good to get the nightly job running, so that we have a sense of how this thing works in the real world. If it works well, we can invest more in it, adding more test cases, reducing test execution time, etc.

Merged via the queue into trunk with commit 92fd2b4 May 10, 2026
30 checks passed
@crazytonyli crazytonyli deleted the iangmaia/ci-ai-e2e-tests-gem branch May 10, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Status] DO NOT MERGE Testing Unit and UI Tests and Tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants