Two commands for autonomous testing — same TDD engine, different mindsets.
| Command | Mindset | Team |
|---|---|---|
npx agentic-loop uat |
"Does this work correctly?" | recon, happy-path, edge-cases |
npx agentic-loop chaos-agent |
"Can we break it?" | recon, chaos, security |
Both use Agent Teams for coordinated browser exploration, then strict TDD per test case.
┌──────────────────────────────────────────────────────────────────────────┐
│ Phase 1: DISCOVER + PLAN │
│ │
│ Agent team explores the live app with Playwright: │
│ - Recon agent maps routes, forms, endpoints │
│ - Specialist agents test their areas │
│ - Agents share intel via SendMessage │
│ - Team lead merges findings → plan.json │
│ │
├──────────────────────────────────────────────────────────────────────────┤
│ Plan Review │
│ │
│ You review the plan before anything executes: │
│ [Y] Execute | [n] Cancel | [e] Edit in $EDITOR │
│ │
├──────────────────────────────────────────────────────────────────────────┤
│ Phase 2: TDD LOOP (per test case) │
│ │
│ ┌─ RED ──────────────────────────────┐ │
│ │ Claude writes the test ONLY │ │
│ │ No app changes allowed │ │
│ │ Test must have content assertions │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Test passes? → App already correct → commit, next case │
│ Test fails? → Classify: test bug or app bug? │
│ │ │
│ ▼ (app bug) │
│ ┌─ GREEN ────────────────────────────┐ │
│ │ Claude fixes the app ONLY │ │
│ │ No test changes allowed │ │
│ │ Regression check before commit │ │
│ └────────────────────────────────────┘ │
│ │
├──────────────────────────────────────────────────────────────────────────┤
│ Phase 3: REPORT │
│ │
│ Summary: cases passed/failed, bugs found/fixed, files changed │
└──────────────────────────────────────────────────────────────────────────┘
Verifies that features work correctly for real users.
npx agentic-loop uat| Agent | Role |
|---|---|
| recon | Maps all routes/endpoints, catalogs forms with selectors, identifies tech stack and auth |
| happy-path-{area} | One per feature area. Completes primary user journeys, records correct behavior as ground truth |
| edge-cases | Tests boundary conditions: empty fields, long input, validation, back button, refresh mid-flow |
Only spawns agents for areas that exist. No forms? No forms specialist.
_ _ _ _____ _
| | | | / \|_ _| | | ___ ___ _ __
| | | |/ _ \ | | | | / _ \ / _ \| '_ \
| |_| / ___ \| | | |__| (_) | (_) | |_) |
\___/_/ \_\_| |_____\___/ \___/| .__/
|_|
Phase 1: Discovery + Plan Generation
⟳ Browser navigate to http://localhost:3000
⟳ Reading .ralph/config.json
⟳ Browser take_screenshot
...
✓ Plan generated: 8 test cases
Execute this plan? [Y/n/e(dit)]
Tries to break the app with XSS, injection, chaos inputs, auth bypass.
npx agentic-loop chaos-agentBy default, the chaos-agent spins up an isolated Docker copy of your app so your live dev server is never touched. It parses your docker-compose.yml, offsets all ports by 10000 (e.g., :5173 becomes :15173), and runs the red team against the isolated copy.
$ npx agentic-loop chaos-agent
Starting isolated Docker environment...
→ Parsing docker-compose.yml for port mappings
→ Generated override: api 8001→18001, web 5173→15173
→ Isolated environment ready (frontend: :15173, api: :18001)
Phase 1: Red team exploring your app for vulnerabilities
→ Agents attack http://localhost:15173 (your :5173 is untouched)
Tearing down isolated environment...
→ Done. Your dev server was never touched.
Requirements: Docker and a compose file (docker-compose.yml, compose.yml, etc.)
Fallback: If Docker isn't available or no compose file exists, the chaos-agent falls back to testing against your live app with non-destructive guardrails (no DELETE endpoints, no mass mutations, etc.).
| Agent | Role |
|---|---|
| recon | Attack surface mapping. Catalogs every input, auth mechanism, API endpoint. Shares intel: "login uses JWT in localStorage" |
| chaos | Chaos testing. For every input: empty strings, 10000-char payloads, special characters, unicode/emoji, null bytes. Double-submit, missing fields, rapid-fire interactions |
| security | XSS in every input, SQL injection, auth bypass via direct URL, IDOR via ID manipulation, sensitive data in localStorage/console/page source, missing CSRF tokens |
Agents coordinate — recon shares discoveries, security acts on them.
# Acceptance testing
npx agentic-loop uat # Full: team → plan → TDD loop
npx agentic-loop uat --plan-only # Generate plan only
npx agentic-loop uat --focus auth # Focus on auth tests only
npx agentic-loop uat --focus UAT-003 # Focus on specific test case
npx agentic-loop uat --no-fix # Write tests but don't fix app bugs
npx agentic-loop uat --max 10 # Limit to 10 iterations
npx agentic-loop uat --quiet # Suppress activity feed
npx agentic-loop uat --review # Re-review existing plan
# Adversarial testing
npx agentic-loop chaos-agent # Full: red team → plan → TDD loop
npx agentic-loop chaos-agent --plan-only # Generate chaos plan only
npx agentic-loop chaos-agent --no-fix # Find vulnerabilities without fixing
npx agentic-loop chaos-agent --focus security # Focus on security tests
npx agentic-loop chaos-agent --quiet # Suppress activity feed| Flag | Description |
|---|---|
--plan-only |
Generate plan without executing the TDD loop |
--focus <id|category> |
Run specific test case (UAT-003) or category (auth, security, forms) |
--no-fix |
Write failing tests as documented bugs — skip GREEN phase |
--max N |
Limit total iterations (default: 20) |
--quiet |
Suppress the live activity feed |
--review |
Re-review an existing plan before executing |
Both commands share the same strict TDD loop.
Claude writes the test file. Constraints enforced:
- No app changes — if Claude modifies app code, changes are rolled back and the phase retries
- Content assertions required — tests that only check "page loads" are rejected as shallow
- Minimum 2 assertions — at least one content assertion (toContain, toHaveText, toBe, etc.)
- Input-output pattern — e2e tests must fill/click AND verify the result
If the test passes in RED, the app is already correct — commit and move on. If the test fails with an assertion error, it's an app bug — commit the RED test and transition to GREEN. If the test fails with a syntax/import error, it's a test bug — retry RED with feedback.
Claude fixes the application code. Constraints enforced:
- No test changes — if Claude modifies the test file, it's restored from the last commit
- Regression check — existing tests must still pass after the fix
- Rollback on regression — if the fix breaks other tests, all changes are rolled back
If a test case exceeds the retry limit (default: 5 combined RED + GREEN retries), it's skipped and flagged for human attention.
Generated during Phase 1 discovery. Stored at .ralph/uat/plan.json or .ralph/chaos/plan.json.
{
"testSuite": {
"name": "UAT Loop",
"generatedAt": "2026-02-09T10:30:00Z",
"status": "pending",
"discoveryMethod": "uat-team"
},
"testCases": [
{
"id": "UAT-001",
"title": "Login — valid credentials redirect to dashboard",
"category": "auth",
"type": "e2e",
"userStory": "As a user, I can log in and see my dashboard",
"testApproach": "Fill login form, verify redirect and welcome message",
"testFile": "tests/e2e/auth/login.spec.ts",
"targetFiles": ["src/pages/login.tsx", "src/api/auth.ts"],
"edgeCases": ["Empty password", "Wrong credentials", "SQL injection in email"],
"assertions": [
{
"input": "Fill email='user@test.com', password='pass123', submit",
"expected": "Redirects to /dashboard, shows 'Welcome, User'",
"strategy": "keyword"
},
{
"input": "Submit with empty password",
"expected": "Shows 'Password is required'",
"strategy": "keyword"
},
{
"input": "Fill email with XSS payload, submit",
"expected": "Payload displayed as text, no script execution",
"strategy": "security"
}
],
"passes": false,
"retryCount": 0,
"source": "uat-team:happy-path-auth"
}
]
}| Field | Description |
|---|---|
discoveryMethod |
"uat-team" or "chaos-agent" |
type |
"e2e" (browser) or "integration" (API-only) |
assertions |
Input/expected pairs that become expect() calls — at least 3 per case |
passes |
Tracks completion state through the TDD loop |
phase |
null (start RED), "red" (RED done, resume GREEN) |
redRetries / greenRetries |
Per-phase retry counts for the circuit breaker |
source |
Which agent discovered this test case |
In .ralph/config.json:
{
"uat": {
"sessionSeconds": 1800,
"maxIterations": 20,
"maxCaseRetries": 5,
"maxSessionSeconds": 600
},
"chaos": {
"sessionSeconds": 1800,
"maxIterations": 20,
"maxCaseRetries": 5,
"maxSessionSeconds": 600,
"isolate": true,
"docker": {
"portOffset": 10000,
"healthTimeout": 120,
"build": true
}
},
"docker": {
"composeFile": "docker-compose.yml"
}
}| Field | Default | Description |
|---|---|---|
*.sessionSeconds |
1800 |
Timeout for the Phase 1 discovery session (Agent Teams) |
*.maxIterations |
20 |
Maximum TDD loop iterations |
*.maxCaseRetries |
5 |
Circuit breaker — max combined RED + GREEN retries per case |
*.maxSessionSeconds |
600 |
Timeout per individual Claude session (RED or GREEN) |
| Field | Default | Description |
|---|---|---|
chaos.isolate |
true |
Enable Docker isolation for chaos-agent runs |
chaos.docker.portOffset |
10000 |
Offset added to host ports (e.g., 5173 → 15173) |
chaos.docker.healthTimeout |
120 |
Seconds to wait for containers to be healthy |
chaos.docker.build |
true |
Pass --build to rebuild images each run |
docker.composeFile |
auto-detect | Path to compose file (checks docker-compose.yml, compose.yml, etc.) |
UAT and Chaos Agent are configured independently — you can give Chaos Agent more retries or a longer discovery timeout.
.ralph/
uat/ # UAT acceptance testing
plan.json # Generated test plan
progress.txt # Activity log
last_failure.txt # Failure context for retries
UAT-PROMPT.md # Project-specific testing guide (generated)
screenshots/ # Screenshots from discovery
last_test_output.log # Most recent test output
chaos/ # Chaos Agent adversarial testing
plan.json # Generated attack plan
progress.txt # Activity log
last_failure.txt # Failure context for retries
UAT-PROMPT.md # Project-specific red team guide (generated)
screenshots/ # Screenshots from discovery
last_test_output.log # Most recent test output
UAT and Chaos Agent use separate directories — they don't clobber each other. Both share the same lockfile (.ralph/.lock) so they can't run simultaneously, since both modify app code via git.
If you stop mid-run (Ctrl+C or npx agentic-loop stop), re-running the same command resumes where you left off:
npx agentic-loop uat # Resumes existing plan, picks up next incomplete case
npx agentic-loop chaos-agent # Same — resumes from last positionTo start fresh, delete the plan and re-run:
rm .ralph/uat/plan.json && npx agentic-loop uat
rm .ralph/chaos/plan.json && npx agentic-loop chaos-agentRalph rejects shallow tests that only check structure. A test must verify content — the right data, not just that the page loads.
| Rejected (shallow) | Accepted (content) |
|---|---|
expect(page).toHaveURL('/dashboard') |
expect(page.getByText('Welcome, John')).toBeVisible() |
expect(form).toBeVisible() |
expect(page.getByText('Email is required')).toBeVisible() |
expect(response.status).toBe(200) |
expect(response.json().user.name).toBe('John') |
When a test is rejected as shallow, Ralph saves specific feedback about what's wrong and retries with guidance.
- Code Check — Verification pipeline after each story
- How Ralph Works — Full architecture details
- Cheatsheet — All commands at a glance