Skip to content

Commit 05ad8d7

Browse files
committed
203-cycle grounded evolution run: score ceiling 39.0, self-tuning mutation weights, test quality tracking, summary report, conclusions
1 parent e812cb2 commit 05ad8d7

4 files changed

Lines changed: 416 additions & 90 deletions

File tree

README.md

Lines changed: 45 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,12 @@
2727

2828
<!-- EVOLUTION_STATUS_START -->
2929

30-
> **Last Evolution Cycle:** 2026-05-28T19:25:56.130374+00:00 UTC
31-
> **Generation:** 50
32-
> **Best Score:** 96.0
30+
> **Last Evolution Cycle:** 2026-05-28T21:29:06+00:00 UTC
31+
> **Generations:** 203
32+
> **Best Score:** 39.0
3333
> **Population Size:** 50
34+
> **Benchmarks:** 7
35+
> **Test Quality:** 4–5 real assertions per cycle
3436
3537
<!-- EVOLUTION_STATUS_END -->
3638

@@ -284,21 +286,21 @@ Each cycle:
284286
12. Wait 10 seconds, repeat
285287
```
286288

287-
### Mutation Engine: `mutation_engine.py` (55 lines)
289+
### Mutation Engine: `mutation_engine.py` (136 lines)
288290

289-
25 mutation operations that transform prompts:
291+
39 mutation operations that transform prompts, with **self-tuning weights**:
290292

291293
```python
292294
MUTATIONS = [
293-
"Add stronger modularity requirements",
294-
"Require async support",
295-
"Require retry handling with exponential backoff",
296-
"Require comprehensive tests with pytest",
297-
"Add input validation using Pydantic or dataclasses",
298-
# ... 20 more
295+
{"desc": "Add stronger modularity requirements", "weight": 1.0},
296+
{"desc": "Require async support", "weight": 1.0},
297+
{"desc": "Require comprehensive tests with pytest", "weight": 1.0},
298+
# ... 36 more, weights auto-adjusted by success rate
299299
]
300300
```
301301

302+
Mutations that consistently produce negative score deltas have their probability reduced (down to 0.1x). Successful mutations maintain or increase their weight. Weights persist to `memory/mutation_weights.json`.
303+
302304
Also provides `crossover_prompts()` for genetic recombination between two prompts.
303305

304306
### Population Manager: `population_manager.py` (50 lines)
@@ -420,61 +422,43 @@ git checkout prompt.txt
420422

421423
## Results
422424

423-
### Current Snapshot
425+
### Grounded Evolution Results (203 Cycles)
424426

425427
| Metric | Value |
426428
|--------|-------|
427-
| **Generations** | 218 |
428-
| **Population** | 218 prompts |
429-
| **Best Lexical Score** | **1000 / 1000** |
430-
| **Score Range** | 35 → 1000 (28.6×) |
431-
| **Ceiling Progression** | 500 → 862 → 1000 |
432-
| **Grounded Best** | 96.0 / 100 |
433-
434-
> **Note: Lexical Plateau at 862/1000 (now broken).** A bug in `mutate.py`'s `get_missing_keywords`
435-
> function was ignoring all 286 single-keyword scoring conditions (those without `and`/`or`),
436-
> blocking 180 uncovered signals. After fix: **896 → 914 → 932 → 950 → 968 → 986 → 1000** in 30 cycles.
437-
> 6 prompts now score the maximum. The grounded loop remains the next frontier.
438-
439-
### Top 10 Prompts
440-
441-
| Rank | File | Score | Key Differentiator |
442-
|------|------|-------|-------------------|
443-
| 1 | `prompt_131.txt` | 862.0 | Full production-ready agent structure |
444-
| 2 | `prompt_132.txt` | 862.0 | Comprehensive error handling + logging |
445-
| 3 | `prompt_133.txt` | 862.0 | Async-first with complete test suite |
446-
| 4 | `prompt_134.txt` | 862.0 | Docker + CI/CD + observability |
447-
| 5 | `prompt_135.txt` | 862.0 | Multi-source RAG + embedding pipeline |
448-
| 6 | `prompt_136.txt` | 862.0 | LangGraph + tool calling + streaming |
449-
| 7 | `prompt_137.txt` | 862.0 | Security + auth + rate limiting |
450-
| 8 | `prompt_138.txt` | 862.0 | Kubernetes + Terraform + monitoring |
451-
| 9 | `prompt_140.txt` | 862.0 | Full microservice architecture |
452-
| 10 | `prompt_121.txt` | 840.0 | LangGraph + Ollama + comprehensive testing |
453-
454-
### Score Distribution
429+
| **Cycles** | 203 |
430+
| **Best Execution Score** | **39.0 / 80** |
431+
| **Score Range** | 17.0 → 39.0 |
432+
| **Average Score** | 30.9 |
433+
| **Test Quality** | 4–5 real assertions/cycle (after prompt fix) |
434+
| **Hidden Tests Passed** | 0 / 203 |
435+
| **Total LLM Tokens** | ~1,000,000 |
436+
| **Mutation Operators** | 127 uses |
437+
| **Crossover Operators** | 76 uses |
438+
| **Process Stability** | 100% (no crashes) |
455439

456-
```mermaid
457-
pie title Prompt Score Distribution (150 prompts)
458-
"800–862 (elite)" : 30
459-
"600–799 (strong)" : 25
460-
"500–599 (good)" : 35
461-
"300–499 (developing)" : 20
462-
"100–299 (emerging)" : 25
463-
"35–99 (baseline)" : 15
464-
```
440+
### What We Learned
441+
442+
**1. Score plateau at 39/80** — Despite 203 cycles, the execution score never exceeded 39. The system converged to a fitness plateau. Prompts evolved to produce projects that pass basic structural checks (syntax, imports, file count) but consistently failed benchmark-specific behavioral tests. This suggests the mutation operators explore *prompt text similarity* space, not *functional correctness* space — and these are not the same.
443+
444+
**2. Test quality is directly controllable via prompt engineering** — The single most impactful change was improving the LLM system prompt from *"Generate clean code"* to *"Generate real tests with assertions, no placeholders"*. This moved test quality from 0 real assertions to 4–5 per cycle instantly. The generator system prompt is a critical lever.
445+
446+
**3. Self-tuning mutation weights work but converge** — The weight adjustment system successfully downweighted 5 consistently harmful mutations to 0.1× probability. However, this also reduced exploration diversity — the same few mutations were repeatedly selected, narrowing the search space.
447+
448+
**4. Hidden behavioral tests remain unsolved** — Across 203 cycles, not a single generated project passed benchmark-specific hidden tests. The generated code produces correct *structure* (right function signatures, right file layout) but not correct *behavior* (the functions don't actually work as specified). This is the fundamental open problem.
449+
450+
**5. LLM generation is reliable** — The Mistral API completed all 203 generations without a single failure. Average generation time was stable at ~60s/cycle.
451+
452+
**6. Population converges without diversity preservation** — The greedy elitist selection (keep top 50) caused the population to converge to near-identical prompts producing 3-file projects. A diversity-preserving mechanism (e.g., novelty search or fitness sharing) is needed.
453+
454+
### Open Challenges
465455

466-
### What Top Prompts Generate
467-
468-
When fed to the grounded generator, top prompts produce:
469-
- Full `src/package/` layouts with 20+ modules
470-
- LangGraph ReAct loops with Ollama local models
471-
- Pydantic v2 validation + type hints everywhere
472-
- Async/await, streaming, SSE/websocket support
473-
- OpenTelemetry + Prometheus + Grafana stacks
474-
- OAuth2/JWT auth with rate limiting
475-
- pytest property-based, snapshot, benchmark tests
476-
- Docker + Kubernetes + systemd deployment
477-
- CI/CD with GitHub Actions + pre-commit
456+
| Challenge | Impact | Hypothesis |
457+
|-----------|--------|------------|
458+
| **Hidden test failure** | Blocks scores above 40 | Need behavioral validation mid-generation, not post-hoc |
459+
| **Population convergence** | Stagnation after ~50 cycles | Add novelty search or multi-objective optimization |
460+
| **Token cost** | ~5,000 tokens/cycle | Cache similar prompts, use cheaper models for pre-filtering |
461+
| **Mutation granularity** | Most mutations are neutral | Add structured mutations that modify specific prompt sections
478462

479463
---
480464

0 commit comments

Comments
 (0)