|
27 | 27 |
|
28 | 28 | <!-- EVOLUTION_STATUS_START --> |
29 | 29 |
|
30 | | -> **Last Evolution Cycle:** 2026-05-28T19:25:56.130374+00:00 UTC |
31 | | -> **Generation:** 50 |
32 | | -> **Best Score:** 96.0 |
| 30 | +> **Last Evolution Cycle:** 2026-05-28T21:29:06+00:00 UTC |
| 31 | +> **Generations:** 203 |
| 32 | +> **Best Score:** 39.0 |
33 | 33 | > **Population Size:** 50 |
| 34 | +> **Benchmarks:** 7 |
| 35 | +> **Test Quality:** 4–5 real assertions per cycle |
34 | 36 |
|
35 | 37 | <!-- EVOLUTION_STATUS_END --> |
36 | 38 |
|
@@ -284,21 +286,21 @@ Each cycle: |
284 | 286 | 12. Wait 10 seconds, repeat |
285 | 287 | ``` |
286 | 288 |
|
287 | | -### Mutation Engine: `mutation_engine.py` (55 lines) |
| 289 | +### Mutation Engine: `mutation_engine.py` (136 lines) |
288 | 290 |
|
289 | | -25 mutation operations that transform prompts: |
| 291 | +39 mutation operations that transform prompts, with **self-tuning weights**: |
290 | 292 |
|
291 | 293 | ```python |
292 | 294 | MUTATIONS = [ |
293 | | - "Add stronger modularity requirements", |
294 | | - "Require async support", |
295 | | - "Require retry handling with exponential backoff", |
296 | | - "Require comprehensive tests with pytest", |
297 | | - "Add input validation using Pydantic or dataclasses", |
298 | | - # ... 20 more |
| 295 | + {"desc": "Add stronger modularity requirements", "weight": 1.0}, |
| 296 | + {"desc": "Require async support", "weight": 1.0}, |
| 297 | + {"desc": "Require comprehensive tests with pytest", "weight": 1.0}, |
| 298 | + # ... 36 more, weights auto-adjusted by success rate |
299 | 299 | ] |
300 | 300 | ``` |
301 | 301 |
|
| 302 | +Mutations that consistently produce negative score deltas have their probability reduced (down to 0.1x). Successful mutations maintain or increase their weight. Weights persist to `memory/mutation_weights.json`. |
| 303 | + |
302 | 304 | Also provides `crossover_prompts()` for genetic recombination between two prompts. |
303 | 305 |
|
304 | 306 | ### Population Manager: `population_manager.py` (50 lines) |
@@ -420,61 +422,43 @@ git checkout prompt.txt |
420 | 422 |
|
421 | 423 | ## Results |
422 | 424 |
|
423 | | -### Current Snapshot |
| 425 | +### Grounded Evolution Results (203 Cycles) |
424 | 426 |
|
425 | 427 | | Metric | Value | |
426 | 428 | |--------|-------| |
427 | | -| **Generations** | 218 | |
428 | | -| **Population** | 218 prompts | |
429 | | -| **Best Lexical Score** | **1000 / 1000** | |
430 | | -| **Score Range** | 35 → 1000 (28.6×) | |
431 | | -| **Ceiling Progression** | 500 → 862 → 1000 | |
432 | | -| **Grounded Best** | 96.0 / 100 | |
433 | | - |
434 | | -> **Note: Lexical Plateau at 862/1000 (now broken).** A bug in `mutate.py`'s `get_missing_keywords` |
435 | | -> function was ignoring all 286 single-keyword scoring conditions (those without `and`/`or`), |
436 | | -> blocking 180 uncovered signals. After fix: **896 → 914 → 932 → 950 → 968 → 986 → 1000** in 30 cycles. |
437 | | -> 6 prompts now score the maximum. The grounded loop remains the next frontier. |
438 | | -
|
439 | | -### Top 10 Prompts |
440 | | - |
441 | | -| Rank | File | Score | Key Differentiator | |
442 | | -|------|------|-------|-------------------| |
443 | | -| 1 | `prompt_131.txt` | 862.0 | Full production-ready agent structure | |
444 | | -| 2 | `prompt_132.txt` | 862.0 | Comprehensive error handling + logging | |
445 | | -| 3 | `prompt_133.txt` | 862.0 | Async-first with complete test suite | |
446 | | -| 4 | `prompt_134.txt` | 862.0 | Docker + CI/CD + observability | |
447 | | -| 5 | `prompt_135.txt` | 862.0 | Multi-source RAG + embedding pipeline | |
448 | | -| 6 | `prompt_136.txt` | 862.0 | LangGraph + tool calling + streaming | |
449 | | -| 7 | `prompt_137.txt` | 862.0 | Security + auth + rate limiting | |
450 | | -| 8 | `prompt_138.txt` | 862.0 | Kubernetes + Terraform + monitoring | |
451 | | -| 9 | `prompt_140.txt` | 862.0 | Full microservice architecture | |
452 | | -| 10 | `prompt_121.txt` | 840.0 | LangGraph + Ollama + comprehensive testing | |
453 | | - |
454 | | -### Score Distribution |
| 429 | +| **Cycles** | 203 | |
| 430 | +| **Best Execution Score** | **39.0 / 80** | |
| 431 | +| **Score Range** | 17.0 → 39.0 | |
| 432 | +| **Average Score** | 30.9 | |
| 433 | +| **Test Quality** | 4–5 real assertions/cycle (after prompt fix) | |
| 434 | +| **Hidden Tests Passed** | 0 / 203 | |
| 435 | +| **Total LLM Tokens** | ~1,000,000 | |
| 436 | +| **Mutation Operators** | 127 uses | |
| 437 | +| **Crossover Operators** | 76 uses | |
| 438 | +| **Process Stability** | 100% (no crashes) | |
455 | 439 |
|
456 | | -```mermaid |
457 | | -pie title Prompt Score Distribution (150 prompts) |
458 | | - "800–862 (elite)" : 30 |
459 | | - "600–799 (strong)" : 25 |
460 | | - "500–599 (good)" : 35 |
461 | | - "300–499 (developing)" : 20 |
462 | | - "100–299 (emerging)" : 25 |
463 | | - "35–99 (baseline)" : 15 |
464 | | -``` |
| 440 | +### What We Learned |
| 441 | + |
| 442 | +**1. Score plateau at 39/80** — Despite 203 cycles, the execution score never exceeded 39. The system converged to a fitness plateau. Prompts evolved to produce projects that pass basic structural checks (syntax, imports, file count) but consistently failed benchmark-specific behavioral tests. This suggests the mutation operators explore *prompt text similarity* space, not *functional correctness* space — and these are not the same. |
| 443 | + |
| 444 | +**2. Test quality is directly controllable via prompt engineering** — The single most impactful change was improving the LLM system prompt from *"Generate clean code"* to *"Generate real tests with assertions, no placeholders"*. This moved test quality from 0 real assertions to 4–5 per cycle instantly. The generator system prompt is a critical lever. |
| 445 | + |
| 446 | +**3. Self-tuning mutation weights work but converge** — The weight adjustment system successfully downweighted 5 consistently harmful mutations to 0.1× probability. However, this also reduced exploration diversity — the same few mutations were repeatedly selected, narrowing the search space. |
| 447 | + |
| 448 | +**4. Hidden behavioral tests remain unsolved** — Across 203 cycles, not a single generated project passed benchmark-specific hidden tests. The generated code produces correct *structure* (right function signatures, right file layout) but not correct *behavior* (the functions don't actually work as specified). This is the fundamental open problem. |
| 449 | + |
| 450 | +**5. LLM generation is reliable** — The Mistral API completed all 203 generations without a single failure. Average generation time was stable at ~60s/cycle. |
| 451 | + |
| 452 | +**6. Population converges without diversity preservation** — The greedy elitist selection (keep top 50) caused the population to converge to near-identical prompts producing 3-file projects. A diversity-preserving mechanism (e.g., novelty search or fitness sharing) is needed. |
| 453 | + |
| 454 | +### Open Challenges |
465 | 455 |
|
466 | | -### What Top Prompts Generate |
467 | | - |
468 | | -When fed to the grounded generator, top prompts produce: |
469 | | -- Full `src/package/` layouts with 20+ modules |
470 | | -- LangGraph ReAct loops with Ollama local models |
471 | | -- Pydantic v2 validation + type hints everywhere |
472 | | -- Async/await, streaming, SSE/websocket support |
473 | | -- OpenTelemetry + Prometheus + Grafana stacks |
474 | | -- OAuth2/JWT auth with rate limiting |
475 | | -- pytest property-based, snapshot, benchmark tests |
476 | | -- Docker + Kubernetes + systemd deployment |
477 | | -- CI/CD with GitHub Actions + pre-commit |
| 456 | +| Challenge | Impact | Hypothesis | |
| 457 | +|-----------|--------|------------| |
| 458 | +| **Hidden test failure** | Blocks scores above 40 | Need behavioral validation mid-generation, not post-hoc | |
| 459 | +| **Population convergence** | Stagnation after ~50 cycles | Add novelty search or multi-objective optimization | |
| 460 | +| **Token cost** | ~5,000 tokens/cycle | Cache similar prompts, use cheaper models for pre-filtering | |
| 461 | +| **Mutation granularity** | Most mutations are neutral | Add structured mutations that modify specific prompt sections |
478 | 462 |
|
479 | 463 | --- |
480 | 464 |
|
|
0 commit comments