Evolving full agent blueprints through execution-grounded genetic algorithms — not just prompts, but tools, memory, planning, and self-evaluation.
Navigation · Overview · Project Lineage · Architecture · Quick Start · Modules · Project Structure · Research · Contributing
Grounded Agent Forge is the next evolution of execution-grounded prompt optimization. Where the original grounded_evolution evolved text prompts to generate better code, this project evolves complete agent blueprints — full specifications for autonomous AI agents including system prompts, tool definitions, memory architectures, planning strategies, and self-evaluation mechanisms.
timeline
title The Evolution of Agent Evolution
autoresearch-ai-agent-skeleton : Lexical-only prompt scoring (400+ keyword signals)
grounded_evolution : Execution-grounded validation (AST + pytest + flake8)
grounded_agent_forge : Full agent blueprint evolution in Docker sandbox
| Feature | Impact |
|---|---|
| 🧬 Agent-Level Evolution | Not just prompts — entire agent architectures evolve through genetic algorithms |
| 📦 Docker Sandboxing | Every generated agent executes in an isolated container; real execution metrics drive fitness |
| 🎯 Multi-Objective Fitness | Agents scored on correctness, efficiency, tool-use accuracy, planning depth, and self-evaluation |
| 🔄 Meta-Evolution | The evolutionary strategy itself evolves: crossover rates, mutation operators, and selection pressure adapt over time |
| 🧩 Task Specialization | Populations diversify into specialist agents for different problem domains |
| 📊 Real-Time Dashboard | Web-based visualization of evolution progress, agent scores, and population dynamics |
┌──────────────────────────────────────────────────────────────────┐
│ grounded_agent_forge │
│ (THIS REPO) │
│ Evolves full agent blueprints (prompt + tools + memory + │
│ planning + self-eval) in Docker sandbox with multi-objective │
│ fitness, meta-evolution, and task specialization. │
│ │
│ 🏗️ Agent-level evolution 📦 Docker sandboxed execution │
│ 🎯 8+ fitness dimensions 🔄 Self-tuning meta-evolution │
│ 📊 Real-time dashboard 🧩 Task specialization │
└──────────────────────────────────────────────────────────────────┘
▲
│ builds on · evolves from
┌──────────────────────────────────────────────────────────────────┐
│ grounded_evolution │
│ (github.com/NullLabTests/grounded_evolution) │
│ Evolves text prompts with execution-grounded validation via AST │
│ parse, pytest, and flake8. Two-loop system: lexical + grounded. │
│ │
│ 📝 203 evolution cycles 🏆 Best score: 39/80 │
│ 🔬 7 benchmark tasks 🔄 127 mutations + 76 crossovers │
└──────────────────────────────────────────────────────────────────┘
▲
│ builds on · evolves from
┌──────────────────────────────────────────────────────────────────┐
│ autoresearch-ai-agent-skeleton │
│ Lexical-only prompt evolution with 400+ keyword signals across │
│ 19 categories. 5 genetic mutation strategies. Meta-signal │
│ injection via auto_evolve.py. │
│ │
│ 📝 218 prompts evolved 🏆 Best lexical score: 1000/1000 │
│ 🔤 400+ keyword signals 🧬 5 mutation strategies │
└──────────────────────────────────────────────────────────────────┘
| Capability | Lexical-Only | Grounded Evolution | 🚀 Grounded Agent Forge |
|---|---|---|---|
| Keyword prompt scoring | ✅ 400+ signals | ✅ 400+ signals | ✅ 400+ signals |
| Execution-grounded validation | ❌ | ✅ AST + pytest + flake8 | ✅ Full Docker sandbox |
| Evolves prompts | ✅ | ✅ | ✅ |
| Evolves agent blueprints | ❌ | ❌ | ✅ |
| Docker sandbox isolation | ❌ | ❌ | ✅ |
| Multi-objective fitness | ❌ | ❌ | ✅ (8+ dimensions) |
| Meta-evolution | ✅ signal injection | ✅ signal injection | ✅ full strategy evolution |
| Task specialization | ❌ | ❌ | ✅ |
| Real-time dashboard | ❌ | ❌ | ✅ |
| Self-evaluation in agents | ❌ | ❌ | ✅ |
| Tool-use validation | ❌ | ❌ | ✅ |
| Planning depth scoring | ❌ | ❌ | ✅ |
| Infinite research loop | ❌ (finite) | ✅ | ✅ |
| Auto-commit on improvement | ❌ | ✅ | ✅ |
This project was built using DeepSeek V4 as the primary coding model.
┌──────────────────────────────────────────────────────────────────────┐
│ GROUNDED AGENT FORGE │
│ │
│ ┌──────────────────────────┐ ┌────────────────────────────────┐ │
│ │ orchestrator.py │───▶│ agent_spec_generator.py │ │
│ │ ─ Main evolution loop │ │ ─ Generates agent blueprints │ │
│ │ ─ Selection & mutation │ │ ─ System prompt + tools │ │
│ │ ─ Parallel generation │ │ ─ Memory + planning config │ │
│ └───────────┬──────────────┘ └───────────────┬────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────┐ ┌────────────────────────────────┐ │
│ │ full_agent_evaluator │ │ Docker Sandbox │ │
│ │ ─ Multi-objective score │───▶│ ─ Isolated container exec │ │
│ │ ─ 8 fitness dimensions │ │ ─ Tool-use validation │ │
│ │ ─ Benchmark execution │ │ ─ Planning evaluation │ │
│ └───────────┬──────────────┘ └────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ meta_evolver.py │───▶ Self-tuning evolution strategy │
│ │ ─ Adaptive mutation │ │
│ │ ─ Weight optimization │ │
│ │ ─ Novelty-driven explore│ │
│ └──────────────────────────┘ │
│ │
│ ┌──────────────────────────┐ │
│ │ dashboard/ │───▶ Real-time evolution visualization │
│ │ main.py │ (FastAPI + Web UI) │
│ └──────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
graph TB
subgraph Forge["⚒️ Agent Forge Loop"]
direction TB
A["🧬 Agent Blueprint<br/>Population"] --> B["🎯 orchestrator.py<br/>Select + Mutate"]
B --> C["🤖 agent_spec_generator.py<br/>LLM → Full Agent Spec"]
C --> D["📦 Docker Sandbox<br/>Build + Run Agent"]
D --> E["📊 full_agent_evaluator.py<br/>Multi-Objective Score"]
E --> F["🧠 meta_evolver.py<br/>Tune Evolution Strategy"]
F --> G["💾 Update Population<br/>+ Persist to DB"]
G --> A
end
subgraph Dashboard["📈 Real-Time Visualization"]
DASH["🖥️ dashboard/main.py<br/>FastAPI + Charts"]
end
E -->|"fitness data"| DASH
DASH -->|"control signals"| B
quadrantChart
title Fitness Dimension Weights
x-axis "Low Impact" --> "High Impact"
y-axis "Easy to Measure" --> "Hard to Measure"
quadrant-1 "Core Metrics"
quadrant-2 "Quality Signals"
quadrant-3 "Secondary"
quadrant-4 "Long-term"
Correctness: [0.9, 0.3]
Tool-Use: [0.6, 0.5]
Planning: [0.5, 0.7]
Code-Quality: [0.4, 0.2]
Memory: [0.3, 0.6]
Self-Eval: [0.3, 0.8]
Efficiency: [0.2, 0.4]
Prompt-Quality: [0.1, 0.1]
| Dimension | Weight | What It Measures |
|---|---|---|
| 🎯 Correctness | 30% | Does the agent solve the task correctly? |
| 🔧 Tool-Use Accuracy | 15% | Does the agent call tools with valid arguments? |
| 🧩 Planning Depth | 15% | Does the agent decompose problems into steps? |
| 📝 Code Quality | 10% | AST validity, project structure, linting |
| 🧠 Memory Effectiveness | 10% | Does the agent use memory to maintain context? |
| 🔍 Self-Evaluation | 10% | Does the agent correctly assess its own outputs? |
| ⚡ Efficiency | 5% | Token efficiency, round-trips to completion |
| 📖 Prompt Quality | 5% | Lexical signal coverage (legacy metric) |
- Python 3.12+
- Docker (for sandboxed agent execution)
- LLM API key — DeepSeek, OpenAI, or any OpenAI-compatible provider
# Clone the repository
git clone git@github.com:NullLabTests/grounded_agent_forge.git
cd grounded_agent_forge
# Create virtual environment
python -m venv .venv && source .venv/bin/activate
# Install base + forge extras
pip install -e ".[forge]"
# Configure your LLM provider
cp .env.example .env
# Edit .env with your API key and model preferences# Start the infinite agent evolution loop (two ways):
python -m agent_forge.orchestrator
# OR use the shell wrapper:
bash run_forge_loop.shuvicorn dashboard.main:app --reload --port 8000
# Open → http://localhost:8000| Variable | Default | Description |
|---|---|---|
LLM_API_KEY |
— | LLM provider API key |
LLM_MODEL |
deepseek-chat |
Model name |
LLM_BASE_URL |
https://api.deepseek.com/v1 |
API endpoint |
FORGE_DB_URL |
sqlite+aiosqlite:///forge_population.db |
Population database |
SANDBOX_TIMEOUT |
300 |
Docker sandbox timeout (seconds) |
MAX_PARALLEL_GENERATIONS |
3 |
Concurrent agent generations |
HUMAN_APPROVAL |
false |
Require manual approval before execution |
DASHBOARD_PORT |
8000 |
Dashboard server port |
The central evolution loop coordinator — the brain of the forge.
┌──────────────────────────────────────┐
│ orchestrator.py │
│ │
│ ┌─────────┐ ┌──────────┐ ┌─────┐ │
│ │ Load │─▶│ Select │─▶│ Mu- │ │
│ │ pop │ │ champion │ │ tate│ │
│ └─────────┘ └──────────┘ └──┬──┘ │
│ ▼ │
│ ┌─────────┐ ┌──────────┐ ┌─────┐ │
│ │ Per- │◀─│ Track │◀─│ Eval│ │
│ │ sist │ │ fitness │ │ uate│ │
│ └─────────┘ └──────────┘ └─────┘ │
└──────────────────────────────────────┘
- Loads/persists agent blueprint population from database
- Tournament selection with elitism
- Mutation and crossover scheduling
- Parallel generation management
- Fitness tracking and convergence detection
Generates full agent specifications from evolved blueprints. An agent spec includes:
| Component | Description |
|---|---|
| 🧠 System Prompt | Core identity, behavior instructions, and constraints |
| 🛠️ Tool Definitions | Function schemas the agent can call (JSON schema) |
| 💾 Memory Architecture | Short-term, long-term, and working memory configuration |
| 🗺️ Planning Strategy | Chain-of-thought, ReAct, or tree-of-thought configuration |
| 🔍 Self-Evaluation Criteria | How the agent judges its own outputs |
| 📐 Output Schema | Expected response format and structure |
Multi-objective fitness evaluator — the forge's quality gate.
Agent Spec
│
▼
┌─────────────────────────────┐
│ Build Docker Container │
│ └─ Install dependencies │
│ └─ Configure environment │
└──────────┬──────────────────┘
▼
┌─────────────────────────────┐
│ Execute Against Benchmarks │
│ └─ Task completion check │
│ └─ Tool call validation │
│ └─ Planning analysis │
└──────────┬──────────────────┘
▼
┌─────────────────────────────┐
│ Score Across 8 Dimensions │
│ └─ Correctness (30%) │
│ └─ Tool-Use (15%) │
│ └─ Planning (15%) │
│ └─ + 5 more metrics │
└─────────────────────────────┘
- Builds Docker containers from agent specs
- Executes agents against benchmark tasks
- Scores across 8+ fitness dimensions
- Handles sandbox timeouts and failures gracefully
- Logs detailed per-dimension metrics
Evolution strategy optimizer — the forge that forges itself.
┌──────────────────────────────────┐
│ meta_evolver.py │
│ │
│ Input: population fitness deltas│
│ │
│ ┌────────────────────────────┐ │
│ │ Track operator success │ │
│ │ per operator │ │
│ └──────────┬─────────────────┘ │
│ ▼ │
│ ┌────────────────────────────┐ │
│ │ Adjust probabilities │ │
│ │ up-weight winners │ │
│ │ down-weight losers │ │
│ └──────────┬─────────────────┘ │
│ ▼ │
│ ┌────────────────────────────┐ │
│ │ Detect stagnation │ │
│ │ if flat → novelty search │ │
│ └──────────┬─────────────────┘ │
│ ▼ │
│ Output: new evolution config │
└──────────────────────────────────┘
- Tracks which mutation/crossover operators produce the best fitness gains
- Adjusts operator probabilities in real-time (self-tuning weights)
- Evolves the evolution strategy itself (meta-level adaptation)
- Detects stagnation and introduces novelty-driven exploration
- Persists strategy state across runs
FastAPI-based web dashboard providing:
| Feature | Description |
|---|---|
| 📊 Population View | Real-time visualization of the agent population |
| 📈 Fitness Trajectory | Score over time across all dimensions |
| 🔍 Agent Inspector | Compare blueprint specs side-by-side |
| 🎯 Dimension Breakdown | Per-dimension score distribution |
| 🎮 Evolution Controls | Pause, resume, and manual trigger |
grounded_agent_forge/
├── README.md # This file
├── LICENSE # MIT license
├── pyproject.toml # Project metadata + dependencies
├── AGENTS.md # Agent collaboration conventions
├── CHANGELOG.md # Release history
├── CONTRIBUTING.md # How to contribute
├── SECURITY.md # Security policy
├── .env.example # Environment template
├── .gitignore # Git ignore rules
│
├── agent_forge/ # ⚒️ Core forge modules (primary)
│ ├── __init__.py
│ ├── orchestrator.py # Evolution loop coordinator
│ ├── agent_spec_generator.py # Agent blueprint generator
│ ├── full_agent_evaluator.py # Multi-objective fitness evaluator
│ └── meta_evolver.py # Strategy adaptation
│
├── dashboard/ # 📊 Real-time web dashboard
│ └── main.py # FastAPI application
│
├── run_forge_loop.sh # Shell automation wrapper
│
├── .github/ # 🔄 CI/CD + community
│ ├── workflows/
│ │ ├── ci.yml # Lint + import checks
│ │ └── badge.yml # Dynamic score badge
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.md
│ │ ├── feature_request.md
│ │ └── config.yml
│ ├── dependabot.yml
│ ├── FUNDING.yml
│ └── CODEOWNERS
│
├── docs/ # 📚 Documentation
├── experiments/ # 🔬 Experiment outputs
├── benchmarks/ # 📋 Task definitions
│
├── evaluator/ # (legacy) Grounded evolution evaluator
├── population/ # (legacy) Evolved prompts
├── memory/ # (legacy) Evolution state
├── analysis/ # (legacy) Visualization scripts
├── generator.py # (legacy) LLM code generation
├── infinite_research_loop.py # (legacy) Grounded evolution loop
├── mutation_engine.py # (legacy) Prompt mutation operators
└── population_manager.py # (legacy) Population persistence
Note: Modules marked "(legacy)" are carried forward from
grounded_evolution. They remain functional but the primary development focus is onagent_forge/.
Grounded Agent Forge explores the frontier of evolutionary software optimization:
| Research Direction | Description |
|---|---|
| 🧬 Blueprint-Level Evolution | Moving from prompt text optimization to full agent architecture evolution |
| 📦 Execution-Grounded Multi-Objective Fitness | Real Docker sandbox execution across 8+ fitness dimensions |
| 🔄 Meta-Evolutionary Adaptation | The evolutionary strategy itself evolves, preventing stagnation |
| 🧩 Task Specialization | Populations naturally diversify into domain-specific agent archetypes |
| 🔍 Self-Evaluating Agents | Agents that can assess their own output quality are rewarded |
mindmap
root((Agent Forge))
Blueprint Evolution
System prompts
Tool definitions
Memory architectures
Planning strategies
Execution Grounding
Docker sandbox
Real execution metrics
Multi-objective scoring
Meta Evolution
Self-tuning weights
Strategy adaptation
Novelty search
Task Specialization
Domain clustering
Niche formation
Pareto optimization
Dashboard
Real-time viz
Population analysis
Control interface
- ❌ A claim of AGI or sentience
- ❌ A self-conscious or self-aware system
- ❌ Runaway recursive self-improvement
✅ It is a well-scoped experimental system for studying how genetic algorithms can evolve complete agent architectures — with real execution validation in isolated sandboxes.
We welcome contributions! See CONTRIBUTING.md for details.
Quick start for contributors:
# Fork & clone
git clone git@github.com:YOUR_USERNAME/grounded_agent_forge.git
# Install dev dependencies
pip install -e ".[forge]" ruff
# Lint your code
ruff check agent_forge/ dashboard/
# Open a PRMIT — see LICENSE.
| Contribution | Link |
|---|---|
| 🧬 Predecessor | grounded_evolution — execution-grounded prompt evolution platform with 203 evolution cycles |
| 📜 Inspiration | autoresearch by Andrej Karpathy — the original lexical prompt evolution concept |
| 🤖 Built Using | DeepSeek V4 as the primary coding model for this project |