EvoMAS - Evolutionary Generation of Multi-Agent Systems

This repository implements the following paper:

Evolutionary Generation of Multi-Agent Systems Yuntong Hu, Yuting Zhang, Matthew Trager, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto ICML 2026 [arXiv]

EvoMAS uses an LLM meta-model as an evolutionary operator to select, mutate, and cross over MAS configurations from a pool, evaluating them on target benchmarks. This README covers the short path: install, then run one of the benchmark scripts.

Setup

conda create -n mas python=3.11 -y
conda activate mas
pip install -r requirements.txt

# API keys — only fill in the providers you'll use
cp .env.example .env
# then edit .env

AWS Bedrock models use the standard credential chain (aws configure, env vars, or an EC2 instance role). No .env entry is needed when AWS CLI credentials are already set up.

Note on model availability: Some model IDs (e.g., on Bedrock) referenced in the default configuration and MAS pool files may be deprecated or retired over time. If you encounter model-not-found errors, update the model IDs to currently available versions. You can override model IDs without editing source code via META_MODEL, JUDGE_MODEL, and AGENT_MODELS environment variables (see Editing parameters), or pass --meta-model-id / --model-list flags to main.py directly.

Download datasets

Task data is not shipped with the repo. Run the one-click preparation script to download every benchmark into the paths the pipeline expects:

scripts/prepare_datasets.sh

This performs four idempotent steps:

SWE-bench source repos — clones 15 upstream projects into dataset/repos/ at the pinned commits listed in src/dataset/repos.txt.
BBEH — downloads all 23 subsets from google-deepmind/bbeh into dataset/bbeh/benchmark_tasks/ and rebuilds bbeh_mini (20 stride-sampled tasks per subset = 460 total).
WorkBench — clones olly-styles/WorkBench and stages per-domain data.csv + test.json under dataset/workbench/<domain>/.
SWE-bench task metadata — downloads Lite (300 tasks) and Verified (500 tasks) from HuggingFace into dataset/swe_bench_{lite,verified}/.

Useful flags:

scripts/prepare_datasets.sh --force         # re-fetch everything
scripts/prepare_datasets.sh --skip-repos    # everything except the ~GB repo clones
scripts/prepare_datasets.sh --bbeh-only     # target a single dataset family
scripts/prepare_datasets.sh --workbench-only
scripts/prepare_datasets.sh --swe-only
scripts/prepare_datasets.sh --repos-only

Running a benchmark

Three benchmark scripts live in scripts/. Each accepts a subset/variant either as a positional arg or as an env var, and writes outputs under $OUTPUT_ROOT/<dataset>_main/ (default output_paper/).

# BBEH — 24 subsets available under dataset/bbeh/benchmark_tasks/
scripts/run_bbeh.sh mini
scripts/run_bbeh.sh boolean_expressions
SUBSET=word_sorting scripts/run_bbeh.sh

# WorkBench — 6 subdomains
scripts/run_workbench.sh email                # default
scripts/run_workbench.sh calendar
SUBDOMAIN=multi_domain scripts/run_workbench.sh

# SWE-bench — verified (default) or lite
scripts/run_swebench.sh verified
scripts/run_swebench.sh lite
VARIANT=lite scripts/run_swebench.sh

Each script auto-resolves NUM_EVAL_TASKS to the full task count of the chosen subset / domain / variant (e.g. 460 for bbeh_mini, 500 for swe_bench_verified). To run on a smaller sample, override via env var (see below).

There is also a batch-size ablation script:

DATASET=bbeh_mini scripts/run_ablation_batch_size.sh     # sweep BS ∈ {1, 10, 460}

Editing parameters

All tunable parameters are env vars with defaults in scripts/common.sh. Override any of them before invoking a script:

Variable	Default	Description
`NUM_EVAL_TASKS`	(full dataset)	Number of tasks to run
`MAX_STEPS`	`2`	Evolutionary iterations per batch
`NUM_PARENTS`	`2`	Parent configs the meta-model selects per batch
`SEED`	`42`	Random seed
`BATCH_SIZE`	`1`	Tasks per evolution batch (1 = per-query; `N` = one shared trajectory over N tasks)
`WORKERS`	`16`	Parallel batches (ThreadPoolExecutor)
`META_MODEL`	`bedrock:global.anthropic.claude-sonnet-4-5-...`	Evolutionary-operator LLM
`JUDGE_MODEL`	same as `META_MODEL`	LLM-as-judge for reward
`AGENT_MODELS`	`bedrock:us.anthropic.claude-3-5-sonnet-... bedrock:qwen.qwen3-235b-... bedrock:qwen.qwen3-coder-480b-...`	Space-separated worker model palette
`MEMORY_EVOLUTION`	`true`	Persist meta-model memory updates
`MEMORY_PATH`	(auto)	Explicit memory JSON path; empty = `dataset/<subset-path>/memory_<ts>.json`
`OUTPUT_ROOT`	`output_paper`	Root directory for run outputs
`CONDA_ENV`	`mas`	conda environment name
`EVOMAS_REPOS_DIR`	`dataset/repos`	Directory containing cloned SWE-bench source repos

Examples:

# Small smoke test
NUM_EVAL_TASKS=1 MAX_STEPS=1 WORKERS=1 scripts/run_bbeh.sh mini

# Larger parallel run on workbench_email
WORKERS=32 scripts/run_workbench.sh email

# Swap the worker palette to a single model
AGENT_MODELS="bedrock:us.anthropic.claude-3-5-sonnet-20241022-v2:0" \
    scripts/run_bbeh.sh boolean_expressions

# Continue a prior memory file instead of starting fresh
MEMORY_PATH=dataset/bbeh/benchmark_tasks/bbeh_mini/memory_20260423_101510.json \
    scripts/run_bbeh.sh mini

Direct `main.py` invocation

Under the hood, each script runs python main.py with the resolved arguments. For ad-hoc configurations not covered by the scripts, call main.py directly:

python main.py --dataset bbeh_boolean_expressions \
    --num-eval-tasks 50 \
    --batch-size 1 \
    --workers 8 \
    --max-steps 2 \
    --meta-model-id bedrock:global.anthropic.claude-sonnet-4-5-20250929-v1:0 \
    --llm-as-judge bedrock:global.anthropic.claude-sonnet-4-5-20250929-v1:0 \
    --model-list bedrock:us.anthropic.claude-3-5-sonnet-20241022-v2:0 \
                 bedrock:qwen.qwen3-235b-a22b-2507-v1:0 \
                 bedrock:qwen.qwen3-coder-480b-a35b-v1:0

python main.py --help

CLI flags mirror the env-var names (--num-eval-tasks, --batch-size, --workers, --memory-path, --memory-evolution, --task-ids, etc.). --task-ids accepts int indices (BBEH / WorkBench) or string instance IDs (SWE-bench, e.g. astropy__astropy-12907).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
mas_pools		mas_pools
scripts		scripts
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoMAS - Evolutionary Generation of Multi-Agent Systems

Setup

Download datasets

Running a benchmark

Editing parameters

Direct `main.py` invocation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoMAS - Evolutionary Generation of Multi-Agent Systems

Setup

Download datasets

Running a benchmark

Editing parameters

Direct main.py invocation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Direct `main.py` invocation

Packages