Research prototype for query-conditional, capability-aware sampling of multi-agent operator DAGs. Built on top of bingreeky/MaAS (ICML'25 Oral).
Large language model-based multi-agent systems have demonstrated strong capabilities in mathematical reasoning and code generation, and their effectiveness is heavily influenced by the underlying agent architecture. However, manually designing effective architectures is labor-intensive and task-dependent, making automated multi-agent architecture generation an important research direction. Existing methods typically formulate this problem as a sequential, incremental construction process, which favors local architectural decisions over global planning and provides limited query-specific structural adaptation.
To address this issue, QACS formulates multi-agent architecture search as query-conditioned Directed Acyclic Graph (DAG) generation over a learned capability space. Specifically, QACS first pretrains a capability-aware representation that summarizes each agent operator's reasoning profile in a shared latent space; conditioned on this representation, a lightweight sampler then generates a tailored DAG of agent calls for each input query, allowing the resulting pipeline to adapt to per-query reasoning demands.
.
├── examples/maas/
│ ├── optimize.py # legacy MaAS entry (prompt search)
│ └── dag_optimize.py # QACS entry — this is what the README uses
├── maas/
│ ├── ext/maas/
│ │ ├── models/
│ │ │ ├── unified_dag_sampler.py # UnifiedDAGSampler (W_Q/W_K/need/stop/edge)
│ │ │ ├── dag.py # DAGPlan / DAGNode classes
│ │ │ └── utils.py # SentenceEncoder, helpers
│ │ ├── scripts/
│ │ │ ├── pretrain_capability.py # Stage 1 pretraining CLI
│ │ │ ├── dag_optimizer.py # Stage 2 REINFORCE driver
│ │ │ ├── dag_evaluator.py # Per-dataset evaluation driver
│ │ │ └── optimized/
│ │ │ ├── GSM8K/train/dag_graph.py
│ │ │ ├── MATH/train/dag_graph.py
│ │ │ └── HumanEval/train/dag_graph.py
│ │ ├── benchmark/experiment_configs.py # operator pool per dataset
│ │ └── data/ # *.jsonl benchmark files
│ ├── actions/ prompts/ tools/ utils/ # inherited from MetaGPT
│ └── configs/
├── config/
│ ├── config2.example.yaml # template — copy this
│ └── config2.yaml # YOUR keys, gitignored
├── requirements.txt
└── README.md
- Python 3.10 or newer
- A CUDA-capable GPU (tested on CUDA 11.8 with
torch==2.1.0+cu118) sentence-transformerspulls downall-MiniLM-L6-v2on first run
git clone git@github.com:Roderick-Stinson/QACS.git
cd QACS
# option A: pip
pip install -r requirements.txt
# option B: uv (faster, self-contained venv)
uv venv && source .venv/bin/activate
uv pip install -r requirements.txtQACS uses the same JSONL layout as upstream MaAS. Place each dataset under
maas/ext/maas/data/:
maas/ext/maas/data/
├── gsm8k_train.jsonl
├── gsm8k_test.jsonl
├── math_test.jsonl
├── humaneval_train.jsonl
├── humaneval_test.jsonl
└── humaneval_public_test.jsonl
Each line is an object with at least these fields:
| Dataset | Fields |
|---|---|
| GSM8K | { "question": str, "answer": str, "cot": str, "id": str } |
| MATH | { "problem": str, "solution": str, "level": str, ... } |
| HumanEval | { "prompt": str, "canonical_solution": str, "test": str, "task_id": str } |
Copy the example and fill in your credentials (this file is gitignored so keys stay local):
cp config/config2.example.yaml config/config2.yaml
$EDITOR config/config2.yamlMinimum required:
llm:
api_type: "openai"
model: "gpt-4o-mini"
base_url: "" # or your gateway URL
api_key: "sk-..." # your key
models:
gpt-4o-mini:
api_type: "openai"
model: "gpt-4o-mini"
base_url: ""
api_key: "sk-..."--opt_model_name and --exec_model_name on the CLI below are looked up in
the models: block.
python -m maas.ext.maas.scripts.pretrain_capability \
--k 8 \
--epochs-stage1 50 \
--epochs-stage2 20 \
--output pretrained_sampler.ptThis writes pretrained_sampler.pt in the project root (containing
W_Q / W_K / need_mlp / stop_net / W_edge). You only need to do this once;
all three benchmarks reuse the same file.
python -m examples.maas.dag_optimize \
--dataset GSM8K --round 1 --sample 4 \
--exec_model_name gpt-4o-mini \
--lr 0.01 --max_nodes 8Checkpoints, CSV results and per-query traces are written under
maas/ext/maas/scripts/optimized/GSM8K/train/round_1/.
python -m examples.maas.dag_optimize \
--dataset GSM8K --round 1 --sample 4 \
--exec_model_name gpt-4o-mini --is_testTest-time traces land under .../optimized/GSM8K/test/round_1/traces/.
TBD — the Stage-2 training pipeline is currently under diagnostic review. Early rounds show signs of a query-invariant routing prior (the capability matrix
Cis differentiated across operators but nearly constant across queries, andstop_nettends to terminate the DAG at a single node). We are deliberately withholding numbers until the routing is validated to avoid publishing misleading scores dominated by the fallbackProgrammer/Testverification operators.
Operators are defined per task type in maas/ext/maas/benchmark/experiment_configs.py:
| Task type | Operators available to the sampler |
|---|---|
| Math (GSM8K, MATH) | Generate, GenerateCoT, MultiGenerateCoT, ScEnsemble, Programmer, SelfRefine |
| Code (HumanEval) | Generate, GenerateCoT, MultiGenerateCoT, ScEnsemble, Test, SelfRefine |
EarlyStop is defined in the pool but filtered out by DAGOptimizer —
stopping is now handled by stop_net instead.
QACS is a research fork of bingreeky/MaAS
(ICML'25 Oral); the dataset pipelines, operator implementations, benchmark
wrappers and most of the execution infrastructure are inherited directly from
it. The capability sampler (UnifiedDAGSampler), the two-stage training
(pretrain_capability.py, dag_optimizer.py) and the per-dataset
DAGWorkflow wrappers are the contributions of this branch.
Like the upstream, QACS also uses prompts and operator designs adapted from ADAS, AgentSquare and AFLOW.