A tree-search framework for systematically discovering failure modes of text-to-image (T2I) models.
For each (entity × attribute) combination, Failure Atlas generates images, evaluates them with a vision-language model (VLM), prunes branches whose ancestors already fail, and (optionally) trains an online predictor that prioritizes likely-failing branches first.
Paper: FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration · Accepted to CVPR 2026 · arXiv:2509.21995
Mainstream T2I models (e.g., SDXL, FLUX) are typically evaluated only on benchmarks. While such evaluations can reflect overall model performance, they provide limited insight for improving the model or its training data. FailureAtlas focuses on model failures and formulates failure discovery as a search problem. It explores a structured (entity × attribute) space, uses a VLM to assess each generated image, prunes branches that consistently fail, and employs an online predictor to allocate compute toward branches that are most likely to expose errors. In this way, FailureAtlas efficiently constructs a much broader “failure map” specific to a given model.
Each failure discovered by the framework corresponds to a minimal (entity + attribute) combination that reliably triggers poor model behavior when used as input. By systematically uncovering these failure patterns, FailureAtlas helps developers better understand model capability boundaries and provides actionable guidance for data refinement and model improvement. This repository contains the open-source implementation of the full pipeline.
中文文档见 README_zh.md
- Service-oriented: Image generation and VLM evaluation run as standalone HTTP services. The driver script is a thin Python client that does multi-IP round-robin and concurrent requests, so you can scale by launching more servers and keep the driver host lightweight.
- Pluggable T2I backends: Ships with reference servers for
sdxl-turboandflux. Add a new model by writing one adapter class. - Online predictor (optional): A small Transformer scores each candidate node from T5 attribute embeddings; the search visits low-score (likely-failing) branches first. Re-trained every
train_interval_nodes. - Boolean-list evaluation: Strict VLM prompt → returns
[True, False, ...]for entity + each attribute, easy to extend to new attribute types. - Resumable: Search results are appended to JSONL atomically; restart skips finished nodes.
┌────────────────────────────────────────────────┐
│ scripts/run_search.py │
│ │
│ build_tree → for each layer: │
│ ├─ predictor.predict() (optional, sort) │
│ ├─ t2i_client.generate(prompts) ─→ HTTP │
│ ├─ vlm_client.evaluate(images, attrs)─→ HTTP │
│ └─ prune & save (jsonl) │
│ │
│ every N nodes: predictor.train(history) │
└────────────────────────────────────────────────┘
↓ HTTP ↓ HTTP (round-robin)
┌──────────────────┐ ┌──────────────────────────┐
│ T2I server(s) │ │ VLM server(s) │
│ servers/t2i_*.py│ │ user-deployed vLLM │
│ POST /generate │ │ /v1/chat/completions │
└──────────────────┘ └──────────────────────────┘
git clone <this-repo> failure_atlas
cd failure_atlas
pip install -r requirements.txtBefore launching real GPU services, you can run the bundled mock T2I/VLM demo to verify the search loop end to end:
bash examples/run_demo.shThis starts local mock servers, runs scripts.run_search on examples/prompt_demo.jsonl, and writes a tiny JSONL result under outputs/demo/.
Single GPU:
# sdxl-turbo on GPU 0, port 9000
CUDA_VISIBLE_DEVICES=0 python -m servers.t2i_server \
--model sdxl-turbo --model-path stabilityai/sdxl-turbo \
--host 0.0.0.0 --port 9000Multiple GPUs at once (one process per GPU, sequential ports):
NUM_GPUS=4 START_PORT=9000 \
MODEL_NAME=sdxl-turbo \
MODEL_PATH=stabilityai/sdxl-turbo \
bash servers/launch_t2i.sh
# logs are written under logs/t2i_<model>_<port>.logSwap MODEL_NAME=flux MODEL_PATH=black-forest-labs/FLUX.1-schnell for FLUX, or run two batches on disjoint GPUs/ports to mix backends.
CUDA_VISIBLE_DEVICES=2,3 vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--port 8080 --served-model-name Qwen2.5-VL-72B-Instruct \
--tensor-parallel-size 2 --dtype bfloat16A convenience wrapper with the same defaults is also provided:
MODEL=Qwen/Qwen2.5-VL-72B-Instruct TP=2 PORT=8080 bash servers/launch_vlm.shThe VLM endpoint must accept image inputs via the OpenAI chat schema; a text-only LLM will fail every evaluation request.
Replace the <T2I_HOST_*> / <VLM_HOST> placeholders in configs/t2i_server.yaml :: servers and configs/vlm_client.yaml :: servers with the IPs/ports of the services launched above. Each T2I server's model field must match a key under defaults: in the same file (e.g. sdxl-turbo, flux).
The online predictor in configs/predictor.yaml needs a one-time embedding cache over your corpus. Skip this step if you set predictor.enable: false.
python -m scripts.build_t5_emb \
--corpus-dir corpus \
--t5-model t5-base \
--output outputs/predictor/t5_emb.pklRebuild only when the corpus (entities or attribute values) changes.
python -m scripts.run_search --config configs/default.yamlResults are appended as JSONL to outputs/search/<task_name>.jsonl, where <task_name> is taken from task_name: in the YAML (override on the CLI with --task-name my_run). Re-running the same task resumes from where it left off as long as io.resume: true.
To split the workload across N independent driver hosts (e.g. several login nodes pointing at the same service pool), pass --total-tasks N --task-id i (i ∈ [0, N)). Each shard processes a disjoint slice of the tree and writes to its own output file.
Traversal order. The included configs/default.yaml uses layer-major traversal: finish layer N across every entity before moving one layer deeper, which is useful for broad-coverage runs. To finish all layers of one entity before moving to the next, switch to entity-major:
python -m scripts.run_search --config configs/default.yaml --traversal entity
# equivalent to setting search.traversal: "entity" in configs/default.yamlBoth modes share the same parent-pruning, resume cache and predictor. In layer-major mode the predictor ranks candidates across all entities at once, but the resulting jsonl rows are interleaved across entities.
Four small YAMLs in configs/:
| File | Purpose |
|---|---|
default.yaml |
Main search parameters (corpus paths, batch size, prune threshold, traversal: entity|layer, output dir, sharding). |
t2i_server.yaml |
List of T2I endpoints + per-model defaults (steps, guidance, server_batch_size, resolution). |
vlm_client.yaml |
List of VLM endpoints + client knobs: workers, timeout, temperature, top_k, enable_thinking, image_max_side (long-edge resize before upload — keeps tokens bounded). |
predictor.yaml |
Online predictor: enable, t5_emb_file, warmup_nodes, train_interval_nodes, lr, epochs_per_train, checkpoint dir. |
Files under corpus/:
| File | Shipped? | Description |
|---|---|---|
entity_en.json |
yes | Taxonomy: upper_class → sub_class → [entity, ...]. |
entity_en_sample.json |
yes | One representative entity per sub_class (default search set). |
attribute_en_{base,background,image}.json |
yes | Attribute schema (Base / Background / Image families). |
match.json |
yes (~1.6 MB) | Compact whitelist upper → sub → entity → attr → [value, ...] that prunes implausible Base-attribute combinations. Auto-loaded by tools/build_corpus.py. |
prompt.json |
no (≥100 MB) | Pre-built search tree (one JSON node per line). Either download or rebuild locally. |
Option A — Download. Place the file at corpus/prompt.json. The release link will be announced alongside the paper.
Option B — Rebuild locally. All required schemas (entity_en.json, attribute_en_*.json, match.json) are already shipped, so you can go straight to build_corpus:
python -m tools.build_corpus \
--corpus-dir corpus \
--output corpus/prompt.json \
--use-sample-onlyIf you also want to regenerate match.json (e.g. after editing the entity/attribute schema), run the LLM-scoring stage first — it requires a configured VLM/LLM endpoint:
python -m tools.build_match \
--corpus-dir corpus \
--vlm-config configs/vlm_client.yaml \
--log-file outputs/build_match/scores.jsonl \
--output corpus/match.json \
--use-sample-only --workers 32Full options and resume/aggregate flows are documented in tools/README.md.
{
"idx": 123,
"entity": "Apple",
"value": [["Base: Colors", "Red"], ["Background: Time/Period", "Night"]],
"parent": [12],
"layer": 2,
"upper_class": "...", "sub_class": "...",
"caption": "An image of a red apple. The background time is night.",
"valid": true,
"tested": true,
"acc": 0.625,
"eval": [[true, true, false], [true, false, true], ...],
"eval_raw": [[true, true, false], ...],
"predictor_score": 0.71
}acc <= prune_threshold → marked valid=false; descendants are pruned without generation.
If you find this work useful, please cite:
@article{chen2025failureatlas,
title = {FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration},
author = {Chen, Muxi and Zhang, Zhaohua and Zhao, Chenchen and Chen, Mingyang and
Jiang, Wenyu and Jiang, Tianwen and Zhuo, Jianhuan and Tang, Yu and
Xiao, Qiuyong and Zhang, Jihong and others},
journal = {arXiv preprint arXiv:2509.21995},
year = {2025}
}To appear at CVPR 2026.
Apache-2.0. See LICENSE.
