Failure Atlas

A tree-search framework for systematically discovering failure modes of text-to-image (T2I) models.

For each (entity × attribute) combination, Failure Atlas generates images, evaluates them with a vision-language model (VLM), prunes branches whose ancestors already fail, and (optionally) trains an online predictor that prioritizes likely-failing branches first.

Paper: FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration · Accepted to CVPR 2026 · arXiv:2509.21995

Mainstream T2I models (e.g., SDXL, FLUX) are typically evaluated only on benchmarks. While such evaluations can reflect overall model performance, they provide limited insight for improving the model or its training data. FailureAtlas focuses on model failures and formulates failure discovery as a search problem. It explores a structured (entity × attribute) space, uses a VLM to assess each generated image, prunes branches that consistently fail, and employs an online predictor to allocate compute toward branches that are most likely to expose errors. In this way, FailureAtlas efficiently constructs a much broader “failure map” specific to a given model.

Each failure discovered by the framework corresponds to a minimal (entity + attribute) combination that reliably triggers poor model behavior when used as input. By systematically uncovering these failure patterns, FailureAtlas helps developers better understand model capability boundaries and provides actionable guidance for data refinement and model improvement. This repository contains the open-source implementation of the full pipeline.

中文文档见 README_zh.md

Example Failure Cases

Highlights

Service-oriented: Image generation and VLM evaluation run as standalone HTTP services. The driver script is a thin Python client that does multi-IP round-robin and concurrent requests, so you can scale by launching more servers and keep the driver host lightweight.
Pluggable T2I backends: Ships with reference servers for sdxl-turbo and flux. Add a new model by writing one adapter class.
Online predictor (optional): A small Transformer scores each candidate node from T5 attribute embeddings; the search visits low-score (likely-failing) branches first. Re-trained every train_interval_nodes.
Boolean-list evaluation: Strict VLM prompt → returns [True, False, ...] for entity + each attribute, easy to extend to new attribute types.
Resumable: Search results are appended to JSONL atomically; restart skips finished nodes.

Architecture

              ┌────────────────────────────────────────────────┐
              │            scripts/run_search.py               │
              │                                                │
              │  build_tree → for each layer:                  │
              │   ├─ predictor.predict()  (optional, sort)     │
              │   ├─ t2i_client.generate(prompts)  ─→ HTTP     │
              │   ├─ vlm_client.evaluate(images, attrs)─→ HTTP │
              │   └─ prune & save (jsonl)                      │
              │                                                │
              │  every N nodes: predictor.train(history)       │
              └────────────────────────────────────────────────┘
                       ↓ HTTP                ↓ HTTP (round-robin)
              ┌──────────────────┐  ┌──────────────────────────┐
              │  T2I server(s)   │  │  VLM server(s)           │
              │  servers/t2i_*.py│  │  user-deployed vLLM      │
              │  POST /generate  │  │  /v1/chat/completions    │
              └──────────────────┘  └──────────────────────────┘

Installation

git clone <this-repo> failure_atlas
cd failure_atlas
pip install -r requirements.txt

CPU-only Smoke Test

Before launching real GPU services, you can run the bundled mock T2I/VLM demo to verify the search loop end to end:

bash examples/run_demo.sh

This starts local mock servers, runs scripts.run_search on examples/prompt_demo.jsonl, and writes a tiny JSONL result under outputs/demo/.

Quick Start

1. Launch T2I service(s) (one process per GPU)

Single GPU:

# sdxl-turbo on GPU 0, port 9000
CUDA_VISIBLE_DEVICES=0 python -m servers.t2i_server \
  --model sdxl-turbo --model-path stabilityai/sdxl-turbo \
  --host 0.0.0.0 --port 9000

Multiple GPUs at once (one process per GPU, sequential ports):

NUM_GPUS=4 START_PORT=9000 \
MODEL_NAME=sdxl-turbo \
MODEL_PATH=stabilityai/sdxl-turbo \
  bash servers/launch_t2i.sh
# logs are written under logs/t2i_<model>_<port>.log

Swap MODEL_NAME=flux MODEL_PATH=black-forest-labs/FLUX.1-schnell for FLUX, or run two batches on disjoint GPUs/ports to mix backends.

2. Launch a VLM service (any OpenAI-compatible vLLM)

CUDA_VISIBLE_DEVICES=2,3 vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
  --port 8080 --served-model-name Qwen2.5-VL-72B-Instruct \
  --tensor-parallel-size 2 --dtype bfloat16

A convenience wrapper with the same defaults is also provided:

MODEL=Qwen/Qwen2.5-VL-72B-Instruct TP=2 PORT=8080 bash servers/launch_vlm.sh

The VLM endpoint must accept image inputs via the OpenAI chat schema; a text-only LLM will fail every evaluation request.

3. Edit `configs/default.yaml` (and the included sub-configs)

Replace the <T2I_HOST_*> / <VLM_HOST> placeholders in configs/t2i_server.yaml :: servers and configs/vlm_client.yaml :: servers with the IPs/ports of the services launched above. Each T2I server's model field must match a key under defaults: in the same file (e.g. sdxl-turbo, flux).

4. (Optional) Pre-compute T5 embeddings for the predictor

The online predictor in configs/predictor.yaml needs a one-time embedding cache over your corpus. Skip this step if you set predictor.enable: false.

python -m scripts.build_t5_emb \
  --corpus-dir corpus \
  --t5-model t5-base \
  --output outputs/predictor/t5_emb.pkl

Rebuild only when the corpus (entities or attribute values) changes.

5. Run the search

python -m scripts.run_search --config configs/default.yaml

Results are appended as JSONL to outputs/search/<task_name>.jsonl, where <task_name> is taken from task_name: in the YAML (override on the CLI with --task-name my_run). Re-running the same task resumes from where it left off as long as io.resume: true.

To split the workload across N independent driver hosts (e.g. several login nodes pointing at the same service pool), pass --total-tasks N --task-id i (i ∈ [0, N)). Each shard processes a disjoint slice of the tree and writes to its own output file.

Traversal order. The included configs/default.yaml uses layer-major traversal: finish layer N across every entity before moving one layer deeper, which is useful for broad-coverage runs. To finish all layers of one entity before moving to the next, switch to entity-major:

python -m scripts.run_search --config configs/default.yaml --traversal entity
# equivalent to setting search.traversal: "entity" in configs/default.yaml

Both modes share the same parent-pruning, resume cache and predictor. In layer-major mode the predictor ranks candidates across all entities at once, but the resulting jsonl rows are interleaved across entities.

Configuration

Four small YAMLs in configs/:

File	Purpose
`default.yaml`	Main search parameters (corpus paths, batch size, prune threshold, `traversal: entity\|layer`, output dir, sharding).
`t2i_server.yaml`	List of T2I endpoints + per-model defaults (steps, guidance, server_batch_size, resolution).
`vlm_client.yaml`	List of VLM endpoints + client knobs: `workers`, `timeout`, `temperature`, `top_k`, `enable_thinking`, `image_max_side` (long-edge resize before upload — keeps tokens bounded).
`predictor.yaml`	Online predictor: `enable`, `t5_emb_file`, `warmup_nodes`, `train_interval_nodes`, `lr`, `epochs_per_train`, checkpoint dir.

Corpus

Files under corpus/:

File	Shipped?	Description
`entity_en.json`	yes	Taxonomy: `upper_class → sub_class → [entity, ...]`.
`entity_en_sample.json`	yes	One representative entity per sub_class (default search set).
`attribute_en_{base,background,image}.json`	yes	Attribute schema (Base / Background / Image families).
`match.json`	yes (~1.6 MB)	Compact whitelist `upper → sub → entity → attr → [value, ...]` that prunes implausible Base-attribute combinations. Auto-loaded by `tools/build_corpus.py`.
`prompt.json`	no (≥100 MB)	Pre-built search tree (one JSON node per line). Either download or rebuild locally.

Get `prompt.json`

Option A — Download. Place the file at corpus/prompt.json. The release link will be announced alongside the paper.

Option B — Rebuild locally. All required schemas (entity_en.json, attribute_en_*.json, match.json) are already shipped, so you can go straight to build_corpus:

python -m tools.build_corpus \
    --corpus-dir corpus \
    --output corpus/prompt.json \
    --use-sample-only

If you also want to regenerate match.json (e.g. after editing the entity/attribute schema), run the LLM-scoring stage first — it requires a configured VLM/LLM endpoint:

python -m tools.build_match \
    --corpus-dir corpus \
    --vlm-config configs/vlm_client.yaml \
    --log-file outputs/build_match/scores.jsonl \
    --output corpus/match.json \
    --use-sample-only --workers 32

Full options and resume/aggregate flows are documented in tools/README.md.

Output schema (per node)

{
  "idx": 123,
  "entity": "Apple",
  "value": [["Base: Colors", "Red"], ["Background: Time/Period", "Night"]],
  "parent": [12],
  "layer": 2,
  "upper_class": "...", "sub_class": "...",
  "caption": "An image of a red apple. The background time is night.",
  "valid": true,
  "tested": true,
  "acc": 0.625,
  "eval": [[true, true, false], [true, false, true], ...],
  "eval_raw": [[true, true, false], ...],
  "predictor_score": 0.71
}

acc <= prune_threshold → marked valid=false; descendants are pruned without generation.

Citation

If you find this work useful, please cite:

@article{chen2025failureatlas,
  title   = {FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration},
  author  = {Chen, Muxi and Zhang, Zhaohua and Zhao, Chenchen and Chen, Mingyang and
             Jiang, Wenyu and Jiang, Tianwen and Zhuo, Jianhuan and Tang, Yu and
             Xiao, Qiuyong and Zhang, Jihong and others},
  journal = {arXiv preprint arXiv:2509.21995},
  year    = {2025}
}

To appear at CVPR 2026.

License

Apache-2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Failure Atlas

Example Failure Cases

Highlights

Architecture

Installation

CPU-only Smoke Test

Quick Start

1. Launch T2I service(s) (one process per GPU)

2. Launch a VLM service (any OpenAI-compatible vLLM)

3. Edit `configs/default.yaml` (and the included sub-configs)

4. (Optional) Pre-compute T5 embeddings for the predictor

5. Run the search

Configuration

Corpus

Get `prompt.json`

Output schema (per node)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
asserts		asserts
configs		configs
corpus		corpus
examples		examples
scripts		scripts
servers		servers
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Failure Atlas

Example Failure Cases

Highlights

Architecture

Installation

CPU-only Smoke Test

Quick Start

1. Launch T2I service(s) (one process per GPU)

2. Launch a VLM service (any OpenAI-compatible vLLM)

3. Edit configs/default.yaml (and the included sub-configs)

4. (Optional) Pre-compute T5 embeddings for the predictor

5. Run the search

Configuration

Corpus

Get prompt.json

Output schema (per node)

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. Edit `configs/default.yaml` (and the included sub-configs)

Get `prompt.json`

Packages