The Structured Output Benchmark

SOB · A multi-source benchmark for evaluating structured-output quality in LLMs

Leaderboard · Quickstart · Installation · Inference · Evaluation · Submit a model · Citation

SOB measures value-level correctness of LLM-generated JSON, not just whether the JSON is valid. We evaluate models across three source modalities — text, images, and audio — under a single unified evaluation framework.

🏆 Leaderboard

Top 5 by Overall (coverage-adjusted aggregate across text + image + audio). The full live leaderboard is on the SOB Leaderboard Space — it auto-updates whenever a model PR lands.

Rank	Model	Overall	Val. Acc.	Faithful.	JSON Pass	Path Rec.	Str. Cov.	Type Saf.	Perfect
1	GPT-5.4	0.870	0.798	0.869	0.993	0.988	0.981	0.993	0.469
2	GLM-4.7	0.861	0.804	0.868	0.965	0.959	0.957	0.965	0.508
3	Qwen3.5-35B	0.861	0.801	0.863	0.969	0.962	0.960	0.969	0.500
4	Gemini-2.5-Flash	0.860	0.796	0.856	0.972	0.967	0.961	0.972	0.498
5	Qwen3-235B	0.857	0.786	0.854	0.978	0.970	0.968	0.978	0.463

Per-modality bests: text 0.830 (GLM-4.7) · image 0.672 (Gemma-4-31B) · audio 0.237 (Gemini-2.5-Flash) — see paper Tables 2–4. Perfect Response is aggregated over text + image only.

All 21 rows + per-modality leaderboards → interfaze-ai/sob-leaderboard

Quickstart

Load the dataset directly:

from datasets import load_dataset
text  = load_dataset("interfaze-ai/sob", split="test")           # 5,000 records
image = load_dataset("interfaze-ai/sob", "image", split="train") #   209 records
audio = load_dataset("interfaze-ai/sob", "audio", split="train") #   115 records

Or run a 5-record smoke test end-to-end:

git clone https://github.com/JigsawStack/sob && cd sob
make install
export OPENROUTER_API_KEY=...
python -m sob.run --provider openrouter --modality text \
    --model-id google/gemma-4-31b-it --sample-size 5
python evaluate.py data/text_responses/response_google_gemma-4-31b-it.jsonl

Installation

Python 3.12, clean virtualenv:

git clone https://github.com/JigsawStack/sob && cd sob
uv venv && source .venv/bin/activate
make install

make install uses uv sync if available, otherwise falls back to pip install -r requirements.txt. Other targets:

make format    # ruff format .
make lint      # ruff check .

For local vLLM inference (NVIDIA GPU, CUDA 12.8, ≥ 24 GB VRAM):

uv pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128

API keys

export OPENROUTER_API_KEY=...
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export HF_TOKEN=...               # only if the dataset is private

Git LFS

Response files and per-model evaluations under data/ are LFS-tracked:

git lfs install

Running inference

--modality text runs the test split (5,000 records); image and audio use the single train split for those configs (209 / 115). Omit --sample-size for the full run.

OpenRouter:

python -m sob.run --provider openrouter --modality text \
  --model-id google/gemma-4-31b-it --sample-size 100

OpenAI:

python -m sob.run --provider openai --modality image --model-id gpt-5

Anthropic:

python -m sob.run --provider anthropic --modality audio --model-id claude-sonnet-4-6

Gemini:

python -m sob.run --provider gemini --modality text --model-id gemini-2.5-flash

vLLM (open-weight, your GPU):

python -m sob.run --provider vllm --modality text \
  --model-id Qwen/Qwen3.5-35B-A3B --use-structured-decoding

--use-structured-decoding is the schema-constrained ablation from paper §6.2; the headline leaderboard runs without it.

Outputs:

data/text_responses/response_<model>.jsonl
data/images_responses/response_<model>_image.jsonl
data/audio_responses/response_<model>_audio.jsonl

Evaluation

Score a single response file:

python evaluate.py data/text_responses/response_google_gemma-4-31b-it.jsonl

Produces data/evaluation/<modality>/<model>/{eval_records.jsonl, eval_summary.json} — every paper number is reproducible from these summaries. Or score a whole directory:

python evaluate.py data/text_responses/                # all response_*.jsonl
python evaluate.py data/audio_responses/ --modality audio

Submitting a new model

The leaderboard is rebuilt from data/evaluation/ by scripts/build_leaderboard.py on every push to main, published to the interfaze-ai/sob-leaderboard dataset, and rendered by the Space.

Fork, run inference + evaluate.py for one or more modalities, and drop the resulting eval_summary.json files into data/evaluation/{text,image,audio}/<your_model_dir>/.
Add an entry for <your_model_dir> in data/evaluation/display_names.json. The _comment key is ignored — paste your "<dir>": "<Pretty Name>" alongside the others.
Open a PR — CI builds the leaderboard JSON and posts a top-10 preview comment to verify the row before merge.
On merge to main, the publish workflow uploads a fresh leaderboard.json to the dataset and the Space picks it up.

Preview locally before opening a PR:

python scripts/build_leaderboard.py --output leaderboard.json

License

MIT License. Source datasets retain their original licenses: HotpotQA (CC-BY-SA-4.0), AMI Meeting Corpus (CC-BY-4.0), olmOCR-bench / olmOCR (ODC-BY / Apache-2.0).

Acknowledgments

The HotpotQA team, the AMI Meeting Corpus team, and the Allen AI olmOCR team for releasing their datasets.

Contact

Open an issue or reach the authors at {abhinav, harsha, yoeven, vineet}@interfaze.ai.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
data		data
scripts		scripts
sob		sob
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Structured Output Benchmark

SOB · A multi-source benchmark for evaluating structured-output quality in LLMs

🏆 Leaderboard

Quickstart

Installation

API keys

Git LFS

Running inference

Evaluation

Submitting a new model

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Structured Output Benchmark

SOB · A multi-source benchmark for evaluating structured-output quality in LLMs

🏆 Leaderboard

Quickstart

Installation

API keys

Git LFS

Running inference

Evaluation

Submitting a new model

License

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages