Leaderboard · Quickstart · Installation · Inference · Evaluation · Submit a model · Citation
SOB measures value-level correctness of LLM-generated JSON, not just whether the JSON is valid. We evaluate models across three source modalities — text, images, and audio — under a single unified evaluation framework.
Top 5 by Overall (coverage-adjusted aggregate across text + image + audio). The full live leaderboard is on the SOB Leaderboard Space — it auto-updates whenever a model PR lands.
| Rank | Model | Overall | Val. Acc. | Faithful. | JSON Pass | Path Rec. | Str. Cov. | Type Saf. | Perfect |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.4 | 0.870 | 0.798 | 0.869 | 0.993 | 0.988 | 0.981 | 0.993 | 0.469 |
| 2 | GLM-4.7 | 0.861 | 0.804 | 0.868 | 0.965 | 0.959 | 0.957 | 0.965 | 0.508 |
| 3 | Qwen3.5-35B | 0.861 | 0.801 | 0.863 | 0.969 | 0.962 | 0.960 | 0.969 | 0.500 |
| 4 | Gemini-2.5-Flash | 0.860 | 0.796 | 0.856 | 0.972 | 0.967 | 0.961 | 0.972 | 0.498 |
| 5 | Qwen3-235B | 0.857 | 0.786 | 0.854 | 0.978 | 0.970 | 0.968 | 0.978 | 0.463 |
Per-modality bests: text 0.830 (GLM-4.7) · image 0.672 (Gemma-4-31B) · audio 0.237 (Gemini-2.5-Flash) — see paper Tables 2–4. Perfect Response is aggregated over text + image only.
All 21 rows + per-modality leaderboards → interfaze-ai/sob-leaderboard
Load the dataset directly:
from datasets import load_dataset
text = load_dataset("interfaze-ai/sob", split="test") # 5,000 records
image = load_dataset("interfaze-ai/sob", "image", split="train") # 209 records
audio = load_dataset("interfaze-ai/sob", "audio", split="train") # 115 recordsOr run a 5-record smoke test end-to-end:
git clone https://github.com/JigsawStack/sob && cd sob
make install
export OPENROUTER_API_KEY=...
python -m sob.run --provider openrouter --modality text \
--model-id google/gemma-4-31b-it --sample-size 5
python evaluate.py data/text_responses/response_google_gemma-4-31b-it.jsonlPython 3.12, clean virtualenv:
git clone https://github.com/JigsawStack/sob && cd sob
uv venv && source .venv/bin/activate
make installmake install uses uv sync if available, otherwise falls back to pip install -r requirements.txt. Other targets:
make format # ruff format .
make lint # ruff check .For local vLLM inference (NVIDIA GPU, CUDA 12.8, ≥ 24 GB VRAM):
uv pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128export OPENROUTER_API_KEY=...
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export HF_TOKEN=... # only if the dataset is privateResponse files and per-model evaluations under data/ are LFS-tracked:
git lfs install--modality text runs the test split (5,000 records); image and audio use the single train split for those configs (209 / 115). Omit --sample-size for the full run.
OpenRouter:
python -m sob.run --provider openrouter --modality text \
--model-id google/gemma-4-31b-it --sample-size 100OpenAI:
python -m sob.run --provider openai --modality image --model-id gpt-5Anthropic:
python -m sob.run --provider anthropic --modality audio --model-id claude-sonnet-4-6Gemini:
python -m sob.run --provider gemini --modality text --model-id gemini-2.5-flashvLLM (open-weight, your GPU):
python -m sob.run --provider vllm --modality text \
--model-id Qwen/Qwen3.5-35B-A3B --use-structured-decoding--use-structured-decoding is the schema-constrained ablation from paper §6.2; the headline leaderboard runs without it.
Outputs:
data/text_responses/response_<model>.jsonldata/images_responses/response_<model>_image.jsonldata/audio_responses/response_<model>_audio.jsonl
Score a single response file:
python evaluate.py data/text_responses/response_google_gemma-4-31b-it.jsonlProduces data/evaluation/<modality>/<model>/{eval_records.jsonl, eval_summary.json} — every paper number is reproducible from these summaries. Or score a whole directory:
python evaluate.py data/text_responses/ # all response_*.jsonl
python evaluate.py data/audio_responses/ --modality audioThe leaderboard is rebuilt from data/evaluation/ by scripts/build_leaderboard.py on every push to main, published to the interfaze-ai/sob-leaderboard dataset, and rendered by the Space.
- Fork, run inference +
evaluate.pyfor one or more modalities, and drop the resultingeval_summary.jsonfiles intodata/evaluation/{text,image,audio}/<your_model_dir>/. - Add an entry for
<your_model_dir>indata/evaluation/display_names.json. The_commentkey is ignored — paste your"<dir>": "<Pretty Name>"alongside the others. - Open a PR — CI builds the leaderboard JSON and posts a top-10 preview comment to verify the row before merge.
- On merge to
main, the publish workflow uploads a freshleaderboard.jsonto the dataset and the Space picks it up.
Preview locally before opening a PR:
python scripts/build_leaderboard.py --output leaderboard.jsonMIT License. Source datasets retain their original licenses: HotpotQA (CC-BY-SA-4.0), AMI Meeting Corpus (CC-BY-4.0), olmOCR-bench / olmOCR (ODC-BY / Apache-2.0).
The HotpotQA team, the AMI Meeting Corpus team, and the Allen AI olmOCR team for releasing their datasets.
Open an issue or reach the authors at {abhinav, harsha, yoeven, vineet}@interfaze.ai.