Skip to content

TIGER-AI-Lab/SWE-QA-Pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

πŸ“– arXiv | πŸ€— SWE-QA-Pro Bench |


πŸ“’ News

  • πŸ”₯ [2026-3-20] SWE-QA-Pro Bench is publicly released! Please check our paper and benchmark. The model and code will be released soon.

πŸ“˜ Introduction

SWE-QA-Pro is a benchmark and training framework for agentic repository-level code understanding, enabling models to explore, reason over, and verify real-world codebases. This work targets key limitations in existing evaluations:

  • Limited diversity: benchmarks focus on popular repositories, missing long-tail software tasks
  • Knowledge leakage: many questions can be solved without interacting with the codebase
  • Weak tool necessity: unclear whether agentic workflows are actually required

To address these challenges, we introduce two key components:

  1. SWE-QA-Pro Benchmark:

    Alt text A repository-level QA benchmark built from diverse long-tail repositories with executable environments.

    • Questions are seeded from real issues and grouped via clustering to ensure topic diversity
    • Each item is grounded in actual code with human verification
    • A difficulty calibration pipeline filters out questions solvable by direct-answer models

    This results in a benchmark where agentic exploration is necessary, with up to a ~13-point performance gap between tool-using agents and direct answering.

  2. Agentic Training Pipeline & Models:

    Alt text A scalable framework for learning repository-level agentic reasoning.

    • Generates synthetic tool-use trajectories and grounded supervision
    • Trains models with a two-stage recipe (SFT β†’ RLAIF)
    • Enables small open models to learn multi-step reasoning, tool usage, and code navigation

    Models trained with this pipeline achieve strong performance, with our SWE-QA-Pro 8B model surpassing GPT-4o by +2.3 points on SWE-QA-Pro and substantially narrowing the gap to state-of-the-art proprietary models.

πŸ› οΈ TODO

  • Release the dataset
  • Release the evaluation code
  • Release the model
  • Release the training code

πŸ“¬ Contact


πŸ“– Citation

BibTeX:

@article{cai2026sweqapro,
      title={SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding}, 
      author={Songcheng Cai and Zhiheng Lyu and Yuansheng Ni and Xiangchao Chen and Baichuan Zhou and Shenzhe Zhu and Yi Lu and Haozhe Wang and Chi Ruan and Benjamin Schneider and Weixu Zhang and Xiang Li and Andy Zheng and Yuyu Zhang and Ping Nie and Wenhu Chen},
      journal={arXiv preprint arXiv:2603.16124},
      year={2026},
}

About

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding [ACL 2026]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors