Official repository for the benchmark proposed in “Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding”, which is the first spoken language understanding benchmark incorporating visual scene information and explicit reasoning for joint intent detection and slot filling.
业内首个包含场景视觉信息与推理过程的意图-槽位联合识别口语理解基准,论文《Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding》的官方仓库。
The TeleVRSLU dataset is available at: https://huggingface.co/datasets/Tele-AI/TeleVRSLU
If you find VRSLU helpful to your research, please consider citing the following papers. We sincerely thank the original authors for their valuable contributions. If VRSLU is useful to your work, please also consider citing the original ProSLU paper.
@article{wu2025introducing,
title={Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding},
author={Wu, Di and Jiang, Liting and Fang, Ruiyu and Xie, Hongyan and Su, Haoxiang and Huang, Hao and He, Zhongjiang and Song, Shuangyong and Li, Xuelong and others},
journal={arXiv preprint arXiv:2511.19005},
year={2025}
}
@inproceedings{xu2022text,
title={Text is no more enough! a benchmark for profile-based spoken language understanding},
author={Xu, Xiao and Qin, Libo and Chen, Kaiji and Wu, Guoxing and Li, Linlin and Che, Wanxiang},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={36},
number={10},
pages={11575--11585},
year={2022}
}