Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.

Please prepare long-context pre-training dataset truncated to 32k tokens in the following format, see here for examples.
DatasetDict({
train: Dataset({
features: ['text', 'meta', 'input_ids', 'index'],
num_rows: 32
})
})
You can use our Long Attention Calculator or other LLMs with long-context modeling capability.
If you run the following script with our toy dataset, you will get similar CDS scores in file ./toy_scores.json.
bash launch_toy.shFor full usage:
bash launch.shOur training mainly follows Huggingface Trainer code base. Please refer to that repo for more details.
If you find this repo useful for your research, please consider citing the paper:
@article{chen2025ladm,
title={LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs},
author={Chen, Jianghao and Wu, Junhong and Xu, Yangyifan and Zhang, Jiajun},
journal={arXiv preprint arXiv:2503.02502},
year={2025}
}