Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
Hongzheng Yang*1, Yongqiang Chen*2,3, Zeyu Qin4, Tongliang Liu5, Chaowei Xiao6, Kun Zhang2,3, Bo Han1
1TMLR Group, HKBU 2MBZUAI 3CMU 4HKUST 5SAIC Centre, USYD 6JHU
*Equal contribution.
COCA is built on top of LLaMA-Factory. The original training, data loading, model loading, and trainer infrastructure is preserved, while this codebase adds LoFIT-style representation intervention as a fine-tuning method and provides the COCA safety-alignment experiment configurations.
Published in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, PMLR 306.
Representation intervention aims to localize and modify the representations that encode the underlying concepts in large language models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could localize the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose COncept ConcentrAtion (COCA). Instead of identifying the faithful locations to intervene, COCA refactors the training data with an explicit reasoning process, which first identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.
The implementation keeps LLaMA-Factory as the base framework and adds COCA/LoFIT-specific logic in a small number of places:
src/llamafactory/model/lofit.pyimplements the LoFIT attention intervention module, selected-head scoring, trainable-parameter selection, and top-head export.src/llamafactory/hparams/finetuning_args.pyaddsfinetuning_type: lofitand LoFIT arguments such aslofit_component,lofit_topk_heads,lofit_heads_path, andlofit_applied_layers.src/llamafactory/model/adapter.pyregisters LoFIT as a fine-tuning method, inserts LoFIT layers into attention modules, freezes the base model, and enables only the selected LoFIT parameters.src/llamafactory/model/loader.pyapplies the model-loading compatibility settings needed before LoFIT adapters are attached.src/llamafactory/train/sft/trainer.pyaddsLofitSFTTrainer, which supports the head-selection loss and saves the selected heads for the second LoFIT stage.examples/lofit/contains the experiment YAML files for COCA LoFIT runs.data/dataset_info.jsonregisters the COCA safety, benign, and evaluation datasets used by the experiment configurations.
The core LLaMA-Factory training workflow is otherwise left intact, so standard LLaMA-Factory commands and configuration conventions still apply.
COCA uses the LLaMA-Factory SFT path with finetuning_type: lofit. The standard LoFIT experiment has two phases.
First, run head selection with lofit_component: A. This trains the head-selection parameters and writes the selected attention heads to lofit_heads_path.
llamafactory-cli train examples/lofit/head.yamlThen run the bias/intervention tuning phase with lofit_component: v. This loads the selected heads from the first phase and trains only the corresponding LoFIT intervention parameters.
llamafactory-cli train examples/lofit/llama3_lofit_bias_tuning.yamlWe also implement other COCA variants. They are configured in examples/lofit/normal.yaml, examples/lofit/normal_warm.yaml, and examples/lofit/think_warm.yaml. You can skip the first phase for simplicity.
The dataset registry is defined in data/dataset_info.json. The COCA-specific entries include:
beaver_thinkandwizardlm_think_*for COCA-refactored reasoning data.safety_beaver_enhanced_alpaca,safety_beaver_normal_alpaca, andsafety_beaver_ablation_alpacafor safety-alignment variants.
If this code is useful for your research, please cite:
@inproceedings{yang2026coca,
title={Concept Concentration for Faithful Representation Intervention},
author={Yang, Hongzheng and Chen, Yongqiang and Qin, Zeyu and Liu, Tongliang and Xiao, Chaowei and Zhang, Kun and Han, Bo},
booktitle={Proceedings of the 43rd International Conference on Machine Learning},
year={2026},
series={Proceedings of Machine Learning Research},
volume={306}
}If you have any questions, feel free to contact hzyang05@gmail.com.
This repository is based on LLaMA-Factory and benefits from the Hugging Face Transformers, PEFT, TRL, and PyTorch ecosystems.