Skip to content

tmlr-group/COCA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

Hongzheng Yang*1, Yongqiang Chen*2,3, Zeyu Qin4, Tongliang Liu5, Chaowei Xiao6, Kun Zhang2,3, Bo Han1

1TMLR Group, HKBU   2MBZUAI   3CMU   4HKUST   5SAIC Centre, USYD   6JHU

*Equal contribution.

COCA is built on top of LLaMA-Factory. The original training, data loading, model loading, and trainer infrastructure is preserved, while this codebase adds LoFIT-style representation intervention as a fine-tuning method and provides the COCA safety-alignment experiment configurations.

Paper

Published in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, PMLR 306.

Abstract

Representation intervention aims to localize and modify the representations that encode the underlying concepts in large language models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could localize the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose COncept ConcentrAtion (COCA). Instead of identifying the faithful locations to intervene, COCA refactors the training data with an explicit reasoning process, which first identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

Code Overview

The implementation keeps LLaMA-Factory as the base framework and adds COCA/LoFIT-specific logic in a small number of places:

  • src/llamafactory/model/lofit.py implements the LoFIT attention intervention module, selected-head scoring, trainable-parameter selection, and top-head export.
  • src/llamafactory/hparams/finetuning_args.py adds finetuning_type: lofit and LoFIT arguments such as lofit_component, lofit_topk_heads, lofit_heads_path, and lofit_applied_layers.
  • src/llamafactory/model/adapter.py registers LoFIT as a fine-tuning method, inserts LoFIT layers into attention modules, freezes the base model, and enables only the selected LoFIT parameters.
  • src/llamafactory/model/loader.py applies the model-loading compatibility settings needed before LoFIT adapters are attached.
  • src/llamafactory/train/sft/trainer.py adds LofitSFTTrainer, which supports the head-selection loss and saves the selected heads for the second LoFIT stage.
  • examples/lofit/ contains the experiment YAML files for COCA LoFIT runs.
  • data/dataset_info.json registers the COCA safety, benign, and evaluation datasets used by the experiment configurations.

The core LLaMA-Factory training workflow is otherwise left intact, so standard LLaMA-Factory commands and configuration conventions still apply.

LoFIT Workflow

COCA uses the LLaMA-Factory SFT path with finetuning_type: lofit. The standard LoFIT experiment has two phases.

First, run head selection with lofit_component: A. This trains the head-selection parameters and writes the selected attention heads to lofit_heads_path.

llamafactory-cli train examples/lofit/head.yaml

Then run the bias/intervention tuning phase with lofit_component: v. This loads the selected heads from the first phase and trains only the corresponding LoFIT intervention parameters.

llamafactory-cli train examples/lofit/llama3_lofit_bias_tuning.yaml

We also implement other COCA variants. They are configured in examples/lofit/normal.yaml, examples/lofit/normal_warm.yaml, and examples/lofit/think_warm.yaml. You can skip the first phase for simplicity.

Data

The dataset registry is defined in data/dataset_info.json. The COCA-specific entries include:

  • beaver_think and wizardlm_think_* for COCA-refactored reasoning data.
  • safety_beaver_enhanced_alpaca, safety_beaver_normal_alpaca, and safety_beaver_ablation_alpaca for safety-alignment variants.

Citation

If this code is useful for your research, please cite:

@inproceedings{yang2026coca,
  title={Concept Concentration for Faithful Representation Intervention},
  author={Yang, Hongzheng and Chen, Yongqiang and Qin, Zeyu and Liu, Tongliang and Xiao, Chaowei and Zhang, Kun and Han, Bo},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026},
  series={Proceedings of Machine Learning Research},
  volume={306}
}

If you have any questions, feel free to contact hzyang05@gmail.com.

Acknowledgement

This repository is based on LLaMA-Factory and benefits from the Hugging Face Transformers, PEFT, TRL, and PyTorch ecosystems.

About

[ICML 2026] "Concept Concentration for Faithful Representation Intervention"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages