COCA: Concept Concentration for Faithful Representation Intervention

Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

Hongzheng Yang^*1, Yongqiang Chen^*2,3, Zeyu Qin⁴, Tongliang Liu⁵, Chaowei Xiao⁶, Kun Zhang^2,3, Bo Han¹

¹TMLR Group, HKBU ²MBZUAI ³CMU ⁴HKUST ⁵SAIC Centre, USYD ⁶JHU

^*Equal contribution.

COCA is built on top of LLaMA-Factory. The original training, data loading, model loading, and trainer infrastructure is preserved, while this codebase adds LoFIT-style representation intervention as a fine-tuning method and provides the COCA safety-alignment experiment configurations.

Paper

Published in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, PMLR 306.

Abstract

Representation intervention aims to localize and modify the representations that encode the underlying concepts in large language models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could localize the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose COncept ConcentrAtion (COCA). Instead of identifying the faithful locations to intervene, COCA refactors the training data with an explicit reasoning process, which first identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

Code Overview

The implementation keeps LLaMA-Factory as the base framework and adds COCA/LoFIT-specific logic in a small number of places:

src/llamafactory/model/lofit.py implements the LoFIT attention intervention module, selected-head scoring, trainable-parameter selection, and top-head export.
src/llamafactory/hparams/finetuning_args.py adds finetuning_type: lofit and LoFIT arguments such as lofit_component, lofit_topk_heads, lofit_heads_path, and lofit_applied_layers.
src/llamafactory/model/adapter.py registers LoFIT as a fine-tuning method, inserts LoFIT layers into attention modules, freezes the base model, and enables only the selected LoFIT parameters.
src/llamafactory/model/loader.py applies the model-loading compatibility settings needed before LoFIT adapters are attached.
src/llamafactory/train/sft/trainer.py adds LofitSFTTrainer, which supports the head-selection loss and saves the selected heads for the second LoFIT stage.
examples/lofit/ contains the experiment YAML files for COCA LoFIT runs.
data/dataset_info.json registers the COCA safety, benign, and evaluation datasets used by the experiment configurations.

The core LLaMA-Factory training workflow is otherwise left intact, so standard LLaMA-Factory commands and configuration conventions still apply.

LoFIT Workflow

COCA uses the LLaMA-Factory SFT path with finetuning_type: lofit. The standard LoFIT experiment has two phases.

First, run head selection with lofit_component: A. This trains the head-selection parameters and writes the selected attention heads to lofit_heads_path.

llamafactory-cli train examples/lofit/head.yaml

Then run the bias/intervention tuning phase with lofit_component: v. This loads the selected heads from the first phase and trains only the corresponding LoFIT intervention parameters.

llamafactory-cli train examples/lofit/llama3_lofit_bias_tuning.yaml

We also implement other COCA variants. They are configured in examples/lofit/normal.yaml, examples/lofit/normal_warm.yaml, and examples/lofit/think_warm.yaml. You can skip the first phase for simplicity.

Data

The dataset registry is defined in data/dataset_info.json. The COCA-specific entries include:

beaver_think and wizardlm_think_* for COCA-refactored reasoning data.
safety_beaver_enhanced_alpaca, safety_beaver_normal_alpaca, and safety_beaver_ablation_alpaca for safety-alignment variants.

Citation

If this code is useful for your research, please cite:

@inproceedings{yang2026coca,
  title={Concept Concentration for Faithful Representation Intervention},
  author={Yang, Hongzheng and Chen, Yongqiang and Qin, Zeyu and Liu, Tongliang and Xiao, Chaowei and Zhang, Kun and Han, Bo},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026},
  series={Proceedings of Machine Learning Research},
  volume={306}
}

If you have any questions, feel free to contact hzyang05@gmail.com.

Acknowledgement

This repository is based on LLaMA-Factory and benefits from the Hugging Face Transformers, PEFT, TRL, and PyTorch ecosystems.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
examples		examples
src		src
tests		tests
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COCA: Concept Concentration for Faithful Representation Intervention

Paper

Abstract

Code Overview

LoFIT Workflow

Data

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

COCA: Concept Concentration for Faithful Representation Intervention

Paper

Abstract

Code Overview

LoFIT Workflow

Data

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages