Skip to content

claws-lab/sysformer

Repository files navigation

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

This is the supplementary code for paper published at ICLR 2026. Link.

SysFormer is a comprehensive defense framework designed to protect large language models (LLMs) against jailbreak attacks. Built on the JailbreakBench benchmark, this system provides multiple defense mechanisms and evaluation tools for assessing robustness against adversarial prompts and jailbreak techniques.

Overview

This project implements various defense strategies against LLM jailbreaks, including:

  • SystemFormer: A transformer-based defense mechanism that processes system prompts and user inputs
  • SysEmbedder: Embedding-based defense using semantic representations
  • Circuit Breaker: A mechanism to detect and interrupt harmful request processing
  • Baseline Defenses: Including perturbation-based methods (random swap, random patch)

The framework supports training, evaluation, and benchmarking against diverse attack strategies and datasets.

Table of Contents

Setup

Requirements

pip install jailbreakbench
pip install -r requirements.txt

or

conda env create -f environment.yml

The project requires:

  • PyTorch with CUDA support
  • Transformers library
  • Hydra for configuration management
  • Weights & Biases (wandb) for experiment tracking
  • Various NLP and evaluation libraries

Project Structure

.
├── src/                              # Core implementation
│   ├── sysformer.py                 # SystemFormer model implementation
│   ├── trainer.py                   # Training loop and coordination
│   ├── dataset.py                   # Dataset loading and processing
│   ├── losses.py                    # Loss functions for training
│   ├── evaluators.py                # Evaluation metrics and judges
│   ├── lm.py                        # Language model utilities
│   ├── attacks.py                   # Attack implementations
│   ├── baselines.py                 # Baseline defense methods
│   ├── prompts.py                   # Prompt templates
│   └── utils.py                     # Utility functions
├── configs/                          # Configuration files
│   ├── config_train.yaml            # Training configuration preset
│   ├── dataset/                     # Dataset configs (JBB, HarmBench, LLM Safeguard, etc.)
│   ├── attack/                      # Attack strategy configs
│   ├── defense/                     # Defense mechanism configs
│   ├── judge/                       # Judge/evaluator configs (GPT-4o, Llama-3, etc.)
│   └── language_model/              # LLM configs with PEFT options
├── jailbreakbench/                  # JailbreakBench integration
│   ├── src/jailbreakbench/         # JailbreakBench library code
│   └── examples/                    # Usage examples
├── train.py                         # Legacy training script
├── run.py                           # Main training entry point
├── test.py                          # Evaluation/testing entry point
├── requirements.txt                 # Python dependencies
├── environment.yaml                 # Conda environment yaml
└── README.md                        # This file

Key Components

SysFormer Model (src/sysformer.py)

The core defense mechanism that uses a transformer-based architecture to process system prompts and user inputs. It includes:

  • SystemFormer: Main model class that combines text encoding, transformer layers, and language projection
  • LlamaGuardEncoder: Integration with Meta's Llama Guard for additional safety encoding

Trainer (src/trainer.py)

Orchestrates the entire training pipeline:

  • Data loading and preprocessing
  • Model initialization and setup
  • Training loop with validation
  • Loss computation and optimization
  • Model checkpointing and evaluation

Dataset Handler (src/dataset.py)

Manages dataset loading and processing:

  • LabeledDataset: Custom dataset class supporting multiple jailbreak datasets
  • Integration with JailbreakBench datasets
  • Support for adversarial examples and self-safe training
  • Attention mask and position tracking for harmful/safe examples

Losses (src/losses.py)

Multiple loss functions for different training objectives:

  • ResponseSFTLoss: Supervised fine-tuning loss
  • RefusalLoss: Specialized loss for training refusal responses
  • CBHarmLoss/CBSafeLoss: Circuit breaker losses
  • HarmClassificationLoss: Harm classification objective

Evaluators (src/evaluators.py)

Evaluation metrics and judge models:

  • Integration with GPT-4o, GPT-4o-mini, and Llama models
  • Automatic and manual evaluation modes
  • Support for both jailbreak and refusal assessment

Baselines (src/baselines.py)

Simple baseline defense mechanisms:

  • RandomSwapPerturbation: Random character swaps in input
  • RandomPatchPerturbation: Random character patches in input

Configuration

The project uses Hydra for configuration management. Configuration files are organized in the configs/ directory:

Dataset Options

  • jbb_behaviors: JailbreakBench behaviors dataset (200 behaviors)
  • harm_bench: HarmBench dataset
  • llm_safeguard: LLM Safeguard dataset
  • strong_reject: Strong Reject dataset

Attack Options

  • none: No attack applied
  • strong_reject: Strong Reject attack variants
  • And more custom attack variants

Defense Options

  • sysformer: Full SystemFormer defense
  • sysformer_small/lg: Scaled versions of SystemFormer
  • sysembedder: Embedding-only defense
  • none: No defense baseline

Judge Options

  • gpt4o: OpenAI GPT-4o
  • gpt4o-mini: OpenAI GPT-4o-mini

Language Models

  • Llama 2/3 variants (7B, 13B)
  • Mistral 7B
  • Phi models
  • Huggingface zephyr

Support for PEFT (Parameter-Efficient Fine-Tuning) as a baseline:

  • LoRA (Low-Rank Adaptation)
  • AdaLora

Usage

Training

Train a defense model using a specific configuration:

python run.py --config-name config_train_strong.yaml \
  dataset=jbb_behaviors \
  attack=none \
  judge=gpt4o-mini \
  language_model=llama3.1-8b

Example with custom parameters:

python run.py --config-name config_train_strong_add.yaml \
  dataset=jbb_behaviors \
  attack=strong_reject_train \
  judge=gpt4o \
  language_model=llama2-7b \
  language_model.peft=lora

Testing

Evaluate a trained defense model:

python test.py --config-name config_train_strong.yaml \
  dataset=jbb_behaviors \
  attack=none \
  judge=gpt4o-mini \
  language_model=llama3.1-8b

Configuration Override

You can override any configuration parameter from the command line:

python run.py --config-name config_train.yaml \
  dataset=jbb_behaviors \
  train_params.num_epochs=10 \
  train_params.batch_size=16 \
  train_params.learning_rate=1e-4

Batch Training with run_all.sh

The run_all.sh script automates comprehensive training runs across multiple LLMs, datasets, and defense mechanisms with various hyperparameter configurations. This is useful for conducting large-scale experiments and ablation studies.

Script Configuration (editable variables):

  • datasets: Array of datasets to train on (e.g., jbb_behaviors)
  • llms: Array of language models (e.g., llama3.1-8b, llama2-7b, mistral7b_instruct2, zephyr7b, phi3.5mini)
  • defenses: Array of defense mechanisms (e.g., sysformer, sysembedder)
  • LR: Learning rate (default: 0.0001)
  • CONFIG: Configuration file name (default: config_train)
  • NUM_EPOCHS: Number of training epochs (default: 10)
  • VALIDATE_AFTER: Validation frequency (default: 2)
  • BATCH_SIZE: Batch size (default: 8)
  • ATTACK: Attack type (default: none)
  • SAVE_PRE: Prefix for save directories

Tested Configurations: The script trains with multiple loss weight combinations:

  • Safe loss weights: 0.2, 0.5, 1.0
  • Self-safe training: enabled/disabled variants
  • SFT (Supervised Fine-Tuning): enabled with add_sft=true

Usage:

bash run_all.sh

To modify configurations, edit the variables at the top of the script before running.

Batch Testing with test_jb.sh

The test_jb.sh script evaluates trained defense models across multiple datasets, LLMs, and defense mechanisms. It systematically tests models against various jailbreak attack strategies and measures their robustness.

Script Configuration (editable variables):

  • datasets: Array of test datasets (e.g., jbb_behaviors, strong_reject)
  • llms: Array of language models to evaluate (e.g., phi3.5mini, mistral7b_instruct2, zephyr7b, llama3.1-8b, llama2-7b)
  • defenses: Array of defense mechanisms (e.g., sysformer, sysembedder, none)
  • LR: Learning rate (default: 0.0001)
  • NUM_EPOCHS: Number of epochs (default: 10)
  • BATCH_SIZE: Batch size (default: 16)
  • VALIDATE_AFTER: Validation frequency (default: 2)
  • CONFIG: Configuration file (default: config_train_strong_add)
  • ATTACK: Attack strategy (default: strong_reject)
  • SAVE_PRE: Prefix for model save directories (default: JB)

Testing Strategy:

  1. First tests baseline model (defense=none) for each dataset and LLM combination
  2. Then evaluates each defense mechanism with multiple loss weight configurations:
    • Safe loss weights: 0.2, 0.5, 1.0
    • Self-safe training: enabled/disabled variants
    • All with SFT enabled

Usage:

bash test_jb.sh

Output: Test results are logged to W&B (Weights & Biases) and saved in configured directories. Results include:

  • Attack success rates for each defense
  • Comparison metrics across different hyperparameter settings
  • Per-dataset and per-LLM performance statistics

Datasets

The framework integrates multiple jailbreak and harm datasets:

  • JailbreakBench (JBB): Comprehensive benchmark with 200 harmful behaviors
  • Strong Reject: Dataset with strong reject patterns

Each dataset includes:

  • Harmful behaviors/prompts
  • Associated metadata and classifications
  • Jailbreak variants (for robustness testing)

Models

Supported language models span various sizes and architectures:

  • Llama: 7B (Llama 2 and 3.1)
  • Mistral: 7B, Instruct variants
  • Phi: 3.5-mini, 4-mini
  • Zephyr: 7B

Models can be fine-tuned with PEFT methods for efficiency.

Defense Mechanisms

SystemFormer

A multi-layer defense that:

  1. Encodes input text using semantic encoders
  2. Applies transformer layers to detect and mitigate harmful patterns
  3. Projects representations back to language model space
  4. Filters or modifies generations based on detected risks

SysEmbedder

Lightweight defense using only embedding-based representations without full transformer layers.

Default System Prompt

Uses the default system prompt as the defense method.

Features

  • Modular Design: Easy to swap components (models, datasets, losses)
  • Hydra Integration: Flexible configuration without code changes
  • Weights & Biases: Automatic experiment tracking and logging
  • Multi-GPU Support: Distributed training capabilities
  • Multiple Evaluation Modes: Manual and automatic evaluation with various judges
  • Checkpointing: Model saving and resumption
  • Extensible: Easy to add new defenses, attacks, and datasets

Notes

  • Requires GPU for efficient training and inference
  • Experiment tracking requires Weights & Biases account (can be disabled)
  • Some evaluations require API access (GPT-4o via OpenAI)
  • Configuration files are composable - combine base configs with overrides
  • Training configurations can be customized by creating new YAML files in configs/

References

This project builds on the JailbreakBench benchmark framework. For more information on jailbreak attacks and defenses, refer to their documentation and papers.

Citation

@article{sharma2025sysformer,
title={Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts},
author={Sharma, Kartik and Jin, Yiqiao and Rakesh, Vineeth and Dou, Yingtong and Pan, Menghai and Das, Mahashweta and Kumar, Srijan},
journal={International Conference on Learning Representations (ICLR)},
year={2026} }

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors