This is the supplementary code for paper published at ICLR 2026. Link.
SysFormer is a comprehensive defense framework designed to protect large language models (LLMs) against jailbreak attacks. Built on the JailbreakBench benchmark, this system provides multiple defense mechanisms and evaluation tools for assessing robustness against adversarial prompts and jailbreak techniques.
This project implements various defense strategies against LLM jailbreaks, including:
- SystemFormer: A transformer-based defense mechanism that processes system prompts and user inputs
- SysEmbedder: Embedding-based defense using semantic representations
- Circuit Breaker: A mechanism to detect and interrupt harmful request processing
- Baseline Defenses: Including perturbation-based methods (random swap, random patch)
The framework supports training, evaluation, and benchmarking against diverse attack strategies and datasets.
pip install jailbreakbench
pip install -r requirements.txtor
conda env create -f environment.ymlThe project requires:
- PyTorch with CUDA support
- Transformers library
- Hydra for configuration management
- Weights & Biases (wandb) for experiment tracking
- Various NLP and evaluation libraries
.
├── src/ # Core implementation
│ ├── sysformer.py # SystemFormer model implementation
│ ├── trainer.py # Training loop and coordination
│ ├── dataset.py # Dataset loading and processing
│ ├── losses.py # Loss functions for training
│ ├── evaluators.py # Evaluation metrics and judges
│ ├── lm.py # Language model utilities
│ ├── attacks.py # Attack implementations
│ ├── baselines.py # Baseline defense methods
│ ├── prompts.py # Prompt templates
│ └── utils.py # Utility functions
├── configs/ # Configuration files
│ ├── config_train.yaml # Training configuration preset
│ ├── dataset/ # Dataset configs (JBB, HarmBench, LLM Safeguard, etc.)
│ ├── attack/ # Attack strategy configs
│ ├── defense/ # Defense mechanism configs
│ ├── judge/ # Judge/evaluator configs (GPT-4o, Llama-3, etc.)
│ └── language_model/ # LLM configs with PEFT options
├── jailbreakbench/ # JailbreakBench integration
│ ├── src/jailbreakbench/ # JailbreakBench library code
│ └── examples/ # Usage examples
├── train.py # Legacy training script
├── run.py # Main training entry point
├── test.py # Evaluation/testing entry point
├── requirements.txt # Python dependencies
├── environment.yaml # Conda environment yaml
└── README.md # This file
The core defense mechanism that uses a transformer-based architecture to process system prompts and user inputs. It includes:
- SystemFormer: Main model class that combines text encoding, transformer layers, and language projection
- LlamaGuardEncoder: Integration with Meta's Llama Guard for additional safety encoding
Orchestrates the entire training pipeline:
- Data loading and preprocessing
- Model initialization and setup
- Training loop with validation
- Loss computation and optimization
- Model checkpointing and evaluation
Manages dataset loading and processing:
- LabeledDataset: Custom dataset class supporting multiple jailbreak datasets
- Integration with JailbreakBench datasets
- Support for adversarial examples and self-safe training
- Attention mask and position tracking for harmful/safe examples
Multiple loss functions for different training objectives:
- ResponseSFTLoss: Supervised fine-tuning loss
- RefusalLoss: Specialized loss for training refusal responses
- CBHarmLoss/CBSafeLoss: Circuit breaker losses
- HarmClassificationLoss: Harm classification objective
Evaluation metrics and judge models:
- Integration with GPT-4o, GPT-4o-mini, and Llama models
- Automatic and manual evaluation modes
- Support for both jailbreak and refusal assessment
Simple baseline defense mechanisms:
- RandomSwapPerturbation: Random character swaps in input
- RandomPatchPerturbation: Random character patches in input
The project uses Hydra for configuration management. Configuration files are organized in the configs/ directory:
jbb_behaviors: JailbreakBench behaviors dataset (200 behaviors)harm_bench: HarmBench datasetllm_safeguard: LLM Safeguard datasetstrong_reject: Strong Reject dataset
none: No attack appliedstrong_reject: Strong Reject attack variants- And more custom attack variants
sysformer: Full SystemFormer defensesysformer_small/lg: Scaled versions of SystemFormersysembedder: Embedding-only defensenone: No defense baseline
gpt4o: OpenAI GPT-4ogpt4o-mini: OpenAI GPT-4o-mini
- Llama 2/3 variants (7B, 13B)
- Mistral 7B
- Phi models
- Huggingface zephyr
Support for PEFT (Parameter-Efficient Fine-Tuning) as a baseline:
- LoRA (Low-Rank Adaptation)
- AdaLora
Train a defense model using a specific configuration:
python run.py --config-name config_train_strong.yaml \
dataset=jbb_behaviors \
attack=none \
judge=gpt4o-mini \
language_model=llama3.1-8bExample with custom parameters:
python run.py --config-name config_train_strong_add.yaml \
dataset=jbb_behaviors \
attack=strong_reject_train \
judge=gpt4o \
language_model=llama2-7b \
language_model.peft=loraEvaluate a trained defense model:
python test.py --config-name config_train_strong.yaml \
dataset=jbb_behaviors \
attack=none \
judge=gpt4o-mini \
language_model=llama3.1-8bYou can override any configuration parameter from the command line:
python run.py --config-name config_train.yaml \
dataset=jbb_behaviors \
train_params.num_epochs=10 \
train_params.batch_size=16 \
train_params.learning_rate=1e-4The run_all.sh script automates comprehensive training runs across multiple LLMs, datasets, and defense mechanisms with various hyperparameter configurations. This is useful for conducting large-scale experiments and ablation studies.
Script Configuration (editable variables):
datasets: Array of datasets to train on (e.g.,jbb_behaviors)llms: Array of language models (e.g.,llama3.1-8b,llama2-7b,mistral7b_instruct2,zephyr7b,phi3.5mini)defenses: Array of defense mechanisms (e.g.,sysformer,sysembedder)LR: Learning rate (default: 0.0001)CONFIG: Configuration file name (default:config_train)NUM_EPOCHS: Number of training epochs (default: 10)VALIDATE_AFTER: Validation frequency (default: 2)BATCH_SIZE: Batch size (default: 8)ATTACK: Attack type (default:none)SAVE_PRE: Prefix for save directories
Tested Configurations: The script trains with multiple loss weight combinations:
- Safe loss weights: 0.2, 0.5, 1.0
- Self-safe training: enabled/disabled variants
- SFT (Supervised Fine-Tuning): enabled with
add_sft=true
Usage:
bash run_all.shTo modify configurations, edit the variables at the top of the script before running.
The test_jb.sh script evaluates trained defense models across multiple datasets, LLMs, and defense mechanisms. It systematically tests models against various jailbreak attack strategies and measures their robustness.
Script Configuration (editable variables):
datasets: Array of test datasets (e.g.,jbb_behaviors,strong_reject)llms: Array of language models to evaluate (e.g.,phi3.5mini,mistral7b_instruct2,zephyr7b,llama3.1-8b,llama2-7b)defenses: Array of defense mechanisms (e.g.,sysformer,sysembedder,none)LR: Learning rate (default: 0.0001)NUM_EPOCHS: Number of epochs (default: 10)BATCH_SIZE: Batch size (default: 16)VALIDATE_AFTER: Validation frequency (default: 2)CONFIG: Configuration file (default:config_train_strong_add)ATTACK: Attack strategy (default:strong_reject)SAVE_PRE: Prefix for model save directories (default:JB)
Testing Strategy:
- First tests baseline model (
defense=none) for each dataset and LLM combination - Then evaluates each defense mechanism with multiple loss weight configurations:
- Safe loss weights: 0.2, 0.5, 1.0
- Self-safe training: enabled/disabled variants
- All with SFT enabled
Usage:
bash test_jb.shOutput: Test results are logged to W&B (Weights & Biases) and saved in configured directories. Results include:
- Attack success rates for each defense
- Comparison metrics across different hyperparameter settings
- Per-dataset and per-LLM performance statistics
The framework integrates multiple jailbreak and harm datasets:
- JailbreakBench (JBB): Comprehensive benchmark with 200 harmful behaviors
- Strong Reject: Dataset with strong reject patterns
Each dataset includes:
- Harmful behaviors/prompts
- Associated metadata and classifications
- Jailbreak variants (for robustness testing)
Supported language models span various sizes and architectures:
- Llama: 7B (Llama 2 and 3.1)
- Mistral: 7B, Instruct variants
- Phi: 3.5-mini, 4-mini
- Zephyr: 7B
Models can be fine-tuned with PEFT methods for efficiency.
A multi-layer defense that:
- Encodes input text using semantic encoders
- Applies transformer layers to detect and mitigate harmful patterns
- Projects representations back to language model space
- Filters or modifies generations based on detected risks
Lightweight defense using only embedding-based representations without full transformer layers.
Uses the default system prompt as the defense method.
- Modular Design: Easy to swap components (models, datasets, losses)
- Hydra Integration: Flexible configuration without code changes
- Weights & Biases: Automatic experiment tracking and logging
- Multi-GPU Support: Distributed training capabilities
- Multiple Evaluation Modes: Manual and automatic evaluation with various judges
- Checkpointing: Model saving and resumption
- Extensible: Easy to add new defenses, attacks, and datasets
- Requires GPU for efficient training and inference
- Experiment tracking requires Weights & Biases account (can be disabled)
- Some evaluations require API access (GPT-4o via OpenAI)
- Configuration files are composable - combine base configs with overrides
- Training configurations can be customized by creating new YAML files in
configs/
This project builds on the JailbreakBench benchmark framework. For more information on jailbreak attacks and defenses, refer to their documentation and papers.
@article{sharma2025sysformer,
title={Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts},
author={Sharma, Kartik and Jin, Yiqiao and Rakesh, Vineeth and Dou, Yingtong and Pan, Menghai and Das, Mahashweta and Kumar, Srijan},
journal={International Conference on Learning Representations (ICLR)},
year={2026} }