Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System

This repository contains the code and experiments from the paper Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System. Based on prior work by Subliminal Learning, we investigate how hidden biases of LLMs transfer in multi-agent systems, and develop Thought Virus -- a novel attack vector that exploits subliminal prompting in multi-agent settings.

Key Findings

An agent compromised via subliminal prompting spreads its induced bias though multi-agent networks across all tested models and topologies.
Bias strength decreases with distance from the originally compromised agent, yet persists across up to 5 agent-to-agent hops in our experiments.
Thought Virus induces viral misalignment: subliminal prompting of a single agent degrades the truthfulness of downstream agents on TruthfulQA, even when those agents receive no adversarial input directly.

Setup

Installation

# Install dependencies using uv
uv sync

# Or using pip
pip install -e .

Environment Variables

Copy the example environment file and configure it:

cp .env.example .env
# Edit .env with your API keys and configuration

Project Structure

.
├── src/                              # Source code
│   └── run_analysis.py               # Main analysis script
├── experiments/                      # Experimental results and plots
│   ├── animal-preference/            # Results for animal preference experiments
│   ├── misalignment/                 # Results for misalignment experiment
|   └── conversation_bias_detection   # Detects biases included in MAS conversations
├── result_analysis/                  # Analysis and plotting scripts
|   └── plot_frequency_bars.py        # Create barplots for response frequency results
│   └── plot_logprob_bars.py          # Create barplots for logprob results
└── pyproject.toml                    # Project dependencies

Running Animal Preference Experiments

Generating a Config File

Note: Some models on Hugging Face (e.g., gated models like Llama) require you to log in and accept their terms of service before access is granted. Once approved, you must provide a Hugging Face API token in the .env file.

To start a new run, define the relevant hyperparameters in experiment_config.py and place it in the corresponding folder.

Generating Results

python src/run_analysis.py experiments/animal_preference/{EXPERIMENT_FOLDER}

Generating Plots

python result_analysis/plot_frequency_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}
python result_analysis/plot_logprob_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}

Detecting Overt Biases in Conversations

To check if the bias was communicated overtly in any of the agent-to-agent — which would make the attack trivially detectable and stoppable — we run two complementary checks: a simple regex search and an LLM judge. Both can be run via run_detection.sh in experiments/animal_preference/conversation_bias_detection/.

Citation

@article{weckbecker2026thought,
  title = {Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems},
  author = {Moritz Weckbecker and Jonas Müller and Ben Hagag and Michael Mulet},
  journal = {arXiv preprint arXiv:2603.00131},
  year = {2026}
}

Contact

For questions about the project or the code, feel free to reach out to Moritz Weckbecker or Michael Mulet.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
experiments		experiments
result_analysis		result_analysis
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System

Key Findings

Setup

Installation

Environment Variables

Project Structure

Running Animal Preference Experiments

Generating a Config File

Generating Results

Generating Plots

Detecting Overt Biases in Conversations

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System

Key Findings

Setup

Installation

Environment Variables

Project Structure

Running Animal Preference Experiments

Generating a Config File

Generating Results

Generating Plots

Detecting Overt Biases in Conversations

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages