Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: Deploy Documentation

on:
push:
branches: [main]
paths:
- 'docs/**'
- 'mkdocs.yml'
workflow_dispatch:

permissions:
contents: read
pages: write
id-token: write

concurrency:
group: pages
cancel-in-progress: false

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install mkdocs-material "mkdocstrings[python]>=0.24"
- run: pip install -e .
- run: mkdocs build
- uses: actions/upload-pages-artifact@v3
with:
path: site/

deploy:
needs: build
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- id: deployment
uses: actions/deploy-pages@v4
22 changes: 12 additions & 10 deletions .github/workflows/python-tests.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
name: Python Tests

on:
push:
branches: [main]
pull_request:
branches: [ main ]
branches: [main]

jobs:
test:
Expand All @@ -12,24 +14,24 @@ jobs:
python-version: ["3.12"]

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"

- name: Lint with ruff
run: |
ruff format --check --diff .
ruff check --select I .
ruff format --check --diff lettucedetect/ tests/
ruff check lettucedetect/ tests/ --extend-exclude lettucedetect/integrations/

- name: Test with pytest
run: |
pytest tests/test_inference_pytest.py -v
pytest tests/test_inference_pytest.py -v
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,9 @@ cython_debug/
# PyPI configuration file
.pypirc

# macOS
.DS_Store

# data/
data/

Expand All @@ -178,4 +181,5 @@ output/
temp/

# cache/
lettucedetect/cache/
lettucedetect/cache/
testing/
2 changes: 1 addition & 1 deletion docs/EUROBERT.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# 🥬 LettuceDetect Goes Multilingual: Fine-tuning EuroBERT on Synthetic RAGTruth Translations

<p align="center">
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/feature/cn_llm_eval/assets/lettuce_detective_multi.png?raw=true" alt="LettuceDetect Multilingual Task Force" width="520"/>
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/lettuce_detective_multi.png?raw=true" alt="LettuceDetect Multilingual Task Force" width="520"/>
<br>
<em>Expanding hallucination detection across languages for RAG pipelines.</em>
</p>
Expand Down
13 changes: 0 additions & 13 deletions docs/EVALUATION.md

This file was deleted.

13 changes: 13 additions & 0 deletions docs/api/datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Datasets

## HallucinationSample

::: lettucedetect.datasets.hallucination_dataset.HallucinationSample

## HallucinationData

::: lettucedetect.datasets.hallucination_dataset.HallucinationData

## HallucinationDataset

::: lettucedetect.datasets.hallucination_dataset.HallucinationDataset
13 changes: 13 additions & 0 deletions docs/api/detectors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Detectors

## Factory

::: lettucedetect.detectors.factory.make_detector

## Base Detector

::: lettucedetect.detectors.base.BaseDetector

## Transformer Detector

::: lettucedetect.detectors.transformer.TransformerDetector
5 changes: 5 additions & 0 deletions docs/api/inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Inference

The main entry point for hallucination detection.

::: lettucedetect.models.inference.HallucinationDetector
9 changes: 9 additions & 0 deletions docs/api/training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Training

## Trainer

::: lettucedetect.models.trainer.Trainer

## Evaluator

::: lettucedetect.models.evaluator
43 changes: 43 additions & 0 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Benchmarks

## RAGTruth (English)

Evaluated on the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) test set. This benchmark measures how well models detect hallucinations in LLM-generated text across QA, summarization, and data-to-text tasks.

### Example-Level Detection

Binary classification: does the answer contain any hallucination?

| Model | Type | Overall F1 |
|-------|------|-----------|
| GPT-4 | LLM (zero-shot) | 63.4% |
| Luna | Encoder | 65.4% |
| **lettucedetect-base-v1** | **Encoder (149M)** | **76.8%** |
| Llama-2-13B (fine-tuned) | LLM | 78.7% |
| **lettucedetect-large-v1** | **Encoder (395M)** | **79.2%** |
| RAG-HAT (Llama-3-8B) | LLM | 83.9% |

### Span-Level Detection

LettuceDetect achieves state-of-the-art span-level results among models that report this metric, outperforming fine-tuned Llama-2-13B. Span-level evaluation measures how precisely the model can locate the exact hallucinated text within an answer.

## What These Numbers Mean

- **Example F1** — Can the model tell if an answer has *any* hallucination? Higher is better.
- **Span F1** — Can the model point to *exactly which parts* are hallucinated? This is the harder task and where LettuceDetect excels relative to its size.

LettuceDetect models are 50-500x smaller than LLM-based detectors while achieving competitive or better accuracy.

## Citation

```bibtex
@misc{Kovacs:2025,
title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},
author={Ádám Kovács and Gábor Recski},
year={2025},
eprint={2502.17125},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.17125},
}
```
154 changes: 154 additions & 0 deletions docs/code-hallucination/architecture-research.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Architecture Research: Detection Models for Code Hallucination

Research notes on model architectures for training on the code hallucination dataset. We compare four approaches ranging from fast encoder-based classifiers to generative span detectors.

## Approach A: Token Classification (Encoder)

**Architecture:** ModernBERT/EuroBERT + linear classification head

The current LettuceDetect approach. Each answer token gets a binary label (0=supported, 1=hallucinated). Consecutive hallucinated tokens are merged into spans at inference.

```
Input: [CLS] context [SEP] question [SEP] answer [SEP]
Output: [-100, -100, ..., 0, 0, 1, 1, 1, 0, 0, ...]
^^^^^^^^^ hallucinated span
```

| Property | Value |
|----------|-------|
| **Models** | ModernBERT-base (149M), ModernBERT-large (395M), EuroBERT (210M-2.1B) |
| **Context** | 8K tokens |
| **Inference** | Single forward pass, 30-60 samples/sec on A100 |
| **Training** | Standard token classification, CrossEntropyLoss |
| **Validated by** | LettuceDetect (79.2% F1), HaluGate (vLLM), PsiloQA (EMNLP 2025) |

**Strengths:** Fast, simple, production-ready. Handles long contiguous spans well.
**Weaknesses:** No code-specific pretraining. Cannot explain *why* something is hallucinated.

---

## Approach B: Token Classification (Decoder LLM)

**Architecture:** Qwen3.5-2B + bidirectional attention (LLM2Vec) + linear head

Use a decoder LLM pretrained on massive code corpora, convert to bidirectional encoder via [LLM2Vec](https://arxiv.org/abs/2404.05961), then add a token classification head.

```
Step 1: Load Qwen3.5-2B base (2B params, code-heavy pretraining)
Step 2: Enable bidirectional attention (remove causal mask)
Step 3: Short MNTP adaptation (masked next token prediction with LoRA)
Step 4: Add linear head (hidden_dim=2048 → 2 classes)
Step 5: Fine-tune on code hallucination dataset with LoRA
```

| Property | Value |
|----------|-------|
| **Model** | Qwen3.5-2B (2B params) |
| **Context** | 262K native (practically limited by GPU memory) |
| **Inference** | Single forward pass, ~5-15 samples/sec |
| **VRAM** | ~5-8GB in bf16 |
| **Reference** | [Looking Right is Sometimes Right (ACL 2024)](https://arxiv.org/abs/2401.14556) — 0.947 F1 on NER with mask removal |

**Strengths:** Deep code understanding from pretraining. Bidirectional attention after conversion.
**Weaknesses:** 5x larger than ModernBERT. Requires LLM2Vec conversion step. Novel (unvalidated for hallucination detection).

**Key insight:** The [ACL 2024 paper](https://arxiv.org/abs/2401.14556) showed decoder LLMs with causal mask removal reach 0.947 F1 on NER, significantly above RoBERTa-large (0.900). The gains come from combining rich pretrained representations with bidirectional context.

---

## Approach C: Chunk Verification (Reranker-style)

**Architecture:** Qwen3.5-2B or Qwen3-0.6B, reranker-style yes/no scoring

Inspired by [Qwen3-Reranker](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B). Split the answer into chunks (lines, statements), then ask the model for each chunk: "Is this code correct given the context?"

```
Input: "Given this source code, is this line correct? yes/no"
Output: P(yes) = 0.12 → hallucinated
P(yes) = 0.95 → supported
```

No architectural modifications. Uses the LLM's native next-token prediction to classify.

| Property | Value |
|----------|-------|
| **Models** | Qwen3-0.6B (tiny, fast) or Qwen3.5-2B |
| **Inference** | N forward passes per sample (one per chunk) |
| **Training** | Standard SFT with yes/no labels |
| **Reference** | [MiniCheck (EMNLP 2024)](https://arxiv.org/abs/2404.10774) — GPT-4-level at 400x lower cost |

**Strengths:** No architecture changes. Uses LLM code reasoning directly. Can work with tiny models.
**Weaknesses:** Slowest inference (N passes per sample). Chunk boundary sensitivity. No sub-chunk granularity.

---

## Approach D: Generative Span Detection

**Architecture:** Qwen3.5-2B, standard SFT, generates JSON with hallucinated spans

The model directly outputs which spans are hallucinated and why. This is the reverse of the hallucination injection process.

```
Input: "Given the source code and answer, identify hallucinated spans."
Output: {
"hallucinated_spans": [
{"text": "response.json_decode()", "explanation": "method is json(), not json_decode()"}
]
}
```

| Property | Value |
|----------|-------|
| **Models** | Qwen3.5-2B or larger |
| **Inference** | Single generation (autoregressive, slower than forward pass) |
| **Training** | Standard SFT with LoRA |
| **SOTA** | [RL4HS (Oct 2025)](https://arxiv.org/abs/2510.02173) — 58.3 F1 on RAGTruth, beats GPT-5 (42.2) and o3 (51.2) |

**Strengths:**

- No architecture changes — pure text generation
- Free explanations alongside span detection
- Naturally handles variable span counts
- Can leverage the LLM's code knowledge ("this API doesn't exist")
- Training data format already matches (reverse of injection pipeline)
- Current SOTA approach (RL4HS)

**Weaknesses:** Autoregressive generation is slower. Risk of hallucinating in the detector itself. String matching needed to map spans back to character offsets.

**RL enhancement:** [RL4HS](https://arxiv.org/abs/2510.02173) shows that adding reinforcement learning (GRPO with span-level rewards) on top of SFT dramatically improves performance. SFT alone is a strong baseline; RL pushes it to SOTA.

---

## Comparison

| | A. Encoder token | B. LLM token | C. Chunk verifier | D. Generative span |
|---|---|---|---|---|
| **Base model** | ModernBERT-large | Qwen3.5-2B | Qwen3-0.6B | Qwen3.5-2B |
| **Parameters** | 395M | 2B | 0.6B | 2B |
| **Architecture mods** | None | Mask removal | None | None |
| **Inference speed** | Fastest | Medium | Slowest | Medium-slow |
| **Explainable** | No | No | No | Yes |
| **Code understanding** | Limited | Deep | Deep | Deep |
| **Training complexity** | Simple | LLM2Vec + LoRA | Simple SFT | Simple SFT |
| **SOTA reference** | LettuceDetect, HaluGate | ACL 2024 paper | MiniCheck | RL4HS |

## Recommended Experiments

1. **A vs D** — Token classification (ModernBERT) vs generative span detection (Qwen3.5-2B). The core comparison: fast encoder vs reasoning LLM, both trained on the same dataset.

2. **A vs B** — Does code pretraining help token classification? Same task, different backbone.

3. **D with RL** — If SFT results are promising, add GRPO with span-overlap rewards (following RL4HS).

## Key References

- [LettuceDetect (arXiv:2502.17125)](https://arxiv.org/abs/2502.17125) — Encoder token classification baseline
- [HaluGate (vLLM, Dec 2025)](https://blog.vllm.ai/2025/12/14/halugate.html) — Production ModernBERT + NLI pipeline
- [RL4HS (arXiv:2510.02173)](https://arxiv.org/abs/2510.02173) — SOTA generative span detection with RL
- [FAVA (COLM 2024)](https://arxiv.org/abs/2401.06855) — Generative hallucination editing
- [PsiloQA (EMNLP 2025)](https://arxiv.org/abs/2510.04849) — Multilingual encoder-based span detection
- [Looking Right is Sometimes Right (ACL 2024)](https://arxiv.org/abs/2401.14556) — Decoder LLMs for token classification
- [LLM2Vec (2024)](https://arxiv.org/abs/2404.05961) — Converting decoders to bidirectional encoders
- [MiniCheck (EMNLP 2024)](https://arxiv.org/abs/2404.10774) — Sentence-level fact checking
- [Qwen3-Reranker](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B) — LLM-based yes/no classification
- [CodeMirage (2024)](https://arxiv.org/abs/2408.08333) — Code hallucination taxonomy (snippet-level only)
Loading