KRLabsOrg · adaamko · Mar 4, 2026 · Mar 3, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,43 @@
+name: Deploy Documentation
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'docs/**'
+      - 'mkdocs.yml'
+  workflow_dispatch:
+
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+concurrency:
+  group: pages
+  cancel-in-progress: false
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+      - run: pip install mkdocs-material "mkdocstrings[python]>=0.24"
+      - run: pip install -e .
+      - run: mkdocs build
+      - uses: actions/upload-pages-artifact@v3
+        with:
+          path: site/
+
+  deploy:
+    needs: build
+    runs-on: ubuntu-latest
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    steps:
+      - id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/.github/workflows/python-tests.yml b/.github/workflows/python-tests.yml
@@ -1,8 +1,10 @@
 name: Python Tests
 
 on:
+  push:
+    branches: [main]
   pull_request:
-    branches: [ main ]
+    branches: [main]
 
 jobs:
   test:
@@ -12,24 +14,24 @@ jobs:
         python-version: ["3.12"]
 
     steps:
-    - uses: actions/checkout@v3
-    
+    - uses: actions/checkout@v4
+
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v4
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ matrix.python-version }}
         cache: 'pip'
-    
+
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
         pip install -e ".[dev]"
-    
+
     - name: Lint with ruff
       run: |
-        ruff format --check --diff .
-        ruff check --select I .
-    
+        ruff format --check --diff lettucedetect/ tests/
+        ruff check lettucedetect/ tests/ --extend-exclude lettucedetect/integrations/
+
     - name: Test with pytest
       run: |
-        pytest tests/test_inference_pytest.py -v 
+        pytest tests/test_inference_pytest.py -v
diff --git a/.gitignore b/.gitignore
@@ -170,6 +170,9 @@ cython_debug/
 # PyPI configuration file
 .pypirc
 
+# macOS
+.DS_Store
+
 # data/
 data/
 
@@ -178,4 +181,5 @@ output/
 temp/
 
 # cache/
-lettucedetect/cache/
+lettucedetect/cache/
+testing/
diff --git a/docs/EUROBERT.md b/docs/EUROBERT.md
@@ -1,7 +1,7 @@
 # 🥬 LettuceDetect Goes Multilingual: Fine-tuning EuroBERT on Synthetic RAGTruth Translations
 
 <p align="center">
-  <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/feature/cn_llm_eval/assets/lettuce_detective_multi.png?raw=true" alt="LettuceDetect Multilingual Task Force" width="520"/>
+  <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/lettuce_detective_multi.png?raw=true" alt="LettuceDetect Multilingual Task Force" width="520"/>
   <br>
   <em>Expanding hallucination detection across languages for RAG pipelines.</em>
 </p>

diff --git a/docs/EVALUATION.md b/docs/EVALUATION.md
diff --git a/docs/api/datasets.md b/docs/api/datasets.md
@@ -0,0 +1,13 @@
+# Datasets
+
+## HallucinationSample
+
+::: lettucedetect.datasets.hallucination_dataset.HallucinationSample
+
+## HallucinationData
+
+::: lettucedetect.datasets.hallucination_dataset.HallucinationData
+
+## HallucinationDataset
+
+::: lettucedetect.datasets.hallucination_dataset.HallucinationDataset
diff --git a/docs/api/detectors.md b/docs/api/detectors.md
@@ -0,0 +1,13 @@
+# Detectors
+
+## Factory
+
+::: lettucedetect.detectors.factory.make_detector
+
+## Base Detector
+
+::: lettucedetect.detectors.base.BaseDetector
+
+## Transformer Detector
+
+::: lettucedetect.detectors.transformer.TransformerDetector
diff --git a/docs/api/inference.md b/docs/api/inference.md
@@ -0,0 +1,5 @@
+# Inference
+
+The main entry point for hallucination detection.
+
+::: lettucedetect.models.inference.HallucinationDetector
diff --git a/docs/api/training.md b/docs/api/training.md
@@ -0,0 +1,9 @@
+# Training
+
+## Trainer
+
+::: lettucedetect.models.trainer.Trainer
+
+## Evaluator
+
+::: lettucedetect.models.evaluator
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -0,0 +1,43 @@
+# Benchmarks
+
+## RAGTruth (English)
+
+Evaluated on the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) test set. This benchmark measures how well models detect hallucinations in LLM-generated text across QA, summarization, and data-to-text tasks.
+
+### Example-Level Detection
+
+Binary classification: does the answer contain any hallucination?
+
+| Model | Type | Overall F1 |
+|-------|------|-----------|
+| GPT-4 | LLM (zero-shot) | 63.4% |
+| Luna | Encoder | 65.4% |
+| **lettucedetect-base-v1** | **Encoder (149M)** | **76.8%** |
+| Llama-2-13B (fine-tuned) | LLM | 78.7% |
+| **lettucedetect-large-v1** | **Encoder (395M)** | **79.2%** |
+| RAG-HAT (Llama-3-8B) | LLM | 83.9% |
+
+### Span-Level Detection
+
+LettuceDetect achieves state-of-the-art span-level results among models that report this metric, outperforming fine-tuned Llama-2-13B. Span-level evaluation measures how precisely the model can locate the exact hallucinated text within an answer.
+
+## What These Numbers Mean
+
+- **Example F1** — Can the model tell if an answer has *any* hallucination? Higher is better.
+- **Span F1** — Can the model point to *exactly which parts* are hallucinated? This is the harder task and where LettuceDetect excels relative to its size.
+
+LettuceDetect models are 50-500x smaller than LLM-based detectors while achieving competitive or better accuracy.
+
+## Citation
+
+```bibtex
+@misc{Kovacs:2025,
+      title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},
+      author={Ádám Kovács and Gábor Recski},
+      year={2025},
+      eprint={2502.17125},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.17125},
+}
+```
diff --git a/docs/code-hallucination/architecture-research.md b/docs/code-hallucination/architecture-research.md
@@ -0,0 +1,154 @@
+# Architecture Research: Detection Models for Code Hallucination
+
+Research notes on model architectures for training on the code hallucination dataset. We compare four approaches ranging from fast encoder-based classifiers to generative span detectors.
+
+## Approach A: Token Classification (Encoder)
+
+**Architecture:** ModernBERT/EuroBERT + linear classification head
+
+The current LettuceDetect approach. Each answer token gets a binary label (0=supported, 1=hallucinated). Consecutive hallucinated tokens are merged into spans at inference.
+
+```
+Input:  [CLS] context [SEP] question [SEP] answer [SEP]
+Output: [-100, -100, ..., 0, 0, 1, 1, 1, 0, 0, ...]
+                              ^^^^^^^^^ hallucinated span
+```
+
+| Property | Value |
+|----------|-------|
+| **Models** | ModernBERT-base (149M), ModernBERT-large (395M), EuroBERT (210M-2.1B) |
+| **Context** | 8K tokens |
+| **Inference** | Single forward pass, 30-60 samples/sec on A100 |
+| **Training** | Standard token classification, CrossEntropyLoss |
+| **Validated by** | LettuceDetect (79.2% F1), HaluGate (vLLM), PsiloQA (EMNLP 2025) |
+
+**Strengths:** Fast, simple, production-ready. Handles long contiguous spans well.
+**Weaknesses:** No code-specific pretraining. Cannot explain *why* something is hallucinated.
+
+---
+
+## Approach B: Token Classification (Decoder LLM)
+
+**Architecture:** Qwen3.5-2B + bidirectional attention (LLM2Vec) + linear head
+
+Use a decoder LLM pretrained on massive code corpora, convert to bidirectional encoder via [LLM2Vec](https://arxiv.org/abs/2404.05961), then add a token classification head.
+
+```
+Step 1: Load Qwen3.5-2B base (2B params, code-heavy pretraining)
+Step 2: Enable bidirectional attention (remove causal mask)
+Step 3: Short MNTP adaptation (masked next token prediction with LoRA)
+Step 4: Add linear head (hidden_dim=2048 → 2 classes)
+Step 5: Fine-tune on code hallucination dataset with LoRA
+```
+
+| Property | Value |
+|----------|-------|
+| **Model** | Qwen3.5-2B (2B params) |
+| **Context** | 262K native (practically limited by GPU memory) |
+| **Inference** | Single forward pass, ~5-15 samples/sec |
+| **VRAM** | ~5-8GB in bf16 |
+| **Reference** | [Looking Right is Sometimes Right (ACL 2024)](https://arxiv.org/abs/2401.14556) — 0.947 F1 on NER with mask removal |
+
+**Strengths:** Deep code understanding from pretraining. Bidirectional attention after conversion.
+**Weaknesses:** 5x larger than ModernBERT. Requires LLM2Vec conversion step. Novel (unvalidated for hallucination detection).
+
+**Key insight:** The [ACL 2024 paper](https://arxiv.org/abs/2401.14556) showed decoder LLMs with causal mask removal reach 0.947 F1 on NER, significantly above RoBERTa-large (0.900). The gains come from combining rich pretrained representations with bidirectional context.
+
+---
+
+## Approach C: Chunk Verification (Reranker-style)
+
+**Architecture:** Qwen3.5-2B or Qwen3-0.6B, reranker-style yes/no scoring
+
+Inspired by [Qwen3-Reranker](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B). Split the answer into chunks (lines, statements), then ask the model for each chunk: "Is this code correct given the context?"
+
+```
+Input:  "Given this source code, is this line correct? yes/no"
+Output: P(yes) = 0.12  →  hallucinated
+        P(yes) = 0.95  →  supported
+```
+
+No architectural modifications. Uses the LLM's native next-token prediction to classify.
+
+| Property | Value |
+|----------|-------|
+| **Models** | Qwen3-0.6B (tiny, fast) or Qwen3.5-2B |
+| **Inference** | N forward passes per sample (one per chunk) |
+| **Training** | Standard SFT with yes/no labels |
+| **Reference** | [MiniCheck (EMNLP 2024)](https://arxiv.org/abs/2404.10774) — GPT-4-level at 400x lower cost |
+
+**Strengths:** No architecture changes. Uses LLM code reasoning directly. Can work with tiny models.
+**Weaknesses:** Slowest inference (N passes per sample). Chunk boundary sensitivity. No sub-chunk granularity.
+
+---
+
+## Approach D: Generative Span Detection
+
+**Architecture:** Qwen3.5-2B, standard SFT, generates JSON with hallucinated spans
+
+The model directly outputs which spans are hallucinated and why. This is the reverse of the hallucination injection process.
+
+```
+Input:  "Given the source code and answer, identify hallucinated spans."
+Output: {
+  "hallucinated_spans": [
+    {"text": "response.json_decode()", "explanation": "method is json(), not json_decode()"}
+  ]
+}
+```
+
+| Property | Value |
+|----------|-------|
+| **Models** | Qwen3.5-2B or larger |
+| **Inference** | Single generation (autoregressive, slower than forward pass) |
+| **Training** | Standard SFT with LoRA |
+| **SOTA** | [RL4HS (Oct 2025)](https://arxiv.org/abs/2510.02173) — 58.3 F1 on RAGTruth, beats GPT-5 (42.2) and o3 (51.2) |
+
+**Strengths:**
+
+- No architecture changes — pure text generation
+- Free explanations alongside span detection
+- Naturally handles variable span counts
+- Can leverage the LLM's code knowledge ("this API doesn't exist")
+- Training data format already matches (reverse of injection pipeline)
+- Current SOTA approach (RL4HS)
+
+**Weaknesses:** Autoregressive generation is slower. Risk of hallucinating in the detector itself. String matching needed to map spans back to character offsets.
+
+**RL enhancement:** [RL4HS](https://arxiv.org/abs/2510.02173) shows that adding reinforcement learning (GRPO with span-level rewards) on top of SFT dramatically improves performance. SFT alone is a strong baseline; RL pushes it to SOTA.
+
+---
+
+## Comparison
+
+| | A. Encoder token | B. LLM token | C. Chunk verifier | D. Generative span |
+|---|---|---|---|---|
+| **Base model** | ModernBERT-large | Qwen3.5-2B | Qwen3-0.6B | Qwen3.5-2B |
+| **Parameters** | 395M | 2B | 0.6B | 2B |
+| **Architecture mods** | None | Mask removal | None | None |
+| **Inference speed** | Fastest | Medium | Slowest | Medium-slow |
+| **Explainable** | No | No | No | Yes |
+| **Code understanding** | Limited | Deep | Deep | Deep |
+| **Training complexity** | Simple | LLM2Vec + LoRA | Simple SFT | Simple SFT |
+| **SOTA reference** | LettuceDetect, HaluGate | ACL 2024 paper | MiniCheck | RL4HS |
+
+## Recommended Experiments
+
+1. **A vs D** — Token classification (ModernBERT) vs generative span detection (Qwen3.5-2B). The core comparison: fast encoder vs reasoning LLM, both trained on the same dataset.
+
+2. **A vs B** — Does code pretraining help token classification? Same task, different backbone.
+
+3. **D with RL** — If SFT results are promising, add GRPO with span-overlap rewards (following RL4HS).
+
+## Key References
+
+- [LettuceDetect (arXiv:2502.17125)](https://arxiv.org/abs/2502.17125) — Encoder token classification baseline
+- [HaluGate (vLLM, Dec 2025)](https://blog.vllm.ai/2025/12/14/halugate.html) — Production ModernBERT + NLI pipeline
+- [RL4HS (arXiv:2510.02173)](https://arxiv.org/abs/2510.02173) — SOTA generative span detection with RL
+- [FAVA (COLM 2024)](https://arxiv.org/abs/2401.06855) — Generative hallucination editing
+- [PsiloQA (EMNLP 2025)](https://arxiv.org/abs/2510.04849) — Multilingual encoder-based span detection
+- [Looking Right is Sometimes Right (ACL 2024)](https://arxiv.org/abs/2401.14556) — Decoder LLMs for token classification
+- [LLM2Vec (2024)](https://arxiv.org/abs/2404.05961) — Converting decoders to bidirectional encoders
+- [MiniCheck (EMNLP 2024)](https://arxiv.org/abs/2404.10774) — Sentence-level fact checking
+- [Qwen3-Reranker](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B) — LLM-based yes/no classification
+- [CodeMirage (2024)](https://arxiv.org/abs/2408.08333) — Code hallucination taxonomy (snippet-level only)