PDF to AI Agent Knowledge Bridge
Convert PDFs to enhanced, AI-agent-ready Markdown/JSON documents.
Features • Quick Start • Installation • Usage
Anji bridges the gap between PDFs designed for human reading and the structured, semantic text that AI agents require. It leverages:
- PaddleOCR-VL for high-quality PDF-to-Markdown conversion
- Ovis2.5-9B Vision-Language Model for intelligent image analysis
- Mistune for flexible AST manipulation
| Feature | Description |
|---|---|
| Smart OCR | Extracts text, tables, and images with layout awareness |
| VLM Image Analysis | Generates captions and descriptions for embedded images |
| Decorative Filtering | Removes logos, watermarks, and noise automatically |
| Heading Correction(developing) | Fixes OCR-generated heading hierarchy issues |
| Multi-Format Output | Export to Markdown, JSON, or structured data |
| Batch Processing | Efficiently process multiple PDFs |
| Flexible Pipeline | Run full pipeline or individual steps |
| Base64 Embedding | Embed images as base64 data URLs in markdown |
# Install
pip install -e .
# Convert a PDF
anji pipeline document.pdf output/
# Embed images as base64 (single portable file)
anji pipeline document.pdf output/ --embed-base64
# Or use as a Python library
python -c "
from anji import run_full_pipeline
run_full_pipeline('document.pdf', 'output/')
"# Basic installation
pip install -e .
# With development dependencies
pip install -e ".[dev]"Anji requires two external services running:
Requires GPU. Run using Docker:
docker run \
-it \
--rm \
--gpus all \
--network host \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend vllmRequires GPU with ~16GB VRAM. Run with vLLM:
vllm serve AIDC-AI/Ovis2.5-9B \
--trust-remote-code \
--port 8000 \
--gpu-memory-utilization 0.4Note: If you encounter
RuntimeError: Exception from the 'vlm' worker: only 0-dimensional arrays can be converted to Python scalars, installnumpy==1.26.4.
# Full pipeline
anji pipeline input.pdf output_dir
# Batch processing
anji batch output_base file1.pdf file2.pdf file3.pdf
# Individual steps
anji pdf input.pdf output_dir # PDF → Markdown
anji image input.md output.md # Analyze images
anji md enhance input.md output.md # Enhance AST
anji md export input.md out --format json # Export# Keep images folder (default: enabled)
anji pipeline input.pdf output/ --keep-images
# Disable images folder
anji pipeline input.pdf output/ --no-keep-images
# Embed images as base64 (single portable markdown file)
anji pipeline input.pdf output/ --embed-base64
# Combine options
anji pipeline input.pdf output/ --embed-base64 --no-keep-imagesfrom anji import Pipeline, run_full_pipeline, batch_pipeline
# Simple usage
run_full_pipeline("document.pdf", "output/")
# Advanced usage
pipeline = Pipeline(
paddleocr_server_url="http://localhost:8118/v1",
vlm_server_url="http://localhost:8000/v1"
)
outputs = pipeline.run(
input_path="document.pdf",
output_folder="output",
output_format="both", # markdown, json, structured, or both
keep_images=True, # keep imgs folder
embed_base64=False, # or True for single file
)
# Batch processing
batch_pipeline(
input_paths=["doc1.pdf", "doc2.pdf"],
output_base_folder="batch_output"
)
pipeline.close()output/
└── document_name/
└── enhanced/
├── document.md # Enhanced Markdown
├── document.json # JSON AST (optional)
└── imgs/ # Extracted images (optional)
├── image1.jpg
└── image2.jpg
With --embed-base64, images are embedded directly in the markdown file as base64 data URLs.
Anji processes PDFs through 4 stages:
- PDF → Markdown - Uses PaddleOCR-VL to extract text, tables, and images
- Markdown → AST - Parses markdown into an abstract syntax tree using Mistune
- Enhance - Analyzes images with VLM, fixes heading levels, filters decorative elements
- Export - Outputs as Markdown, JSON, or structured data
| Variable | Default | Description |
|---|---|---|
API_BASE_URL |
http://localhost:8000/v1 |
VLM server URL |
API_KEY |
abc-123 |
VLM API key |
MODEL_NAME |
AIDC-AI/Ovis2.5-9B |
VLM model name |
anji pipeline input.pdf output/ \
--format markdown|json|structured|both \
--no-enhance \
--no-fix-headings \
--no-filter-decorative \
--no-enrich-images \
--keep-images \
--embed-base64 \
--dummy # Test without API calls# Code formatting
black anji/
# Linting
ruff check anji/
# Type checking
mypy anji/
# Testing
pytestanji/
├── anji/ # Main package
│ ├── __init__.py # Exports
│ ├── main.py # CLI entry point
│ ├── cli.py # Command-line interface
│ ├── pipeline.py # Pipeline orchestration
│ ├── pdf_converter.py # PDF → Markdown
│ ├── image_analyzer.py # VLM image analysis
│ ├── ast_handler.py # AST manipulation
│ ├── enhancement.py # AST enhancement
│ └── exporters.py # Export utilities
├── pyproject.toml # Package configuration
├── README.md # English documentation
├── README_CN.md # Chinese documentation
├── CLAUDE.md # Claude Code context
└── .gitignore
MIT License. See LICENSE for details.
Contributions are welcome! Please read CLAUDE.md for development guidelines.
Built for AI agents, by AI agents