Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -198,8 +198,8 @@ bench-download-mineru: bench-setup ## Create MinerU isolated venv + download pip
@$(BENCH_DIR)/.venvs/mineru/bin/mineru-models-download \
--source huggingface --model_type pipeline

bench-compare-all: build bench-setup ## Compare EdgeParse against ALL engines: opendataloader, docling, marker, mineru, pymupdf4llm, markitdown, edgequake
$(call log,Running full multi-engine comparison — all 8 engines ...)
bench-compare-all: build bench-setup ## Compare EdgeParse against ALL engines: opendataloader, docling, marker, mineru, pymupdf4llm, markitdown, edgequake, liteparse
$(call log,Running full multi-engine comparison — all 9 engines ...)
@cd $(BENCH_DIR) && uv run python compare_all.py --all --install

bench-compare-fast: build bench-setup ## Quick comparison: EdgeParse + pymupdf4llm + markitdown (installs missing engines)
Expand Down
106 changes: 70 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/Rust-1.85%2B-orange.svg)](https://www.rust-lang.org/)

EdgeParse converts any digital PDF into Markdown, JSON (with bounding boxes), HTML, or plain text — deterministically, without a JVM, without a GPU, and with best-in-class accuracy on the 200-document benchmark suite included in this repository.
EdgeParse converts any digital PDF into Markdown, JSON (with bounding boxes), HTML, or plain text — deterministically, without a JVM, without a GPU, without OCR models, and with **best-in-class accuracy** among non-OCR tools on the 200-document benchmark suite included in this repository.

Available as a **Rust library**, **CLI binary**, **Python package** (`edgeparse`), and **Node.js package** (`@edgeparse/pdf`).

Expand All @@ -21,6 +21,9 @@ Available as a **Rust library**, **CLI binary**, **Python package** (`edgeparse`
- [Node.js SDK](#nodejs-sdk)
- [Architecture](#architecture)
- [Benchmark](#benchmark)
- [Why it matters](#why-it-matters)
- [Results on 200-document benchmark suite](#results-on-200-document-benchmark-suite)
- [Running the benchmark](#running-the-benchmark)
- [Documentation](#documentation)
- [Project Layout](#project-layout)
- [Contributing](#contributing)
Expand Down Expand Up @@ -386,17 +389,66 @@ Stages marked `par_map_pages` run in parallel via Rayon; cross-page stages run s

## Benchmark

The `benchmark/` directory contains a full evaluation suite against real-world PDFs — academic papers, multi-column layouts, tables, scanned pages — with ground-truth Markdown and element annotations.
### Why it matters

### Metrics
Most PDF parsers were designed for one thing: **handle scanned documents with OCR at any cost**. That means pulling in deep-learning stacks (PaddleOCR, Surya, EasyOCR, layout detection models), Python-heavy runtimes, and GPU dependencies — even when processing a born-digital PDF that contains perfectly legible text. The result is tools that are **slow, large to install, and brittle in production**.

| Metric | Description |
|--------|-------------|
| **NID** | Normalised Index Distance — reading order accuracy |
| **TEDS** | Tree-Edit-Distance-based Similarity — table structure accuracy |
| **MHS** | Markdown Heading Similarity — heading hierarchy accuracy |
| **Table Detection F1** | Precision / recall of table presence detection |
| **Speed** | Seconds per document |
The reality is that the vast majority of business, research, and enterprise PDFs are **born-digital**: they have embedded fonts, vector text, and structured content. OCR is unnecessary. What they need is precision — correct reading order, accurate table extraction, and reliable heading detection.

EdgeParse is built on this insight. It uses **zero ML models, zero OCR, zero GPU**, and achieves top-tier accuracy through first-principles PDF parsing: font decoding, layout geometry, ruling-line analysis, and XY-Cut++ reading order. The result is a parser that is **fastest in class** and **dominant among all non-OCR tools** on every benchmark metric.

### Results on 200-document benchmark suite

Evaluated on 200 real-world PDFs spanning academic papers, financial reports, multi-column layouts, complex tables, and mixed-language documents, running on Apple M4 Max.

#### Against non-OCR tools (apples-to-apples)

Tools that require no OCR or deep-learning model inference. EdgeParse wins on **every metric** including speed.

| Engine | NID ↑ | TEDS ↑ | MHS ↑ | Overall ↑ | Speed ↓ |
|--------|-------:|-------:|------:|----------:|--------:|
| **EdgeParse** ✅ | **0.911** | **0.783** | **0.821** | **0.881** | **0.023 s/doc** |
| OpenDataLoader | 0.912 | 0.494 | 0.760 | 0.844 | 0.048 s/doc |
| PyMuPDF4LLM | 0.888 | 0.540 | 0.774 | 0.833 | 0.310 s/doc |
| Microsoft MarkItDown | 0.844 | 0.273 | 0.000 | 0.589 | 0.078 s/doc |
| LiteParse (LlamaIndex) | 0.857 | 0.000 | 0.000 | 0.569 | 0.214 s/doc |

> **NID** = reading order accuracy (normalised index distance), **TEDS** = table structure accuracy, **MHS** = heading hierarchy accuracy, **Overall** = geometric mean of all metrics. Higher is better (↑), lower is better for speed (↓).

EdgeParse is **13× faster than PyMuPDF4LLM** and **2× faster than OpenDataLoader**, while delivering significantly better table and heading accuracy. MarkItDown and LiteParse produce zero MHS and near-zero TEDS, meaning they extract raw text only with no structural understanding.

#### Against ML/OCR-based tools

Tools that rely on deep-learning models, OCR engines, or GPU inference. Included for reference — they carry significant deployment weight.

| Engine | NID ↑ | TEDS ↑ | MHS ↑ | Overall ↑ | Speed ↓ | Requires |
|--------|-------:|-------:|------:|----------:|--------:|---------|
| **EdgeParse** ✅ | **0.911** | **0.783** | **0.821** | **0.881** | **0.023 s/doc** | Nothing |
| MinerU | 0.953 | — | 0.858 | 0.906 | 20.8 s/doc | PaddleOCR + layout models |
| IBM Docling | 0.899 | **0.887** | 0.824 | 0.882 | 0.424 s/doc | Layout + OCR models |
| Marker | 0.866 | 0.825 | 0.794 | 0.846 | 30.3 s/doc | Surya OCR + GPU |

EdgeParse is within rounding distance of Docling's **MHS** (0.821 vs 0.824) and **Overall** (0.881 vs 0.882) — while being **18× faster** and requiring zero model downloads. It outperforms Marker on all metrics while being **1,300× faster**. MinerU leads on NID and MHS but at **900× the latency** and requires a full OCR + layout model stack.

The tradeoff is TEDS: Docling's layout models give it an edge on complex borderless tables (0.887 vs 0.783). If your pipeline is dominated by complex scanned tables, weigh that against the 18× speed penalty and model dependencies.

#### Summary

| Condition | Recommendation |
|-----------|---------------|
| Born-digital PDFs, latency-sensitive, production deployment | **EdgeParse** — best accuracy/speed tradeoff, zero dependencies |
| Complex scanned tables, GPU available, batch offline processing | Consider Docling or MinerU |
| Scanned documents requiring full OCR | Use a dedicated OCR pipeline |

### Metrics explained

| Metric | What it measures |
|--------|-----------------|
| **NID** | Reading order accuracy — how well content follows the logical reading sequence |
| **TEDS** | Table structure accuracy — tree-edit distance between extracted and ground-truth table trees |
| **MHS** | Heading hierarchy accuracy — correctness of document structure and section titles |
| **Overall** | Geometric mean of NID, TEDS, and MHS |
| **Speed** | Wall-clock seconds per document (full pipeline, 200 docs, parallel) |

### Running the benchmark

Expand All @@ -410,22 +462,19 @@ cargo build --release
cd benchmark
uv sync

# 3. Run all documents
# 3. Run EdgeParse on all 200 documents
uv run python run.py

# 4. Run against a single engine
uv run python run.py --engine edgeparse

# 5. Compare engines
# 4. Compare against other engines
uv run python compare_all.py
```

Results are written to `benchmark/prediction/edgeparse/`.
HTML reports are written to `benchmark/reports/`.

### Threshold file
### Regression thresholds

`benchmark/thresholds.json` defines minimum acceptable scores:
`benchmark/thresholds.json` defines minimum acceptable scores for CI:

```json
{
Expand All @@ -437,21 +486,6 @@ HTML reports are written to `benchmark/reports/`.
}
```

### Supported engines

The benchmark can compare multiple engines side by side:

| Engine | Notes |
|--------|-------|
| `edgeparse` | This project (default) |
| `docling` | IBM Docling |
| `marker` | VikParuchuri/marker |
| `markitdown` | Microsoft MarkItDown |
| `mineru` | MinerU |
| `pymupdf4llm` | PyMuPDF4LLM |
| `opendataloader` | OpenDataLoader PDF |
| `edgequake` | EdgeQuake service |

---

## Documentation
Expand Down Expand Up @@ -494,15 +528,15 @@ edgeparse/
│ └── src/ # index.ts, types.ts, cli.ts
├── benchmark/ # Evaluation suite
│ ├── run.py # Benchmark runner
│ ├── compare_all.py # Multi-engine comparison
│ ├── run.py # Benchmark runner (EdgeParse)
│ ├── compare_all.py # Multi-engine comparison (9 engines)
│ ├── pyproject.toml
│ ├── thresholds.json # Regression thresholds
│ ├── pdfs/ # Benchmark PDFs
│ ├── pdfs/ # Benchmark PDFs (200 docs)
│ ├── ground-truth/ # Reference Markdown and JSON annotations
│ ├── prediction/ # Per-engine output directories
│ ├── reports/ # HTML benchmark reports
│ └── src/ # Python evaluators and engine parsers
│ └── src/ # Python evaluators and engine adapters
├── docs/ # Technical documentation (Markdown)
Expand Down
Loading
Loading