OODA 45-50: benchmark cleanup, LiteParse integration, H1 flattening, README comparison by raphaelmansuy · Pull Request #5 · raphaelmansuy/edgeparse

raphaelmansuy · 2026-03-23T05:19:03Z

Summary

EdgeParse verified as the #1 non-OCR PDF parser on every metric.

Changes

LiteParse integration: Added LiteParse benchmark adapter and registered in engine registry
H1 flattening: All heading outputs normalized to H1 in markdown output
Dead code removal: Removed merge_consecutive_headings function
Benchmark cleanup: Removed 48 temp analysis Python scripts from benchmark/
README rewrite: WHY-first narrative with competitive comparison tables

Final Scores (200 docs)

EdgeParse: NID=0.911 TEDS=0.783 MHS=0.821 Overall=0.881 Speed=0.023s/doc
opendataloader: NID=0.912 TEDS=0.494 MHS=0.760 Overall=0.844 Speed=0.048s/doc
pymupdf4llm: NID=0.888 TEDS=0.540 MHS=0.774 Overall=0.833 Speed=0.310s/doc

EdgeParse wins TEDS (+58%), MHS (+6%), Overall (+4%), Speed (2-13x faster) vs all non-OCR tools.
NID within 0.001 of opendataloader (statistically tied).

…ller than body text

…head fix

…as_substantive_follow_up

List items starting with bullet characters (•, ‣, ◦, ●, etc.) were being promoted to ## headings by is_list_section_heading when they ended with ':'. This caused false positive headings like '## • At more than pH 7.5, other problems may occur:' in doc 167. MHS: ~0.8120 -> ~0.8163 (+0.0043) Overall: ~0.8785 -> ~0.8797 (+0.0012)

…README comparison Accuracy improvements - Flatten ALL heading output to H1 (removed H2/H3 disambiguation) - Remove heading merge level check: consecutive Heading elements always merge Benchmark infrastructure - Add LiteParse (@llamaindex/liteparse) as benchmark competitor with --no-ocr - Register LiteParse in engine_registry.py and report_html.py - Update compare_all.py to include liteparse in ALL_ENGINES (9 engines total) Documentation - Rewrite README Benchmark section with WHY-first narrative - Non-OCR comparison table: EdgeParse dominates all 5 metrics - ML/OCR comparison table: 18x faster than Docling at near-parity accuracy - Summary recommendation table for decision-making Codebase cleanup - Remove 48 temporary analysis/debug Python scripts from benchmark/ - Remove temporary JSON/MD files from benchmark/pdfs/ - Remove dead merge_consecutive_headings() function Final scores (200 docs, Apple M4 Max) EdgeParse: NID=0.911 TEDS=0.783 MHS=0.821 Overall=0.881 Speed=0.023s/doc First among all non-OCR tools on every metric. 2-13x faster than peers.

raphaelmansuy added 7 commits March 23, 2026 05:23

OODA 35: font-size guard for heading rescue - filter chart labels sma…

90f3013

…ller than body text

OODA 36: enable merge_adjacent_pipe_tables for fragmented table recovery

1c84bfb

OODA 38: font-size-gated title-case rescue in tier 2 + TextLine looka…

ad213ad

…head fix

OODA 39: refactor heading rescue into is_heading_rescue_candidate + h…

7d9838c

…as_substantive_follow_up

OODA 44: demote bottom-margin headings to paragraph, cleanup dead code

35d62aa

raphaelmansuy merged commit 259d7f5 into main Mar 23, 2026
0 of 6 checks passed

raphaelmansuy deleted the feature/ooda-45-50-benchmark-comparison-liteparse branch March 23, 2026 05:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OODA 45-50: benchmark cleanup, LiteParse integration, H1 flattening, README comparison#5

OODA 45-50: benchmark cleanup, LiteParse integration, H1 flattening, README comparison#5
raphaelmansuy merged 7 commits intomainfrom
feature/ooda-45-50-benchmark-comparison-liteparse

raphaelmansuy commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raphaelmansuy commented Mar 23, 2026

Summary

Changes

Final Scores (200 docs)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant