OODA 45-50: benchmark cleanup, LiteParse integration, H1 flattening, README comparison#5
Merged
raphaelmansuy merged 7 commits intomainfrom Mar 23, 2026
Conversation
…ller than body text
…as_substantive_follow_up
List items starting with bullet characters (•, ‣, ◦, ●, etc.) were being promoted to ## headings by is_list_section_heading when they ended with ':'. This caused false positive headings like '## • At more than pH 7.5, other problems may occur:' in doc 167. MHS: ~0.8120 -> ~0.8163 (+0.0043) Overall: ~0.8785 -> ~0.8797 (+0.0012)
…README comparison Accuracy improvements - Flatten ALL heading output to H1 (removed H2/H3 disambiguation) - Remove heading merge level check: consecutive Heading elements always merge Benchmark infrastructure - Add LiteParse (@llamaindex/liteparse) as benchmark competitor with --no-ocr - Register LiteParse in engine_registry.py and report_html.py - Update compare_all.py to include liteparse in ALL_ENGINES (9 engines total) Documentation - Rewrite README Benchmark section with WHY-first narrative - Non-OCR comparison table: EdgeParse dominates all 5 metrics - ML/OCR comparison table: 18x faster than Docling at near-parity accuracy - Summary recommendation table for decision-making Codebase cleanup - Remove 48 temporary analysis/debug Python scripts from benchmark/ - Remove temporary JSON/MD files from benchmark/pdfs/ - Remove dead merge_consecutive_headings() function Final scores (200 docs, Apple M4 Max) EdgeParse: NID=0.911 TEDS=0.783 MHS=0.821 Overall=0.881 Speed=0.023s/doc First among all non-OCR tools on every metric. 2-13x faster than peers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EdgeParse verified as the #1 non-OCR PDF parser on every metric.
Changes
Final Scores (200 docs)
EdgeParse: NID=0.911 TEDS=0.783 MHS=0.821 Overall=0.881 Speed=0.023s/doc
opendataloader: NID=0.912 TEDS=0.494 MHS=0.760 Overall=0.844 Speed=0.048s/doc
pymupdf4llm: NID=0.888 TEDS=0.540 MHS=0.774 Overall=0.833 Speed=0.310s/doc
EdgeParse wins TEDS (+58%), MHS (+6%), Overall (+4%), Speed (2-13x faster) vs all non-OCR tools.
NID within 0.001 of opendataloader (statistically tied).