Tutorial 01 — CLI Quick-Start

Goal: Install the edgeparse CLI, convert PDFs to every supported format, and master every flag.

→ Next: Python SDK · Node.js SDK · Rust library

Installation
First conversion
Output formats
Page ranges
Multiple files and batch output
Table detection methods
Image extraction
Reading order control
Encrypted PDFs
Content safety filters
Layout options
Page separators
Complete flag reference

1. Installation

Option A — Install from crates.io (recommended)

cargo install edgeparse-cli

This puts edgeparse on your $PATH. Requires Rust 1.85+.

Verify:

edgeparse --version
# edgeparse 0.1.0

Option B — Build from source

git clone https://github.com/raphaelmansuy/edgeparse.git
cd edgeparse
cargo build --release

The binary is at target/release/edgeparse. Either add it to your $PATH or run it as ./target/release/edgeparse.

./target/release/edgeparse --version
# edgeparse 0.1.0

Option C — Python package brings a CLI too

If you already have the Python SDK installed, it registers the same edgeparse command:

pip install edgeparse
edgeparse --version

The Python-installed CLI is a Python wrapper; the Rust binary from cargo install edgeparse-cli is typically 5–10× faster for large batches.

2. First Conversion

Convert a PDF to Markdown and write the result to output/:

edgeparse examples/pdf/lorem.pdf --format markdown --output-dir output/

Expected output file: output/lorem.md

Content:

# Lorem

# Ipsum

Lorem ipsum dolor sit amet, incididunt ut labore et dolore exercitation ullamco
laboris dolor in reprehenderit ...

To print to stdout instead of writing a file, omit --output-dir:

edgeparse examples/pdf/lorem.pdf --format markdown

3. Output Formats

EdgeParse supports five output formats. Use -f / --format:

Markdown (default)

edgeparse examples/pdf/lorem.pdf -f markdown -o output/

Output: output/lorem.md — headings, paragraphs, and GFM tables.

JSON with bounding boxes

edgeparse examples/pdf/lorem.pdf -f json -o output/

Output: output/lorem.json — full structured data including element type, bounding box coordinates, font information, and page number. Example:

{
  "file name": "lorem.pdf",
  "number of pages": 1,
  "kids": [
    {
      "type": "paragraph",
      "id": 1,
      "page number": 1,
      "bounding box": [200.89, 706.94, 299.69, 745.09],
      "font": "Pretendard-Regular",
      "font size": 32.0,
      "content": "Lorem"
    }
  ]
}

See Tutorial 05 — Output Formats for the full JSON schema.

HTML

edgeparse examples/pdf/lorem.pdf -f html -o output/

Output: output/lorem.html — HTML5 document with semantic <p>, <h1>–<h6>, and <table> elements.

Plain text

edgeparse examples/pdf/lorem.pdf -f text -o output/

Output: output/lorem.txt — UTF-8 text preserving reading order.

Multiple formats at once

Combine formats with a comma. One output file per format:

edgeparse examples/pdf/lorem.pdf -f markdown,json,html,text -o output/
# Writes: output/lorem.md  output/lorem.json  output/lorem.html  output/lorem.txt

Markdown variants

Value	Description
`markdown`	Standard Markdown with GFM tables
`markdown-with-html`	Markdown with HTML `<table>` fallback for complex tables
`markdown-with-images`	Markdown with images linked or embedded

edgeparse examples/pdf/1901.03003.pdf -f markdown-with-html -o output/

4. Page Ranges

Extract a subset of pages with --pages:

# Only pages 1 and 2
edgeparse examples/pdf/1901.03003.pdf -f markdown --pages "1-2" -o output/

# Pages 1, 3, and 5 through 7
edgeparse examples/pdf/1901.03003.pdf -f markdown --pages "1,3,5-7" -o output/

# Just page 1
edgeparse examples/pdf/1901.03003.pdf -f markdown --pages "1" -o output/

Pages are 1-indexed. Out-of-range pages are silently skipped.

5. Multiple Files and Batch Output

Pass multiple files — EdgeParse processes them in parallel:

edgeparse examples/pdf/*.pdf -f markdown -o output/

Each PDF produces its own output file in output/.

Tip: EdgeParse uses Rayon for per-file and per-page parallelism. On an M-series Mac, processing 200 PDFs averages 0.023 s/doc.

Suppress the per-file log messages with --quiet / -q:

edgeparse examples/pdf/*.pdf -f markdown -o output/ -q

6. Table Detection Methods

EdgeParse has two table detection modes:

Flag	Description	Best for
`--table-method default`	Ruling-line detection	PDFs with visible table borders
`--table-method cluster`	Geometric clustering	Borderless/whitespace tables

# Default (ruling lines)
edgeparse document.pdf -f markdown --table-method default -o output/

# Cluster (borderless tables)
edgeparse document.pdf -f markdown --table-method cluster -o output/

Most academic papers and reports use ruled tables — the default is correct for those. Use cluster for spreadsheet exports or looser layouts.

7. Image Extraction

Control how images are handled with --image-output:

Value	Description	Output
`off`	Skip images entirely	No image data
`external`	Extract to files (default)	PNG/JPEG in `--image-dir`
`embedded`	Base64 data URIs in text	Inline in Markdown/HTML

External images (default)

edgeparse examples/pdf/lorem.pdf -f markdown \
  --image-output external \
  --image-dir output/images/ \
  -o output/

Images are saved as output/images/page1_img1.png, etc. The Markdown file references them with relative paths.

Embedded images (data URIs)

edgeparse examples/pdf/lorem.pdf -f markdown \
  --image-output embedded \
  -o output/

Images are base64-encoded inline in the Markdown — useful for self-contained documents.

Image format

# PNG (default, lossless)
edgeparse document.pdf --image-output external --image-format png -o output/

# JPEG (smaller, lossy)
edgeparse document.pdf --image-output external --image-format jpeg -o output/

8. Reading Order Control

XY-Cut++ is the default. The off mode outputs text in the order it appears in the PDF content stream — useful for debugging or PDFs with non-standard layout:

# Default: XY-Cut++ reading order (correct for most PDFs)
edgeparse document.pdf -f markdown --reading-order xycut -o output/

# Raw content-stream order
edgeparse document.pdf -f markdown --reading-order off -o output/

9. Encrypted PDFs

Pass the password with -p / --password:

edgeparse secure.pdf -f markdown -p "mypassword" -o output/

10. Content Safety Filters

EdgeParse removes hidden, off-page, and invisible text by default to prevent prompt-injection attacks in RAG pipelines. Disable specific filters with --content-safety-off:

Flag value	Filter disabled
`hidden-text`	Rendering-mode hidden text (`Tr=3`)
`off-page`	Text outside the visible page box
`tiny`	Text smaller than 1 pt
`hidden-ocg`	Text in hidden optional content groups
`all`	All safety filters

# Disable all filters (e.g., for debugging)
edgeparse document.pdf -f markdown --content-safety-off all -o output/

# Disable only off-page filter
edgeparse document.pdf -f markdown --content-safety-off off-page -o output/

Security note: Only disable safety filters if you control and trust the PDF source.

11. Layout Options

Preserve line breaks

By default, EdgeParse joins soft line-breaks within paragraphs. Use --keep-line-breaks to preserve them:

edgeparse document.pdf -f markdown --keep-line-breaks -o output/

Include headers and footers

Headers and footers are filtered by default. Include them with:

edgeparse document.pdf -f markdown --include-header-footer -o output/

Tagged PDF structure tree

For tagged PDFs (accessibility-compliant), use the structure tree to improve heading and list detection:

edgeparse document.pdf -f markdown --use-struct-tree -o output/

PII sanitization

Redact common PII patterns (emails, phone numbers, SSNs, credit card numbers):

edgeparse document.pdf -f markdown --sanitize -o output/

Invalid character replacement

Replace characters that cannot be decoded with a custom string (default: space):

edgeparse document.pdf -f text --replace-invalid-chars "?" -o output/

12. Page Separators

Insert a custom string between pages in multi-page documents:

# Markdown separator
edgeparse document.pdf -f markdown \
  --markdown-page-separator $'\n---\n' \
  -o output/

# Text separator
edgeparse document.pdf -f text \
  --text-page-separator $'\n=====\n' \
  -o output/

# HTML separator
edgeparse document.pdf -f html \
  --html-page-separator '<hr class="page-break">' \
  -o output/

13. Complete Flag Reference

edgeparse [OPTIONS] <INPUT>...

Arguments:
  <INPUT>...   One or more PDF file paths (required)

Core options:
  -o, --output-dir <DIR>     Write output files here (default: input file directory)
  -f, --format <FMT>         Output formats, comma-separated:
                               json | text | html | markdown | markdown-with-html
                               | markdown-with-images  (default: json)
  -p, --password <PW>        Password for encrypted PDFs
      --pages <RANGE>        Page range, e.g. "1,3,5-7"
  -q, --quiet                Suppress log output

Layout options:
      --reading-order <ALG>  xycut (default) | off
      --table-method <M>     default (ruling lines) | cluster (borderless)
      --keep-line-breaks     Preserve original line breaks
      --use-struct-tree      Use tagged PDF structure tree
      --include-header-footer  Include headers/footers
      --sanitize             Enable PII sanitisation
      --replace-invalid-chars <CH>  Replace invalid chars (default: space)

Image options:
      --image-output <MODE>  off | external (default) | embedded
      --image-format <FMT>   png (default) | jpeg
      --image-dir <DIR>      Directory for extracted image files

Page separator options:
      --markdown-page-separator <STR>
      --text-page-separator <STR>
      --html-page-separator <STR>

Safety options:
      --content-safety-off <FLAGS>   all | hidden-text | off-page | tiny | hidden-ocg

Hybrid backend options (advanced):
      --hybrid <BACKEND>     off (default) | docling-fast
      --hybrid-mode <MODE>   auto (default) | full
      --hybrid-url <URL>     Hybrid backend service URL
      --hybrid-timeout <MS>  Timeout in milliseconds (default: 30000)
      --hybrid-fallback      Fall back to local extraction on error

Standard:
  -V, --version              Print version
  -h, --help                 Print help

Quick Cheat Sheet

# Markdown to stdout
edgeparse doc.pdf -f markdown

# JSON with bounding boxes
edgeparse doc.pdf -f json -o out/

# Multiple formats, pages 1-5
edgeparse doc.pdf -f markdown,json --pages "1-5" -o out/

# Borderless-table PDF
edgeparse doc.pdf -f markdown --table-method cluster -o out/

# Extract images to files
edgeparse doc.pdf -f markdown --image-output external --image-dir out/images/ -o out/

# Encrypted PDF
edgeparse secure.pdf -f markdown -p "password" -o out/

# Batch all PDFs in a folder
edgeparse pdfs/*.pdf -f markdown -o out/ -q

→ Continue: Python SDK Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial 01 — CLI Quick-Start

Table of Contents

1. Installation

Option A — Install from crates.io (recommended)

Option B — Build from source

Option C — Python package brings a CLI too

2. First Conversion

3. Output Formats

Markdown (default)

JSON with bounding boxes

HTML

Plain text

Multiple formats at once

Markdown variants

4. Page Ranges

5. Multiple Files and Batch Output

6. Table Detection Methods

7. Image Extraction

External images (default)

Embedded images (data URIs)

Image format

8. Reading Order Control

9. Encrypted PDFs

10. Content Safety Filters

11. Layout Options

Preserve line breaks

Include headers and footers

Tagged PDF structure tree

PII sanitization

Invalid character replacement

12. Page Separators

13. Complete Flag Reference

Quick Cheat Sheet

FilesExpand file tree

01-cli.md

Latest commit

History

01-cli.md

File metadata and controls

Tutorial 01 — CLI Quick-Start

Table of Contents

1. Installation

Option A — Install from crates.io (recommended)

Option B — Build from source

Option C — Python package brings a CLI too

2. First Conversion

3. Output Formats

Markdown (default)

JSON with bounding boxes

HTML

Plain text

Multiple formats at once

Markdown variants

4. Page Ranges

5. Multiple Files and Batch Output

6. Table Detection Methods

7. Image Extraction

External images (default)

Embedded images (data URIs)

Image format

8. Reading Order Control

9. Encrypted PDFs

10. Content Safety Filters

11. Layout Options

Preserve line breaks

Include headers and footers

Tagged PDF structure tree

PII sanitization

Invalid character replacement

12. Page Separators

13. Complete Flag Reference

Quick Cheat Sheet