feat: Expand Document Intelligence converter — custom models, queryFields, structured field front matter by chienyuanchang · Pull Request #1913 · microsoft/markitdown

chienyuanchang · 2026-05-26T17:00:55Z

Summary

Brings the Azure Document Intelligence (DI) converter up to feature parity with the recently merged Azure Content Understanding converter (#1865):

Custom / prebuilt model IDs via docintel_model_id (e.g. prebuilt-invoice, prebuilt-receipt, custom models). Previously hard-coded to prebuilt-layout.
queryFields add-on via docintel_query_fields=["FieldA", "FieldB", ...], only applied when the file type supports OCR add-ons (PDF/images).
Structured field extraction as YAML front matter, mirroring the CU converter's output shape so downstream LLM pipelines can parse both uniformly.
API version bumped to 2024-11-30 (GA) from the older 2024-07-31-preview.
User-Agent telemetry (markitdown-docintel/<version>) added to the DI client.

Motivation

Today the DI converter only runs prebuilt-layout and discards everything in AnalyzeResult.documents[*].fields. This means users who want invoice/receipt totals, vendor names, or custom-model fields have to call the DI SDK directly and lose MarkItDown's converter chain, plugin model, and YAML front matter convention.

The new CU converter (#1865) already exposes structured fields the way LLM pipelines want them. This PR gives DI users the same ergonomics without changing default behavior — MarkItDown(docintel_endpoint=...) with no extra args still runs prebuilt-layout and returns plain markdown.

What's included

File	Change
`converters/_doc_intel_converter.py`	New `model_id` and `query_fields` ctor args; field-to-YAML serialization; `QUERY_FIELDS` feature toggled only for OCR file types; API version bump; User-Agent header.
`_markitdown.py`	Wire new `docintel_model_id` / `docintel_query_fields` kwargs through to the converter.
`__main__.py`	New CLI flags `--docintel-model-id` and `--docintel-query-fields` (comma-separated).
`tests/test_docintel_converter.py`	23 new unit tests — no network calls — covering field serialization (scalars, currency, arrays, nested objects), YAML quoting/escaping, front-matter generation, `queryFields` feature gating per file type, and end-to-end kwarg plumbing from `MarkItDown(...)`.

Key design decisions

Non-breaking by default. docintel_model_id defaults to "prebuilt-layout" and docintel_query_fields defaults to None. Existing callers see no behavior change other than the API version bump.
queryFields is OCR-only. The DI service rejects queryFields for Office formats (DOCX/PPTX/XLSX/HTML). The converter inspects stream_info and only adds DocumentAnalysisFeature.QUERY_FIELDS (and only forwards query_fields=) for OCR-supported MIME types. Office types silently skip it instead of failing.
YAML front matter shape mirrors CU. Output is:
```
---
modelId: prebuilt-invoice
fields:
  VendorName: Contoso Ltd.
  InvoiceTotal: 1250.0 USD
  Items:
    - Description: Widget
      Amount: 250.0 USD
---

# ... markdown body ...
```
modelId is omitted when the value is empty; fields: is omitted entirely when DI returned no documents or no non-empty fields (so default prebuilt-layout output is unchanged).
Field value extraction is typed-first. _field_value() checks DI's typed accessors (value_string, value_number, value_date, value_currency, value_address, value_array, value_object) in priority order before falling back to raw content. Dates serialize via isoformat(); currency renders as "<amount> <code>".
Custom YAML emitter (no PyYAML dep). A minimal dump function handles the scalar/list/dict shapes DI returns, with conservative quoting for strings containing YAML metacharacters, leading/trailing whitespace, or YAML reserved words (null, true, yes, etc.). Avoids adding a new dependency to markitdown core.

Usage

# CLI — custom model
markitdown receipt.pdf --use-docintel \
  -e "https://<resource>.cognitiveservices.azure.com/" \
  --docintel-model-id prebuilt-invoice

# CLI — queryFields add-on with default prebuilt-layout
markitdown paper.pdf --use-docintel \
  -e "https://<resource>.cognitiveservices.azure.com/" \
  --docintel-query-fields "Title,Author,PublicationDate"

from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://<resource>.cognitiveservices.azure.com/",
    docintel_model_id="prebuilt-invoice",
    docintel_query_fields=["CustomerName", "PaymentMethod"],
)
print(md.convert("receipt.pdf").markdown)

Testing

pip install -e "packages/markitdown[az-doc-intel,dev]"
python -m pytest packages/markitdown/tests/test_docintel_converter.py -v

Unit tests: 23 passed, no network required. Coverage includes default model ID, custom model override, every typed-scalar branch, currency, fallback to content, array-of-scalars, empty-documents/empty-fields short-circuit, YAML quoting for special characters and reserved words, nested dict emission, queryFields feature inclusion for PDF/image and exclusion for Office formats, and kwarg propagation from MarkItDown(...).

Live validation against https://<resource>.cognitiveservices.azure.com/ (DefaultAzureCredential):

Scenario	Result
Default `prebuilt-layout` on a paper PDF	Plain markdown, no front matter (unchanged behavior).
`docintel_model_id="prebuilt-invoice"` on a receipt PDF	YAML front matter with `InvoiceDate`, `InvoiceTotal`, nested `Items` (Description / Amount / UnitPrice).
`docintel_query_fields=["Title","Author","PublicationDate"]` on a paper PDF	API call accepted with `queryFields` feature; output matches plain layout (paper has no matching fields populated).
`prebuilt-invoice` + `docintel_query_fields=["CustomerName","PaymentMethod"]` combo	Front matter includes both invoice fields and queryFields.

Backward compatibility

No public API removed or renamed.
Default model is still prebuilt-layout.
Output for existing callers is byte-identical except for the API version sent to the service (2024-11-30 GA instead of 2024-07-31-preview).
The [az-doc-intel] optional extra is unchanged — no new dependencies.

Relationship to #1865 (CU converter)

This PR intentionally mirrors the CU converter's front-matter shape and CLI flag naming convention (--docintel-* / --cu-*). DI and CU coexist in MarkItDown(...); when both endpoints are provided, the existing registration order (DI first, then CU) is preserved so CU takes precedence for overlapping formats — no change from #1865.

first version

2a6c257

chienyuanchang marked this pull request as ready for review May 26, 2026 17:01

chienyuanchang added 2 commits May 26, 2026 10:11

fix black

97c19b6

fix black

35c4848

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Expand Document Intelligence converter — custom models, queryFields, structured field front matter#1913

feat: Expand Document Intelligence converter — custom models, queryFields, structured field front matter#1913
chienyuanchang wants to merge 3 commits into
microsoft:mainfrom
chienyuanchang:chienyuanchang/di_convertor_update

chienyuanchang commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chienyuanchang commented May 26, 2026

Summary

Motivation

What's included

Key design decisions

Usage

Testing

Backward compatibility

Relationship to #1865 (CU converter)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant