Skip to content

feat: Expand Document Intelligence converter — custom models, queryFields, structured field front matter#1913

Open
chienyuanchang wants to merge 3 commits into
microsoft:mainfrom
chienyuanchang:chienyuanchang/di_convertor_update
Open

feat: Expand Document Intelligence converter — custom models, queryFields, structured field front matter#1913
chienyuanchang wants to merge 3 commits into
microsoft:mainfrom
chienyuanchang:chienyuanchang/di_convertor_update

Conversation

@chienyuanchang
Copy link
Copy Markdown
Member

Summary

Brings the Azure Document Intelligence (DI) converter up to feature parity with the recently merged Azure Content Understanding converter (#1865):

  • Custom / prebuilt model IDs via docintel_model_id (e.g. prebuilt-invoice, prebuilt-receipt, custom models). Previously hard-coded to prebuilt-layout.
  • queryFields add-on via docintel_query_fields=["FieldA", "FieldB", ...], only applied when the file type supports OCR add-ons (PDF/images).
  • Structured field extraction as YAML front matter, mirroring the CU converter's output shape so downstream LLM pipelines can parse both uniformly.
  • API version bumped to 2024-11-30 (GA) from the older 2024-07-31-preview.
  • User-Agent telemetry (markitdown-docintel/<version>) added to the DI client.

Motivation

Today the DI converter only runs prebuilt-layout and discards everything in AnalyzeResult.documents[*].fields. This means users who want invoice/receipt totals, vendor names, or custom-model fields have to call the DI SDK directly and lose MarkItDown's converter chain, plugin model, and YAML front matter convention.

The new CU converter (#1865) already exposes structured fields the way LLM pipelines want them. This PR gives DI users the same ergonomics without changing default behavior — MarkItDown(docintel_endpoint=...) with no extra args still runs prebuilt-layout and returns plain markdown.

What's included

File Change
converters/_doc_intel_converter.py New model_id and query_fields ctor args; field-to-YAML serialization; QUERY_FIELDS feature toggled only for OCR file types; API version bump; User-Agent header.
_markitdown.py Wire new docintel_model_id / docintel_query_fields kwargs through to the converter.
__main__.py New CLI flags --docintel-model-id and --docintel-query-fields (comma-separated).
tests/test_docintel_converter.py 23 new unit tests — no network calls — covering field serialization (scalars, currency, arrays, nested objects), YAML quoting/escaping, front-matter generation, queryFields feature gating per file type, and end-to-end kwarg plumbing from MarkItDown(...).

Key design decisions

  • Non-breaking by default. docintel_model_id defaults to "prebuilt-layout" and docintel_query_fields defaults to None. Existing callers see no behavior change other than the API version bump.

  • queryFields is OCR-only. The DI service rejects queryFields for Office formats (DOCX/PPTX/XLSX/HTML). The converter inspects stream_info and only adds DocumentAnalysisFeature.QUERY_FIELDS (and only forwards query_fields=) for OCR-supported MIME types. Office types silently skip it instead of failing.

  • YAML front matter shape mirrors CU. Output is:

    ---
    modelId: prebuilt-invoice
    fields:
      VendorName: Contoso Ltd.
      InvoiceTotal: 1250.0 USD
      Items:
        - Description: Widget
          Amount: 250.0 USD
    ---
    
    # ... markdown body ...

    modelId is omitted when the value is empty; fields: is omitted entirely when DI returned no documents or no non-empty fields (so default prebuilt-layout output is unchanged).

  • Field value extraction is typed-first. _field_value() checks DI's typed accessors (value_string, value_number, value_date, value_currency, value_address, value_array, value_object) in priority order before falling back to raw content. Dates serialize via isoformat(); currency renders as "<amount> <code>".

  • Custom YAML emitter (no PyYAML dep). A minimal dump function handles the scalar/list/dict shapes DI returns, with conservative quoting for strings containing YAML metacharacters, leading/trailing whitespace, or YAML reserved words (null, true, yes, etc.). Avoids adding a new dependency to markitdown core.

Usage

# CLI — custom model
markitdown receipt.pdf --use-docintel \
  -e "https://<resource>.cognitiveservices.azure.com/" \
  --docintel-model-id prebuilt-invoice

# CLI — queryFields add-on with default prebuilt-layout
markitdown paper.pdf --use-docintel \
  -e "https://<resource>.cognitiveservices.azure.com/" \
  --docintel-query-fields "Title,Author,PublicationDate"
from markitdown import MarkItDown

md = MarkItDown(
    docintel_endpoint="https://<resource>.cognitiveservices.azure.com/",
    docintel_model_id="prebuilt-invoice",
    docintel_query_fields=["CustomerName", "PaymentMethod"],
)
print(md.convert("receipt.pdf").markdown)

Testing

pip install -e "packages/markitdown[az-doc-intel,dev]"
python -m pytest packages/markitdown/tests/test_docintel_converter.py -v

Unit tests: 23 passed, no network required. Coverage includes default model ID, custom model override, every typed-scalar branch, currency, fallback to content, array-of-scalars, empty-documents/empty-fields short-circuit, YAML quoting for special characters and reserved words, nested dict emission, queryFields feature inclusion for PDF/image and exclusion for Office formats, and kwarg propagation from MarkItDown(...).

Live validation against https://<resource>.cognitiveservices.azure.com/ (DefaultAzureCredential):

Scenario Result
Default prebuilt-layout on a paper PDF Plain markdown, no front matter (unchanged behavior).
docintel_model_id="prebuilt-invoice" on a receipt PDF YAML front matter with InvoiceDate, InvoiceTotal, nested Items (Description / Amount / UnitPrice).
docintel_query_fields=["Title","Author","PublicationDate"] on a paper PDF API call accepted with queryFields feature; output matches plain layout (paper has no matching fields populated).
prebuilt-invoice + docintel_query_fields=["CustomerName","PaymentMethod"] combo Front matter includes both invoice fields and queryFields.

Backward compatibility

  • No public API removed or renamed.
  • Default model is still prebuilt-layout.
  • Output for existing callers is byte-identical except for the API version sent to the service (2024-11-30 GA instead of 2024-07-31-preview).
  • The [az-doc-intel] optional extra is unchanged — no new dependencies.

Relationship to #1865 (CU converter)

This PR intentionally mirrors the CU converter's front-matter shape and CLI flag naming convention (--docintel-* / --cu-*). DI and CU coexist in MarkItDown(...); when both endpoints are provided, the existing registration order (DI first, then CU) is preserved so CU takes precedence for overlapping formats — no change from #1865.

@chienyuanchang chienyuanchang marked this pull request as ready for review May 26, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant