feat: Expand Document Intelligence converter — custom models, queryFields, structured field front matter#1913
Open
chienyuanchang wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the Azure Document Intelligence (DI) converter up to feature parity with the recently merged Azure Content Understanding converter (#1865):
docintel_model_id(e.g.prebuilt-invoice,prebuilt-receipt, custom models). Previously hard-coded toprebuilt-layout.queryFieldsadd-on viadocintel_query_fields=["FieldA", "FieldB", ...], only applied when the file type supports OCR add-ons (PDF/images).2024-11-30(GA) from the older2024-07-31-preview.markitdown-docintel/<version>) added to the DI client.Motivation
Today the DI converter only runs
prebuilt-layoutand discards everything inAnalyzeResult.documents[*].fields. This means users who want invoice/receipt totals, vendor names, or custom-model fields have to call the DI SDK directly and lose MarkItDown's converter chain, plugin model, and YAML front matter convention.The new CU converter (#1865) already exposes structured fields the way LLM pipelines want them. This PR gives DI users the same ergonomics without changing default behavior —
MarkItDown(docintel_endpoint=...)with no extra args still runsprebuilt-layoutand returns plain markdown.What's included
converters/_doc_intel_converter.pymodel_idandquery_fieldsctor args; field-to-YAML serialization;QUERY_FIELDSfeature toggled only for OCR file types; API version bump; User-Agent header._markitdown.pydocintel_model_id/docintel_query_fieldskwargs through to the converter.__main__.py--docintel-model-idand--docintel-query-fields(comma-separated).tests/test_docintel_converter.pyqueryFieldsfeature gating per file type, and end-to-end kwarg plumbing fromMarkItDown(...).Key design decisions
Non-breaking by default.
docintel_model_iddefaults to"prebuilt-layout"anddocintel_query_fieldsdefaults toNone. Existing callers see no behavior change other than the API version bump.queryFieldsis OCR-only. The DI service rejectsqueryFieldsfor Office formats (DOCX/PPTX/XLSX/HTML). The converter inspectsstream_infoand only addsDocumentAnalysisFeature.QUERY_FIELDS(and only forwardsquery_fields=) for OCR-supported MIME types. Office types silently skip it instead of failing.YAML front matter shape mirrors CU. Output is:
modelIdis omitted when the value is empty;fields:is omitted entirely when DI returned no documents or no non-empty fields (so defaultprebuilt-layoutoutput is unchanged).Field value extraction is typed-first.
_field_value()checks DI's typed accessors (value_string,value_number,value_date,value_currency,value_address,value_array,value_object) in priority order before falling back to rawcontent. Dates serialize viaisoformat(); currency renders as"<amount> <code>".Custom YAML emitter (no PyYAML dep). A minimal dump function handles the scalar/list/dict shapes DI returns, with conservative quoting for strings containing YAML metacharacters, leading/trailing whitespace, or YAML reserved words (
null,true,yes, etc.). Avoids adding a new dependency tomarkitdowncore.Usage
Testing
pip install -e "packages/markitdown[az-doc-intel,dev]" python -m pytest packages/markitdown/tests/test_docintel_converter.py -vUnit tests: 23 passed, no network required. Coverage includes default model ID, custom model override, every typed-scalar branch, currency, fallback to
content, array-of-scalars, empty-documents/empty-fields short-circuit, YAML quoting for special characters and reserved words, nested dict emission,queryFieldsfeature inclusion for PDF/image and exclusion for Office formats, and kwarg propagation fromMarkItDown(...).Live validation against
https://<resource>.cognitiveservices.azure.com/(DefaultAzureCredential):prebuilt-layouton a paper PDFdocintel_model_id="prebuilt-invoice"on a receipt PDFInvoiceDate,InvoiceTotal, nestedItems(Description / Amount / UnitPrice).docintel_query_fields=["Title","Author","PublicationDate"]on a paper PDFqueryFieldsfeature; output matches plain layout (paper has no matching fields populated).prebuilt-invoice+docintel_query_fields=["CustomerName","PaymentMethod"]comboBackward compatibility
prebuilt-layout.2024-11-30GA instead of2024-07-31-preview).[az-doc-intel]optional extra is unchanged — no new dependencies.Relationship to #1865 (CU converter)
This PR intentionally mirrors the CU converter's front-matter shape and CLI flag naming convention (
--docintel-*/--cu-*). DI and CU coexist inMarkItDown(...); when both endpoints are provided, the existing registration order (DI first, then CU) is preserved so CU takes precedence for overlapping formats — no change from #1865.