Skip to content

feat: add gRPC protobuf definitions and conversion utility#546

Open
krickert wants to merge 7 commits intodocling-project:mainfrom
ai-pipestream:feat/add-protobuf
Open

feat: add gRPC protobuf definitions and conversion utility#546
krickert wants to merge 7 commits intodocling-project:mainfrom
ai-pipestream:feat/add-protobuf

Conversation

@krickert
Copy link
Copy Markdown

feat: add gRPC protobuf definitions and Pydantic conversion utility

Description

This PR introduces official Protocol Buffer definitions for the DoclingDocument model and a high-performance conversion utility to map between Docling's Pydantic models and Protobuf representations.

By moving the Protobuf source of truth into docling-core, we enable:

  1. Cross-language support: Standardized schema for clients in Go, Java, Rust, etc.
  2. Efficient Serialization: Significant reduction in payload size and faster (de)serialization compared to JSON.
  3. Architectural Decoupling: Separation of the document schema from the transport layer (docling-serve).

Key Changes

1. Protobuf Definitions (/proto)

  • Added ai/docling/core/v1/docling_document.proto.
  • Fully mirrors the DoclingDocument Pydantic model, including:
    • Text items (Titles, Headers, Paragraphs, etc.).
    • Structured items (Tables, Pictures, Key-Value pairs).
    • Metadata (Provenance, Bounding Boxes, Image references).
    • New field types: field_regions, field_items, field_heading, and field_value.

2. Conversion Utility (docling_core/utils/conversion.py)

  • Implemented docling_document_to_proto: A surgical, field-by-field mapper.
  • Handles complex types like google.protobuf.Struct for custom metadata.
  • Validates enum mappings for DocItemLabel, GroupLabel, and CoordOrigin.

3. Tooling & Dependencies

  • Added protobuf as a core dependency.
  • Added grpcio-tools to the dev dependency group for local development.
  • Added scripts/gen_proto.py to automate code generation using uv.
  • Integrated buf linting and formatting standards.

Validation Performed

Unit Tests

  • Added test/test_proto_conversion.py to verify:
    • Minimal document conversion.
    • Rich text and title mapping.
    • Consistency of default field names (e.g., _root_).

Integration Testing (via docling-serve)

  • Verified against the docling-serve gRPC suite.
  • Successfully processed the full array of standard Docling test PDFs via gRPC conversion.
  • Verified schema consistency using the docling-serve startup schema validator, ensuring 100% parity between Pydantic and Proto schemas.

Related Issues/PRs

Protocol buffer integration:

  • Created docling_core/proto/__init__.py to centralize imports for DoclingDocument protocol buffer definitions and conversion utilities.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 16, 2026

DCO Check Passed

Thanks @krickert, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 16, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: e233312

Signed-off-by: Kristian Rickert <krickert@gmail.com>
@krickert
Copy link
Copy Markdown
Author

Docling Team,

I have updated this implementation and moved the core defninitions and mapping into here. I debated about keeping the mapping in docling-serve, but feel this might be the best place since it's so model specific.

Specifically, I’ve moved the Protobuf definitions and the Pydantic-to-Proto conversion logic into docling-core to ensure that the document schema remains the single source of truth for all downstream services.

  • Protobuf Definitions: The schemas strictly follow buf lint conventions and are designed as 1:1 mirrors of the current Pydantic models. I have been very careful to ensure full field parity and it works with all protobuf standard tools (buf was just for linting because it has the strongest linting standards).
  • Dynamic Schema Validation: At startup, the server crawls the Pydantic models and validates them against the Protobuf descriptors. Any deltas are logged as warnings, ensuring we never silently drift out of sync.
  • Resilient Mapping: If new fields are added to the Pydantic models during development, the server gracefully maps them to a custom_fields section, ensuring the gRPC server remains operational while the Proto is being updated.
  • Extensive Validation: I have verified this implementation using both Python and Java clients to ensure the generated stubs are idiomatic and performant across languages. I intend to try more... because why not?

I am currently running this implementation through a stress test of 80k+ PDFs over the coming week to verify stability at scale. I would love to start a discussion on how we can expand this into native streaming functionality, and in the meantime I’d be happy to contribute language-specific tutorials.

Looking forward to your feedback... you built an incredible product and I'm happy to contribute.

@krickert
Copy link
Copy Markdown
Author

This merge is tied together with docling-project/docling-serve#504 - this is the model definition and mapping while the other project is the gRPC server option.

@krickert
Copy link
Copy Markdown
Author

Bump.. is this enough to start a review? Just an initial pass.. I can convert to draft if we need a few rounds of discussion. On my side, I'm getting ready to test this against a large corpus in common-crawl @dolfim-ibm

@krickert
Copy link
Copy Markdown
Author

@dolfim-ibm updated latest commit with main to keep it up to date. The latest protobufs were sync'ed (the grpc server was working without it though, but the new model changes have been added and properly mapped)

@krickert
Copy link
Copy Markdown
Author

Pushed the changes to docling-project/docling-serve/pull/504 as these two are tightly coupled.

Apply design review of DoclingDocument proto against the Pydantic source
of truth and lock the parity contract in PARITY.md.

Proto changes:
- PictureItem: make self_ref required and parent optional, matching
  the Pydantic shape and other DocItem subclasses.
- CodeItem: inline TextItemBase fields directly instead of nesting a
  base wrapper, since CodeItem overrides the meta field with
  FloatingMeta and the wrapper would force two coexisting meta fields.
- Formatting: add script_raw fallback so unrecognized Script enum
  values round-trip as strings, matching the policy used for label,
  picture_class, code_language, and modality.
- TrackSource: drop the redundant kind field; the Pydantic
  Literal["track"] discriminator is already represented by the
  SourceType oneof tag in proto.

Conversion utility updated to populate the inlined CodeItem fields,
emit script_raw on unrecognized scripts, and stop writing TrackSource.kind.

PARITY.md documents intentional Pydantic vs proto differences,
including computed fields surfaced for JSON parity (TableData.grid)
and Pydantic-only discriminators absorbed by oneof tags.

Pre-release wire stability is explicitly out of scope until the gRPC
PR ships, so renames are still permitted; this is documented in the
sync procedure on the docling-serve side.
@krickert
Copy link
Copy Markdown
Author

Update: proto parity tightening and sync with main

Pushed f2c8145 on top of the existing branch. The branch is up to date with upstream/main and the test suite is green (chunker tests skipped due to an unrelated tree_sitter env dependency).

I manually went through everything in docling_document.proto against Pydantic's model. No new features, just sharper parity and improvements so future drift is loud.

Proto changes (still backed by Pydantic):

  • PictureItem: self_ref is now required and parent is optional, matching the Pydantic shape and bringing it in line with every other DocItem subclass.
  • CodeItem: TextItemBase is now inlined instead of nested as a base wrapper. CodeItem overrides meta with FloatingMeta in Pydantic, so the wrapper would have forced two coexisting meta fields (base.meta: BaseMeta and meta: FloatingMeta), which is an anti-pattern on the wire. Inlining preserves field semantics 1:1 with Pydantic and removes the ambiguity.
  • Formatting: added script_raw fallback so unrecognized Script enum values round-trip as strings. This matches the policy already used for label, picture_class, code_language, and modality, giving every growable enum the same forward-compat story.
  • TrackSource: dropped the redundant kind field. The Pydantic Literal["track"] discriminator is already represented by the SourceType oneof tag in proto, so emitting it as a string was double-encoding the same information.

docling_core/utils/conversion.py is updated to populate the inlined CodeItem fields, emit script_raw on unrecognized scripts, and stop writing TrackSource.kind.

Documentation
I added proto/ai/docling/core/v1/PARITY.md

Documents the parity contract between Pydantic and proto. Covers the *_raw policy for forward-compatible enums, intentional Pydantic-only constructs absorbed by proto's oneof tags (e.g. TrackSource.kind), and computed fields surfaced for JSON parity (e.g. TableData.grid). Future contributors get a single page that explains why something is or is not 1:1, so the review surface for the next sync stays small.

Wire stability

Since it's pre-released renames of the model are easy to keep up with. Once clients are pinned, so renames should get some thought if they're small / not needed. Something to discuss but not that big of a deal - it'll just prevent head scratching if the name change breaks the protobuf contract and we handle it in mapping.

Verification

  • python scripts/gen_proto.py regenerates clean against main.
  • pytest test/test_proto_conversion.py is green.
  • The runtime schema validator on the docling-serve side (feat: Grpc native converter docling-serve#504) reports zero warnings against this branch.

I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: f2c8145

Signed-off-by: Kristian Rickert <krickert@gmail.com>
@krickert
Copy link
Copy Markdown
Author

I've made a repository with examples demonstrating how to run docling via gRPC.

Here: ai-pipestream/docling-grpc-examples

Language / Environment Tooling & Stack Details
Python uv + grpcio
Go protoc-gen-go + grpc-go
Java (Vanilla) Gradle + protobuf-gradle-plugin + grpc-java
Node.js (TypeScript) @grpc/grpc-js + grpc-tools (TypeScript via tsx)
Rust tonic + prost (Compile-time stub gen via tonic-build)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant