This page is for people who:
- Want to extend ExStruct's internal implementation
- Want to add new extraction targets (shapes, SmartArt, comments, etc.)
- Want to extend a backend (Openpyxl / COM / LibreOffice / future XML)
- Are trying to submit a PR but are unsure which files to touch
src/exstruct/core/
├── pipeline.py # Orchestrates the overall flow
├── backends/ # Backend abstractions and runtime-specific adapters
│ ├── openpyxl_backend.py
│ ├── com_backend.py
│ └── libreoffice_backend.py
├── libreoffice.py # LibreOffice runtime/session helper
├── ooxml_drawing.py # OOXML drawing/chart parser for best-effort rich extraction
├── modeling.py # Final data integration
├── workbook.py # Workbook lifecycle management
├── cells.py # Cell/table analysis (mainly openpyxl)
└── utils.py # Shared utilities
- Do not put Excel parsing logic in Pipeline
- Limit Pipeline's responsibilities to only the following:
- Calling order of backends
- Fallback decisions
- Artifact management
- Handoff to Modeling
Decision criterion
Is this code directly reading Excel content? If so, it should not be in Pipeline.
Backend exists for pure extraction.
- Excel → raw data
- No interpretation
- No integration
- Avoid side effects as much as possible
- Reading cell values
- Reading shape positions
- Calling COM APIs
- Raising exceptions
- Building WorkbookData / SheetData
- Bringing in concerns about the output format
- Fallback logging (this is Pipeline's responsibility)
Only Modeling should integrate results from multiple backends into a single semantic structure.
- Combine Openpyxl + COM / LibreOffice results
- Normalize coordinates, directions, and types
- Fill in missing data
The only layer that may know the final JSON/YAML/TOON shape is Modeling.
-
Add an extraction method to Backend
class Backend(Protocol): def extract_comments(self, ...): ...
-
Implement in
OpenpyxlBackend/ComBackend- One side is enough. Use
NotImplementedErrorif not implemented.
- One side is enough. Use
-
Add the call to
pipeline.py- Explicitly state whether to include it as a fallback target.
-
Integrate into WorkbookData in
modeling.py -
Add tests
-
Implement
Backendand/orRichBackendfromsrc/exstruct/core/backends/base.pyin a new backend moduleclass XmlBackend: def extract_cells(self, *, include_links: bool): ... def extract_shapes(self, *, mode: str): ...
-
Add backend selection to Pipeline
- Minimize changes to existing backends.
-
Keep Modeling unchanged if possible
- This is the most fragile type of change
- Limit changes to
modeling.pyand the Pydantic model - Do not change the backend
- Do not change Pipeline
- COM or LibreOffice runtime being unavailable is the normal case
- Do not treat fallback as an exception
- Always provide a
FallbackReason
log_fallback(
reason=FallbackReason.COM_UNAVAILABLE,
message="COM backend not available"
)
log_fallback(
reason=FallbackReason.LIBREOFFICE_UNAVAILABLE,
message="LibreOffice backend not available"
)| Layer | Test focus |
|---|---|
| Backend | extraction correctness |
| Pipeline | fallback / branching |
| Modeling | integration logic |
- Fragile tests that depend heavily on a real Excel instance
- Massive tests that couple Backend and Modeling all at once
- No Excel parsing logic in Pipeline
- No interpretation logic in Backend
- Modeling is the single source of truth for the final structure
- Fallback reason is explicit
- Tests have been added
- If the public API changed, docs have been updated
- Building WorkbookData inside Backend
- Calling openpyxl / xlwings directly from Pipeline
- Ad-hoc logic that "just handles it here"
- Catch-all exceptions with no fallback reason
- Excel is fragile
- COM is powerful but unstable
- LLM/RAG requires stable structure first
Therefore,
Separate responsibilities and localize failure points.