feat(export): native DOCX export via html-to-docx (opt-in)#7568
feat(export): native DOCX export via html-to-docx (opt-in)#7568JohnMcLear merged 23 commits intoether:developfrom
Conversation
Review Summary by Qodo(Agentic_describe updated until commit 17bf820)Native DOCX/PDF export and DOCX import without LibreOffice (soffice-optional)
WalkthroughsDescription• **Native DOCX/PDF export and DOCX import without LibreOffice:** Replaces flag-gated DOCX path with complete soffice-free import/export story using pure-JavaScript converters (html-to-docx, pdfkit, mammoth) • **Soffice-first dispatch model:** Single cascade in ExportHandler and ImportHandler — uses soffice if available, falls back to native converters when soffice = null, fails clearly with 5xx errors if conversion fails • **HTML sanitization hardening:** New stripRemoteImages sanitizer removes remote <img> tags (http/https/protocol-relative) to close SSRF vulnerability; applied to both DOCX and PDF export branches • **Pure-JavaScript PDF export:** pdfkit + htmlparser2 walker (~170 LOC) supporting block elements, inline styling, links, data URI images, text alignment, and list nesting — no jsdom or native binaries • **Native DOCX import via mammoth:** Converts DOCX buffers to HTML with alignment preservation and base64 image embedding; integrates into existing setPadHTML pipeline • **Comprehensive test coverage:** Unit and integration tests for sanitizer, PDF walker, DOCX import, export endpoints, round-trip fidelity, and negative cases for unsupported formats • **UI always shows DOCX/PDF links:** Removed conditional hiding since native converters are built-in; ODT remains gated on soffice availability • **Documentation and design specs:** Added implementation plan and design specification detailing problem statement, selection model, error handling, and out-of-scope follow-ups • **Minimal runtime overhead:** ~10–12 MB added dependencies vs. ~500 MB for LibreOffice or ~200 MB for puppeteer Diagramflowchart LR
A["Export/Import Request"] --> B{"soffice Available?"}
B -->|"yes"| C["Use soffice<br/>all formats"]
B -->|"withoutPDF<br/>Windows"| D["soffice for most<br/>native PDF"]
B -->|"no"| E["Native converters"]
E --> F["DOCX: html-to-docx"]
E --> G["PDF: pdfkit walker"]
E --> H["DOCX import: mammoth"]
F --> I["HTML Sanitization<br/>stripRemoteImages"]
G --> I
H --> J["HTML Pipeline"]
I --> K["Convert to Buffer"]
J --> L["Set Pad Content"]
File Changes1. src/tests/backend/specs/export.ts
|
Summary
Closes #7538. Replaces the original flag-gated DOCX-only path with a complete soffice-free import/export story:
settings.soffice = null, pads can be exported ashtml,txt,etherpad,docx,pdf— all in-process, no subprocess, no native binaries..html,.txt,.etherpad,.docx— all in-process.Selection model
A single cascade in
ExportHandler.tsandImportHandler.ts:sofficeAvailable() === 'yes'→ existing soffice path'withoutPDF'(Windows) → soffice for everything exceptpdf, which goes native'no'(soffice null) → native DOCX/PDF (export) and native DOCX (import); ODT/DOC/RTF (and PDF import) remain blocked with a clear messageNo fallback chain — if a native converter throws, the response is 5xx with the error logged. Mirrors the spec's "soffice if installed, native otherwise, fail clearly" decision.
Native converters
html-to-docxpdfkit+htmlparser2mammothTotal runtime install added: ~10–12 MB. Compared with ~500 MB for LibreOffice or ~200 MB for puppeteer, this is the right tradeoff for the structural-fidelity bar #7538 calls out.
Hardening
New
stripRemoteImagessanitizer removes any<img>whosesrcis notdata:or relative. Both DOCX and PDF export branches run plugin-modified HTML through it before conversion, closing Qodo's SSRF finding (#4) on the existing html-to-docx path and preventing the equivalent issue on PDF.Out of scope (follow-ups)
Test plan
pnpm run ts-checkcleanSOFFICE=null, export DOCX and PDF; both produce valid filesSOFFICE=null, import the fixture .docx and verify pad contentCloses #7538