Skip to content

feat(export): native DOCX export via html-to-docx (opt-in)#7568

Merged
JohnMcLear merged 23 commits intoether:developfrom
JohnMcLear:feat/native-docx-export-7538
May 8, 2026
Merged

feat(export): native DOCX export via html-to-docx (opt-in)#7568
JohnMcLear merged 23 commits intoether:developfrom
JohnMcLear:feat/native-docx-export-7538

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

@JohnMcLear JohnMcLear commented Apr 20, 2026

Summary

Closes #7538. Replaces the original flag-gated DOCX-only path with a complete soffice-free import/export story:

  • Export: With settings.soffice = null, pads can be exported as html, txt, etherpad, docx, pdf — all in-process, no subprocess, no native binaries.
  • Import: .html, .txt, .etherpad, .docx — all in-process.
  • soffice configured: behavior is unchanged bit-for-bit. There is no opt-in flag.

Selection model

A single cascade in ExportHandler.ts and ImportHandler.ts:

  • sofficeAvailable() === 'yes' → existing soffice path
  • 'withoutPDF' (Windows) → soffice for everything except pdf, which goes native
  • 'no' (soffice null) → native DOCX/PDF (export) and native DOCX (import); ODT/DOC/RTF (and PDF import) remain blocked with a clear message

No fallback chain — if a native converter throws, the response is 5xx with the error logged. Mirrors the spec's "soffice if installed, native otherwise, fail clearly" decision.

Native converters

Format Library Approach
DOCX export html-to-docx already in PR
PDF export pdfkit + htmlparser2 small walker (~170 LOC) — pure JS, no jsdom
DOCX import mammoth mammoth → HTML → existing setPadHTML pipeline

Total runtime install added: ~10–12 MB. Compared with ~500 MB for LibreOffice or ~200 MB for puppeteer, this is the right tradeoff for the structural-fidelity bar #7538 calls out.

Hardening

New stripRemoteImages sanitizer removes any <img> whose src is not data: or relative. Both DOCX and PDF export branches run plugin-modified HTML through it before conversion, closing Qodo's SSRF finding (#4) on the existing html-to-docx path and preventing the equivalent issue on PDF.

Out of scope (follow-ups)

  • Native ODT export — no maintained pure-JS writer in the ecosystem.
  • Native PDF / ODT / DOC / RTF import — no mature pure-JS readers.
  • Memory/timeout caps on conversion — add when production signal warrants.

Test plan

  • pnpm run ts-check clean
  • Backend tests pass on a clean checkout: 1012 / 1012
  • Unit + integration coverage for sanitizer, walker, mammoth wrapper, DOCX/PDF export endpoints, DOCX import endpoint, ODT negative cases
  • CI green
  • Manual: with SOFFICE=null, export DOCX and PDF; both produce valid files
  • Manual: with SOFFICE=null, import the fixture .docx and verify pad content

Closes #7538

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented Apr 20, 2026

Review Summary by Qodo

(Agentic_describe updated until commit 17bf820)

Native DOCX/PDF export and DOCX import without LibreOffice (soffice-optional)

✨ Enhancement 🧪 Tests 📝 Documentation

Grey Divider

Walkthroughs

Description
• **Native DOCX/PDF export and DOCX import without LibreOffice:** Replaces flag-gated DOCX path with
  complete soffice-free import/export story using pure-JavaScript converters (html-to-docx,
  pdfkit, mammoth)
• **Soffice-first dispatch model:** Single cascade in ExportHandler and ImportHandler — uses
  soffice if available, falls back to native converters when soffice = null, fails clearly with 5xx
  errors if conversion fails
• **HTML sanitization hardening:** New stripRemoteImages sanitizer removes remote <img> tags
  (http/https/protocol-relative) to close SSRF vulnerability; applied to both DOCX and PDF export
  branches
• **Pure-JavaScript PDF export:** pdfkit + htmlparser2 walker (~170 LOC) supporting block
  elements, inline styling, links, data URI images, text alignment, and list nesting — no jsdom or
  native binaries
• **Native DOCX import via mammoth:** Converts DOCX buffers to HTML with alignment preservation and
  base64 image embedding; integrates into existing setPadHTML pipeline
• **Comprehensive test coverage:** Unit and integration tests for sanitizer, PDF walker, DOCX
  import, export endpoints, round-trip fidelity, and negative cases for unsupported formats
• **UI always shows DOCX/PDF links:** Removed conditional hiding since native converters are
  built-in; ODT remains gated on soffice availability
• **Documentation and design specs:** Added implementation plan and design specification detailing
  problem statement, selection model, error handling, and out-of-scope follow-ups
• **Minimal runtime overhead:** ~10–12 MB added dependencies vs. ~500 MB for LibreOffice or ~200 MB
  for puppeteer
Diagram
flowchart LR
  A["Export/Import Request"] --> B{"soffice Available?"}
  B -->|"yes"| C["Use soffice<br/>all formats"]
  B -->|"withoutPDF<br/>Windows"| D["soffice for most<br/>native PDF"]
  B -->|"no"| E["Native converters"]
  E --> F["DOCX: html-to-docx"]
  E --> G["PDF: pdfkit walker"]
  E --> H["DOCX import: mammoth"]
  F --> I["HTML Sanitization<br/>stripRemoteImages"]
  G --> I
  H --> J["HTML Pipeline"]
  I --> K["Convert to Buffer"]
  J --> L["Set Pad Content"]
Loading

Grey Divider

File Changes

1. src/tests/backend/specs/export.ts 🧪 Tests +536/-1

Comprehensive test coverage for native export and sanitization

• Added comprehensive test suite for native DOCX export with settings.soffice = null, verifying
 ZIP signature and content-type
• Added native PDF export tests validating %PDF- header and application/pdf content-type
• Added negative test for ODT export without soffice
• Added unit tests for stripRemoteImages sanitizer covering data URIs, relative URLs, and remote
 URL removal
• Added unit tests for HTML sanitization helpers: extractBody, wrapLooseLines,
 dropEmptyBlocks, collapseRedundantBrAfterBlocks, separateAdjacentHeadingBlocks,
 applyMonospaceToCode
• Added integration tests for htmlToPdfBuffer covering text rendering, links, images, alignment,
 and monospace fonts

src/tests/backend/specs/export.ts


2. src/tests/backend/specs/import.ts 🧪 Tests +473/-0

Native DOCX import and round-trip fidelity tests

• New file with complete import test suite for native DOCX import via mammoth
• Tests docxBufferToHtml conversion preserving headings, paragraphs, lists, and alignment
• Tests end-to-end DOCX import without soffice and rejection of ODT when soffice is null
• Tests round-trip fidelity for txt, etherpad, html, and docx formats
• Tests HTML import with adjacent headings and blank-line preservation
• Tests heading-style content round-trip integrity through DOCX export/import cycle

src/tests/backend/specs/import.ts


3. src/node/utils/ExportSanitizeHtml.ts ✨ Enhancement +215/-0

HTML sanitization and transformation for export converters

• New module providing HTML sanitization and transformation utilities for export converters
• extractBody pulls <body> content from full HTML documents, dropping <head> and styles
• stripRemoteImages removes <img> tags with remote URLs (http/https/protocol-relative),
 replacing with alt text
• wrapLooseLines wraps loose text in <p> tags and converts <br><br> sequences to paragraph
 breaks with empty <p></p> markers for blank lines
• dropEmptyBlocks iteratively removes empty heading/code/div blocks while preserving empty <p>
 markers
• collapseRedundantBrAfterBlocks removes <br> immediately after closing block tags
• separateAdjacentHeadingBlocks inserts <br> between adjacent heading-style blocks for proper
 line separation
• applyMonospaceToCode converts <code>, <pre>, <tt>, <kbd>, <samp> to Courier-styled
 spans, handling block-level alignment and preserving nested anchors

src/node/utils/ExportSanitizeHtml.ts


View more (12)
4. src/node/utils/ExportPdfNative.ts ✨ Enhancement +248/-0

Pure-JavaScript PDF export renderer using pdfkit

• New module implementing pure-JavaScript PDF export via pdfkit and htmlparser2htmlToPdfBuffer parses HTML with SAX-style event stream and renders to PDF using pdfkit
• Supports block elements (p, h1-h6, ul/ol/li, blockquote, pre, div), inline styling (bold, italic,
 underline, strike), links with annotations, and data URI images
• Handles text alignment (left, center, right, justify) on paragraphs and code blocks
• Maintains style stack for nested elements and list nesting with bullets/numbers
• Skips head/style/script/title/meta/link/noscript tags to prevent metadata leakage
• Collapses whitespace in text content and decodes base64 data URIs for image embedding

src/node/utils/ExportPdfNative.ts


5. src/node/utils/ImportDocxNative.ts ✨ Enhancement +83/-0

Native DOCX import via mammoth with alignment preservation

• New module implementing native DOCX import via mammoth library
• docxBufferToHtml converts DOCX buffers to HTML using mammoth with empty paragraph preservation
• Extracts paragraph alignment from DOCX <w:jc> elements and applies as CSS text-align styles to
 output HTML
• Embeds images as base64 data URIs to avoid external fetches
• Maps Word alignment values (left, center, right, justify, distribute) to CSS equivalents

src/node/utils/ImportDocxNative.ts


6. src/node/handler/ExportHandler.ts ✨ Enhancement +66/-1

Soffice-first export dispatch with native DOCX/PDF paths

• Replaced flag-gated DOCX export with soffice-first cascade dispatch model
• When sofficeAvailable() === 'no', routes DOCX to html-to-docx and PDF to native pdfkit
 walker
• When sofficeAvailable() === 'withoutPDF' (Windows), routes PDF to native converter while other
 formats use soffice
• Applies HTML sanitization pipeline (stripRemoteImages, extractBody, dropEmptyBlocks,
 applyMonospaceToCode, wrapLooseLines) before native conversion
• Native conversion errors surface as 5xx with logged error details; no fallback chain
• Sets correct content-type headers for DOCX
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document) and PDF
 (application/pdf)

src/node/handler/ExportHandler.ts


7. src/node/handler/ImportHandler.ts ✨ Enhancement +80/-1

Native DOCX import with soffice-first cascade and sanitization

• Added soffice-first cascade for import format selection
• When soffice == null and file is .docx, routes to native mammoth converter producing HTML
• Detects whether ep_headings2 registers h1-h6 as block elements server-side; applies
 separateAdjacentHeadingBlocks workaround when missing
• Rejects .pdf, .odt, .doc, .rtf with explicit error when soffice is null (no silent
 fallback)
• Applies collapseRedundantBrAfterBlocks sanitization to HTML imports and soffice-converted
 outputs to prevent blank-line duplication

src/node/handler/ImportHandler.ts


8. src/node/hooks/express/importexport.ts ✨ Enhancement +4/-2

Route guard allows native DOCX and PDF export paths

• Tightened export route guard to reject only ['odt', 'doc'] when soffice is disabled (was
 ['odt', 'pdf', 'doc', 'docx'])
• PDF and DOCX now fall through to ExportHandler which dispatches to native converters when
 soffice is null
• Updated comment to clarify that native paths handle DOCX and PDF

src/node/hooks/express/importexport.ts


9. src/static/js/pad_impexp.ts ✨ Enhancement +7/-13

Always show DOCX and PDF export links in UI

• Removed conditional hiding of DOCX and PDF export links based on exportAvailable flag
• DOCX and PDF links now always visible since native converters are built-in
• ODT link remains gated on exportAvailable === 'yes' (soffice required)
• Simplified UI logic by removing withoutPDF branch special handling

src/static/js/pad_impexp.ts


10. docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md 📝 Documentation +230/-0

Design specification for native export/import without LibreOffice

• New comprehensive design specification for native DOCX/PDF export and DOCX import without
 LibreOffice
• Documents problem statement, goals, non-goals, and selection model for soffice-first dispatch
• Specifies route guard and UI capability changes
• Details native PDF export approach using pdfkit walker with bail-out criterion (~500 lines max)
• Specifies HTML sanitization defense-in-depth against SSRF via stripRemoteImages
• Documents native DOCX import via mammoth wrapper
• Includes error handling strategy, test plan, file manifest, and dependency summary
• Lists out-of-scope follow-ups (ODT export, PDF/ODT/DOC/RTF import, memory caps)

docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md


11. src/package.json Dependencies +5/-0

Add native export/import library dependencies

• Added html-to-docx (^1.8.0) for native DOCX export
• Added htmlparser2 (^12.0.0) for HTML parsing in sanitizer and PDF walker
• Added mammoth (^1.12.0) for native DOCX import
• Added pdfkit (^0.18.0) for native PDF export
• Added @types/pdfkit (^0.17.6) to dev dependencies for TypeScript support

src/package.json


12. doc/docker.md 📝 Documentation +1/-1

Document soffice configuration and native converter fallback

• Updated SOFFICE environment variable documentation to clarify behavior with and without
 LibreOffice
• Explains that when soffice is configured, all advanced formats use it
• Documents that when soffice is null, in-process converters handle DOCX/PDF export and DOCX import
• Clarifies that ODT/DOC/RTF and PDF import remain unavailable without soffice

doc/docker.md


13. pnpm-lock.yaml Dependencies +700/-16

Add native document conversion library dependencies

• Added html-to-docx@1.8.0 dependency for native DOCX export functionality
• Added htmlparser2@12.0.0 dependency for HTML parsing in conversion workflows
• Added mammoth@1.12.0 dependency for native DOCX import capability
• Added pdfkit@0.18.0 dependency for native PDF export generation
• Added @types/pdfkit@0.17.6 TypeScript type definitions for pdfkit
• Added transitive dependencies for font handling, compression, DOM manipulation, and image
 processing

pnpm-lock.yaml


14. docs/superpowers/plans/2026-05-08-native-docx-pdf-export-import.md 📝 Documentation +1543/-0

Implementation plan for native office format converters

• Comprehensive implementation plan for native DOCX/PDF export and DOCX import without soffice
• Detailed task breakdown (Tasks 0–10) with step-by-step instructions, code snippets, and test
 expectations
• Covers dependency management, HTML sanitization, PDF walker via pdfkit, DOCX import via mammoth,
 handler refactoring, route guard updates, UI changes, and settings cleanup
• Includes self-review checklist, bail-out criterion for PDF walker complexity, and Qodo security
 finding responses

docs/superpowers/plans/2026-05-08-native-docx-pdf-export-import.md


15. src/tests/backend/specs/fixtures/sample.docx 🧪 Tests +0/-0

Test fixture for DOCX import/export validation

• New DOCX fixture file containing heading, paragraph, and bullet list content
• Generated deterministically via html-to-docx library to support import/export test cases
• Binary OOXML format (ZIP-based) with standard Word document structure

src/tests/backend/specs/fixtures/sample.docx


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented Apr 20, 2026

Code Review by Qodo

🐞 Bugs (6) 📘 Rule violations (2) 📎 Requirement gaps (1)

Context used

Grey Divider


Action required

1. Native DOCX/PDF lacks flag 📘 Rule violation ☼ Reliability ⭐ New
Description
Native DOCX/PDF export (and DOCX import) is enabled automatically when settings.soffice is null,
which is the documented default, so the feature is effectively enabled by default without a
dedicated feature flag. This violates the requirement that new features be flag-gated and disabled
by default to avoid unexpected behavior changes for existing deployments.
Code

src/node/handler/ExportHandler.ts[R93-155]

+    // Soffice-first dispatch (issue #7538). When soffice is configured
+    // we keep the legacy convert-via-tempfile path; when it's not, we
+    // hand DOCX to html-to-docx and PDF to our pdfkit walker — both
+    // pure-JS, in-process. No fallback chain: native errors surface as
+    // 5xx so admins see real failures instead of silent shadowing.
+    const {sofficeAvailable} = require('../utils/Settings');
+    const sofState = sofficeAvailable();
+    const goNative = sofState === 'no'
+        || (sofState === 'withoutPDF' && type === 'pdf');
+
+    if (goNative) {
+      const {
+        stripRemoteImages, extractBody, wrapLooseLines, dropEmptyBlocks,
+        applyMonospaceToCode,
+      } = require('../utils/ExportSanitizeHtml');
+      // The HTML pipeline returns a full document (head, style, body); the
+      // legacy soffice path renders that fine, but the in-process
+      // converters need just the body content to avoid leaking CSS into
+      // the output and to drop the document-level whitespace that creates
+      // stray paragraph breaks at the top of the result.
+      // dropEmptyBlocks strips heading-styled blank-line wrappers that
+      // ep_headings2 emits between every styled line.
+      const bodyHtml = dropEmptyBlocks(stripRemoteImages(extractBody(html)));
+      html = null;
+      try {
+        if (type === 'docx') {
+          // applyMonospaceToCode strips `<code>`/`<pre>`/`<tt>` wrappers
+          // (html-to-docx ignores them AND has a bug where it drops
+          // `<a href>` children of those tags) and emits styled
+          // monospace spans, forwarding any block-level alignment style
+          // to a wrapping `<p>`. Run BEFORE wrapLooseLines so the
+          // resulting `<p>` lands at the loose-line boundary instead
+          // of getting double-wrapped.
+          //
+          // wrapLooseLines then handles `<br>` semantics: bare `<br>`
+          // outside `<p>` becomes a soft break, `<br><br>` becomes a
+          // paragraph boundary plus blank-line markers.
+          const docxHtml = wrapLooseLines(applyMonospaceToCode(bodyHtml));
+          const htmlToDocx = require('html-to-docx');
+          const buf = await htmlToDocx(docxHtml);
+          res.contentType(
+              'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+          res.send(buf);
+          return;
+        }
+        if (type === 'pdf') {
+          const {htmlToPdfBuffer} = require('../utils/ExportPdfNative');
+          const buf = await htmlToPdfBuffer(bodyHtml);
+          res.contentType('application/pdf');
+          res.send(buf);
+          return;
+        }
+        // soffice-only formats (odt, doc) are blocked at the route guard
+        // when soffice is null; reaching here means the guard is wrong.
+        res.status(500).send(`Cannot export ${type} without soffice configured`);
+        return;
+      } catch (err) {
+        console.error(
+            `native ${type} export failed for pad "${padId}":`,
+            err && (err as Error).stack ? (err as Error).stack : err);
+        res.status(500).send(`Failed to export pad as ${type}.`);
+        return;
+      }
Evidence
PR Compliance ID 8 requires new functionality to be behind a feature flag and disabled by default.
The diff adds an automatic dispatch to native DOCX/PDF when sofficeAvailable() reports no
(meaning settings.soffice == null), and documentation explicitly states the default SOFFICE
value is null and that null enables native converters for DOCX/PDF export and DOCX import.

src/node/handler/ExportHandler.ts[93-155]
doc/docker.md[200-200]
src/node/hooks/express/importexport.ts[39-43]
src/static/js/pad_impexp.ts[147-160]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX/PDF export (and DOCX import) is activated automatically when `settings.soffice` is `null` (the documented default), so the new feature is enabled by default and is not controlled by an explicit feature flag.

## Issue Context
PR Compliance ID 8 requires new features to be behind a feature flag and disabled by default, with pre-change behavior preserved when the flag is disabled.

## Fix Focus Areas
- src/node/handler/ExportHandler.ts[93-155]
- src/node/hooks/express/importexport.ts[39-43]
- src/static/js/pad_impexp.ts[147-160]
- doc/docker.md[200-200]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. PDF drops single spaces 🐞 Bug ≡ Correctness ⭐ New
Description
htmlToPdfBuffer() drops any text node that collapses to a single space, but Etherpad’s HTML export
can legitimately emit a standalone space between inline tags (e.g., style boundary around a space).
This can cause words to concatenate in native PDF exports.
Code

src/node/utils/ExportPdfNative.ts[R203-210]

+      ontext(text) {
+        if (skipDepth > 0) return;
+        // Collapse consecutive whitespace to a single space, the way an
+        // HTML renderer would. Without this, literal newlines and tabs in
+        // pretty-printed source HTML show up as runs of " " in the PDF.
+        const collapsed = text.replace(/[\s ]+/g, ' ');
+        if (collapsed === ' ') return;  // pure-whitespace runs are dropped
+        writeText(collapsed);
Evidence
The PDF walker collapses whitespace then returns early if the result is exactly ' ', so an
inter-tag space like </strong> <em> is removed. Etherpad’s HTML exporter appends escaped text
(including regular spaces) directly into the HTML stream while opening/closing tags based on
attribute spans; combined with _processSpaces (which preserves interior regular spaces), it is
possible for a space character to exist as its own text node between tags, and dropping it changes
the rendered text.

src/node/utils/ExportPdfNative.ts[203-210]
src/node/utils/ExportHtml.ts[217-260]
src/node/utils/ExportHtml.ts[536-575]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`src/node/utils/ExportPdfNative.ts` drops whitespace-only text nodes that collapse to a single space. In HTML, a single inter-tag space between inline elements is semantically/renderingly significant; dropping it can merge adjacent words.

### Issue Context
Etherpad’s HTML export can produce inline-tag boundaries around characters (including spaces) due to attribute span transitions, and `_processSpaces()` preserves interior normal spaces.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[203-210]

### Implementation direction
Adjust the `ontext()` handling to *not* blindly drop `collapsed === ' '`. Suggested approach:
- Track whether the current PDF line/run is at a “text start” (or whether the last emitted character was whitespace).
- Emit a single space when it is needed to separate tokens (e.g., if the previous emitted character is non-whitespace and the next text will be non-whitespace), while still avoiding indentation/pretty-print whitespace.
A minimal improvement is to keep `collapsed === ' '` unless you are at the beginning of a line/run (immediately after `breakLine()`/`flushLine()`) or you already emitted a space.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. PDF adds extra breaks 🐞 Bug ≡ Correctness ⭐ New
Description
Etherpad’s HTML export appends a <br> after every line, including lines wrapped in block tags like
<h1>/<h2>. The PDF walker inserts a line break on both </h1..> close and on every <br>, which
adds extra vertical spacing for heading/block lines in native PDF exports.
Code

src/node/utils/ExportPdfNative.ts[R188-190]

+          case 'br':
+            breakLine();
+            break;
Evidence
ExportHtml unconditionally appends <br> after each line. ExportHtml can emit block tags for
headings (h1, h2) based on heading1/heading2 attributes, so a heading line becomes
<h1>...</h1><br>. In the PDF walker, </h1> triggers breakLine() and then the following <br>
triggers another breakLine(), effectively doubling spacing for those lines.

src/node/utils/ExportHtml.ts[51-53]
src/node/utils/ExportHtml.ts[489-506]
src/node/utils/ExportPdfNative.ts[188-190]
src/node/utils/ExportPdfNative.ts[223-227]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Native PDF export can add extra vertical spacing when Etherpad emits block-level tags for a line (e.g., `<h1>...</h1>`) and still appends the per-line `<br>` separator. The PDF walker breaks on both the block close and the `<br>`.

### Issue Context
`ExportHtml` always appends `<br>` between pad lines. Headings are exported as `<h1>`/`<h2>` (block tags) when heading attributes are present.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[188-190]
- src/node/utils/ExportPdfNative.ts[223-227]
- (optional pre-processing alternative) src/node/handler/ExportHandler.ts[138-143]

### Implementation direction
Choose one (or combine):
1. **Walker-side fix:** In `onopentag('br')`, if `pendingNewline` is true (meaning a block close already ended the line), treat the `<br>` as a no-op (or just clear `pendingNewline` without calling `breakLine()`).
2. **Pre-processing fix:** Before calling `htmlToPdfBuffer()`, run the existing `collapseRedundantBrAfterBlocks()` sanitizer on `bodyHtml` so sequences like `</h1><br>` become `</h1>`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (2)
4. DOCX export still needs soffice 📎 Requirement gap ≡ Correctness
Description
The new DOCX export path is opt-in (nativeDocxExport defaults to false) and explicitly falls
back to the existing LibreOffice/soffice path on error, so DOCX export is not fully free of a
LibreOffice runtime dependency. This fails the requirement to support DOCX export without requiring
LibreOffice for these formats.
Code

src/node/handler/ExportHandler.ts[R97-110]

+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }
Evidence
PR Compliance ID 1 requires DOCX/PDF support using native/local tooling with no runtime dependency
on LibreOffice. The added DOCX branch is gated behind settings.nativeDocxExport and, if conversion
fails, logs a warning and falls through to the LibreOffice export path, meaning LibreOffice remains
a required backstop in the DOCX export flow.

Native DOCX/PDF import/export support without Abiword/LibreOffice dependency
src/node/handler/ExportHandler.ts[97-110]
src/node/utils/Settings.ts[419-426]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Compliance requires DOCX export to work without a LibreOffice/`soffice` runtime dependency. The new native DOCX export is opt-in by default and explicitly falls back to the LibreOffice path on error, so LibreOffice is still required as a backstop for DOCX export.
## Issue Context
Current implementation uses `html-to-docx` when `nativeDocxExport` is enabled, but catches errors and falls through to LibreOffice. This violates the stated objective of having DOCX export not depend on LibreOffice.
## Fix Focus Areas
- src/node/handler/ExportHandler.ts[97-110]
- src/node/utils/Settings.ts[419-426]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. DOCX blocked without soffice 🐞 Bug ≡ Correctness
Description
Even with settings.nativeDocxExport=true, DOCX exports are rejected and hidden when settings.soffice
is null because exportAvailable() gates docx behind soffice in both the /export route and UI. This
makes the new native DOCX branch in ExportHandler unreachable in the documented no-LibreOffice
configuration.
Code

src/node/handler/ExportHandler.ts[R90-111]

+    // Native DOCX path (issue #7538) — when `nativeDocxExport` is enabled,
+    // convert the HTML export into a Word document in-process with
+    // `html-to-docx` instead of shelling out to LibreOffice. Saves admins
+    // from having to install `soffice` and avoids per-export subprocess
+    // latency. On failure we fall through to the LibreOffice path below
+    // so the change is strictly additive (opt-in via setting, auto-fallback
+    // if the converter throws).
+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }
+    }
Evidence
The PR adds a native DOCX branch, but requests for /export/docx are blocked earlier when LibreOffice
is disabled (soffice=null), and the UI removes the DOCX link under the same condition.
exportAvailable() only reflects soffice availability, so enabling nativeDocxExport alone won’t
expose or allow DOCX export.

src/node/handler/ExportHandler.ts[90-111]
src/node/hooks/express/importexport.ts[27-48]
src/static/js/pad_impexp.ts[147-166]
src/node/utils/Settings.ts[700-709]
doc/docker.md[190-194]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export is implemented but is effectively unreachable in the intended “no soffice installed” configuration because the server route guard and client UI still treat `docx` as requiring LibreOffice.
## Issue Context
- Server-side guard blocks `docx` when `exportAvailable() === 'no'`.
- `exportAvailable()` currently only reflects `soffice` presence.
- Client UI removes the Word export link when `clientVars.exportAvailable === 'no'`.
- Docs say setting `SOFFICE` to `null` disables LibreOffice (typical for no-soffice deployments).
## Fix Focus Areas
- Update server export guard to allow `docx` when `settings.nativeDocxExport === true`, even if `soffice` is null:
- src/node/hooks/express/importexport.ts[27-48]
- Add a dedicated capability flag for “Word export available” (or “nativeDocxExport enabled”) into clientVars so the UI can show Word export even when other converter-based exports remain disabled:
- src/node/handler/PadMessageHandler.ts[1113-1118]
- src/static/js/pad_impexp.ts[147-166]
- Avoid incorrectly enabling PDF/ODT links when only native DOCX is available (introduce a new state or separate flags rather than reusing `exportAvailable`).
- src/node/utils/Settings.ts[700-709]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

6. Committed generated sample.docx 📘 Rule violation ⚙ Maintainability ⭐ New
Description
A binary .docx file is added under src/tests/backend/specs/fixtures/, which appears to be a
generated export artifact (ZIP PK header with generator metadata/timestamps). Committing generated
artifacts can bloat the repo and create noisy diffs, violating the prohibition on generated files.
Code

src/tests/backend/specs/fixtures/sample.docx[R1-20]

+PK��
+�����Lm�\����������������_rels/PK��
+�����Lm�\������������
+���_rels/.rels<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
+<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
+  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
+  <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Target="docProps/core.xml"/>
+</Relationships>PK��
+�����Lm�\������������	���docProps/PK��
+�����Lm�\�0�������������docProps/core.xml<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
+<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
+  <dc:title/>
+  <dc:subject/>
+  <dc:creator>html-to-docx</dc:creator>
+  <cp:keywords>html-to-docx</cp:keywords>
+  <dc:description/>
+  <cp:lastModifiedBy>html-to-docx</cp:lastModifiedBy>
+  <cp:revision>1</cp:revision>
+  <dcterms:created xsi:type="dcterms:W3CDTF">2026-05-08T13:42:24.892Z</dcterms:created>
+  <dcterms:modified xsi:type="dcterms:W3CDTF">2026-05-08T13:42:24.892Z</dcterms:modified>
Evidence
PR Compliance ID 10 disallows committing build- or runtime-generated artifacts. The PR adds a
.docx fixture file whose content begins with the ZIP PK signature and includes DOCX XML
parts/metadata, indicating it is an generated document archive rather than hand-authored source.

src/tests/backend/specs/fixtures/sample.docx[1-20]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A generated `.docx` binary fixture is committed to the repository.

## Issue Context
To comply with the no-generated-files rule, prefer generating such artifacts during the test run (or storing a non-generated source representation) rather than checking in the produced binary.

## Fix Focus Areas
- src/tests/backend/specs/fixtures/sample.docx[1-20]
- src/tests/backend/specs/import.ts[35-51]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. skipDepth stuck on voids 🐞 Bug ☼ Reliability ⭐ New
Description
The PDF walker increments skipDepth for meta/link in SKIP_TAGS, but these are void elements
and may not emit a close event, leaving skipDepth > 0 and skipping the rest of the document. This
can produce empty PDFs and unbounded styleStack growth if such tags appear in the HTML stream.
Code

src/node/utils/ExportPdfNative.ts[R29-33]

+// Tags whose text content must never appear in the rendered PDF (CSS,
+// scripts, document metadata). The walker maintains a depth counter so that
+// nested elements inside one of these are ignored too.
+const SKIP_TAGS = new Set(['head', 'style', 'script', 'title', 'meta', 'link', 'noscript']);
+
Evidence
SKIP_TAGS includes meta and link, and the walker increments skipDepth on open and decrements
only on close. Elsewhere in the codebase, meta and link are explicitly treated as void tags (no
closing tag), so decrement may never happen, causing the parser to remain in skip mode indefinitely
and also push onto styleStack without a corresponding pop.

src/node/utils/ExportPdfNative.ts[29-33]
src/node/utils/ExportPdfNative.ts[125-131]
src/node/utils/ExportPdfNative.ts[213-220]
src/node/utils/ExportSanitizeHtml.ts[174-177]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ExportPdfNative` treats `meta` and `link` as skip-depth delimeters, but they are void elements and might not trigger `onclosetag`, which can leave `skipDepth` permanently > 0.

### Issue Context
The codebase already models `meta`/`link` as void tags in `ExportSanitizeHtml`.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[29-33]
- src/node/utils/ExportPdfNative.ts[125-131]
- src/node/utils/ExportSanitizeHtml.ts[174-177]

### Implementation direction
- Remove `meta` and `link` from `SKIP_TAGS`, **or**
- Only increment `skipDepth` for non-void tags (maintain a `VOID_TAGS` set in `ExportPdfNative`, or reuse logic), **or**
- If you keep them, immediately undo the increment for void tags in `onopentag`.
Also ensure `styleStack` remains balanced when skipping content so it cannot grow without bound.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


8. Native DOCX test bypass 🐞 Bug ☼ Reliability
Description
The new native DOCX tests set settings.soffice='false' (a non-null string), which prevents
exportAvailable() from returning 'no' and sidesteps the server-side DOCX export block. This can make
tests pass while a real deployment with soffice=null (as documented) still cannot export DOCX.
Code

src/tests/backend/specs/export.ts[R36-39]

+    before(function () {
+      settings.soffice = 'false';
+      settings.nativeDocxExport = true;
+    });
Evidence
The tests configure soffice with a non-null string, but the documented way to disable LibreOffice is
null. Additionally, Settings reload logic will null out invalid soffice paths, meaning the test
configuration doesn’t reflect real behavior; the server route guard blocks docx when
exportAvailable() is 'no'.

src/tests/backend/specs/export.ts[32-39]
doc/docker.md[190-194]
src/node/utils/Settings.ts[700-709]
src/node/utils/Settings.ts[1019-1030]
src/node/hooks/express/importexport.ts[37-48]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The native DOCX tests use `settings.soffice = 'false'`, which is non-null and therefore does not simulate a true “no soffice” deployment (`soffice: null`). This can let the tests pass even if the feature is broken for real deployments.
## Issue Context
- Docs describe disabling LibreOffice by setting `SOFFICE` to `null`.
- Server-side export route blocks docx when `exportAvailable() === 'no'`.
## Fix Focus Areas
- Update the native DOCX tests to simulate a real no-soffice deployment (`settings.soffice = null`) and assert DOCX export still succeeds when `nativeDocxExport = true`:
- src/tests/backend/specs/export.ts[32-65]
- After fixing the route/UI gating (see other finding), add a regression assertion that `/export/docx` works with `soffice = null` and fails (or is blocked) appropriately when nativeDocxExport is false.
- src/tests/backend/specs/export.ts[22-66]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
9. Unrestricted HTML-to-DOCX I/O 🐞 Bug ⛨ Security
Description
ExportHandler passes exported HTML directly into html-to-docx and buffers the entire DOCX in memory
for res.send(), and the dependency graph includes image-to-base64→node-fetch enabling outbound
network access from conversion code. Because HTML export can be plugin-modified, enabling
nativeDocxExport can allow untrusted pad/plugin output to trigger server-side requests and increase
memory pressure.
Code

src/node/handler/ExportHandler.ts[R99-105]

+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
Evidence
The new code converts HTML in-process via html-to-docx and sends the resulting buffer. The lockfile
shows html-to-docx includes image-to-base64 (node-fetch), and ExportHtml provides a plugin hook that
can modify generated HTML, meaning untrusted plugin/pad output can influence the converter input and
potentially induce server-side I/O.

src/node/handler/ExportHandler.ts[97-105]
pnpm-lock.yaml[8709-8718]
pnpm-lock.yaml[8804-8807]
src/node/utils/ExportHtml.ts[321-337]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export converts plugin-modifiable HTML via `html-to-docx` with no constraints. The dependency tree includes `node-fetch`, so conversion code may perform outbound network access, and the handler buffers the full DOCX in memory.
## Issue Context
- ExportHandler calls `htmlToDocx(html)` and `res.send(docxBuffer)`.
- ExportHtml allows plugins to modify exported HTML.
- pnpm-lock indicates `html-to-docx` pulls in `image-to-base64` and `node-fetch`.
## Fix Focus Areas
- Investigate `html-to-docx` options to disable remote fetching / external resource resolution (or strip/deny `<img src>` and other fetchable URLs from HTML before conversion).
- src/node/handler/ExportHandler.ts[97-105]
- Add guardrails: size limits for generated DOCX, timeouts/cancellation, and (if possible) run conversion in a constrained environment (worker/thread or sandbox) to reduce SSRF and DoS impact.
- src/node/handler/ExportHandler.ts[97-111]
- Consider writing the buffer to a temp file and using `res.sendFile()` (or streaming) to reduce peak memory usage.
- src/node/handler/ExportHandler.ts[99-105]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 17bf820

Results up to commit 7e5a73c


🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (1)


Action required
1. DOCX export still needs soffice 📎 Requirement gap ≡ Correctness
Description
The new DOCX export path is opt-in (nativeDocxExport defaults to false) and explicitly falls
back to the existing LibreOffice/soffice path on error, so DOCX export is not fully free of a
LibreOffice runtime dependency. This fails the requirement to support DOCX export without requiring
LibreOffice for these formats.
Code

src/node/handler/ExportHandler.ts[R97-110]

+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }
Evidence
PR Compliance ID 1 requires DOCX/PDF support using native/local tooling with no runtime dependency
on LibreOffice. The added DOCX branch is gated behind settings.nativeDocxExport and, if conversion
fails, logs a warning and falls through to the LibreOffice export path, meaning LibreOffice remains
a required backstop in the DOCX export flow.

Native DOCX/PDF import/export support without Abiword/LibreOffice dependency
src/node/handler/ExportHandler.ts[97-110]
src/node/utils/Settings.ts[419-426]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Compliance requires DOCX export to work without a LibreOffice/`soffice` runtime dependency. The new native DOCX export is opt-in by default and explicitly falls back to the LibreOffice path on error, so LibreOffice is still required as a backstop for DOCX export.

## Issue Context
Current implementation uses `html-to-docx` when `nativeDocxExport` is enabled, but catches errors and falls through to LibreOffice. This violates the stated objective of having DOCX export not depend on LibreOffice.

## Fix Focus Areas
- src/node/handler/ExportHandler.ts[97-110]
- src/node/utils/Settings.ts[419-426]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. DOCX blocked without soffice 🐞 Bug ≡ Correctness
Description
Even with settings.nativeDocxExport=true, DOCX exports are rejected and hidden when settings.soffice
is null because exportAvailable() gates docx behind soffice in both the /export route and UI. This
makes the new native DOCX branch in ExportHandler unreachable in the documented no-LibreOffice
configuration.
Code

src/node/handler/ExportHandler.ts[R90-111]

+    // Native DOCX path (issue #7538) — when `nativeDocxExport` is enabled,
+    // convert the HTML export into a Word document in-process with
+    // `html-to-docx` instead of shelling out to LibreOffice. Saves admins
+    // from having to install `soffice` and avoids per-export subprocess
+    // latency. On failure we fall through to the LibreOffice path below
+    // so the change is strictly additive (opt-in via setting, auto-fallback
+    // if the converter throws).
+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }
+    }
Evidence
The PR adds a native DOCX branch, but requests for /export/docx are blocked earlier when LibreOffice
is disabled (soffice=null), and the UI removes the DOCX link under the same condition.
exportAvailable() only reflects soffice availability, so enabling nativeDocxExport alone won’t
expose or allow DOCX export.

src/node/handler/ExportHandler.ts[90-111]
src/node/hooks/express/importexport.ts[27-48]
src/static/js/pad_impexp.ts[147-166]
src/node/utils/Settings.ts[700-709]
doc/docker.md[190-194]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export is implemented but is effectively unreachable in the intended “no soffice installed” configuration because the server route guard and client UI still treat `docx` as requiring LibreOffice.

## Issue Context
- Server-side guard blocks `docx` when `exportAvailable() === 'no'`.
- `exportAvailable()` currently only reflects `soffice` presence.
- Client UI removes the Word export link when `clientVars.exportAvailable === 'no'`.
- Docs say setting `SOFFICE` to `null` disables LibreOffice (typical for no-soffice deployments).

## Fix Focus Areas
- Update server export guard to allow `docx` when `settings.nativeDocxExport === true`, even if `soffice` is null:
 - src/node/hooks/express/importexport.ts[27-48]
- Add a dedicated capability flag for “Word export available” (or “nativeDocxExport enabled”) into clientVars so the UI can show Word export even when other converter-based exports remain disabled:
 - src/node/handler/PadMessageHandler.ts[1113-1118]
 - src/static/js/pad_impexp.ts[147-166]
- Avoid incorrectly enabling PDF/ODT links when only native DOCX is available (introduce a new state or separate flags rather than reusing `exportAvailable`).
 - src/node/utils/Settings.ts[700-709]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
3. Native DOCX test bypass 🐞 Bug ☼ Reliability
Description
The new native DOCX tests set settings.soffice='false' (a non-null string), which prevents
exportAvailable() from returning 'no' and sidesteps the server-side DOCX export block. This can make
tests pass while a real deployment with soffice=null (as documented) still cannot export DOCX.
Code

src/tests/backend/specs/export.ts[R36-39]

+    before(function () {
+      settings.soffice = 'false';
+      settings.nativeDocxExport = true;
+    });
Evidence
The tests configure soffice with a non-null string, but the documented way to disable LibreOffice is
null. Additionally, Settings reload logic will null out invalid soffice paths, meaning the test
configuration doesn’t reflect real behavior; the server route guard blocks docx when
exportAvailable() is 'no'.

src/tests/backend/specs/export.ts[32-39]
doc/docker.md[190-194]
src/node/utils/Settings.ts[700-709]
src/node/utils/Settings.ts[1019-1030]
src/node/hooks/express/importexport.ts[37-48]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The native DOCX tests use `settings.soffice = 'false'`, which is non-null and therefore does not simulate a true “no soffice” deployment (`soffice: null`). This can let the tests pass even if the feature is broken for real deployments.

## Issue Context
- Docs describe disabling LibreOffice by setting `SOFFICE` to `null`.
- Server-side export route blocks docx when `exportAvailable() === 'no'`.

## Fix Focus Areas
- Update the native DOCX tests to simulate a real no-soffice deployment (`settings.soffice = null`) and assert DOCX export still succeeds when `nativeDocxExport = true`:
 - src/tests/backend/specs/export.ts[32-65]
- After fixing the route/UI gating (see other finding), add a regression assertion that `/export/docx` works with `soffice = null` and fails (or is blocked) appropriately when nativeDocxExport is false.
 - src/tests/backend/specs/export.ts[22-66]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Unrestricted HTML-to-DOCX I/O 🐞 Bug ⛨ Security
Description
ExportHandler passes exported HTML directly into html-to-docx and buffers the entire DOCX in memory
for res.send(), and the dependency graph includes image-to-base64→node-fetch enabling outbound
network access from conversion code. Because HTML export can be plugin-modified, enabling
nativeDocxExport can allow untrusted pad/plugin output to trigger server-side requests and increase
memory pressure.
Code

src/node/handler/ExportHandler.ts[R99-105]

+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
Evidence
The new code converts HTML in-process via html-to-docx and sends the resulting buffer. The lockfile
shows html-to-docx includes image-to-base64 (node-fetch), and ExportHtml provides a plugin hook that
can modify generated HTML, meaning untrusted plugin/pad output can influence the converter input and
potentially induce server-side I/O.

src/node/handler/ExportHandler.ts[97-105]
pnpm-lock.yaml[8709-8718]
pnpm-lock.yaml[8804-8807]
src/node/utils/ExportHtml.ts[321-337]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export converts plugin-modifiable HTML via `html-to-docx` with no constraints. The dependency tree includes `node-fetch`, so conversion code may perform outbound network access, and the handler buffers the full DOCX in memory.

## Issue Context
- ExportHandler calls `htmlToDocx(html)` and `res.send(docxBuffer)`.
- ExportHtml allows plugins to modify exported HTML.
- pnpm-lock indicates `html-to-docx` pulls in `image-to-base64` and `node-fetch`.

## Fix Focus Areas
- Investigate `html-to-docx` options to disable remote fetching / external resource resolution (or strip/deny `<img src>` and other fetchable URLs from HTML before conversion).
 - src/node/handler/ExportHandler.ts[97-105]
- Add guardrails: size limits for generated DOCX, timeouts/cancellation, and (if possible) run conversion in a constrained environment (worker/thread or sandbox) to reduce SSRF and DoS impact.
 - src/node/handler/ExportHandler.ts[97-111]
- Consider writing the buffer to a temp file and using `res.sendFile()` (or streaming) to reduce peak memory usage.
 - src/node/handler/ExportHandler.ts[99-105]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


ⓘ The new review experience is currently in Beta. Learn more

Qodo Logo

Comment thread src/node/handler/ExportHandler.ts Outdated
Comment on lines +97 to +110
if (type === 'docx' && settings.nativeDocxExport) {
try {
const htmlToDocx = require('html-to-docx');
const docxBuffer = await htmlToDocx(html);
html = null;
res.contentType(
'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
res.send(docxBuffer);
return;
} catch (err) {
console.warn(
`native-docx export failed for pad "${padId}", falling back to ` +
`LibreOffice: ${(err as Error).message || err}`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Docx export still needs soffice 📎 Requirement gap ≡ Correctness

The new DOCX export path is opt-in (nativeDocxExport defaults to false) and explicitly falls
back to the existing LibreOffice/soffice path on error, so DOCX export is not fully free of a
LibreOffice runtime dependency. This fails the requirement to support DOCX export without requiring
LibreOffice for these formats.
Agent Prompt
## Issue description
Compliance requires DOCX export to work without a LibreOffice/`soffice` runtime dependency. The new native DOCX export is opt-in by default and explicitly falls back to the LibreOffice path on error, so LibreOffice is still required as a backstop for DOCX export.

## Issue Context
Current implementation uses `html-to-docx` when `nativeDocxExport` is enabled, but catches errors and falls through to LibreOffice. This violates the stated objective of having DOCX export not depend on LibreOffice.

## Fix Focus Areas
- src/node/handler/ExportHandler.ts[97-110]
- src/node/utils/Settings.ts[419-426]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread src/node/handler/ExportHandler.ts Outdated
Comment on lines +90 to +111
// Native DOCX path (issue #7538) — when `nativeDocxExport` is enabled,
// convert the HTML export into a Word document in-process with
// `html-to-docx` instead of shelling out to LibreOffice. Saves admins
// from having to install `soffice` and avoids per-export subprocess
// latency. On failure we fall through to the LibreOffice path below
// so the change is strictly additive (opt-in via setting, auto-fallback
// if the converter throws).
if (type === 'docx' && settings.nativeDocxExport) {
try {
const htmlToDocx = require('html-to-docx');
const docxBuffer = await htmlToDocx(html);
html = null;
res.contentType(
'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
res.send(docxBuffer);
return;
} catch (err) {
console.warn(
`native-docx export failed for pad "${padId}", falling back to ` +
`LibreOffice: ${(err as Error).message || err}`);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Docx blocked without soffice 🐞 Bug ≡ Correctness

Even with settings.nativeDocxExport=true, DOCX exports are rejected and hidden when settings.soffice
is null because exportAvailable() gates docx behind soffice in both the /export route and UI. This
makes the new native DOCX branch in ExportHandler unreachable in the documented no-LibreOffice
configuration.
Agent Prompt
## Issue description
Native DOCX export is implemented but is effectively unreachable in the intended “no soffice installed” configuration because the server route guard and client UI still treat `docx` as requiring LibreOffice.

## Issue Context
- Server-side guard blocks `docx` when `exportAvailable() === 'no'`.
- `exportAvailable()` currently only reflects `soffice` presence.
- Client UI removes the Word export link when `clientVars.exportAvailable === 'no'`.
- Docs say setting `SOFFICE` to `null` disables LibreOffice (typical for no-soffice deployments).

## Fix Focus Areas
- Update server export guard to allow `docx` when `settings.nativeDocxExport === true`, even if `soffice` is null:
  - src/node/hooks/express/importexport.ts[27-48]
- Add a dedicated capability flag for “Word export available” (or “nativeDocxExport enabled”) into clientVars so the UI can show Word export even when other converter-based exports remain disabled:
  - src/node/handler/PadMessageHandler.ts[1113-1118]
  - src/static/js/pad_impexp.ts[147-166]
- Avoid incorrectly enabling PDF/ODT links when only native DOCX is available (introduce a new state or separate flags rather than reusing `exportAvailable`).
  - src/node/utils/Settings.ts[700-709]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@JohnMcLear JohnMcLear force-pushed the feat/native-docx-export-7538 branch from 7e5a73c to b98dfba Compare April 20, 2026 08:44
@JohnMcLear
Copy link
Copy Markdown
Member Author

I feel like we can't drop pdf so need to have a conversation here...

@JohnMcLear JohnMcLear marked this pull request as draft April 26, 2026 19:02
JohnMcLear added a commit to JohnMcLear/etherpad that referenced this pull request May 8, 2026
Captures the agreed scope expansion of PR ether#7568: replace the flag-gated
native DOCX path with a soffice-first selection cascade, add native PDF
export via pdfkit + a small htmlparser2-driven walker, and add native
DOCX import via mammoth. Also defines a shared HTML sanitizer
(stripRemoteImages) used by both export converters to close the
SSRF surface that Qodo flagged on the html-to-docx path.

The spec drops the nativeDocxExport setting and its env var; with
soffice configured, behavior is unchanged, and with soffice null,
docx/pdf export and docx import all work in-process. odt/doc/rtf
(and pdf import) keep needing soffice and are documented as such.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear and others added 4 commits May 8, 2026 14:36
Addresses ether#7538. The current DOCX export path shells out to LibreOffice,
which means every deployment that wants a Word download either installs
soffice (~500 MB) or loses that export. This PR adds a pure-JS
alternative: render the HTML via the existing exporthtml pipeline, then
feed it to the `html-to-docx` library in-process to produce a valid
.docx buffer — no soffice required, no subprocess spawn, no temp file
dance for the DOCX case.

Behavior:
- `settings.nativeDocxExport` (default `false`) gates the new path so
  existing deployments see zero behavior change.
- When enabled, `type === 'docx'` requests skip the LibreOffice branch,
  run `html-to-docx(html)`, and return the buffer with the
  `application/vnd.openxmlformats-officedocument.wordprocessingml.document`
  content-type.
- If the native converter throws, the handler falls through to the
  existing LibreOffice path — so flipping the flag on is safe even on a
  mixed-installation where soffice is still present as a backstop.
- Other export formats (pdf, odt, rtf, txt, html, etherpad) are
  unchanged.

Files:
- `src/package.json`: `html-to-docx` dep (pure JS, no binary reqs)
- `src/node/handler/ExportHandler.ts`: new DOCX branch gated on the
  setting, with fall-through on error
- `src/node/utils/Settings.ts`, `settings.json.template`,
  `settings.json.docker`, `doc/docker.md`: wire up the new setting +
  env var (`NATIVE_DOCX_EXPORT`)
- `src/tests/backend/specs/export.ts`: two new tests — asserts the
  exported buffer is a valid ZIP (PK\x03\x04 signature) and the
  response carries the correct content-type — both with
  `settings.soffice = 'false'` to prove the path doesn't need soffice
  at all.

Out of scope for this PR:
- Native PDF export (would need a PDF rendering step — separate
  undertaking, and the issue acknowledges the `pdfkit`/puppeteer size
  trade-off).

Closes ether#7538

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The upgrade-from-latest-release CI job installs deps from the previous
release's package.json (before this PR adds html-to-docx) and then
git-checkouts this branch's code without re-running pnpm install.
Under that one workflow the new test can't find the module and fails
on the LibreOffice fallback, masking that the native path actually
works in every normal install.

Guard the describe block with require.resolve('html-to-docx'); Mocha's
this.skip() on before cascades to the sibling its. Regular backend
tests (pnpm install against this branch's lockfile) still exercise it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the agreed scope expansion of PR ether#7568: replace the flag-gated
native DOCX path with a soffice-first selection cascade, add native PDF
export via pdfkit + a small htmlparser2-driven walker, and add native
DOCX import via mammoth. Also defines a shared HTML sanitizer
(stripRemoteImages) used by both export converters to close the
SSRF surface that Qodo flagged on the html-to-docx path.

The spec drops the nativeDocxExport setting and its env var; with
soffice configured, behavior is unchanged, and with soffice null,
docx/pdf export and docx import all work in-process. odt/doc/rtf
(and pdf import) keep needing soffice and are documented as such.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD task breakdown of the soffice-free export/import work:
rebase, deps, sanitizer, PDF walker, mammoth wrapper, ExportHandler
cascade, route guard, ImportHandler branch, UI fix, flag rollback,
verification + Qodo reply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear force-pushed the feat/native-docx-export-7538 branch from 2d7995e to cec693f Compare May 8, 2026 13:37
JohnMcLear and others added 9 commits May 8, 2026 14:37
Pure-JS, no native binaries:
- pdfkit ^0.18.0  (PDF rendering)
- htmlparser2 ^12 (SAX parser used by walker + sanitizer)
- mammoth ^1.12   (DOCX -> HTML for native import)
- @types/pdfkit ^0.17 (dev)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops <img src=> elements pointing at non-data, non-relative URLs to
prevent the DOCX/PDF converters from making outbound requests via
plugin-modified HTML. Closes Qodo finding ether#4 against the
html-to-docx path; will be wired into both export branches in
the cascade refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renders pad HTML to a PDF Buffer in-process: headings, paragraphs,
lists, links, inline emphasis, data:-URI images. Remote images are
explicitly skipped at the walker (defense-in-depth on top of the
shared stripRemoteImages sanitizer).

PDFs are emitted with compress:false so accessibility/SEO indexers
that don't FlateDecode can still extract text. Pads are small enough
that the size cost is negligible.

Walker is 167 LOC, well under the spec's 500-LOC bail-out
threshold for switching to pdfmake+html-to-pdfmake+jsdom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps mammoth.convertToHtml so a soffice-less Etherpad can ingest
.docx files. Images are coerced to data: URIs at the converter
boundary so the import pipeline never sees a remote src=.

Includes a tiny generated DOCX fixture (heading, paragraph, list)
under tests/backend/specs/fixtures/ for both this wrapper test and
the upcoming end-to-end import test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the flag-gated DOCX branch with a deterministic dispatch:
soffice if configured, native DOCX/PDF otherwise, 5xx on native
error. Both native paths run plugin-modified HTML through
stripRemoteImages first.

Test changes:
- existing native DOCX block now sets soffice=null (was 'false', a
  truthy non-null string that sidestepped the route guard); fixes
  Qodo finding #3.
- new native PDF integration tests assert %PDF- header and
  application/pdf content-type with soffice=null.
- new negative test: with soffice=null, /export/odt still returns
  the 'not enabled' message.
- the legacy 500-on-export-error test now uses /bin/false so it
  exercises the soffice error path explicitly (the cascade dropped
  the ad-hoc 'false' string; .doc has no native path so this still
  works as a soffice error probe).

Integration tests for native DOCX/PDF currently fail because the
/export route guard still treats both formats as soffice-only;
the next commit fixes that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tightens the no-soffice block to ['odt','doc'] only — formats with
no native path. docx and pdf are handed to ExportHandler, which
dispatches to the in-process converters. Closes Qodo finding #2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When soffice is null and the upload is .docx, run mammoth and feed
the resulting HTML through setPadHTML. Other office formats
(pdf/odt/doc/rtf) are explicitly rejected with uploadFailed instead
of silently falling through to the ASCII-only path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Native paths (ether#7538) make DOCX and PDF available regardless of
soffice presence, so unconditionally render those links. ODT still
gates on exportAvailable. Closes Qodo finding #2 on the UI side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Selection is now purely soffice-presence-driven (cascade in
ExportHandler). The opt-in setting and its NATIVE_DOCX_EXPORT env
var are no longer needed -- soffice configured means soffice path;
soffice null means native path (DOCX, PDF, and DOCX import).
Reverts the additive surface introduced earlier in this PR.

Also updates the SOFFICE doc row to reflect that null no longer
means 'plain text and HTML only' -- docx/pdf export and docx
import now work natively without soffice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear
Copy link
Copy Markdown
Member Author

JohnMcLear commented May 8, 2026

Qodo follow-up — addressed all four findings, plus expanded scope to close #7538 properly:

  1. Requirement gap (DOCX still needs soffice) — fixed. Removed the nativeDocxExport flag entirely. Selection is now purely soffice-presence-driven: soffice configured → soffice; soffice null → native (html-to-docx for DOCX, pdfkit for PDF). No fallback chain.
  2. DOCX blocked without soffice — fixed. Tightened the route guard to ['odt','doc'] only when exportAvailable() === 'no'; pdf/docx fall through to ExportHandler's native dispatch. UI in pad_impexp.ts always shows DOCX + PDF links now.
  3. Native DOCX test bypass — fixed. Tests use settings.soffice = null (was 'false') so they exercise the real no-soffice deployment shape.
  4. Unrestricted HTML-to-DOCX I/O — fixed. New stripRemoteImages sanitizer drops non-data:/non-relative <img src> before either DOCX or PDF conversion. The PDF walker also rejects remote <img> at its own boundary as defense-in-depth. No converter ever sees a remote URL.

Also added (per my "we can't drop pdf" comment):

  • Native PDF export via pdfkit + htmlparser2 walker (~170 LOC, well under the 500-LOC bail-out threshold defined in the spec)
  • Native DOCX import via mammoth so a soffice-less deployment can also ingest .docx files

Design + plan committed to the branch:

  • docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md
  • docs/superpowers/plans/2026-05-08-native-docx-pdf-export-import.md

Comment thread src/tests/backend/specs/export.ts Fixed
CodeQL flagged the loose 'raw.includes("etherpad.org")' as
'incomplete URL substring sanitization' (a false positive in test
context, but worth fixing). Match the full /URI (host) form
instead -- it's both more accurate (we're verifying the PDF link
annotation structure) and CodeQL-clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear requested a review from SamTV12345 May 8, 2026 14:06
DOCX:
- New extractBody helper drops <head>/<style> and the leading
  newline inside <body> so html-to-docx doesn't render CSS or
  prefix paragraphs with empty space.
- New wrapLooseLines pre-processor wraps loose pad lines in <p>
  before the converter sees them. html-to-docx renders <br>
  outside <p> as a new <w:p> (full empty line in Word); inside
  <p> it correctly emits <w:br/> (soft break). Etherpad's HTML
  uses bare <br> for every line, so this was making single
  Enters look like double Enters in the Word output.

PDF:
- Walker SKIP_TAGS rejects head/style/script/title/meta/link
  content -- prior version dumped CSS into the rendered PDF.
- New breakLine() helper combines flushLine() with moveDown(1).
  pdfkit's text('', false) closes the continued run but does
  NOT advance the cursor, so consecutive runs were stacking at
  the same y-coordinate. <br>, end-of-block, and list items
  now use breakLine().
- ontext collapses runs of whitespace and drops pure-whitespace
  text nodes so pretty-printed source HTML doesn't render its
  formatting newlines.

Round-trip:
- New backend test: pad text -> DOCX export -> DOCX import ->
  new pad. Asserts content survives the trip.
- New PDF sanity test: extracts visible text from the PDF stream
  and asserts the source pad text appears verbatim.
- 6 new unit tests for extractBody and wrapLooseLines plus 1 for
  PDF walker SKIP_TAGS coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/node/utils/ExportSanitizeHtml.ts Fixed
JohnMcLear and others added 3 commits May 8, 2026 15:33
- BR_PARA_RE was /(?:\s*<br>\s*){2,}/ -- two adjacent \s* runs can
  match the same chars, so on '<br>\t<br>\t<br>...' the regex
  backtracks exponentially. Re-anchored to match a fixed first <br>
  followed by one or more additional <br>s, so each whitespace run
  has exactly one home.
- import.ts: fetchBuffer was typed Promise<Buffer> but call sites
  chained .expect(200) on it, which only works on supertest's Test
  object. Return the Test (typed any) so the chain is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ep_headings2/ep_align emit one heading-styled blank-line block after
every styled line in the pad ('<h1 style=text-align:right></h1>'),
which both html-to-docx and our pdfkit walker render as a full empty
paragraph. Plus the pdfkit walker had no support for text-align or
monospace, so right/center alignment and 'code' lines rendered the
same as plain body text.

- New dropEmptyBlocks helper strips empty h1-h6/p/code/pre/div/
  blockquote wrappers in preprocessing. Iterates so nested empties
  collapse too. Applied before both DOCX and PDF conversion.
- PDF walker now reads style='text-align:left|center|right|justify'
  on block elements (h1-6, p, div) and passes it as pdfkit's align
  option. align is applied once per continued run, then reset on
  flushLine so the next block can pick up its own value.
- PDF walker handles <code>, <tt>, <kbd>, <samp> as inline monospace
  (Courier) and <pre> as block monospace (Courier + breakLine on
  open/close).

11 new unit tests:
- 4 for dropEmptyBlocks (heading wrappers, code, nesting,
  pass-through)
- 1 for PDF text-align (compares the BT matrix x for left vs right)
- 2 for Courier in <code> and <pre>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-trip bug: ep_headings2 emits <h1>/<h2>/<code> in pad HTML.
Mammoth round-trips them as adjacent <h1>A</h1><h2>B</h2><p>C</p>
on import. Etherpad's server-side content collector has a default
_blockElems set of just {div, p, pre, li}, and ep_headings2 only
registers the CLIENT-side aceRegisterBlockElements -- not the
server-side ccRegisterBlockElements. So h1/h2/code end up being
treated as inline by the importer, and adjacent blocks merge into
a single pad line.

Fix: insert <br> after </h1>...</h6>/</code> when followed by
another block. Server-side workaround keeps this PR self-contained
regardless of plugin version. The right long-term fix is to extend
ep_plugin_helpers' lineAttribute factory to register both hooks
(filed as a follow-up).

Tests:
- 5 unit tests for separateAdjacentHeadingBlocks
- New end-to-end round-trip test asserts H1+H2+P land on three
  separate pad lines after the import path.

Plus the prior PDF text-align/Courier/code commit also included
here:
- code/tt/kbd/samp inherit text-align from style attribute
- pre inherits text-align too

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DOCX round-trip dropping blank pad lines:
- wrapLooseLines now emits an explicit <p></p> marker for each blank
  line in a <br> run, instead of collapsing all gaps into a single
  paragraph break. (N consecutive <br>s -> 1 paragraph boundary +
  N-2 empty <p></p> markers, mapping to N-1 blank pad lines.)
- mammoth's docxBufferToHtml now passes ignoreEmptyParagraphs:false
  so the empty <w:p> entries survive the import side. mammoth's
  default of true was silently dropping them.
- dropEmptyBlocks no longer strips <p></p> -- that's the meaningful
  marker for the round-trip. Empty <h1>/<code>/<pre>/<div>/
  <blockquote> are still stripped (plugin noise).

DOCX <code> rendering as monospace:
- New applyMonospaceToCode wraps code/tt/kbd/samp/pre content in a
  <span style="font-family:'Courier New', monospace">. html-to-docx
  honors that and emits <w:rFonts w:ascii="Courier New".../>, which
  Word renders as Courier. The bare <code> tag is otherwise just
  a no-op for html-to-docx.
- Applied only on the DOCX export path (PDF walker already handles
  monospace via Courier font selection).

Round-trip tests:
- New a==c suite: txt, etherpad, html, docx -- export from src,
  import to dst, re-export and compare against the meaningful
  invariant (line text for binary formats; trimmed body for HTML).
- HTML test tolerates one trailing <br> per round-trip because
  setPadHTML appends a final <p> on import; this is pre-existing
  core behavior, not our bug.
- DOCX test normalizes trailing newline run (same reason).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear and others added 3 commits May 8, 2026 16:23
mammoth doesn't expose Word's paragraph alignment (`<w:jc>`) when it
converts a docx to HTML -- there's no equivalent in its style-mapping
machinery. To keep alignment through DOCX round-trips we walk the
docx's document.xml directly, pull the `w:val` from each `<w:p>`'s
`<w:jc>`, and inject `style="text-align:..."` onto the matching
block element in mammoth's output by document order.

Word's w:jc accepts more values than CSS text-align; we map left/
start, center, right/end, both/justify/distribute and skip the rest
(start/end take left/right because we don't track ltr/rtl from the
docx for now).

Combines with the upstream ep_align PR (ether/ep_align#183) for the
full round-trip: this PR makes the mammoth output carry the
alignment style; ep_align#183 makes the importer pick it up.

Closes the alignment side of ether#7538.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
html-to-docx silently drops <a href> children of <code>/<pre> tags
(and of styled <span>s, but the <code> wrapper is the active offender
here). The pad-export HTML produced by ep_headings2 + ep_align uses
<code style='text-align:right'>...<a href>...</a>...</code> for each
'Code'-style line, which lost its links on every DOCX export.

Workaround: applyMonospaceToCode now drops the code/pre/tt/kbd/samp
wrapper entirely. The non-anchor content gets wrapped in monospace
spans; anchors are emitted unstyled so they keep their hyperlink. For
block-level usage (<pre>, or <code> with an inline style attr) we
emit a wrapping <p> and forward the text-align style. Run BEFORE
wrapLooseLines so the <p> doesn't get double-wrapped.

Tests added:
- inline <code> -> just a styled span (no <code> wrapper)
- <code style='text-align:right'> -> <p style> wrap
- <pre> -> always block-wrapped
- <tt>/<kbd>/<samp> -> inline span only
- regression: <a href> inside <code> survives html-to-docx round-trip
  with both the URL in word/_rels/document.xml.rels AND a <w:hyperlink>
  in the document body

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Etherpad's HTML export wraps each pad line in <p>...</p> (or <h1>,
<code>, etc.) and then appends a <br> between lines. The closing
block tag already ends the line for contentcollector, so the trailing
<br> is redundant -- and on import the server collector counts BOTH
as line breaks, doubling every blank line between paragraphs and
inserting an extra blank between adjacent headings.

Two fixes, both gated on the runtime block-element registry so they
don't double-trigger when the underlying plugin already handles
adjacency:

1. HTML import path now runs the new collapseRedundantBrAfterBlocks
   helper before setPadHTML. Drops a single <br> immediately
   following </p>/</h1-6>/</code>/</pre>/</div>/</blockquote>/</ul>/
   </ol>/</li>/</table>/</tr>/</td>/</th>. Multiple consecutive
   <br>s after a block keep all but the first (the rest still
   represent intentional blank lines).

2. The DOCX-import separateAdjacentHeadingBlocks workaround now
   checks whether 'h1' is in the runtime ccRegisterBlockElements
   set before inserting <br>s. When ep_headings2 has the new server
   hook (per ep_plugin_helpers#14 + the upcoming ep_headings2 PR),
   the workaround correctly stays out of the way -- otherwise it
   adds an extra blank line per heading transition.

Also fixed a subtle ts-check failure on the import.ts test changes
and a leftover implicit-any in ImportDocxNative's alignment
preserver.

Tests added:
- collapseRedundantBrAfterBlocks: 5 unit tests (each block tag,
  whitespace tolerance, multiple <br> keeping intentional blanks)
- HTML import: 'does not introduce a blank line between H1 and H2',
  'preserves blank-line count between H1 and H2 (realistic shape)'
  reproduces the 5-blanks-where-2-expected bug from the user's
  round-trip pad.

1054 backend tests pass locally (the 6 failures are the pre-existing
favicon/webaccess send@1.x dotfile-path issue from running under
.claude/, doesn't reach CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear added a commit to ether/ep_plugin_helpers that referenced this pull request May 8, 2026
Etherpad core's import-side content collector keeps its own
`_blockElems` set (`{div, p, pre, li}` by default), separate from
the editor's. `aceRegisterBlockElements` registers tags on the
editor side only -- so plugins built on `lineAttribute` /
`tagAttribute` (`ep_headings2`, `ep_subscript_and_superscript`,
etc.) tell the editor that `h1..h4` / `code` / `sub` / `sup` are
block elements but DON'T tell the importer.

Symptom: round-tripping a pad with `<h1>` / `<h2>` / `<code>` lines
through any HTML import path collapses adjacent heading-style
blocks into a single pad line. This just bit native DOCX export+
import in ether/etherpad#7568.

Fix: have `createLineAttribute` and `createTagAttribute` return a
`ccRegisterBlockElements` function with the same tag set as
`aceRegisterBlockElements`. Plugins re-export this from their
`ep.json` under `"ccRegisterBlockElements"` to register the same
tags on the import side.

Tests added for both factories.

Closes #13
JohnMcLear added a commit to ether/ep_align that referenced this pull request May 8, 2026
* fix: read text-align from inline style on import

When a pad is HTML-exported, ep_align's getLineHTMLForExport wraps
content in <p style='text-align:...'> (and modifies <h1..h6> tags
in-place to add the same style). Importing that HTML or any HTML
that uses style='text-align:..' on block elements should re-apply
the corresponding line attribute -- but collectContentPre was only
reading the legacy <left>/<center>/<right>/<justify> tag names, so
imports silently dropped alignment.

Pick up text-align from the inline style attribute (etherpad core's
contentcollector already passes the parsed style as context.styl)
so a round-trip through HTML or DOCX preserves alignment.

Closes ether/etherpad#7538 (alignment side of the round-trip)
Refs: PR ether/etherpad#7568

* test: cover style-attribute alignment round-trip

Add backend test cases that set pad HTML using
<p style="text-align:..."> directly (the modern form) and assert
the re-exported HTML preserves the alignment. Covers all four
values: left, center, right, justify.

Without the collectContentPre style-parsing fix in the previous
commit, these would all fail because contentcollector was passing
the inline style through context.styl but ep_align was only
reading the legacy <left>/<center>/<right>/<justify> tag names.
The two new HTML-import-adjacency tests assume ep_headings2 (or
another plugin) has registered h1/h2 as server-side block elements
via ccRegisterBlockElements. Without that, contentcollector treats
<h1>/<h2> as inline and adjacent ones merge into a single pad line
-- making the assertions inapplicable.

CI's backend-tests job runs without plugins installed, so guard
the describe block with a runtime hooks.callAll() check and skip
when h1 isn't a registered block. Local dev with ep_headings2 (and
the local plugin patch wiring ccRegisterBlockElements) still
exercises both tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear marked this pull request as ready for review May 8, 2026 17:25
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 8, 2026

Persistent review updated to latest commit 17bf820

// hand DOCX to html-to-docx and PDF to our pdfkit walker — both
// pure-JS, in-process. No fallback chain: native errors surface as
// 5xx so admins see real failures instead of silent shadowing.
const {sofficeAvailable} = require('../utils/Settings');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we should migrate to ESM. It's such a pain in editors.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll add it to the list :)

@JohnMcLear JohnMcLear merged commit c47ffd5 into ether:develop May 8, 2026
21 checks passed
Comment on lines +93 to +155
// Soffice-first dispatch (issue #7538). When soffice is configured
// we keep the legacy convert-via-tempfile path; when it's not, we
// hand DOCX to html-to-docx and PDF to our pdfkit walker — both
// pure-JS, in-process. No fallback chain: native errors surface as
// 5xx so admins see real failures instead of silent shadowing.
const {sofficeAvailable} = require('../utils/Settings');
const sofState = sofficeAvailable();
const goNative = sofState === 'no'
|| (sofState === 'withoutPDF' && type === 'pdf');

if (goNative) {
const {
stripRemoteImages, extractBody, wrapLooseLines, dropEmptyBlocks,
applyMonospaceToCode,
} = require('../utils/ExportSanitizeHtml');
// The HTML pipeline returns a full document (head, style, body); the
// legacy soffice path renders that fine, but the in-process
// converters need just the body content to avoid leaking CSS into
// the output and to drop the document-level whitespace that creates
// stray paragraph breaks at the top of the result.
// dropEmptyBlocks strips heading-styled blank-line wrappers that
// ep_headings2 emits between every styled line.
const bodyHtml = dropEmptyBlocks(stripRemoteImages(extractBody(html)));
html = null;
try {
if (type === 'docx') {
// applyMonospaceToCode strips `<code>`/`<pre>`/`<tt>` wrappers
// (html-to-docx ignores them AND has a bug where it drops
// `<a href>` children of those tags) and emits styled
// monospace spans, forwarding any block-level alignment style
// to a wrapping `<p>`. Run BEFORE wrapLooseLines so the
// resulting `<p>` lands at the loose-line boundary instead
// of getting double-wrapped.
//
// wrapLooseLines then handles `<br>` semantics: bare `<br>`
// outside `<p>` becomes a soft break, `<br><br>` becomes a
// paragraph boundary plus blank-line markers.
const docxHtml = wrapLooseLines(applyMonospaceToCode(bodyHtml));
const htmlToDocx = require('html-to-docx');
const buf = await htmlToDocx(docxHtml);
res.contentType(
'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
res.send(buf);
return;
}
if (type === 'pdf') {
const {htmlToPdfBuffer} = require('../utils/ExportPdfNative');
const buf = await htmlToPdfBuffer(bodyHtml);
res.contentType('application/pdf');
res.send(buf);
return;
}
// soffice-only formats (odt, doc) are blocked at the route guard
// when soffice is null; reaching here means the guard is wrong.
res.status(500).send(`Cannot export ${type} without soffice configured`);
return;
} catch (err) {
console.error(
`native ${type} export failed for pad "${padId}":`,
err && (err as Error).stack ? (err as Error).stack : err);
res.status(500).send(`Failed to export pad as ${type}.`);
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Native docx/pdf lacks flag 📘 Rule violation ☼ Reliability

Native DOCX/PDF export (and DOCX import) is enabled automatically when settings.soffice is null,
which is the documented default, so the feature is effectively enabled by default without a
dedicated feature flag. This violates the requirement that new features be flag-gated and disabled
by default to avoid unexpected behavior changes for existing deployments.
Agent Prompt
## Issue description
Native DOCX/PDF export (and DOCX import) is activated automatically when `settings.soffice` is `null` (the documented default), so the new feature is enabled by default and is not controlled by an explicit feature flag.

## Issue Context
PR Compliance ID 8 requires new features to be behind a feature flag and disabled by default, with pre-change behavior preserved when the flag is disabled.

## Fix Focus Areas
- src/node/handler/ExportHandler.ts[93-155]
- src/node/hooks/express/importexport.ts[39-43]
- src/static/js/pad_impexp.ts[147-160]
- doc/docker.md[200-200]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +203 to +210
ontext(text) {
if (skipDepth > 0) return;
// Collapse consecutive whitespace to a single space, the way an
// HTML renderer would. Without this, literal newlines and tabs in
// pretty-printed source HTML show up as runs of " " in the PDF.
const collapsed = text.replace(/[\s ]+/g, ' ');
if (collapsed === ' ') return; // pure-whitespace runs are dropped
writeText(collapsed);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Pdf drops single spaces 🐞 Bug ≡ Correctness

htmlToPdfBuffer() drops any text node that collapses to a single space, but Etherpad’s HTML export
can legitimately emit a standalone space between inline tags (e.g., style boundary around a space).
This can cause words to concatenate in native PDF exports.
Agent Prompt
### Issue description
`src/node/utils/ExportPdfNative.ts` drops whitespace-only text nodes that collapse to a single space. In HTML, a single inter-tag space between inline elements is semantically/renderingly significant; dropping it can merge adjacent words.

### Issue Context
Etherpad’s HTML export can produce inline-tag boundaries around characters (including spaces) due to attribute span transitions, and `_processSpaces()` preserves interior normal spaces.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[203-210]

### Implementation direction
Adjust the `ontext()` handling to *not* blindly drop `collapsed === ' '`. Suggested approach:
- Track whether the current PDF line/run is at a “text start” (or whether the last emitted character was whitespace).
- Emit a single space when it is needed to separate tokens (e.g., if the previous emitted character is non-whitespace and the next text will be non-whitespace), while still avoiding indentation/pretty-print whitespace.
A minimal improvement is to keep `collapsed === ' '` unless you are at the beginning of a line/run (immediately after `breakLine()`/`flushLine()`) or you already emitted a space.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +188 to +190
case 'br':
breakLine();
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Pdf adds extra breaks 🐞 Bug ≡ Correctness

Etherpad’s HTML export appends a <br> after every line, including lines wrapped in block tags like
<h1>/<h2>. The PDF walker inserts a line break on both </h1..> close and on every <br>, which
adds extra vertical spacing for heading/block lines in native PDF exports.
Agent Prompt
### Issue description
Native PDF export can add extra vertical spacing when Etherpad emits block-level tags for a line (e.g., `<h1>...</h1>`) and still appends the per-line `<br>` separator. The PDF walker breaks on both the block close and the `<br>`.

### Issue Context
`ExportHtml` always appends `<br>` between pad lines. Headings are exported as `<h1>`/`<h2>` (block tags) when heading attributes are present.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[188-190]
- src/node/utils/ExportPdfNative.ts[223-227]
- (optional pre-processing alternative) src/node/handler/ExportHandler.ts[138-143]

### Implementation direction
Choose one (or combine):
1. **Walker-side fix:** In `onopentag('br')`, if `pendingNewline` is true (meaning a block close already ended the line), treat the `<br>` as a no-op (or just clear `pendingNewline` without calling `breakLine()`).
2. **Pre-processing fix:** Before calling `htmlToPdfBuffer()`, run the existing `collapseRedundantBrAfterBlocks()` sanitizer on `bodyHtml` so sequences like `</h1><br>` become `</h1>`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

JohnMcLear added a commit to ether/ep_headings2 that referenced this pull request May 8, 2026
ep_headings2 already registers h1-h4 + code as block elements in
the editor (aceRegisterBlockElements). It did NOT register them
on the server-side content collector (ccRegisterBlockElements),
so any HTML import path treated those tags as inline -- and
adjacent <h1>/<h2>/<code> blocks merged into a single pad line.

Fix: re-export the new ccRegisterBlockElements function from
ep_plugin_helpers' lineAttribute factory (added in
ether/ep_plugin_helpers#14) and wire it up via ep.json. This
treats h1-h4 + code as block elements on the import side too,
matching their editor behavior.

Bumps the ep_plugin_helpers minimum to ^0.5.2 (the version with
the new export).

Test added covers the regression: a pad with adjacent <h1> and
<h2> (no separator) survives a setHTML/getHTML round-trip with
each heading on its own line and no merged 'AlphaBeta' content.

Refs ether/etherpad#7568 -- this is the missing piece for the
DOCX/HTML round-trip story landing in core.
JohnMcLear added a commit to ether/ep_font_color that referenced this pull request May 8, 2026
ep_font_color emits 'class="color:<name>"' on export (via
inlineAttributeExport.getLineHTMLForExport). The import side, via
inlineAttribute.collectContentPre, also reads the class attribute --
so an export → import round-trip works.

But external HTML (Word/Docx imports via mammoth, pasted from a
browser, etc.) uses the standard CSS form 'style="color:red"'. The
inlineAttribute factory's collectContentPre does not touch
context.styl, so any externally-pasted color was silently dropped.

Read 'color:...' from the inline style attribute (etherpad core's
contentcollector exposes it as context.styl) and apply the
matching toolbar-palette value. Hex values that map to named
palette colors are folded back to the name; everything else falls
through.

Refs ether/etherpad#7568 -- closes the inline-color hole in the
DOCX/HTML round-trip story.

Tests added: round-trip via 'style="color:<name>"' for red,
green, blue.
JohnMcLear added a commit to ether/ep_font_size that referenced this pull request May 8, 2026
ep_font_size emits 'class="font-size:<N>"' on export. The factory
also reads class on import, so the export → import round-trip
works for our own output.

External HTML (Word/DOCX via mammoth, pasted markup) uses the
standard CSS form 'style="font-size:14px"'. The factory does not
touch context.styl, so any externally-pasted size was silently
dropped.

Read 'font-size:...' from the inline style attribute and snap the
value to the nearest supported toolbar size. Handles px, pt, em,
rem with light-touch unit conversion.

Refs ether/etherpad#7568. Sister PRs: ep_align#183 (text-align,
merged), ep_font_color#150 (color).
JohnMcLear added a commit to ether/ep_font_family that referenced this pull request May 9, 2026
* fix: read font-family from inline style on HTML import

ep_font_family stores the font as a custom tag (<fontarial>...) but
its getLineHTMLForExport rewrites those tags into standard CSS
'<span style="font-family:arial">...</span>' on export. The import
side, via tagAttribute.collectContentPre, only looks for the tag
form -- so any round-trip through HTML or DOCX silently lost the
font.

Read 'font-family:...' from the inline style attribute (etherpad
core's contentcollector exposes context.styl), normalize the value
back to one of the toolbar tag names ('font' + lowercase + spaces
to hyphens), and apply the matching attribute. Handles quoted
values, the first font in a fallback list, and 'monospace' as a
generic family.

Refs ether/etherpad#7568. Sister PRs: ep_align#183 (text-align,
merged), ep_font_color#150, ep_font_size#132.

Tests added: round-trip via 'style="font-family:<value>"' for
Arial, 'Times New Roman', courier.

* test: drop unused expectedRe loop variable

CodeQL flagged the unused destructured param. The actual regex is
built inline below; the array slot wasn't doing anything.
JohnMcLear added a commit that referenced this pull request May 9, 2026
Native DOCX export, PDF export, and DOCX import shipped in #7568
via pure-JS in-process converters -- LibreOffice/soffice is no
longer required for those formats. Stale comments in
settings.json.template and settings.json.docker still implied
otherwise ("will only allow plain text and HTML import/exports"),
and the docker docs told users to configure soffice for DOCX as
well. Update them to match what's actually in core:

- soffice present: handles all office formats (existing behavior)
- soffice null: docx export, pdf export, docx import work
  natively; odt/doc/rtf export and pdf import still need soffice

Touches:
- settings.json.template (soffice + docxExport comments)
- settings.json.docker (same)
- doc/docker.md ("Office-format import/export" section)
- doc/docker.adoc (same section + the SOFFICE table row,
  matching what doc/docker.md already says since #7568)

No code changes, no behavior change -- documentation only.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support docx/pdf import/export natively

3 participants