feat(export): native DOCX export via html-to-docx (opt-in) by JohnMcLear · Pull Request #7568 · ether/etherpad

JohnMcLear · 2026-04-20T08:34:14Z

Summary

Closes #7538. Replaces the original flag-gated DOCX-only path with a complete soffice-free import/export story:

Export: With settings.soffice = null, pads can be exported as html, txt, etherpad, docx, pdf — all in-process, no subprocess, no native binaries.
Import: .html, .txt, .etherpad, .docx — all in-process.
soffice configured: behavior is unchanged bit-for-bit. There is no opt-in flag.

Selection model

A single cascade in ExportHandler.ts and ImportHandler.ts:

sofficeAvailable() === 'yes' → existing soffice path
'withoutPDF' (Windows) → soffice for everything except pdf, which goes native
'no' (soffice null) → native DOCX/PDF (export) and native DOCX (import); ODT/DOC/RTF (and PDF import) remain blocked with a clear message

No fallback chain — if a native converter throws, the response is 5xx with the error logged. Mirrors the spec's "soffice if installed, native otherwise, fail clearly" decision.

Native converters

Format	Library	Approach
DOCX export	`html-to-docx`	already in PR
PDF export	`pdfkit` + `htmlparser2`	small walker (~170 LOC) — pure JS, no jsdom
DOCX import	`mammoth`	mammoth → HTML → existing setPadHTML pipeline

Total runtime install added: ~10–12 MB. Compared with ~500 MB for LibreOffice or ~200 MB for puppeteer, this is the right tradeoff for the structural-fidelity bar #7538 calls out.

Hardening

New stripRemoteImages sanitizer removes any <img> whose src is not data: or relative. Both DOCX and PDF export branches run plugin-modified HTML through it before conversion, closing Qodo's SSRF finding (#4) on the existing html-to-docx path and preventing the equivalent issue on PDF.

Out of scope (follow-ups)

Native ODT export — no maintained pure-JS writer in the ecosystem.
Native PDF / ODT / DOC / RTF import — no mature pure-JS readers.
Memory/timeout caps on conversion — add when production signal warrants.

Test plan

pnpm run ts-check clean
Backend tests pass on a clean checkout: 1012 / 1012
Unit + integration coverage for sanitizer, walker, mammoth wrapper, DOCX/PDF export endpoints, DOCX import endpoint, ODT negative cases
CI green
Manual: with SOFFICE=null, export DOCX and PDF; both produce valid files
Manual: with SOFFICE=null, import the fixture .docx and verify pad content

Closes #7538

qodo-free-for-open-source-projects · 2026-04-20T08:34:35Z

Review Summary by Qodo

(Agentic_describe updated until commit `17bf820`)

Native DOCX/PDF export and DOCX import without LibreOffice (soffice-optional)

✨ Enhancement 🧪 Tests 📝 Documentation

Walkthroughs

Description

• **Native DOCX/PDF export and DOCX import without LibreOffice:** Replaces flag-gated DOCX path with
  complete soffice-free import/export story using pure-JavaScript converters (html-to-docx,
  pdfkit, mammoth)
• **Soffice-first dispatch model:** Single cascade in ExportHandler and ImportHandler — uses
  soffice if available, falls back to native converters when soffice = null, fails clearly with 5xx
  errors if conversion fails
• **HTML sanitization hardening:** New stripRemoteImages sanitizer removes remote <img> tags
  (http/https/protocol-relative) to close SSRF vulnerability; applied to both DOCX and PDF export
  branches
• **Pure-JavaScript PDF export:** pdfkit + htmlparser2 walker (~170 LOC) supporting block
  elements, inline styling, links, data URI images, text alignment, and list nesting — no jsdom or
  native binaries
• **Native DOCX import via mammoth:** Converts DOCX buffers to HTML with alignment preservation and
  base64 image embedding; integrates into existing setPadHTML pipeline
• **Comprehensive test coverage:** Unit and integration tests for sanitizer, PDF walker, DOCX
  import, export endpoints, round-trip fidelity, and negative cases for unsupported formats
• **UI always shows DOCX/PDF links:** Removed conditional hiding since native converters are
  built-in; ODT remains gated on soffice availability
• **Documentation and design specs:** Added implementation plan and design specification detailing
  problem statement, selection model, error handling, and out-of-scope follow-ups
• **Minimal runtime overhead:** ~10–12 MB added dependencies vs. ~500 MB for LibreOffice or ~200 MB
  for puppeteer

Diagram

flowchart LR
  A["Export/Import Request"] --> B{"soffice Available?"}
  B -->|"yes"| C["Use soffice<br/>all formats"]
  B -->|"withoutPDF<br/>Windows"| D["soffice for most<br/>native PDF"]
  B -->|"no"| E["Native converters"]
  E --> F["DOCX: html-to-docx"]
  E --> G["PDF: pdfkit walker"]
  E --> H["DOCX import: mammoth"]
  F --> I["HTML Sanitization<br/>stripRemoteImages"]
  G --> I
  H --> J["HTML Pipeline"]
  I --> K["Convert to Buffer"]
  J --> L["Set Pad Content"]

File Changes

1. src/tests/backend/specs/export.ts 🧪 Tests +536/-1

Comprehensive test coverage for native export and sanitization

• Added comprehensive test suite for native DOCX export with settings.soffice = null, verifying
 ZIP signature and content-type
• Added native PDF export tests validating %PDF- header and application/pdf content-type
• Added negative test for ODT export without soffice
• Added unit tests for stripRemoteImages sanitizer covering data URIs, relative URLs, and remote
 URL removal
• Added unit tests for HTML sanitization helpers: extractBody, wrapLooseLines,
 dropEmptyBlocks, collapseRedundantBrAfterBlocks, separateAdjacentHeadingBlocks,
 applyMonospaceToCode
• Added integration tests for htmlToPdfBuffer covering text rendering, links, images, alignment,
 and monospace fonts

src/tests/backend/specs/export.ts

2. src/tests/backend/specs/import.ts 🧪 Tests +473/-0

Native DOCX import and round-trip fidelity tests

• New file with complete import test suite for native DOCX import via mammoth
• Tests docxBufferToHtml conversion preserving headings, paragraphs, lists, and alignment
• Tests end-to-end DOCX import without soffice and rejection of ODT when soffice is null
• Tests round-trip fidelity for txt, etherpad, html, and docx formats
• Tests HTML import with adjacent headings and blank-line preservation
• Tests heading-style content round-trip integrity through DOCX export/import cycle

src/tests/backend/specs/import.ts

3. src/node/utils/ExportSanitizeHtml.ts ✨ Enhancement +215/-0

HTML sanitization and transformation for export converters

• New module providing HTML sanitization and transformation utilities for export converters
• extractBody pulls <body> content from full HTML documents, dropping <head> and styles
• stripRemoteImages removes <img> tags with remote URLs (http/https/protocol-relative),
 replacing with alt text
• wrapLooseLines wraps loose text in <p> tags and converts <br><br> sequences to paragraph
 breaks with empty <p></p> markers for blank lines
• dropEmptyBlocks iteratively removes empty heading/code/div blocks while preserving empty <p>
 markers
• collapseRedundantBrAfterBlocks removes <br> immediately after closing block tags
• separateAdjacentHeadingBlocks inserts <br> between adjacent heading-style blocks for proper
 line separation
• applyMonospaceToCode converts <code>, <pre>, <tt>, <kbd>, <samp> to Courier-styled
 spans, handling block-level alignment and preserving nested anchors

src/node/utils/ExportSanitizeHtml.ts

View more (12)

4. src/node/utils/ExportPdfNative.ts ✨ Enhancement +248/-0

Pure-JavaScript PDF export renderer using pdfkit

• New module implementing pure-JavaScript PDF export via pdfkit and htmlparser2
• htmlToPdfBuffer parses HTML with SAX-style event stream and renders to PDF using pdfkit
• Supports block elements (p, h1-h6, ul/ol/li, blockquote, pre, div), inline styling (bold, italic,
 underline, strike), links with annotations, and data URI images
• Handles text alignment (left, center, right, justify) on paragraphs and code blocks
• Maintains style stack for nested elements and list nesting with bullets/numbers
• Skips head/style/script/title/meta/link/noscript tags to prevent metadata leakage
• Collapses whitespace in text content and decodes base64 data URIs for image embedding

src/node/utils/ExportPdfNative.ts

5. src/node/utils/ImportDocxNative.ts ✨ Enhancement +83/-0

Native DOCX import via mammoth with alignment preservation

• New module implementing native DOCX import via mammoth library
• docxBufferToHtml converts DOCX buffers to HTML using mammoth with empty paragraph preservation
• Extracts paragraph alignment from DOCX <w:jc> elements and applies as CSS text-align styles to
 output HTML
• Embeds images as base64 data URIs to avoid external fetches
• Maps Word alignment values (left, center, right, justify, distribute) to CSS equivalents

src/node/utils/ImportDocxNative.ts

6. src/node/handler/ExportHandler.ts ✨ Enhancement +66/-1

Soffice-first export dispatch with native DOCX/PDF paths

• Replaced flag-gated DOCX export with soffice-first cascade dispatch model
• When sofficeAvailable() === 'no', routes DOCX to html-to-docx and PDF to native pdfkit
 walker
• When sofficeAvailable() === 'withoutPDF' (Windows), routes PDF to native converter while other
 formats use soffice
• Applies HTML sanitization pipeline (stripRemoteImages, extractBody, dropEmptyBlocks,
 applyMonospaceToCode, wrapLooseLines) before native conversion
• Native conversion errors surface as 5xx with logged error details; no fallback chain
• Sets correct content-type headers for DOCX
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document) and PDF
 (application/pdf)

src/node/handler/ExportHandler.ts

7. src/node/handler/ImportHandler.ts ✨ Enhancement +80/-1

Native DOCX import with soffice-first cascade and sanitization

• Added soffice-first cascade for import format selection
• When soffice == null and file is .docx, routes to native mammoth converter producing HTML
• Detects whether ep_headings2 registers h1-h6 as block elements server-side; applies
 separateAdjacentHeadingBlocks workaround when missing
• Rejects .pdf, .odt, .doc, .rtf with explicit error when soffice is null (no silent
 fallback)
• Applies collapseRedundantBrAfterBlocks sanitization to HTML imports and soffice-converted
 outputs to prevent blank-line duplication

src/node/handler/ImportHandler.ts

8. src/node/hooks/express/importexport.ts ✨ Enhancement +4/-2

Route guard allows native DOCX and PDF export paths

• Tightened export route guard to reject only ['odt', 'doc'] when soffice is disabled (was
 ['odt', 'pdf', 'doc', 'docx'])
• PDF and DOCX now fall through to ExportHandler which dispatches to native converters when
 soffice is null
• Updated comment to clarify that native paths handle DOCX and PDF

src/node/hooks/express/importexport.ts

9. src/static/js/pad_impexp.ts ✨ Enhancement +7/-13

Always show DOCX and PDF export links in UI

• Removed conditional hiding of DOCX and PDF export links based on exportAvailable flag
• DOCX and PDF links now always visible since native converters are built-in
• ODT link remains gated on exportAvailable === 'yes' (soffice required)
• Simplified UI logic by removing withoutPDF branch special handling

src/static/js/pad_impexp.ts

10. docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md 📝 Documentation +230/-0

Design specification for native export/import without LibreOffice

• New comprehensive design specification for native DOCX/PDF export and DOCX import without
 LibreOffice
• Documents problem statement, goals, non-goals, and selection model for soffice-first dispatch
• Specifies route guard and UI capability changes
• Details native PDF export approach using pdfkit walker with bail-out criterion (~500 lines max)
• Specifies HTML sanitization defense-in-depth against SSRF via stripRemoteImages
• Documents native DOCX import via mammoth wrapper
• Includes error handling strategy, test plan, file manifest, and dependency summary
• Lists out-of-scope follow-ups (ODT export, PDF/ODT/DOC/RTF import, memory caps)

docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md

11. src/package.json Dependencies +5/-0

Add native export/import library dependencies

• Added html-to-docx (^1.8.0) for native DOCX export
• Added htmlparser2 (^12.0.0) for HTML parsing in sanitizer and PDF walker
• Added mammoth (^1.12.0) for native DOCX import
• Added pdfkit (^0.18.0) for native PDF export
• Added @types/pdfkit (^0.17.6) to dev dependencies for TypeScript support

src/package.json

12. doc/docker.md 📝 Documentation +1/-1

Document soffice configuration and native converter fallback

• Updated SOFFICE environment variable documentation to clarify behavior with and without
 LibreOffice
• Explains that when soffice is configured, all advanced formats use it
• Documents that when soffice is null, in-process converters handle DOCX/PDF export and DOCX import
• Clarifies that ODT/DOC/RTF and PDF import remain unavailable without soffice

doc/docker.md

13. pnpm-lock.yaml Dependencies +700/-16

Add native document conversion library dependencies

• Added html-to-docx@1.8.0 dependency for native DOCX export functionality
• Added htmlparser2@12.0.0 dependency for HTML parsing in conversion workflows
• Added mammoth@1.12.0 dependency for native DOCX import capability
• Added pdfkit@0.18.0 dependency for native PDF export generation
• Added @types/pdfkit@0.17.6 TypeScript type definitions for pdfkit
• Added transitive dependencies for font handling, compression, DOM manipulation, and image
 processing

pnpm-lock.yaml

14. docs/superpowers/plans/2026-05-08-native-docx-pdf-export-import.md 📝 Documentation +1543/-0

Implementation plan for native office format converters

• Comprehensive implementation plan for native DOCX/PDF export and DOCX import without soffice
• Detailed task breakdown (Tasks 0–10) with step-by-step instructions, code snippets, and test
 expectations
• Covers dependency management, HTML sanitization, PDF walker via pdfkit, DOCX import via mammoth,
 handler refactoring, route guard updates, UI changes, and settings cleanup
• Includes self-review checklist, bail-out criterion for PDF walker complexity, and Qodo security
 finding responses

docs/superpowers/plans/2026-05-08-native-docx-pdf-export-import.md

15. src/tests/backend/specs/fixtures/sample.docx 🧪 Tests +0/-0

Test fixture for DOCX import/export validation

• New DOCX fixture file containing heading, paragraph, and bullet list content
• Generated deterministically via html-to-docx library to support import/export test cases
• Binary OOXML format (ZIP-based) with standard Word document structure

src/tests/backend/specs/fixtures/sample.docx

qodo-free-for-open-source-projects · 2026-04-20T08:34:36Z

Code Review by Qodo

🐞 Bugs (6) 📘 Rule violations (2) 📎 Requirement gaps (1)

Context used

✅ Tickets: 🎫 Support Headings, etc. 🎫 Support docx/pdf import/export natively

1. Native DOCX/PDF lacks flag 📘 Rule violation ☼ Reliability ⭐ New

Description

Native DOCX/PDF export (and DOCX import) is enabled automatically when settings.soffice is null,
which is the documented default, so the feature is effectively enabled by default without a
dedicated feature flag. This violates the requirement that new features be flag-gated and disabled
by default to avoid unexpected behavior changes for existing deployments.

Code

src/node/handler/ExportHandler.ts[R93-155]

+    // Soffice-first dispatch (issue #7538). When soffice is configured
+    // we keep the legacy convert-via-tempfile path; when it's not, we
+    // hand DOCX to html-to-docx and PDF to our pdfkit walker — both
+    // pure-JS, in-process. No fallback chain: native errors surface as
+    // 5xx so admins see real failures instead of silent shadowing.
+    const {sofficeAvailable} = require('../utils/Settings');
+    const sofState = sofficeAvailable();
+    const goNative = sofState === 'no'
+        || (sofState === 'withoutPDF' && type === 'pdf');
+
+    if (goNative) {
+      const {
+        stripRemoteImages, extractBody, wrapLooseLines, dropEmptyBlocks,
+        applyMonospaceToCode,
+      } = require('../utils/ExportSanitizeHtml');
+      // The HTML pipeline returns a full document (head, style, body); the
+      // legacy soffice path renders that fine, but the in-process
+      // converters need just the body content to avoid leaking CSS into
+      // the output and to drop the document-level whitespace that creates
+      // stray paragraph breaks at the top of the result.
+      // dropEmptyBlocks strips heading-styled blank-line wrappers that
+      // ep_headings2 emits between every styled line.
+      const bodyHtml = dropEmptyBlocks(stripRemoteImages(extractBody(html)));
+      html = null;
+      try {
+        if (type === 'docx') {
+          // applyMonospaceToCode strips `<code>`/`<pre>`/`<tt>` wrappers
+          // (html-to-docx ignores them AND has a bug where it drops
+          // `<a href>` children of those tags) and emits styled
+          // monospace spans, forwarding any block-level alignment style
+          // to a wrapping `<p>`. Run BEFORE wrapLooseLines so the
+          // resulting `<p>` lands at the loose-line boundary instead
+          // of getting double-wrapped.
+          //
+          // wrapLooseLines then handles `<br>` semantics: bare `<br>`
+          // outside `<p>` becomes a soft break, `<br><br>` becomes a
+          // paragraph boundary plus blank-line markers.
+          const docxHtml = wrapLooseLines(applyMonospaceToCode(bodyHtml));
+          const htmlToDocx = require('html-to-docx');
+          const buf = await htmlToDocx(docxHtml);
+          res.contentType(
+              'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+          res.send(buf);
+          return;
+        }
+        if (type === 'pdf') {
+          const {htmlToPdfBuffer} = require('../utils/ExportPdfNative');
+          const buf = await htmlToPdfBuffer(bodyHtml);
+          res.contentType('application/pdf');
+          res.send(buf);
+          return;
+        }
+        // soffice-only formats (odt, doc) are blocked at the route guard
+        // when soffice is null; reaching here means the guard is wrong.
+        res.status(500).send(`Cannot export ${type} without soffice configured`);
+        return;
+      } catch (err) {
+        console.error(
+            `native ${type} export failed for pad "${padId}":`,
+            err && (err as Error).stack ? (err as Error).stack : err);
+        res.status(500).send(`Failed to export pad as ${type}.`);
+        return;
+      }

Evidence
PR Compliance ID 8 requires new functionality to be behind a feature flag and disabled by default.
The diff adds an automatic dispatch to native DOCX/PDF when sofficeAvailable() reports no
(meaning settings.soffice == null), and documentation explicitly states the default SOFFICE
value is null and that null enables native converters for DOCX/PDF export and DOCX import.
src/node/handler/ExportHandler.ts[93-155]
doc/docker.md[200-200]
src/node/hooks/express/importexport.ts[39-43]
src/static/js/pad_impexp.ts[147-160]
Best Practice: Repository guidelines

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX/PDF export (and DOCX import) is activated automatically when `settings.soffice` is `null` (the documented default), so the new feature is enabled by default and is not controlled by an explicit feature flag.

## Issue Context
PR Compliance ID 8 requires new features to be behind a feature flag and disabled by default, with pre-change behavior preserved when the flag is disabled.

## Fix Focus Areas
- src/node/handler/ExportHandler.ts[93-155]
- src/node/hooks/express/importexport.ts[39-43]
- src/static/js/pad_impexp.ts[147-160]
- doc/docker.md[200-200]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. PDF drops single spaces 🐞 Bug ≡ Correctness ⭐ New

Description

htmlToPdfBuffer() drops any text node that collapses to a single space, but Etherpad’s HTML export
can legitimately emit a standalone space between inline tags (e.g., style boundary around a space).
This can cause words to concatenate in native PDF exports.

Code

src/node/utils/ExportPdfNative.ts[R203-210]

+      ontext(text) {
+        if (skipDepth > 0) return;
+        // Collapse consecutive whitespace to a single space, the way an
+        // HTML renderer would. Without this, literal newlines and tabs in
+        // pretty-printed source HTML show up as runs of " " in the PDF.
+        const collapsed = text.replace(/[\s ]+/g, ' ');
+        if (collapsed === ' ') return;  // pure-whitespace runs are dropped
+        writeText(collapsed);

Evidence

The PDF walker collapses whitespace then returns early if the result is exactly ' ', so an
inter-tag space like </strong> <em> is removed. Etherpad’s HTML exporter appends escaped text
(including regular spaces) directly into the HTML stream while opening/closing tags based on
attribute spans; combined with _processSpaces (which preserves interior regular spaces), it is
possible for a space character to exist as its own text node between tags, and dropping it changes
the rendered text.

src/node/utils/ExportPdfNative.ts[203-210]
src/node/utils/ExportHtml.ts[217-260]
src/node/utils/ExportHtml.ts[536-575]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`src/node/utils/ExportPdfNative.ts` drops whitespace-only text nodes that collapse to a single space. In HTML, a single inter-tag space between inline elements is semantically/renderingly significant; dropping it can merge adjacent words.

### Issue Context
Etherpad’s HTML export can produce inline-tag boundaries around characters (including spaces) due to attribute span transitions, and `_processSpaces()` preserves interior normal spaces.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[203-210]

### Implementation direction
Adjust the `ontext()` handling to *not* blindly drop `collapsed === ' '`. Suggested approach:
- Track whether the current PDF line/run is at a “text start” (or whether the last emitted character was whitespace).
- Emit a single space when it is needed to separate tokens (e.g., if the previous emitted character is non-whitespace and the next text will be non-whitespace), while still avoiding indentation/pretty-print whitespace.
A minimal improvement is to keep `collapsed === ' '` unless you are at the beginning of a line/run (immediately after `breakLine()`/`flushLine()`) or you already emitted a space.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. PDF adds extra breaks 🐞 Bug ≡ Correctness ⭐ New

Description

Etherpad’s HTML export appends a <br> after every line, including lines wrapped in block tags like
<h1>/<h2>. The PDF walker inserts a line break on both </h1..> close and on every <br>, which
adds extra vertical spacing for heading/block lines in native PDF exports.

Code

src/node/utils/ExportPdfNative.ts[R188-190]

+          case 'br':
+            breakLine();
+            break;

Evidence
ExportHtml unconditionally appends <br> after each line. ExportHtml can emit block tags for
headings (h1, h2) based on heading1/heading2 attributes, so a heading line becomes
<h1>...</h1><br>. In the PDF walker, </h1> triggers breakLine() and then the following <br>
triggers another breakLine(), effectively doubling spacing for those lines.
src/node/utils/ExportHtml.ts[51-53]
src/node/utils/ExportHtml.ts[489-506]
src/node/utils/ExportPdfNative.ts[188-190]
src/node/utils/ExportPdfNative.ts[223-227]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Native PDF export can add extra vertical spacing when Etherpad emits block-level tags for a line (e.g., `<h1>...</h1>`) and still appends the per-line `<br>` separator. The PDF walker breaks on both the block close and the `<br>`.

### Issue Context
`ExportHtml` always appends `<br>` between pad lines. Headings are exported as `<h1>`/`<h2>` (block tags) when heading attributes are present.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[188-190]
- src/node/utils/ExportPdfNative.ts[223-227]
- (optional pre-processing alternative) src/node/handler/ExportHandler.ts[138-143]

### Implementation direction
Choose one (or combine):
1. **Walker-side fix:** In `onopentag('br')`, if `pendingNewline` is true (meaning a block close already ended the line), treat the `<br>` as a no-op (or just clear `pendingNewline` without calling `breakLine()`).
2. **Pre-processing fix:** Before calling `htmlToPdfBuffer()`, run the existing `collapseRedundantBrAfterBlocks()` sanitizer on `bodyHtml` so sequences like `</h1><br>` become `</h1>`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

View more (2)

4. DOCX export still needs soffice 📎 Requirement gap ≡ Correctness

Description

The new DOCX export path is opt-in (nativeDocxExport defaults to false) and explicitly falls
back to the existing LibreOffice/soffice path on error, so DOCX export is not fully free of a
LibreOffice runtime dependency. This fails the requirement to support DOCX export without requiring
LibreOffice for these formats.

Code

src/node/handler/ExportHandler.ts[R97-110]

+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }

Evidence
PR Compliance ID 1 requires DOCX/PDF support using native/local tooling with no runtime dependency
on LibreOffice. The added DOCX branch is gated behind settings.nativeDocxExport and, if conversion
fails, logs a warning and falls through to the LibreOffice export path, meaning LibreOffice remains
a required backstop in the DOCX export flow.
Native DOCX/PDF import/export support without Abiword/LibreOffice dependency
src/node/handler/ExportHandler.ts[97-110]
src/node/utils/Settings.ts[419-426]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Compliance requires DOCX export to work without a LibreOffice/`soffice` runtime dependency. The new native DOCX export is opt-in by default and explicitly falls back to the LibreOffice path on error, so LibreOffice is still required as a backstop for DOCX export.
## Issue Context
Current implementation uses `html-to-docx` when `nativeDocxExport` is enabled, but catches errors and falls through to LibreOffice. This violates the stated objective of having DOCX export not depend on LibreOffice.
## Fix Focus Areas
- src/node/handler/ExportHandler.ts[97-110]
- src/node/utils/Settings.ts[419-426]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. DOCX blocked without soffice 🐞 Bug ≡ Correctness

Description

Even with settings.nativeDocxExport=true, DOCX exports are rejected and hidden when settings.soffice
is null because exportAvailable() gates docx behind soffice in both the /export route and UI. This
makes the new native DOCX branch in ExportHandler unreachable in the documented no-LibreOffice
configuration.

Code

src/node/handler/ExportHandler.ts[R90-111]

+    // Native DOCX path (issue #7538) — when `nativeDocxExport` is enabled,
+    // convert the HTML export into a Word document in-process with
+    // `html-to-docx` instead of shelling out to LibreOffice. Saves admins
+    // from having to install `soffice` and avoids per-export subprocess
+    // latency. On failure we fall through to the LibreOffice path below
+    // so the change is strictly additive (opt-in via setting, auto-fallback
+    // if the converter throws).
+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }
+    }

Evidence
The PR adds a native DOCX branch, but requests for /export/docx are blocked earlier when LibreOffice
is disabled (soffice=null), and the UI removes the DOCX link under the same condition.
exportAvailable() only reflects soffice availability, so enabling nativeDocxExport alone won’t
expose or allow DOCX export.
src/node/handler/ExportHandler.ts[90-111]
src/node/hooks/express/importexport.ts[27-48]
src/static/js/pad_impexp.ts[147-166]
src/node/utils/Settings.ts[700-709]
doc/docker.md[190-194]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export is implemented but is effectively unreachable in the intended “no soffice installed” configuration because the server route guard and client UI still treat `docx` as requiring LibreOffice.
## Issue Context
- Server-side guard blocks `docx` when `exportAvailable() === 'no'`.
- `exportAvailable()` currently only reflects `soffice` presence.
- Client UI removes the Word export link when `clientVars.exportAvailable === 'no'`.
- Docs say setting `SOFFICE` to `null` disables LibreOffice (typical for no-soffice deployments).
## Fix Focus Areas
- Update server export guard to allow `docx` when `settings.nativeDocxExport === true`, even if `soffice` is null:
- src/node/hooks/express/importexport.ts[27-48]
- Add a dedicated capability flag for “Word export available” (or “nativeDocxExport enabled”) into clientVars so the UI can show Word export even when other converter-based exports remain disabled:
- src/node/handler/PadMessageHandler.ts[1113-1118]
- src/static/js/pad_impexp.ts[147-166]
- Avoid incorrectly enabling PDF/ODT links when only native DOCX is available (introduce a new state or separate flags rather than reusing `exportAvailable`).
- src/node/utils/Settings.ts[700-709]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

6. Committed generated sample.docx 📘 Rule violation ⚙ Maintainability ⭐ New

Description

A binary .docx file is added under src/tests/backend/specs/fixtures/, which appears to be a
generated export artifact (ZIP PK header with generator metadata/timestamps). Committing generated
artifacts can bloat the repo and create noisy diffs, violating the prohibition on generated files.

Code

src/tests/backend/specs/fixtures/sample.docx[R1-20]

+PK��
+�����Lm�\����������������_rels/PK��
+�����Lm�\������������
+���_rels/.rels<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
+<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
+  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
+  <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Target="docProps/core.xml"/>
+</Relationships>PK��
+�����Lm�\������������	���docProps/PK��
+�����Lm�\�0�������������docProps/core.xml<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
+<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
+  <dc:title/>
+  <dc:subject/>
+  <dc:creator>html-to-docx</dc:creator>
+  <cp:keywords>html-to-docx</cp:keywords>
+  <dc:description/>
+  <cp:lastModifiedBy>html-to-docx</cp:lastModifiedBy>
+  <cp:revision>1</cp:revision>
+  <dcterms:created xsi:type="dcterms:W3CDTF">2026-05-08T13:42:24.892Z</dcterms:created>
+  <dcterms:modified xsi:type="dcterms:W3CDTF">2026-05-08T13:42:24.892Z</dcterms:modified>

Evidence

PR Compliance ID 10 disallows committing build- or runtime-generated artifacts. The PR adds a
.docx fixture file whose content begins with the ZIP PK signature and includes DOCX XML
parts/metadata, indicating it is an generated document archive rather than hand-authored source.

src/tests/backend/specs/fixtures/sample.docx[1-20]
Best Practice: Repository guidelines

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A generated `.docx` binary fixture is committed to the repository.

## Issue Context
To comply with the no-generated-files rule, prefer generating such artifacts during the test run (or storing a non-generated source representation) rather than checking in the produced binary.

## Fix Focus Areas
- src/tests/backend/specs/fixtures/sample.docx[1-20]
- src/tests/backend/specs/import.ts[35-51]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

7. skipDepth stuck on voids 🐞 Bug ☼ Reliability ⭐ New

Description

The PDF walker increments skipDepth for meta/link in SKIP_TAGS, but these are void elements
and may not emit a close event, leaving skipDepth > 0 and skipping the rest of the document. This
can produce empty PDFs and unbounded styleStack growth if such tags appear in the HTML stream.

Code

src/node/utils/ExportPdfNative.ts[R29-33]

+// Tags whose text content must never appear in the rendered PDF (CSS,
+// scripts, document metadata). The walker maintains a depth counter so that
+// nested elements inside one of these are ignored too.
+const SKIP_TAGS = new Set(['head', 'style', 'script', 'title', 'meta', 'link', 'noscript']);
+

Evidence
SKIP_TAGS includes meta and link, and the walker increments skipDepth on open and decrements
only on close. Elsewhere in the codebase, meta and link are explicitly treated as void tags (no
closing tag), so decrement may never happen, causing the parser to remain in skip mode indefinitely
and also push onto styleStack without a corresponding pop.
src/node/utils/ExportPdfNative.ts[29-33]
src/node/utils/ExportPdfNative.ts[125-131]
src/node/utils/ExportPdfNative.ts[213-220]
src/node/utils/ExportSanitizeHtml.ts[174-177]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ExportPdfNative` treats `meta` and `link` as skip-depth delimeters, but they are void elements and might not trigger `onclosetag`, which can leave `skipDepth` permanently > 0.

### Issue Context
The codebase already models `meta`/`link` as void tags in `ExportSanitizeHtml`.

### Fix Focus Areas
- src/node/utils/ExportPdfNative.ts[29-33]
- src/node/utils/ExportPdfNative.ts[125-131]
- src/node/utils/ExportSanitizeHtml.ts[174-177]

### Implementation direction
- Remove `meta` and `link` from `SKIP_TAGS`, **or**
- Only increment `skipDepth` for non-void tags (maintain a `VOID_TAGS` set in `ExportPdfNative`, or reuse logic), **or**
- If you keep them, immediately undo the increment for void tags in `onopentag`.
Also ensure `styleStack` remains balanced when skipping content so it cannot grow without bound.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

8. Native DOCX test bypass 🐞 Bug ☼ Reliability

Description

The new native DOCX tests set settings.soffice='false' (a non-null string), which prevents
exportAvailable() from returning 'no' and sidesteps the server-side DOCX export block. This can make
tests pass while a real deployment with soffice=null (as documented) still cannot export DOCX.

Code

src/tests/backend/specs/export.ts[R36-39]

+    before(function () {
+      settings.soffice = 'false';
+      settings.nativeDocxExport = true;
+    });

Evidence
The tests configure soffice with a non-null string, but the documented way to disable LibreOffice is
null. Additionally, Settings reload logic will null out invalid soffice paths, meaning the test
configuration doesn’t reflect real behavior; the server route guard blocks docx when
exportAvailable() is 'no'.
src/tests/backend/specs/export.ts[32-39]
doc/docker.md[190-194]
src/node/utils/Settings.ts[700-709]
src/node/utils/Settings.ts[1019-1030]
src/node/hooks/express/importexport.ts[37-48]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The native DOCX tests use `settings.soffice = 'false'`, which is non-null and therefore does not simulate a true “no soffice” deployment (`soffice: null`). This can let the tests pass even if the feature is broken for real deployments.
## Issue Context
- Docs describe disabling LibreOffice by setting `SOFFICE` to `null`.
- Server-side export route blocks docx when `exportAvailable() === 'no'`.
## Fix Focus Areas
- Update the native DOCX tests to simulate a real no-soffice deployment (`settings.soffice = null`) and assert DOCX export still succeeds when `nativeDocxExport = true`:
- src/tests/backend/specs/export.ts[32-65]
- After fixing the route/UI gating (see other finding), add a regression assertion that `/export/docx` works with `soffice = null` and fails (or is blocked) appropriately when nativeDocxExport is false.
- src/tests/backend/specs/export.ts[22-66]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

View more (1)

9. Unrestricted HTML-to-DOCX I/O 🐞 Bug ⛨ Security

Description

ExportHandler passes exported HTML directly into html-to-docx and buffers the entire DOCX in memory
for res.send(), and the dependency graph includes image-to-base64→node-fetch enabling outbound
network access from conversion code. Because HTML export can be plugin-modified, enabling
nativeDocxExport can allow untrusted pad/plugin output to trigger server-side requests and increase
memory pressure.

Code

src/node/handler/ExportHandler.ts[R99-105]

+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;

Evidence

The new code converts HTML in-process via html-to-docx and sends the resulting buffer. The lockfile
shows html-to-docx includes image-to-base64 (node-fetch), and ExportHtml provides a plugin hook that
can modify generated HTML, meaning untrusted plugin/pad output can influence the converter input and
potentially induce server-side I/O.

src/node/handler/ExportHandler.ts[97-105]
pnpm-lock.yaml[8709-8718]
pnpm-lock.yaml[8804-8807]
src/node/utils/ExportHtml.ts[321-337]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export converts plugin-modifiable HTML via `html-to-docx` with no constraints. The dependency tree includes `node-fetch`, so conversion code may perform outbound network access, and the handler buffers the full DOCX in memory.
## Issue Context
- ExportHandler calls `htmlToDocx(html)` and `res.send(docxBuffer)`.
- ExportHtml allows plugins to modify exported HTML.
- pnpm-lock indicates `html-to-docx` pulls in `image-to-base64` and `node-fetch`.
## Fix Focus Areas
- Investigate `html-to-docx` options to disable remote fetching / external resource resolution (or strip/deny `<img src>` and other fetchable URLs from HTML before conversion).
- src/node/handler/ExportHandler.ts[97-105]
- Add guardrails: size limits for generated DOCX, timeouts/cancellation, and (if possible) run conversion in a constrained environment (worker/thread or sandbox) to reduce SSRF and DoS impact.
- src/node/handler/ExportHandler.ts[97-111]
- Consider writing the buffer to a temp file and using `res.sendFile()` (or streaming) to reduce peak memory usage.
- src/node/handler/ExportHandler.ts[99-105]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Previous review results

Review updated until commit 17bf820

Results up to commit 7e5a73c

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (1)

1. DOCX export still needs soffice 📎 Requirement gap ≡ Correctness

Description

The new DOCX export path is opt-in (nativeDocxExport defaults to false) and explicitly falls
back to the existing LibreOffice/soffice path on error, so DOCX export is not fully free of a
LibreOffice runtime dependency. This fails the requirement to support DOCX export without requiring
LibreOffice for these formats.

Code

src/node/handler/ExportHandler.ts[R97-110]

+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }

Evidence
PR Compliance ID 1 requires DOCX/PDF support using native/local tooling with no runtime dependency
on LibreOffice. The added DOCX branch is gated behind settings.nativeDocxExport and, if conversion
fails, logs a warning and falls through to the LibreOffice export path, meaning LibreOffice remains
a required backstop in the DOCX export flow.
Native DOCX/PDF import/export support without Abiword/LibreOffice dependency
src/node/handler/ExportHandler.ts[97-110]
src/node/utils/Settings.ts[419-426]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Compliance requires DOCX export to work without a LibreOffice/`soffice` runtime dependency. The new native DOCX export is opt-in by default and explicitly falls back to the LibreOffice path on error, so LibreOffice is still required as a backstop for DOCX export.

## Issue Context
Current implementation uses `html-to-docx` when `nativeDocxExport` is enabled, but catches errors and falls through to LibreOffice. This violates the stated objective of having DOCX export not depend on LibreOffice.

## Fix Focus Areas
- src/node/handler/ExportHandler.ts[97-110]
- src/node/utils/Settings.ts[419-426]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. DOCX blocked without soffice 🐞 Bug ≡ Correctness

Description

Even with settings.nativeDocxExport=true, DOCX exports are rejected and hidden when settings.soffice
is null because exportAvailable() gates docx behind soffice in both the /export route and UI. This
makes the new native DOCX branch in ExportHandler unreachable in the documented no-LibreOffice
configuration.

Code

src/node/handler/ExportHandler.ts[R90-111]

+    // Native DOCX path (issue #7538) — when `nativeDocxExport` is enabled,
+    // convert the HTML export into a Word document in-process with
+    // `html-to-docx` instead of shelling out to LibreOffice. Saves admins
+    // from having to install `soffice` and avoids per-export subprocess
+    // latency. On failure we fall through to the LibreOffice path below
+    // so the change is strictly additive (opt-in via setting, auto-fallback
+    // if the converter throws).
+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }
+    }

Evidence
The PR adds a native DOCX branch, but requests for /export/docx are blocked earlier when LibreOffice
is disabled (soffice=null), and the UI removes the DOCX link under the same condition.
exportAvailable() only reflects soffice availability, so enabling nativeDocxExport alone won’t
expose or allow DOCX export.
src/node/handler/ExportHandler.ts[90-111]
src/node/hooks/express/importexport.ts[27-48]
src/static/js/pad_impexp.ts[147-166]
src/node/utils/Settings.ts[700-709]
doc/docker.md[190-194]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export is implemented but is effectively unreachable in the intended “no soffice installed” configuration because the server route guard and client UI still treat `docx` as requiring LibreOffice.

## Issue Context
- Server-side guard blocks `docx` when `exportAvailable() === 'no'`.
- `exportAvailable()` currently only reflects `soffice` presence.
- Client UI removes the Word export link when `clientVars.exportAvailable === 'no'`.
- Docs say setting `SOFFICE` to `null` disables LibreOffice (typical for no-soffice deployments).

## Fix Focus Areas
- Update server export guard to allow `docx` when `settings.nativeDocxExport === true`, even if `soffice` is null:
 - src/node/hooks/express/importexport.ts[27-48]
- Add a dedicated capability flag for “Word export available” (or “nativeDocxExport enabled”) into clientVars so the UI can show Word export even when other converter-based exports remain disabled:
 - src/node/handler/PadMessageHandler.ts[1113-1118]
 - src/static/js/pad_impexp.ts[147-166]
- Avoid incorrectly enabling PDF/ODT links when only native DOCX is available (introduce a new state or separate flags rather than reusing `exportAvailable`).
 - src/node/utils/Settings.ts[700-709]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Native DOCX test bypass 🐞 Bug ☼ Reliability

Description

The new native DOCX tests set settings.soffice='false' (a non-null string), which prevents
exportAvailable() from returning 'no' and sidesteps the server-side DOCX export block. This can make
tests pass while a real deployment with soffice=null (as documented) still cannot export DOCX.

Code

src/tests/backend/specs/export.ts[R36-39]

+    before(function () {
+      settings.soffice = 'false';
+      settings.nativeDocxExport = true;
+    });

Evidence
The tests configure soffice with a non-null string, but the documented way to disable LibreOffice is
null. Additionally, Settings reload logic will null out invalid soffice paths, meaning the test
configuration doesn’t reflect real behavior; the server route guard blocks docx when
exportAvailable() is 'no'.
src/tests/backend/specs/export.ts[32-39]
doc/docker.md[190-194]
src/node/utils/Settings.ts[700-709]
src/node/utils/Settings.ts[1019-1030]
src/node/hooks/express/importexport.ts[37-48]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The native DOCX tests use `settings.soffice = 'false'`, which is non-null and therefore does not simulate a true “no soffice” deployment (`soffice: null`). This can let the tests pass even if the feature is broken for real deployments.

## Issue Context
- Docs describe disabling LibreOffice by setting `SOFFICE` to `null`.
- Server-side export route blocks docx when `exportAvailable() === 'no'`.

## Fix Focus Areas
- Update the native DOCX tests to simulate a real no-soffice deployment (`settings.soffice = null`) and assert DOCX export still succeeds when `nativeDocxExport = true`:
 - src/tests/backend/specs/export.ts[32-65]
- After fixing the route/UI gating (see other finding), add a regression assertion that `/export/docx` works with `soffice = null` and fails (or is blocked) appropriately when nativeDocxExport is false.
 - src/tests/backend/specs/export.ts[22-66]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Unrestricted HTML-to-DOCX I/O 🐞 Bug ⛨ Security

Description

ExportHandler passes exported HTML directly into html-to-docx and buffers the entire DOCX in memory
for res.send(), and the dependency graph includes image-to-base64→node-fetch enabling outbound
network access from conversion code. Because HTML export can be plugin-modified, enabling
nativeDocxExport can allow untrusted pad/plugin output to trigger server-side requests and increase
memory pressure.

Code

src/node/handler/ExportHandler.ts[R99-105]

+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;

Evidence

The new code converts HTML in-process via html-to-docx and sends the resulting buffer. The lockfile
shows html-to-docx includes image-to-base64 (node-fetch), and ExportHtml provides a plugin hook that
can modify generated HTML, meaning untrusted plugin/pad output can influence the converter input and
potentially induce server-side I/O.

src/node/handler/ExportHandler.ts[97-105]
pnpm-lock.yaml[8709-8718]
pnpm-lock.yaml[8804-8807]
src/node/utils/ExportHtml.ts[321-337]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Native DOCX export converts plugin-modifiable HTML via `html-to-docx` with no constraints. The dependency tree includes `node-fetch`, so conversion code may perform outbound network access, and the handler buffers the full DOCX in memory.

## Issue Context
- ExportHandler calls `htmlToDocx(html)` and `res.send(docxBuffer)`.
- ExportHtml allows plugins to modify exported HTML.
- pnpm-lock indicates `html-to-docx` pulls in `image-to-base64` and `node-fetch`.

## Fix Focus Areas
- Investigate `html-to-docx` options to disable remote fetching / external resource resolution (or strip/deny `<img src>` and other fetchable URLs from HTML before conversion).
 - src/node/handler/ExportHandler.ts[97-105]
- Add guardrails: size limits for generated DOCX, timeouts/cancellation, and (if possible) run conversion in a constrained environment (worker/thread or sandbox) to reduce SSRF and DoS impact.
 - src/node/handler/ExportHandler.ts[97-111]
- Consider writing the buffer to a temp file and using `res.sendFile()` (or streaming) to reduce peak memory usage.
 - src/node/handler/ExportHandler.ts[99-105]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-free-for-open-source-projects · 2026-04-20T08:39:44Z

+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }


1. Docx export still needs soffice 📎 Requirement gap ≡ Correctness

The new DOCX export path is opt-in (nativeDocxExport defaults to false) and explicitly falls back to the existing LibreOffice/soffice path on error, so DOCX export is not fully free of a LibreOffice runtime dependency. This fails the requirement to support DOCX export without requiring LibreOffice for these formats.

Agent Prompt

## Issue description Compliance requires DOCX export to work without a LibreOffice/`soffice` runtime dependency. The new native DOCX export is opt-in by default and explicitly falls back to the LibreOffice path on error, so LibreOffice is still required as a backstop for DOCX export. ## Issue Context Current implementation uses `html-to-docx` when `nativeDocxExport` is enabled, but catches errors and falls through to LibreOffice. This violates the stated objective of having DOCX export not depend on LibreOffice. ## Fix Focus Areas - src/node/handler/ExportHandler.ts[97-110] - src/node/utils/Settings.ts[419-426]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-free-for-open-source-projects · 2026-04-20T08:39:44Z

+    // Native DOCX path (issue #7538) — when `nativeDocxExport` is enabled,
+    // convert the HTML export into a Word document in-process with
+    // `html-to-docx` instead of shelling out to LibreOffice. Saves admins
+    // from having to install `soffice` and avoids per-export subprocess
+    // latency. On failure we fall through to the LibreOffice path below
+    // so the change is strictly additive (opt-in via setting, auto-fallback
+    // if the converter throws).
+    if (type === 'docx' && settings.nativeDocxExport) {
+      try {
+        const htmlToDocx = require('html-to-docx');
+        const docxBuffer = await htmlToDocx(html);
+        html = null;
+        res.contentType(
+            'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+        res.send(docxBuffer);
+        return;
+      } catch (err) {
+        console.warn(
+            `native-docx export failed for pad "${padId}", falling back to ` +
+            `LibreOffice: ${(err as Error).message || err}`);
+      }
+    }


2. Docx blocked without soffice 🐞 Bug ≡ Correctness

Even with settings.nativeDocxExport=true, DOCX exports are rejected and hidden when settings.soffice is null because exportAvailable() gates docx behind soffice in both the /export route and UI. This makes the new native DOCX branch in ExportHandler unreachable in the documented no-LibreOffice configuration.

Agent Prompt

## Issue description Native DOCX export is implemented but is effectively unreachable in the intended “no soffice installed” configuration because the server route guard and client UI still treat `docx` as requiring LibreOffice. ## Issue Context - Server-side guard blocks `docx` when `exportAvailable() === 'no'`. - `exportAvailable()` currently only reflects `soffice` presence. - Client UI removes the Word export link when `clientVars.exportAvailable === 'no'`. - Docs say setting `SOFFICE` to `null` disables LibreOffice (typical for no-soffice deployments). ## Fix Focus Areas - Update server export guard to allow `docx` when `settings.nativeDocxExport === true`, even if `soffice` is null: - src/node/hooks/express/importexport.ts[27-48] - Add a dedicated capability flag for “Word export available” (or “nativeDocxExport enabled”) into clientVars so the UI can show Word export even when other converter-based exports remain disabled: - src/node/handler/PadMessageHandler.ts[1113-1118] - src/static/js/pad_impexp.ts[147-166] - Avoid incorrectly enabling PDF/ODT links when only native DOCX is available (introduce a new state or separate flags rather than reusing `exportAvailable`). - src/node/utils/Settings.ts[700-709]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

JohnMcLear · 2026-04-20T16:11:21Z

I feel like we can't drop pdf so need to have a conversation here...

Captures the agreed scope expansion of PR ether#7568: replace the flag-gated native DOCX path with a soffice-first selection cascade, add native PDF export via pdfkit + a small htmlparser2-driven walker, and add native DOCX import via mammoth. Also defines a shared HTML sanitizer (stripRemoteImages) used by both export converters to close the SSRF surface that Qodo flagged on the html-to-docx path. The spec drops the nativeDocxExport setting and its env var; with soffice configured, behavior is unchanged, and with soffice null, docx/pdf export and docx import all work in-process. odt/doc/rtf (and pdf import) keep needing soffice and are documented as such. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses ether#7538. The current DOCX export path shells out to LibreOffice, which means every deployment that wants a Word download either installs soffice (~500 MB) or loses that export. This PR adds a pure-JS alternative: render the HTML via the existing exporthtml pipeline, then feed it to the `html-to-docx` library in-process to produce a valid .docx buffer — no soffice required, no subprocess spawn, no temp file dance for the DOCX case. Behavior: - `settings.nativeDocxExport` (default `false`) gates the new path so existing deployments see zero behavior change. - When enabled, `type === 'docx'` requests skip the LibreOffice branch, run `html-to-docx(html)`, and return the buffer with the `application/vnd.openxmlformats-officedocument.wordprocessingml.document` content-type. - If the native converter throws, the handler falls through to the existing LibreOffice path — so flipping the flag on is safe even on a mixed-installation where soffice is still present as a backstop. - Other export formats (pdf, odt, rtf, txt, html, etherpad) are unchanged. Files: - `src/package.json`: `html-to-docx` dep (pure JS, no binary reqs) - `src/node/handler/ExportHandler.ts`: new DOCX branch gated on the setting, with fall-through on error - `src/node/utils/Settings.ts`, `settings.json.template`, `settings.json.docker`, `doc/docker.md`: wire up the new setting + env var (`NATIVE_DOCX_EXPORT`) - `src/tests/backend/specs/export.ts`: two new tests — asserts the exported buffer is a valid ZIP (PK\x03\x04 signature) and the response carries the correct content-type — both with `settings.soffice = 'false'` to prove the path doesn't need soffice at all. Out of scope for this PR: - Native PDF export (would need a PDF rendering step — separate undertaking, and the issue acknowledges the `pdfkit`/puppeteer size trade-off). Closes ether#7538 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The upgrade-from-latest-release CI job installs deps from the previous release's package.json (before this PR adds html-to-docx) and then git-checkouts this branch's code without re-running pnpm install. Under that one workflow the new test can't find the module and fails on the LibreOffice fallback, masking that the native path actually works in every normal install. Guard the describe block with require.resolve('html-to-docx'); Mocha's this.skip() on before cascades to the sibling its. Regular backend tests (pnpm install against this branch's lockfile) still exercise it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the agreed scope expansion of PR ether#7568: replace the flag-gated native DOCX path with a soffice-first selection cascade, add native PDF export via pdfkit + a small htmlparser2-driven walker, and add native DOCX import via mammoth. Also defines a shared HTML sanitizer (stripRemoteImages) used by both export converters to close the SSRF surface that Qodo flagged on the html-to-docx path. The spec drops the nativeDocxExport setting and its env var; with soffice configured, behavior is unchanged, and with soffice null, docx/pdf export and docx import all work in-process. odt/doc/rtf (and pdf import) keep needing soffice and are documented as such. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bite-sized TDD task breakdown of the soffice-free export/import work: rebase, deps, sanitizer, PDF walker, mammoth wrapper, ExportHandler cascade, route guard, ImportHandler branch, UI fix, flag rollback, verification + Qodo reply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure-JS, no native binaries: - pdfkit ^0.18.0 (PDF rendering) - htmlparser2 ^12 (SAX parser used by walker + sanitizer) - mammoth ^1.12 (DOCX -> HTML for native import) - @types/pdfkit ^0.17 (dev) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops <img src=> elements pointing at non-data, non-relative URLs to prevent the DOCX/PDF converters from making outbound requests via plugin-modified HTML. Closes Qodo finding ether#4 against the html-to-docx path; will be wired into both export branches in the cascade refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Renders pad HTML to a PDF Buffer in-process: headings, paragraphs, lists, links, inline emphasis, data:-URI images. Remote images are explicitly skipped at the walker (defense-in-depth on top of the shared stripRemoteImages sanitizer). PDFs are emitted with compress:false so accessibility/SEO indexers that don't FlateDecode can still extract text. Pads are small enough that the size cost is negligible. Walker is 167 LOC, well under the spec's 500-LOC bail-out threshold for switching to pdfmake+html-to-pdfmake+jsdom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wraps mammoth.convertToHtml so a soffice-less Etherpad can ingest .docx files. Images are coerced to data: URIs at the converter boundary so the import pipeline never sees a remote src=. Includes a tiny generated DOCX fixture (heading, paragraph, list) under tests/backend/specs/fixtures/ for both this wrapper test and the upcoming end-to-end import test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the flag-gated DOCX branch with a deterministic dispatch: soffice if configured, native DOCX/PDF otherwise, 5xx on native error. Both native paths run plugin-modified HTML through stripRemoteImages first. Test changes: - existing native DOCX block now sets soffice=null (was 'false', a truthy non-null string that sidestepped the route guard); fixes Qodo finding #3. - new native PDF integration tests assert %PDF- header and application/pdf content-type with soffice=null. - new negative test: with soffice=null, /export/odt still returns the 'not enabled' message. - the legacy 500-on-export-error test now uses /bin/false so it exercises the soffice error path explicitly (the cascade dropped the ad-hoc 'false' string; .doc has no native path so this still works as a soffice error probe). Integration tests for native DOCX/PDF currently fail because the /export route guard still treats both formats as soffice-only; the next commit fixes that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tightens the no-soffice block to ['odt','doc'] only — formats with no native path. docx and pdf are handed to ExportHandler, which dispatches to the in-process converters. Closes Qodo finding #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When soffice is null and the upload is .docx, run mammoth and feed the resulting HTML through setPadHTML. Other office formats (pdf/odt/doc/rtf) are explicitly rejected with uploadFailed instead of silently falling through to the ASCII-only path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Native paths (ether#7538) make DOCX and PDF available regardless of soffice presence, so unconditionally render those links. ODT still gates on exportAvailable. Closes Qodo finding #2 on the UI side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Selection is now purely soffice-presence-driven (cascade in ExportHandler). The opt-in setting and its NATIVE_DOCX_EXPORT env var are no longer needed -- soffice configured means soffice path; soffice null means native path (DOCX, PDF, and DOCX import). Reverts the additive surface introduced earlier in this PR. Also updates the SOFFICE doc row to reflect that null no longer means 'plain text and HTML only' -- docx/pdf export and docx import now work natively without soffice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JohnMcLear · 2026-05-08T14:00:34Z

Qodo follow-up — addressed all four findings, plus expanded scope to close #7538 properly:

Requirement gap (DOCX still needs soffice) — fixed. Removed the nativeDocxExport flag entirely. Selection is now purely soffice-presence-driven: soffice configured → soffice; soffice null → native (html-to-docx for DOCX, pdfkit for PDF). No fallback chain.
DOCX blocked without soffice — fixed. Tightened the route guard to ['odt','doc'] only when exportAvailable() === 'no'; pdf/docx fall through to ExportHandler's native dispatch. UI in pad_impexp.ts always shows DOCX + PDF links now.
Native DOCX test bypass — fixed. Tests use settings.soffice = null (was 'false') so they exercise the real no-soffice deployment shape.
Unrestricted HTML-to-DOCX I/O — fixed. New stripRemoteImages sanitizer drops non-data:/non-relative <img src> before either DOCX or PDF conversion. The PDF walker also rejects remote <img> at its own boundary as defense-in-depth. No converter ever sees a remote URL.

Also added (per my "we can't drop pdf" comment):

Native PDF export via pdfkit + htmlparser2 walker (~170 LOC, well under the 500-LOC bail-out threshold defined in the spec)
Native DOCX import via mammoth so a soffice-less deployment can also ingest .docx files

Design + plan committed to the branch:

docs/superpowers/specs/2026-05-08-native-docx-pdf-export-import-design.md
docs/superpowers/plans/2026-05-08-native-docx-pdf-export-import.md

CodeQL flagged the loose 'raw.includes("etherpad.org")' as 'incomplete URL substring sanitization' (a false positive in test context, but worth fixing). Match the full /URI (host) form instead -- it's both more accurate (we're verifying the PDF link annotation structure) and CodeQL-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DOCX: - New extractBody helper drops <head>/<style> and the leading newline inside <body> so html-to-docx doesn't render CSS or prefix paragraphs with empty space. - New wrapLooseLines pre-processor wraps loose pad lines in <p> before the converter sees them. html-to-docx renders <br> outside <p> as a new <w:p> (full empty line in Word); inside <p> it correctly emits <w:br/> (soft break). Etherpad's HTML uses bare <br> for every line, so this was making single Enters look like double Enters in the Word output. PDF: - Walker SKIP_TAGS rejects head/style/script/title/meta/link content -- prior version dumped CSS into the rendered PDF. - New breakLine() helper combines flushLine() with moveDown(1). pdfkit's text('', false) closes the continued run but does NOT advance the cursor, so consecutive runs were stacking at the same y-coordinate. <br>, end-of-block, and list items now use breakLine(). - ontext collapses runs of whitespace and drops pure-whitespace text nodes so pretty-printed source HTML doesn't render its formatting newlines. Round-trip: - New backend test: pad text -> DOCX export -> DOCX import -> new pad. Asserts content survives the trip. - New PDF sanity test: extracts visible text from the PDF stream and asserts the source pad text appears verbatim. - 6 new unit tests for extractBody and wrapLooseLines plus 1 for PDF walker SKIP_TAGS coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- BR_PARA_RE was /(?:\s*<br>\s*){2,}/ -- two adjacent \s* runs can match the same chars, so on '<br>\t<br>\t<br>...' the regex backtracks exponentially. Re-anchored to match a fixed first <br> followed by one or more additional <br>s, so each whitespace run has exactly one home. - import.ts: fetchBuffer was typed Promise<Buffer> but call sites chained .expect(200) on it, which only works on supertest's Test object. Return the Test (typed any) so the chain is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ep_headings2/ep_align emit one heading-styled blank-line block after every styled line in the pad ('<h1 style=text-align:right></h1>'), which both html-to-docx and our pdfkit walker render as a full empty paragraph. Plus the pdfkit walker had no support for text-align or monospace, so right/center alignment and 'code' lines rendered the same as plain body text. - New dropEmptyBlocks helper strips empty h1-h6/p/code/pre/div/ blockquote wrappers in preprocessing. Iterates so nested empties collapse too. Applied before both DOCX and PDF conversion. - PDF walker now reads style='text-align:left|center|right|justify' on block elements (h1-6, p, div) and passes it as pdfkit's align option. align is applied once per continued run, then reset on flushLine so the next block can pick up its own value. - PDF walker handles <code>, <tt>, <kbd>, <samp> as inline monospace (Courier) and <pre> as block monospace (Courier + breakLine on open/close). 11 new unit tests: - 4 for dropEmptyBlocks (heading wrappers, code, nesting, pass-through) - 1 for PDF text-align (compares the BT matrix x for left vs right) - 2 for Courier in <code> and <pre> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round-trip bug: ep_headings2 emits <h1>/<h2>/<code> in pad HTML. Mammoth round-trips them as adjacent <h1>A</h1><h2>B</h2><p>C</p> on import. Etherpad's server-side content collector has a default _blockElems set of just {div, p, pre, li}, and ep_headings2 only registers the CLIENT-side aceRegisterBlockElements -- not the server-side ccRegisterBlockElements. So h1/h2/code end up being treated as inline by the importer, and adjacent blocks merge into a single pad line. Fix: insert <br> after </h1>...</h6>/</code> when followed by another block. Server-side workaround keeps this PR self-contained regardless of plugin version. The right long-term fix is to extend ep_plugin_helpers' lineAttribute factory to register both hooks (filed as a follow-up). Tests: - 5 unit tests for separateAdjacentHeadingBlocks - New end-to-end round-trip test asserts H1+H2+P land on three separate pad lines after the import path. Plus the prior PDF text-align/Courier/code commit also included here: - code/tt/kbd/samp inherit text-align from style attribute - pre inherits text-align too Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DOCX round-trip dropping blank pad lines: - wrapLooseLines now emits an explicit <p></p> marker for each blank line in a <br> run, instead of collapsing all gaps into a single paragraph break. (N consecutive <br>s -> 1 paragraph boundary + N-2 empty <p></p> markers, mapping to N-1 blank pad lines.) - mammoth's docxBufferToHtml now passes ignoreEmptyParagraphs:false so the empty <w:p> entries survive the import side. mammoth's default of true was silently dropping them. - dropEmptyBlocks no longer strips <p></p> -- that's the meaningful marker for the round-trip. Empty <h1>/<code>/<pre>/<div>/ <blockquote> are still stripped (plugin noise). DOCX <code> rendering as monospace: - New applyMonospaceToCode wraps code/tt/kbd/samp/pre content in a <span style="font-family:'Courier New', monospace">. html-to-docx honors that and emits <w:rFonts w:ascii="Courier New".../>, which Word renders as Courier. The bare <code> tag is otherwise just a no-op for html-to-docx. - Applied only on the DOCX export path (PDF walker already handles monospace via Courier font selection). Round-trip tests: - New a==c suite: txt, etherpad, html, docx -- export from src, import to dst, re-export and compare against the meaningful invariant (line text for binary formats; trimmed body for HTML). - HTML test tolerates one trailing <br> per round-trip because setPadHTML appends a final <p> on import; this is pre-existing core behavior, not our bug. - DOCX test normalizes trailing newline run (same reason). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mammoth doesn't expose Word's paragraph alignment (`<w:jc>`) when it converts a docx to HTML -- there's no equivalent in its style-mapping machinery. To keep alignment through DOCX round-trips we walk the docx's document.xml directly, pull the `w:val` from each `<w:p>`'s `<w:jc>`, and inject `style="text-align:..."` onto the matching block element in mammoth's output by document order. Word's w:jc accepts more values than CSS text-align; we map left/ start, center, right/end, both/justify/distribute and skip the rest (start/end take left/right because we don't track ltr/rtl from the docx for now). Combines with the upstream ep_align PR (ether/ep_align#183) for the full round-trip: this PR makes the mammoth output carry the alignment style; ep_align#183 makes the importer pick it up. Closes the alignment side of ether#7538. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

html-to-docx silently drops <a href> children of <code>/<pre> tags (and of styled <span>s, but the <code> wrapper is the active offender here). The pad-export HTML produced by ep_headings2 + ep_align uses <code style='text-align:right'>...<a href>...</a>...</code> for each 'Code'-style line, which lost its links on every DOCX export. Workaround: applyMonospaceToCode now drops the code/pre/tt/kbd/samp wrapper entirely. The non-anchor content gets wrapped in monospace spans; anchors are emitted unstyled so they keep their hyperlink. For block-level usage (<pre>, or <code> with an inline style attr) we emit a wrapping <p> and forward the text-align style. Run BEFORE wrapLooseLines so the <p> doesn't get double-wrapped. Tests added: - inline <code> -> just a styled span (no <code> wrapper) - <code style='text-align:right'> -> <p style> wrap - <pre> -> always block-wrapped - <tt>/<kbd>/<samp> -> inline span only - regression: <a href> inside <code> survives html-to-docx round-trip with both the URL in word/_rels/document.xml.rels AND a <w:hyperlink> in the document body Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Etherpad's HTML export wraps each pad line in <p>...</p> (or <h1>, <code>, etc.) and then appends a <br> between lines. The closing block tag already ends the line for contentcollector, so the trailing <br> is redundant -- and on import the server collector counts BOTH as line breaks, doubling every blank line between paragraphs and inserting an extra blank between adjacent headings. Two fixes, both gated on the runtime block-element registry so they don't double-trigger when the underlying plugin already handles adjacency: 1. HTML import path now runs the new collapseRedundantBrAfterBlocks helper before setPadHTML. Drops a single <br> immediately following </p>/</h1-6>/</code>/</pre>/</div>/</blockquote>/</ul>/ </ol>/</li>/</table>/</tr>/</td>/</th>. Multiple consecutive <br>s after a block keep all but the first (the rest still represent intentional blank lines). 2. The DOCX-import separateAdjacentHeadingBlocks workaround now checks whether 'h1' is in the runtime ccRegisterBlockElements set before inserting <br>s. When ep_headings2 has the new server hook (per ep_plugin_helpers#14 + the upcoming ep_headings2 PR), the workaround correctly stays out of the way -- otherwise it adds an extra blank line per heading transition. Also fixed a subtle ts-check failure on the import.ts test changes and a leftover implicit-any in ImportDocxNative's alignment preserver. Tests added: - collapseRedundantBrAfterBlocks: 5 unit tests (each block tag, whitespace tolerance, multiple <br> keeping intentional blanks) - HTML import: 'does not introduce a blank line between H1 and H2', 'preserves blank-line count between H1 and H2 (realistic shape)' reproduces the 5-blanks-where-2-expected bug from the user's round-trip pad. 1054 backend tests pass locally (the 6 failures are the pre-existing favicon/webaccess send@1.x dotfile-path issue from running under .claude/, doesn't reach CI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Etherpad core's import-side content collector keeps its own `_blockElems` set (`{div, p, pre, li}` by default), separate from the editor's. `aceRegisterBlockElements` registers tags on the editor side only -- so plugins built on `lineAttribute` / `tagAttribute` (`ep_headings2`, `ep_subscript_and_superscript`, etc.) tell the editor that `h1..h4` / `code` / `sub` / `sup` are block elements but DON'T tell the importer. Symptom: round-tripping a pad with `<h1>` / `<h2>` / `<code>` lines through any HTML import path collapses adjacent heading-style blocks into a single pad line. This just bit native DOCX export+ import in ether/etherpad#7568. Fix: have `createLineAttribute` and `createTagAttribute` return a `ccRegisterBlockElements` function with the same tag set as `aceRegisterBlockElements`. Plugins re-export this from their `ep.json` under `"ccRegisterBlockElements"` to register the same tags on the import side. Tests added for both factories. Closes #13

* fix: read text-align from inline style on import When a pad is HTML-exported, ep_align's getLineHTMLForExport wraps content in <p style='text-align:...'> (and modifies <h1..h6> tags in-place to add the same style). Importing that HTML or any HTML that uses style='text-align:..' on block elements should re-apply the corresponding line attribute -- but collectContentPre was only reading the legacy <left>/<center>/<right>/<justify> tag names, so imports silently dropped alignment. Pick up text-align from the inline style attribute (etherpad core's contentcollector already passes the parsed style as context.styl) so a round-trip through HTML or DOCX preserves alignment. Closes ether/etherpad#7538 (alignment side of the round-trip) Refs: PR ether/etherpad#7568 * test: cover style-attribute alignment round-trip Add backend test cases that set pad HTML using <p style="text-align:..."> directly (the modern form) and assert the re-exported HTML preserves the alignment. Covers all four values: left, center, right, justify. Without the collectContentPre style-parsing fix in the previous commit, these would all fail because contentcollector was passing the inline style through context.styl but ep_align was only reading the legacy <left>/<center>/<right>/<justify> tag names.

The two new HTML-import-adjacency tests assume ep_headings2 (or another plugin) has registered h1/h2 as server-side block elements via ccRegisterBlockElements. Without that, contentcollector treats <h1>/<h2> as inline and adjacent ones merge into a single pad line -- making the assertions inapplicable. CI's backend-tests job runs without plugins installed, so guard the describe block with a runtime hooks.callAll() check and skip when h1 isn't a registered block. Local dev with ep_headings2 (and the local plugin patch wiring ccRegisterBlockElements) still exercises both tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-08T17:25:16Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

qodo-free-for-open-source-projects · 2026-05-08T17:25:58Z

Persistent review updated to latest commit 17bf820

SamTV12345 · 2026-05-08T17:27:07Z

+    // hand DOCX to html-to-docx and PDF to our pdfkit walker — both
+    // pure-JS, in-process. No fallback chain: native errors surface as
+    // 5xx so admins see real failures instead of silent shadowing.
+    const {sofficeAvailable} = require('../utils/Settings');


At some point we should migrate to ESM. It's such a pain in editors.

Noted, I'll add it to the list :)

qodo-free-for-open-source-projects · 2026-05-08T17:35:35Z

+    // Soffice-first dispatch (issue #7538). When soffice is configured
+    // we keep the legacy convert-via-tempfile path; when it's not, we
+    // hand DOCX to html-to-docx and PDF to our pdfkit walker — both
+    // pure-JS, in-process. No fallback chain: native errors surface as
+    // 5xx so admins see real failures instead of silent shadowing.
+    const {sofficeAvailable} = require('../utils/Settings');
+    const sofState = sofficeAvailable();
+    const goNative = sofState === 'no'
+        || (sofState === 'withoutPDF' && type === 'pdf');
+
+    if (goNative) {
+      const {
+        stripRemoteImages, extractBody, wrapLooseLines, dropEmptyBlocks,
+        applyMonospaceToCode,
+      } = require('../utils/ExportSanitizeHtml');
+      // The HTML pipeline returns a full document (head, style, body); the
+      // legacy soffice path renders that fine, but the in-process
+      // converters need just the body content to avoid leaking CSS into
+      // the output and to drop the document-level whitespace that creates
+      // stray paragraph breaks at the top of the result.
+      // dropEmptyBlocks strips heading-styled blank-line wrappers that
+      // ep_headings2 emits between every styled line.
+      const bodyHtml = dropEmptyBlocks(stripRemoteImages(extractBody(html)));
+      html = null;
+      try {
+        if (type === 'docx') {
+          // applyMonospaceToCode strips `<code>`/`<pre>`/`<tt>` wrappers
+          // (html-to-docx ignores them AND has a bug where it drops
+          // `<a href>` children of those tags) and emits styled
+          // monospace spans, forwarding any block-level alignment style
+          // to a wrapping `<p>`. Run BEFORE wrapLooseLines so the
+          // resulting `<p>` lands at the loose-line boundary instead
+          // of getting double-wrapped.
+          //
+          // wrapLooseLines then handles `<br>` semantics: bare `<br>`
+          // outside `<p>` becomes a soft break, `<br><br>` becomes a
+          // paragraph boundary plus blank-line markers.
+          const docxHtml = wrapLooseLines(applyMonospaceToCode(bodyHtml));
+          const htmlToDocx = require('html-to-docx');
+          const buf = await htmlToDocx(docxHtml);
+          res.contentType(
+              'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
+          res.send(buf);
+          return;
+        }
+        if (type === 'pdf') {
+          const {htmlToPdfBuffer} = require('../utils/ExportPdfNative');
+          const buf = await htmlToPdfBuffer(bodyHtml);
+          res.contentType('application/pdf');
+          res.send(buf);
+          return;
+        }
+        // soffice-only formats (odt, doc) are blocked at the route guard
+        // when soffice is null; reaching here means the guard is wrong.
+        res.status(500).send(`Cannot export ${type} without soffice configured`);
+        return;
+      } catch (err) {
+        console.error(
+            `native ${type} export failed for pad "${padId}":`,
+            err && (err as Error).stack ? (err as Error).stack : err);
+        res.status(500).send(`Failed to export pad as ${type}.`);
+        return;
+      }


1. Native docx/pdf lacks flag 📘 Rule violation ☼ Reliability

Native DOCX/PDF export (and DOCX import) is enabled automatically when settings.soffice is null, which is the documented default, so the feature is effectively enabled by default without a dedicated feature flag. This violates the requirement that new features be flag-gated and disabled by default to avoid unexpected behavior changes for existing deployments.

Agent Prompt

## Issue description Native DOCX/PDF export (and DOCX import) is activated automatically when `settings.soffice` is `null` (the documented default), so the new feature is enabled by default and is not controlled by an explicit feature flag. ## Issue Context PR Compliance ID 8 requires new features to be behind a feature flag and disabled by default, with pre-change behavior preserved when the flag is disabled. ## Fix Focus Areas - src/node/handler/ExportHandler.ts[93-155] - src/node/hooks/express/importexport.ts[39-43] - src/static/js/pad_impexp.ts[147-160] - doc/docker.md[200-200]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-free-for-open-source-projects · 2026-05-08T17:35:35Z

+      ontext(text) {
+        if (skipDepth > 0) return;
+        // Collapse consecutive whitespace to a single space, the way an
+        // HTML renderer would. Without this, literal newlines and tabs in
+        // pretty-printed source HTML show up as runs of " " in the PDF.
+        const collapsed = text.replace(/[\s ]+/g, ' ');
+        if (collapsed === ' ') return;  // pure-whitespace runs are dropped
+        writeText(collapsed);


2. Pdf drops single spaces 🐞 Bug ≡ Correctness

htmlToPdfBuffer() drops any text node that collapses to a single space, but Etherpad’s HTML export can legitimately emit a standalone space between inline tags (e.g., style boundary around a space). This can cause words to concatenate in native PDF exports.

Agent Prompt

### Issue description `src/node/utils/ExportPdfNative.ts` drops whitespace-only text nodes that collapse to a single space. In HTML, a single inter-tag space between inline elements is semantically/renderingly significant; dropping it can merge adjacent words. ### Issue Context Etherpad’s HTML export can produce inline-tag boundaries around characters (including spaces) due to attribute span transitions, and `_processSpaces()` preserves interior normal spaces. ### Fix Focus Areas - src/node/utils/ExportPdfNative.ts[203-210] ### Implementation direction Adjust the `ontext()` handling to *not* blindly drop `collapsed === ' '`. Suggested approach: - Track whether the current PDF line/run is at a “text start” (or whether the last emitted character was whitespace). - Emit a single space when it is needed to separate tokens (e.g., if the previous emitted character is non-whitespace and the next text will be non-whitespace), while still avoiding indentation/pretty-print whitespace. A minimal improvement is to keep `collapsed === ' '` unless you are at the beginning of a line/run (immediately after `breakLine()`/`flushLine()`) or you already emitted a space.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-free-for-open-source-projects · 2026-05-08T17:35:35Z

+          case 'br':
+            breakLine();
+            break;


3. Pdf adds extra breaks 🐞 Bug ≡ Correctness

Etherpad’s HTML export appends a <br> after every line, including lines wrapped in block tags like <h1>/<h2>. The PDF walker inserts a line break on both </h1..> close and on every <br>, which adds extra vertical spacing for heading/block lines in native PDF exports.

Agent Prompt

### Issue description Native PDF export can add extra vertical spacing when Etherpad emits block-level tags for a line (e.g., `<h1>...</h1>`) and still appends the per-line `<br>` separator. The PDF walker breaks on both the block close and the `<br>`. ### Issue Context `ExportHtml` always appends `<br>` between pad lines. Headings are exported as `<h1>`/`<h2>` (block tags) when heading attributes are present. ### Fix Focus Areas - src/node/utils/ExportPdfNative.ts[188-190] - src/node/utils/ExportPdfNative.ts[223-227] - (optional pre-processing alternative) src/node/handler/ExportHandler.ts[138-143] ### Implementation direction Choose one (or combine): 1. **Walker-side fix:** In `onopentag('br')`, if `pendingNewline` is true (meaning a block close already ended the line), treat the `<br>` as a no-op (or just clear `pendingNewline` without calling `breakLine()`). 2. **Pre-processing fix:** Before calling `htmlToPdfBuffer()`, run the existing `collapseRedundantBrAfterBlocks()` sanitizer on `bodyHtml` so sequences like `</h1><br>` become `</h1>`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ep_headings2 already registers h1-h4 + code as block elements in the editor (aceRegisterBlockElements). It did NOT register them on the server-side content collector (ccRegisterBlockElements), so any HTML import path treated those tags as inline -- and adjacent <h1>/<h2>/<code> blocks merged into a single pad line. Fix: re-export the new ccRegisterBlockElements function from ep_plugin_helpers' lineAttribute factory (added in ether/ep_plugin_helpers#14) and wire it up via ep.json. This treats h1-h4 + code as block elements on the import side too, matching their editor behavior. Bumps the ep_plugin_helpers minimum to ^0.5.2 (the version with the new export). Test added covers the regression: a pad with adjacent <h1> and <h2> (no separator) survives a setHTML/getHTML round-trip with each heading on its own line and no merged 'AlphaBeta' content. Refs ether/etherpad#7568 -- this is the missing piece for the DOCX/HTML round-trip story landing in core.

ep_font_color emits 'class="color:<name>"' on export (via inlineAttributeExport.getLineHTMLForExport). The import side, via inlineAttribute.collectContentPre, also reads the class attribute -- so an export → import round-trip works. But external HTML (Word/Docx imports via mammoth, pasted from a browser, etc.) uses the standard CSS form 'style="color:red"'. The inlineAttribute factory's collectContentPre does not touch context.styl, so any externally-pasted color was silently dropped. Read 'color:...' from the inline style attribute (etherpad core's contentcollector exposes it as context.styl) and apply the matching toolbar-palette value. Hex values that map to named palette colors are folded back to the name; everything else falls through. Refs ether/etherpad#7568 -- closes the inline-color hole in the DOCX/HTML round-trip story. Tests added: round-trip via 'style="color:<name>"' for red, green, blue.

ep_font_size emits 'class="font-size:<N>"' on export. The factory also reads class on import, so the export → import round-trip works for our own output. External HTML (Word/DOCX via mammoth, pasted markup) uses the standard CSS form 'style="font-size:14px"'. The factory does not touch context.styl, so any externally-pasted size was silently dropped. Read 'font-size:...' from the inline style attribute and snap the value to the nearest supported toolbar size. Handles px, pt, em, rem with light-touch unit conversion. Refs ether/etherpad#7568. Sister PRs: ep_align#183 (text-align, merged), ep_font_color#150 (color).

* fix: read font-family from inline style on HTML import ep_font_family stores the font as a custom tag (<fontarial>...) but its getLineHTMLForExport rewrites those tags into standard CSS '<span style="font-family:arial">...</span>' on export. The import side, via tagAttribute.collectContentPre, only looks for the tag form -- so any round-trip through HTML or DOCX silently lost the font. Read 'font-family:...' from the inline style attribute (etherpad core's contentcollector exposes context.styl), normalize the value back to one of the toolbar tag names ('font' + lowercase + spaces to hyphens), and apply the matching attribute. Handles quoted values, the first font in a fallback list, and 'monospace' as a generic family. Refs ether/etherpad#7568. Sister PRs: ep_align#183 (text-align, merged), ep_font_color#150, ep_font_size#132. Tests added: round-trip via 'style="font-family:<value>"' for Arial, 'Times New Roman', courier. * test: drop unused expectedRe loop variable CodeQL flagged the unused destructured param. The actual regex is built inline below; the array slot wasn't doing anything.

Native DOCX export, PDF export, and DOCX import shipped in #7568 via pure-JS in-process converters -- LibreOffice/soffice is no longer required for those formats. Stale comments in settings.json.template and settings.json.docker still implied otherwise ("will only allow plain text and HTML import/exports"), and the docker docs told users to configure soffice for DOCX as well. Update them to match what's actually in core: - soffice present: handles all office formats (existing behavior) - soffice null: docx export, pdf export, docx import work natively; odt/doc/rtf export and pdf import still need soffice Touches: - settings.json.template (soffice + docxExport comments) - settings.json.docker (same) - doc/docker.md ("Office-format import/export" section) - doc/docker.adoc (same section + the SOFFICE table row, matching what doc/docker.md already says since #7568) No code changes, no behavior change -- documentation only. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-free-for-open-source-projects Bot reviewed Apr 20, 2026

View reviewed changes

JohnMcLear force-pushed the feat/native-docx-export-7538 branch from 7e5a73c to b98dfba Compare April 20, 2026 08:44

JohnMcLear marked this pull request as draft April 26, 2026 19:02

JohnMcLear and others added 4 commits May 8, 2026 14:36

JohnMcLear force-pushed the feat/native-docx-export-7538 branch from 2d7995e to cec693f Compare May 8, 2026 13:37

JohnMcLear and others added 9 commits May 8, 2026 14:37

github-advanced-security AI found potential problems May 8, 2026

View reviewed changes

Comment thread src/tests/backend/specs/export.ts Fixed

JohnMcLear requested a review from SamTV12345 May 8, 2026 14:06

github-advanced-security AI found potential problems May 8, 2026

View reviewed changes

Comment thread src/node/utils/ExportSanitizeHtml.ts Fixed

JohnMcLear and others added 3 commits May 8, 2026 15:33

JohnMcLear mentioned this pull request May 8, 2026

lineAttribute() should also expose ccRegisterBlockElements (server-side) ether/ep_plugin_helpers#13

Closed

This was referenced May 8, 2026

fix: read text-align from inline style on HTML import ether/ep_align#183

Merged

feat(attributes): expose ccRegisterBlockElements ether/ep_plugin_helpers#14

Merged

JohnMcLear and others added 3 commits May 8, 2026 16:23

JohnMcLear marked this pull request as ready for review May 8, 2026 17:25

SamTV12345 reviewed May 8, 2026

View reviewed changes

JohnMcLear mentioned this pull request May 8, 2026

feat: register heading tags as server-side block elements ether/ep_headings2#173

Merged

SamTV12345 approved these changes May 8, 2026

View reviewed changes

JohnMcLear mentioned this pull request May 8, 2026

fix: read text-color from inline style on HTML import ether/ep_font_color#150

Merged

JohnMcLear merged commit c47ffd5 into ether:develop May 8, 2026
21 checks passed

JohnMcLear mentioned this pull request May 8, 2026

fix: read font-size from inline style on HTML import ether/ep_font_size#132

Merged

qodo-free-for-open-source-projects Bot reviewed May 8, 2026

View reviewed changes

JohnMcLear mentioned this pull request May 8, 2026

fix: read font-family from inline style on HTML import ether/ep_font_family#143

Merged

JohnMcLear mentioned this pull request May 9, 2026

docs(7538): soffice is now optional for docx/pdf #7707

Merged

Uh oh!

Conversation

JohnMcLear commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Selection model

Native converters

Hardening

Out of scope (follow-ups)

Test plan

Uh oh!

qodo-free-for-open-source-projects Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Summary by Qodo

(Agentic_describe updated until commit 17bf820)

Walkthroughs

File Changes

Uh oh!

qodo-free-for-open-source-projects Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Previous review results

Uh oh!

qodo-free-for-open-source-projects Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-free-for-open-source-projects Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

JohnMcLear commented Apr 20, 2026

Uh oh!

JohnMcLear commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qodo-code-review Bot commented May 8, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamTV12345 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

JohnMcLear May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qodo-free-for-open-source-projects Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-free-for-open-source-projects Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-free-for-open-source-projects Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohnMcLear commented Apr 20, 2026 •

edited

Loading

qodo-free-for-open-source-projects Bot commented Apr 20, 2026 •

edited

Loading

(Agentic_describe updated until commit `17bf820`)

qodo-free-for-open-source-projects Bot commented Apr 20, 2026 •

edited

Loading

JohnMcLear commented May 8, 2026 •

edited

Loading

qodo-free-for-open-source-projects Bot commented May 8, 2026 •

edited

Loading