fix: preserve line breaks when converting HTML to markdown#79
fix: preserve line breaks when converting HTML to markdown#79pchuri merged 1 commit intopchuri:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes Markdown output formatting issues when converting Confluence storage-format HTML to Markdown, specifically preserving multi-line paragraph content and ensuring block-level elements don’t concatenate without blank lines.
Changes:
- Wrap Confluence code-macro conversions in surrounding newlines so adjacent blocks naturally separate.
- Update
<p>conversion to use the dotAll regex flag to preserve paragraph content containing embedded newlines, and emit surrounding newlines. - Add unit tests covering block separation (code/mermaid/lists/tables) and multi-line paragraph preservation for both
storageToMarkdown()andhtmlToMarkdown().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/confluence-client.test.js | Adds regression tests for multi-line paragraphs and blank-line separation between block elements. |
| lib/confluence-client.js | Adjusts code-macro and paragraph conversions to preserve line breaks and introduce blank-line separation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Convert Confluence code macros to markdown | ||
| markdown = markdown.replace(/<ac:structured-macro ac:name="code"[^>]*>[\s\S]*?<ac:parameter ac:name="language">([^<]*)<\/ac:parameter>[\s\S]*?<ac:plain-text-body><!\[CDATA\[([\s\S]*?)\]\]><\/ac:plain-text-body>[\s\S]*?<\/ac:structured-macro>/g, (_, lang, code) => { | ||
| return `\`\`\`${lang}\n${code}\n\`\`\``; | ||
| return `\n\`\`\`${lang}\n${code}\n\`\`\`\n`; | ||
| }); | ||
|
|
||
| // Convert code macros without language parameter | ||
| markdown = markdown.replace(/<ac:structured-macro ac:name="code"[^>]*>[\s\S]*?<ac:plain-text-body><!\[CDATA\[([\s\S]*?)\]\]><\/ac:plain-text-body>[\s\S]*?<\/ac:structured-macro>/g, (_, code) => { | ||
| return `\`\`\`\n${code}\n\`\`\``; | ||
| return `\n\`\`\`\n${code}\n\`\`\`\n`; | ||
| }); |
There was a problem hiding this comment.
I think this is out of the change's scope.
pchuri
left a comment
There was a problem hiding this comment.
Thanks for the PR! The dotAll flag fix on the <p> regex is a great catch — silently dropping multi-line paragraph content was a subtle but impactful bug. The test coverage is thorough too, with both per-element and complex integration cases.
A few observations:
1. Inconsistent block separation for lists and tables
Code blocks and <p> now emit \n…\n, but <ul>, <ol>, and <table> still use only a leading \n (e.g. '\n' + listItems). This works today because the preceding <p> contributes its trailing \n, but if two block elements appear back-to-back without a <p> in between (e.g. a list immediately followed by a table), there won't be a blank line separating them. Applying the same \n…\n pattern to all block elements would make the output more robust and the code more consistent.
2. Code block content can be mutated by htmlToMarkdown()
(Also flagged by Copilot) storageToMarkdown() converts code macros into fenced Markdown blocks before htmlToMarkdown() runs its catch-all HTML tag stripping (/<(?!\/?(details|summary)\b)[^>]+>/g). This means any <div>, <span>, etc. inside code examples will be silently removed. Not necessarily in scope for this PR, but worth a follow-up — e.g. replacing fenced blocks with placeholder tokens before the HTML strip pass and restoring them afterward.
3. Minor: leading \n on first <p>
Adding a leading \n to every <p> means the very first element produces an extra newline at the start of the output. The final markdown.trim() handles this, so there's no user-visible issue — just something to be aware of in the intermediate state.
Overall this is a solid fix. Once the list/table block separation consistency (item 1) is addressed (or confirmed acceptable), this looks good to merge.
|
Thanks for the thorough review! Per observations 1, the implicit trailing Observations 2 and 3 are both valid, but they're beyond the scope of this fix and would warrant a more substantial restructuring of the conversion pipeline. Happy to track them as separate issues if that's useful. |
pchuri
left a comment
There was a problem hiding this comment.
Good point on the list items — the implicit trailing newline does give us the same shape, so no change needed there. And agreed on 2 & 3 being separate concerns. LGTM!
## [1.27.4](v1.27.3...v1.27.4) (2026-03-19) ### Bug Fixes * preserve line breaks when converting HTML to markdown ([#79](#79)) ([c39f388](c39f388))
|
🎉 This PR is included in version 1.27.4 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Pull Request Template
Description
Fixes content being dropped and block elements running together when converting Confluence storage format to markdown (
read --format markdown).Root causes:
<p>regex was missing thes(dotAll) flag — paragraph content with embedded newlines was silently droppedFix: each block element now emits
\n…\n, so adjacent blocks naturally produce a blank line between them.Type of Change
Testing
Checklist