Skip to content

fix(parity): DOM-aware HTML extraction and heading-line protection#92

Merged
dacharyc merged 1 commit intomainfrom
fix/markdown-content-party-dom-aware-rewrite
May 4, 2026
Merged

fix(parity): DOM-aware HTML extraction and heading-line protection#92
dacharyc merged 1 commit intomainfrom
fix/markdown-content-party-dom-aware-rewrite

Conversation

@dacharyc
Copy link
Copy Markdown
Member

@dacharyc dacharyc commented May 4, 2026

Summary

Closes #90 #91.

Test plan

  • 2 new unit tests in markdown-content-parity.test.ts (one per issue)
  • Full unit suite: 1294 tests pass
  • npm run lint clean, tsc clean
  • Compared per-page parity output across 20 real sites (baseline vs. post-fix); zero regressions, four pages on dacharycarey.com improved

Replaces the flat-text + regex pipeline in extractHtmlText with a DOM
walker that re-parses <pre> rawText to expose syntax-highlighter markup
as DOM nodes. This eliminates the inline `<code>` / `<main>` / `<title>`
ambiguity that issue #90 reported: tag mentions in prose now flow
through as literal text instead of being deleted by the tag-stripping
regex. The HTML_TAG_NAMES set is no longer needed.

Adds heading-line placeholder protection in extractMarkdownText
(restored after list-marker strips) so leading "1. " in numbered
headings like "### 1. How well..." is preserved instead of being
stripped as a numbered-list marker (issue #91).

Validated against 20 doc sites from PARITY-CHECK-NOTES.md: 3 sites
improved (mongodb 8% to 2%, resend 4% to 1%, posthog warn to pass), 0
regressed. Issue repro page (audit-conclusions) goes from 6 missing
segments / warn to 0 missing / pass.

Closes #90 #91
@dacharyc dacharyc merged commit 34b651b into main May 4, 2026
2 checks passed
@dacharyc dacharyc deleted the fix/markdown-content-party-dom-aware-rewrite branch May 4, 2026 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

markdown-content-parity: inline <tag> code spans get text-stripped, causing false 'missing'

1 participant