Conversation
Replaces the flat-text + regex pipeline in extractHtmlText with a DOM walker that re-parses <pre> rawText to expose syntax-highlighter markup as DOM nodes. This eliminates the inline `<code>` / `<main>` / `<title>` ambiguity that issue #90 reported: tag mentions in prose now flow through as literal text instead of being deleted by the tag-stripping regex. The HTML_TAG_NAMES set is no longer needed. Adds heading-line placeholder protection in extractMarkdownText (restored after list-marker strips) so leading "1. " in numbered headings like "### 1. How well..." is preserved instead of being stripped as a numbered-list marker (issue #91). Validated against 20 doc sites from PARITY-CHECK-NOTES.md: 3 sites improved (mongodb 8% to 2%, resend 4% to 1%, posthog warn to pass), 0 regressed. Issue repro page (audit-conclusions) goes from 6 missing segments / warn to 0 missing / pass. Closes #90 #91
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extractHtmlTextwith a DOM walker that re-parses<pre>rawText to expose syntax-highlighter markup as DOM nodes. Eliminates the inline`<code>`/`<main>`/`<title>`ambiguity reported in markdown-content-parity: inline<tag>code spans get text-stripped, causing false 'missing' #90; tag mentions in prose now flow through as literal text instead of being deleted by the tag-stripping regex.HTML_TAG_NAMESis removed.extractMarkdownText(restored after list-marker strips) so leading1.in numbered headings like### 1. How well...is preserved (markdown-content-parity: numbered-list regex strips leading '1. ' from headings, causing false 'missing' #91).PARITY-CHECK-NOTES.md: 3 sites improved (mongodb 8%→2%, resend 4%→1%, posthog warn→pass), 0 regressed. The issue-repro page (audit-conclusions) goes from 6 missing segments / warn to 0 missing / pass.Closes #90 #91.
Test plan
markdown-content-parity.test.ts(one per issue)npm run lintclean,tscclean