fix(parity): DOM-aware HTML extraction and heading-line protection by dacharyc · Pull Request #92 · agent-ecosystem/afdocs

dacharyc · 2026-05-04T02:34:15Z

Summary

Replace the flat-text + regex pipeline in extractHtmlText with a DOM walker that re-parses <pre> rawText to expose syntax-highlighter markup as DOM nodes. Eliminates the inline `<code>` / `<main>` / `<title>` ambiguity reported in markdown-content-parity: inline <tag> code spans get text-stripped, causing false 'missing' #90; tag mentions in prose now flow through as literal text instead of being deleted by the tag-stripping regex. HTML_TAG_NAMES is removed.
Add heading-line placeholder protection in extractMarkdownText (restored after list-marker strips) so leading 1. in numbered headings like ### 1. How well... is preserved (markdown-content-parity: numbered-list regex strips leading '1. ' from headings, causing false 'missing' #91).
Validated against the 20 doc sites tracked in PARITY-CHECK-NOTES.md: 3 sites improved (mongodb 8%→2%, resend 4%→1%, posthog warn→pass), 0 regressed. The issue-repro page (audit-conclusions) goes from 6 missing segments / warn to 0 missing / pass.

Closes #90 #91.

Test plan

2 new unit tests in markdown-content-parity.test.ts (one per issue)
Full unit suite: 1294 tests pass
npm run lint clean, tsc clean
Compared per-page parity output across 20 real sites (baseline vs. post-fix); zero regressions, four pages on dacharycarey.com improved

Replaces the flat-text + regex pipeline in extractHtmlText with a DOM walker that re-parses <pre> rawText to expose syntax-highlighter markup as DOM nodes. This eliminates the inline `<code>` / `<main>` / `<title>` ambiguity that issue #90 reported: tag mentions in prose now flow through as literal text instead of being deleted by the tag-stripping regex. The HTML_TAG_NAMES set is no longer needed. Adds heading-line placeholder protection in extractMarkdownText (restored after list-marker strips) so leading "1. " in numbered headings like "### 1. How well..." is preserved instead of being stripped as a numbered-list marker (issue #91). Validated against 20 doc sites from PARITY-CHECK-NOTES.md: 3 sites improved (mongodb 8% to 2%, resend 4% to 1%, posthog warn to pass), 0 regressed. Issue repro page (audit-conclusions) goes from 6 missing segments / warn to 0 missing / pass. Closes #90 #91

dacharyc merged commit 34b651b into main May 4, 2026
2 checks passed

dacharyc deleted the fix/markdown-content-party-dom-aware-rewrite branch May 4, 2026 02:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(parity): DOM-aware HTML extraction and heading-line protection#92

fix(parity): DOM-aware HTML extraction and heading-line protection#92
dacharyc merged 1 commit intomainfrom
fix/markdown-content-party-dom-aware-rewrite

dacharyc commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dacharyc commented May 4, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant