Skip to content

fix(web): replace regex HTML sanitization with sanitize-html#171

Merged
williamzujkowski merged 1 commit into
mainfrom
fix/codeql-html-sanitizer
May 12, 2026
Merged

fix(web): replace regex HTML sanitization with sanitize-html#171
williamzujkowski merged 1 commit into
mainfrom
fix/codeql-html-sanitizer

Conversation

@williamzujkowski
Copy link
Copy Markdown
Collaborator

Summary

Closes the remaining 15 CodeQL findings in apps/web/src/lib/github.ts:

  • 12 × js/incomplete-multi-character-sanitization
  • 2 × js/bad-tag-filter
  • 1 × js/double-escaping

All flagged the fundamental fact that regex-based HTML stripping is bypass-able. sanitize-html (v2.17) uses a real HTML parser (htmlparser2) so malformed/nested tags and encoded entities cannot escape the allowlist.

Two configs preserve existing contracts

  • sanitizeContentallowedTags: [] (plain text). Used by DiffViewer.svelte for content fetched from the us-code repo via GitHub API.
  • sanitizeExcerptallowedTags: ['mark'] (Pagefind highlight wrapper). Rendered via {@html ...} in SearchBar.svelte.

Test change worth flagging

One test changed expectation: the prior regex would decode &lt;script&gt;... then strip — sanitize-html preserves entities as entities, which is the actually-safe behavior (entity-encoded text renders as literal text, never as parsed HTML). Test updated to assert no raw </> in the output rather than the specific decoded form — captures the security property without coupling to the legacy regex's "decode then strip" pattern.

Test plan

  • pnpm install resolves cleanly (added: sanitize-html 2.17.3 + @types/sanitize-html 2.16.1)
  • pnpm build (Astro SSG + Pagefind index) passes — sanitize-html survives the SSR/bundle path
  • pnpm test (269 tests) green
  • pnpm typecheck clean
  • pnpm lint clean
  • CI build passes
  • Post-merge CodeQL re-scan: 15 alerts in github.ts should clear

🤖 Generated with Claude Code

Closes the remaining CodeQL findings in apps/web/src/lib/github.ts:
12 × js/incomplete-multi-character-sanitization, 2 × js/bad-tag-filter,
1 × js/double-escaping — all flagging the fundamental fact that
regex-based HTML stripping is bypass-able.

`sanitize-html` (v2.17) uses a real HTML parser (htmlparser2) so
malformed/nested tags and encoded entities can't escape the allowlist.
Two configurations preserve current call sites' contracts:

- sanitizeContent → allowedTags: [] (plain text, used in DiffViewer
  for content fetched via GitHub API).
- sanitizeExcerpt → allowedTags: ['mark'] (Pagefind highlight wrapper,
  rendered via {@html} in SearchBar).

One existing test changed expectation: the prior regex would decode
HTML entities and re-strip — sanitize-html preserves entities as
entities, which is the actually-safe behavior (entity-encoded text
renders as literal text, never as parsed HTML). Test updated to
assert no raw `<`/`>` in the output rather than the specific decoded
form, which captures the security property.

Verified: 269 → 269 tests pass (one updated assertion), pnpm build /
typecheck / lint all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@williamzujkowski williamzujkowski requested a review from a team as a code owner May 12, 2026 12:26
@williamzujkowski williamzujkowski merged commit 23297e9 into main May 12, 2026
3 checks passed
@williamzujkowski williamzujkowski deleted the fix/codeql-html-sanitizer branch May 12, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant