fix(web): replace regex HTML sanitization with sanitize-html by williamzujkowski · Pull Request #171 · civic-source/us-code-tracker

williamzujkowski · 2026-05-12T12:26:31Z

Summary

Closes the remaining 15 CodeQL findings in apps/web/src/lib/github.ts:

12 × js/incomplete-multi-character-sanitization
2 × js/bad-tag-filter
1 × js/double-escaping

All flagged the fundamental fact that regex-based HTML stripping is bypass-able. sanitize-html (v2.17) uses a real HTML parser (htmlparser2) so malformed/nested tags and encoded entities cannot escape the allowlist.

Two configs preserve existing contracts

sanitizeContent → allowedTags: [] (plain text). Used by DiffViewer.svelte for content fetched from the us-code repo via GitHub API.
sanitizeExcerpt → allowedTags: ['mark'] (Pagefind highlight wrapper). Rendered via {@html ...} in SearchBar.svelte.

Test change worth flagging

One test changed expectation: the prior regex would decode <script>... then strip — sanitize-html preserves entities as entities, which is the actually-safe behavior (entity-encoded text renders as literal text, never as parsed HTML). Test updated to assert no raw </> in the output rather than the specific decoded form — captures the security property without coupling to the legacy regex's "decode then strip" pattern.

Test plan

pnpm install resolves cleanly (added: sanitize-html 2.17.3 + @types/sanitize-html 2.16.1)
pnpm build (Astro SSG + Pagefind index) passes — sanitize-html survives the SSR/bundle path
pnpm test (269 tests) green
pnpm typecheck clean
pnpm lint clean
CI build passes
Post-merge CodeQL re-scan: 15 alerts in github.ts should clear

🤖 Generated with Claude Code

@html

Closes the remaining CodeQL findings in apps/web/src/lib/github.ts: 12 × js/incomplete-multi-character-sanitization, 2 × js/bad-tag-filter, 1 × js/double-escaping — all flagging the fundamental fact that regex-based HTML stripping is bypass-able. `sanitize-html` (v2.17) uses a real HTML parser (htmlparser2) so malformed/nested tags and encoded entities can't escape the allowlist. Two configurations preserve current call sites' contracts: - sanitizeContent → allowedTags: [] (plain text, used in DiffViewer for content fetched via GitHub API). - sanitizeExcerpt → allowedTags: ['mark'] (Pagefind highlight wrapper, rendered via {@html} in SearchBar). One existing test changed expectation: the prior regex would decode HTML entities and re-strip — sanitize-html preserves entities as entities, which is the actually-safe behavior (entity-encoded text renders as literal text, never as parsed HTML). Test updated to assert no raw `<`/`>` in the output rather than the specific decoded form, which captures the security property. Verified: 269 → 269 tests pass (one updated assertion), pnpm build / typecheck / lint all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

williamzujkowski requested a review from a team as a code owner May 12, 2026 12:26

williamzujkowski merged commit 23297e9 into main May 12, 2026
3 checks passed

williamzujkowski deleted the fix/codeql-html-sanitizer branch May 12, 2026 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(web): replace regex HTML sanitization with sanitize-html#171

fix(web): replace regex HTML sanitization with sanitize-html#171
williamzujkowski merged 1 commit into
mainfrom
fix/codeql-html-sanitizer

williamzujkowski commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

williamzujkowski commented May 12, 2026

Summary

Two configs preserve existing contracts

Test change worth flagging

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant