Skip to content

feat: add content sanitization for user-provided content#210

Open
github-actions[bot] wants to merge 1 commit into
developfrom
pi/issue209-1779619931630
Open

feat: add content sanitization for user-provided content#210
github-actions[bot] wants to merge 1 commit into
developfrom
pi/issue209-1779619931630

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

Summary

Implements content sanitization for all user-provided text before it enters the LLM prompt context. Closes #209.

Approach: No 3rd party library

Regarding the question about using 3rd party libraries — after evaluating the options, a self-contained utility is the better fit here:

  • sanitize-html: 7 dependencies (including postcss, htmlparser2), designed for HTML sanitization — overkill for stripping HTML comments from Markdown text
  • deghost: The most relevant library (zero-dep invisible Unicode stripper), but it's v0.0.1 with a single publisher and zero track record. That's a supply-chain risk for a GitHub Action.
  • Our implementation: 3 stable regex patterns covering well-established Unicode and HTML standards that won't change. ~25 lines, zero dependencies, zero maintenance burden.

The regexes target immutable standards (Unicode character categories, HTML comment syntax) — they require the same "maintenance" as knowing that \n means newline. No library update will ever improve <!--[\s\S]*?-->.

Changes

New file: src/platform/github/sanitize.ts

  • sanitizeContent(text: string): string — strips three categories of hidden content:
    1. HTML comments (<!-- ... -->) — invisible in rendered Markdown but visible to LLMs
    2. Invisible Unicode characters — zero-width spaces, joiners, directional markers, BOM
    3. ASCII control characters — preserves meaningful whitespace (\n, \r, \t)

Integration points

  • src/platform/github/context.ts — sanitizes issue/PR bodies (via CONTEXT_EXTRACTORS) and comment bodies (in getComment())
  • src/platform/github/tools/thread.ts — sanitizes:
    • Issue/PR body in buildThreadResult()
    • Issue comment bodies in transformComment()
    • PR review comment bodies in fetchPRReviewComments()

Tests: tests/platform/github/sanitize.spec.ts

22 test cases covering:

  • HTML comment removal (single-line, multi-line, multiple, nested)
  • Individual invisible Unicode character removal (U+200B-F, U+2028-F, U+2060-F, U+FEFF)
  • Control character removal (null, BEL, BS, DEL, form feed, vertical tab)
  • Meaningful whitespace preservation (\n, \r, \t)
  • Unicode content preservation (emoji, CJK, Arabic)
  • Realistic attack scenarios (hidden instructions + zero-width chars)

Validation

✅ ESLint — pass
✅ TypeScript — pass
✅ Prettier — pass
✅ 22/22 new tests — pass

Co-authored-by: shaftoe <shaftoe@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Content sanitization for user-provided content

0 participants