Replace regex-based HTML entity parsing with handwritten code #2510

jakebailey · 2026-01-14T23:00:07Z

I'm on a quest to eliminate regexes from the shipped code as much as possible.

This one is not trivial, but it's not too bad either when split up.

Copilot

Pull request overview

This PR replaces regex-based HTML entity parsing with handwritten code to eliminate the regexp2 dependency from the JSX transformer. The change maintains the same functionality while using a manual string parsing approach.

Changes:

Removed dependency on github.com/dlclark/regexp2 package
Replaced htmlEntityMatcher regex and htmlEntityReplacer function with handwritten parsing logic
Split implementation into decodeEntities (main parsing loop) and decodeEntity (single entity decoder)

Copilot · 2026-01-14T23:03:45Z

internal/transformers/jsxtransforms/jsx.go

-	if word != nil && word.Capture.String() != "" {
-		res, ok := entities[word.Capture.String()]
+
+		entity := text[1:semi]


The slice operation text[1:semi] is incorrect. At this point, text has already been sliced to start at the & character (line 871: text = text[i:]), and semi is the index of ; within this already-sliced text. Therefore, text[1:semi] correctly extracts the entity name between & and ;. However, on line 883, when the entity is not recognized, the code writes text[:semi+1] which includes the & at position 0 through the ; at position semi. This is correct. But consider the edge case where text = \"&;\" (empty entity). In this case, semi = 1, so entity = text[1:1] which is an empty string, and decodeEntity(\"\") returns (0, false). Then line 883 writes text[:2] which is \"&;\". This preserves the invalid entity correctly. The logic appears sound upon closer inspection.

Hilarious circle

internal/transformers/jsxtransforms/jsx.go

weswigham

You know, reading this is making me realize we probably need to be doing this in the parser stage so the .Text() of JsxText as read by the checker is post-entity-decoding - I think the only reason it ends up not mattering is because the literal type of a JsxText element isn't currently accessible - but if we ever did start typing jsx children better, it'd matter.

jakebailey · 2026-01-26T22:00:04Z

Yeah, that would match other string values and so on...

Replace regex-based HTML entity parsing with handwritten code

6481d15

jakebailey requested review from Copilot and weswigham January 14, 2026 23:00

Copilot started reviewing on behalf of jakebailey January 14, 2026 23:01 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

PR feedback

49579fd

weswigham approved these changes Jan 26, 2026

View reviewed changes

jakebailey added this pull request to the merge queue Jan 26, 2026

Merged via the queue into main with commit 0a92d37 Jan 26, 2026
22 checks passed

jakebailey deleted the jabaile/jsx-entity-no-regex branch January 26, 2026 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace regex-based HTML entity parsing with handwritten code #2510

Replace regex-based HTML entity parsing with handwritten code #2510

Uh oh!

jakebailey commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 14, 2026

Uh oh!

jakebailey Jan 14, 2026

Uh oh!

Uh oh!

weswigham left a comment

Uh oh!

jakebailey commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Replace regex-based HTML entity parsing with handwritten code #2510

Replace regex-based HTML entity parsing with handwritten code #2510

Uh oh!

Conversation

jakebailey commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

jakebailey Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weswigham left a comment

Choose a reason for hiding this comment

Uh oh!

jakebailey commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants