-
Notifications
You must be signed in to change notification settings - Fork 806
Replace regex-based HTML entity parsing with handwritten code #2510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR replaces regex-based HTML entity parsing with handwritten code to eliminate the regexp2 dependency from the JSX transformer. The change maintains the same functionality while using a manual string parsing approach.
Changes:
- Removed dependency on
github.com/dlclark/regexp2package - Replaced
htmlEntityMatcherregex andhtmlEntityReplacerfunction with handwritten parsing logic - Split implementation into
decodeEntities(main parsing loop) anddecodeEntity(single entity decoder)
| if word != nil && word.Capture.String() != "" { | ||
| res, ok := entities[word.Capture.String()] | ||
|
|
||
| entity := text[1:semi] |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slice operation text[1:semi] is incorrect. At this point, text has already been sliced to start at the & character (line 871: text = text[i:]), and semi is the index of ; within this already-sliced text. Therefore, text[1:semi] correctly extracts the entity name between & and ;. However, on line 883, when the entity is not recognized, the code writes text[:semi+1] which includes the & at position 0 through the ; at position semi. This is correct. But consider the edge case where text = \"&;\" (empty entity). In this case, semi = 1, so entity = text[1:1] which is an empty string, and decodeEntity(\"\") returns (0, false). Then line 883 writes text[:2] which is \"&;\". This preserves the invalid entity correctly. The logic appears sound upon closer inspection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hilarious circle
weswigham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, reading this is making me realize we probably need to be doing this in the parser stage so the .Text() of JsxText as read by the checker is post-entity-decoding - I think the only reason it ends up not mattering is because the literal type of a JsxText element isn't currently accessible - but if we ever did start typing jsx children better, it'd matter.
|
Yeah, that would match other string values and so on... |
I'm on a quest to eliminate regexes from the shipped code as much as possible.
This one is not trivial, but it's not too bad either when split up.