Skip to content

Unicode support for XML and HTML tags name #196

@afu-dev

Description

@afu-dev

Tempest version

3.0

PHP version

8.5

Operating system

macOS

Description

Currently, for XML and HTML, we are capturing at best \w which translate into [a-zA-Z0-9_]. But tag names can contains almost any unicode characters after the first char (this one must be [a-zA-Z] for HTML, a bit more permissive for XML).

Here's the specs I've found for each language:

The HTML spec even gave us a valid ReGex: /^(?:[A-Za-z][^\0\t\n\f\r\u0020/>]*|[:_\u0080-\u{10FFFF}][A-Za-z0-9-.:_\u0080-\u{10FFFF}]*)$/u

So the following are valid in HTML:

<math-α></math-α>
<emotion-😍-emoji></emotion-😍-emoji>

(And I guess even GitHub doesn't render then correctly 🥲)

Here's how Gecko (Firefox) is rendering the above HTML:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions