Skip to content

Support Unicode property classes (\p{…} / \P{…}) #7

@DaliVana

Description

@DaliVana

Summary

Regex.compile rejects all Unicode property escapes (\p{…} and \P{…}). This includes general categories (\p{L}, \p{N}, \p{Lu}, \p{Nd}, …), scripts (\p{Greek}, \p{Han}, …), blocks, and binary properties (\p{White_Space}, \p{Alphabetic}, …). These are standardized in [UTS #18 (Unicode Regular Expressions)][tr18] and are required to express tokenizer pre-tokenization patterns faithfully.

Version

  • zig-regex: 0.1.1
  • Zig: 0.16.0

Reproduction

const Regex = @import("regex").Regex;

test "unicode property classes" {
    inline for (.{ "\\p{L}+", "\\p{N}+", "\\p{Lu}", "\\P{L}", "[^\\p{L}\\p{N}]" }) |pat| {
        var re = try Regex.compile(std.testing.allocator, pat);
        re.deinit();
    }
}

Expected
All Unicode property forms compile and match per the Unicode Character Database (UCD), mirroring the UTS #18 / PCRE / fancy-regex semantics:

General categories: one- and two-letter (\p{L}, \p{Lu}, \p{N}, \p{Nd}, \p{P}, \p{Z}, …) — see UAX #44 §5.7.1 General_Category
Negation: \P{…} and \p{^…}
Scripts and script extensions: \p{Greek}, \p{Script=Han}, \p{scx=Latin}, … — see UAX #24 (Script property)
Binary properties: \p{White_Space}, \p{Alphabetic}, \p{Emoji}, …
Usable both standalone and inside character classes, e.g. [^\p{L}\p{N}], with property/value names matched loosely per UAX #44-LM3 (loose matching) (case/_/-/space-insensitive)

Actual
Compilation fails. The lexer has no case for p/P after a backslash, so it falls through to else => RegexError.InvalidEscapeSequence (src/parser.zig:107). Inside [...] the class parser likewise rejects it (parseCharClass → InvalidCharacterClass, src/parser.zig:722-723).

Why this matters
The canonical GPT-4 BPE split pattern is:

'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]|\s[\r\n]|\s+(?!\S)|\s+ Without Unicode property classes this can only be approximated with ASCII [A-Za-z]/[0-9], which silently mis-tokenizes any non-ASCII text (accented Latin, CJK, Cyrillic, emoji, etc.). zig-regex already supports the harder parts of this pattern (lookahead), so property classes are the main remaining blocker for a faithful port — and they're broadly useful well beyond tokenization.

Proposed scope
Full \p{…}/\P{…} support backed by generated UCD tables, targeting UTS #18 "Basic Unicode Support" (RL1.2):

General categories — both abbreviated (L, Lu, Nd, P, Z, …) and the super-categories (L, M, N, P, S, Z, C).
Negation — \P{…} and the \p{^…} form.
Scripts / script extensions — \p{Script=…} / \p{sc=…} and \p{Script_Extensions=…} / \p{scx=…}, plus the bare \p{Greek} shorthand.
Binary properties — at minimum White_Space, Alphabetic, Uppercase, Lowercase, Noncharacter_Code_Point; ideally the full set.
Loose name matching per UAX #44-LM3.
Works identically standalone and inside [...], and composes with negated classes ([^\p{L}\p{N}]) and existing quantifiers.
Implementation suggestion: vendor a small code-generator that emits codepoint range tables from the UCD data files
(DerivedGeneralCategory.txt, Scripts.txt, ScriptExtensions.txt, PropList.txt, PropertyAliases.txt, PropertyValueAliases.txt) at build time, pin the Unicode version, and resolve \p{…} to range sets the existing matcher already understands. Phasing is fine — general categories + negation first (covers the tokenizer use case), scripts/binary properties next.

I'm happy to test a branch against a real BPE tokenizer port (ASCII and non-ASCII corpora) if that helps validate correctness.

References
UTS #18 — Unicode Regular Expressions (property syntax & conformance levels)
UAX #44 — Unicode Character Database (General_Category, loose matching UAX44-LM3)
UAX #24 — Unicode Script Property
UCD data files (latest)
Prior art: Rust regex crate Unicode support, PCRE2 \p docs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions