Support Unicode property classes (\p{…} / \P{…})

### Summary

`Regex.compile` rejects all Unicode property escapes (`\p{…}` and `\P{…}`). This includes general categories (`\p{L}`, `\p{N}`, `\p{Lu}`, `\p{Nd}`, …), scripts (`\p{Greek}`, `\p{Han}`, …), blocks, and binary properties (`\p{White_Space}`, `\p{Alphabetic}`, …). These are standardized in [UTS #18 (Unicode Regular Expressions)][tr18] and are required to express tokenizer pre-tokenization patterns faithfully.

### Version

- zig-regex: 0.1.1
- Zig: 0.16.0

### Reproduction

```zig
const Regex = @import("regex").Regex;

test "unicode property classes" {
    inline for (.{ "\\p{L}+", "\\p{N}+", "\\p{Lu}", "\\P{L}", "[^\\p{L}\\p{N}]" }) |pat| {
        var re = try Regex.compile(std.testing.allocator, pat);
        re.deinit();
    }
}
```

Expected
All Unicode property forms compile and match per the [Unicode Character Database (UCD)](https://www.unicode.org/reports/tr44/), mirroring the [UTS #18](https://www.unicode.org/reports/tr18/) / PCRE / fancy-regex semantics:

General categories: one- and two-letter (\p{L}, \p{Lu}, \p{N}, \p{Nd}, \p{P}, \p{Z}, …) — see [UAX #44 §5.7.1 General_Category](https://www.unicode.org/reports/tr44/#General_Category_Values)
Negation: \P{…} and \p{^…}
Scripts and script extensions: \p{Greek}, \p{Script=Han}, \p{scx=Latin}, … — see [UAX #24 (Script property)](https://www.unicode.org/reports/tr24/)
Binary properties: \p{White_Space}, \p{Alphabetic}, \p{Emoji}, …
Usable both standalone and inside character classes, e.g. [^\p{L}\p{N}], with property/value names matched loosely per [UAX #44-LM3 (loose matching)](https://www.unicode.org/reports/tr44/#UAX44-LM3) (case/_/-/space-insensitive)

Actual
Compilation fails. The lexer has no case for p/P after a backslash, so it falls through to else => RegexError.InvalidEscapeSequence (src/parser.zig:107). Inside [...] the class parser likewise rejects it (parseCharClass → InvalidCharacterClass, src/parser.zig:722-723).

Why this matters
The canonical GPT-4 BPE split pattern is:

'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+ Without Unicode property classes this can only be approximated with ASCII [A-Za-z]/[0-9], which silently mis-tokenizes any non-ASCII text (accented Latin, CJK, Cyrillic, emoji, etc.). zig-regex already supports the harder parts of this pattern (lookahead), so property classes are the main remaining blocker for a faithful port — and they're broadly useful well beyond tokenization.

Proposed scope
Full \p{…}/\P{…} support backed by generated UCD tables, targeting [UTS #18](https://www.unicode.org/reports/tr18/) "Basic Unicode Support" (RL1.2):

General categories — both abbreviated (L, Lu, Nd, P, Z, …) and the super-categories (L, M, N, P, S, Z, C).
Negation — \P{…} and the \p{^…} form.
Scripts / script extensions — \p{Script=…} / \p{sc=…} and \p{Script_Extensions=…} / \p{scx=…}, plus the bare \p{Greek} shorthand.
Binary properties — at minimum White_Space, Alphabetic, Uppercase, Lowercase, Noncharacter_Code_Point; ideally the full set.
Loose name matching per [UAX #44-LM3](https://www.unicode.org/reports/tr44/#UAX44-LM3).
Works identically standalone and inside [...], and composes with negated classes ([^\p{L}\p{N}]) and existing quantifiers.
Implementation suggestion: vendor a small code-generator that emits codepoint range tables from the [UCD data files](https://www.unicode.org/Public/UCD/latest/ucd/)
(DerivedGeneralCategory.txt, Scripts.txt, ScriptExtensions.txt, PropList.txt, PropertyAliases.txt, PropertyValueAliases.txt) at build time, pin the Unicode version, and resolve \p{…} to range sets the existing matcher already understands. Phasing is fine — general categories + negation first (covers the tokenizer use case), scripts/binary properties next.

I'm happy to test a branch against a real BPE tokenizer port (ASCII and non-ASCII corpora) if that helps validate correctness.

References
[UTS #18 — Unicode Regular Expressions](https://www.unicode.org/reports/tr18/) (property syntax & conformance levels)
[UAX #44 — Unicode Character Database](https://www.unicode.org/reports/tr44/) ([General_Category](https://www.unicode.org/reports/tr44/#General_Category_Values), [loose matching UAX44-LM3](https://www.unicode.org/reports/tr44/#UAX44-LM3))
[UAX #24 — Unicode Script Property](https://www.unicode.org/reports/tr24/)
[UCD data files (latest)](https://www.unicode.org/Public/UCD/latest/ucd/)
Prior art: [Rust regex crate Unicode support](https://docs.rs/regex/latest/regex/#unicode-features), [PCRE2 \p docs](https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode property classes (\p{…} / \P{…}) #7

Summary

Version

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support Unicode property classes (\p{…} / \P{…}) #7

Description

Summary

Version

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions