Skip to content

Support inline flag groups / scoped modifiers — (?flags:…) and (?flags) (e.g. (?i:…)) #9

@DaliVana

Description

@DaliVana

Summary

Regex.compile rejects inline flag groups such as (?i:[sdmt]|ll|ve|re) (scoped case-insensitive group) and inline flag toggles like (?i) / (?-i). Only a global compileWithFlags(.{ .case_insensitive = true }) is available, which cannot express case-insensitivity (or any flag) scoped to a sub-expression — a standard feature in PCRE2, Perl, .NET, and the Rust regex / fancy-regex crates.

Version

  • zig-regex: 0.1.1
  • Zig: 0.16.0

Reproduction

const Regex = @import("regex").Regex;

test "inline flag groups" {
    inline for (.{
        "(?i:abc)",            // scoped case-insensitive group
        "(?i)abc",             // inline flag toggle (rest of pattern)
        "(?i:[sdmt]|ll|ve|re)", // the GPT-4 contraction sub-pattern
        "a(?-i:b)c",           // scoped flag *un*set
    }) |pat| {
        var re = try Regex.compile(std.testing.allocator, pat);
        re.deinit();
    }
}

Expected
Inline flags compile and apply with correct scoping, mirroring Rust regex "grouping and flags" / fancy-regex (the engine rustbpe actually uses), PCRE2 "internal option setting", and Perl perlre modifiers:

Scoped group (?flags:expr) — flags apply only inside the group.
Scoped unset (?-flags:expr) and combined (?flags-flags:expr).
Inline toggle (?flags) / (?-flags) — applies from that point to the end of the enclosing group.
At minimum the common flags: i (case-insensitive), m (multiline), s (dot-all), x (extended/verbose); ideally also u/U.

Actual

Compilation fails. The (? group dispatcher in parseGroup (src/parser.zig:477-552) only recognizes (?:, (?=, (?!, (?P, (?<=, (?<!, and (?. Any other character after (? — including i, m, s, x, - — falls through to else => return RegexError.UnexpectedCharacter (src/parser.zig:545-547). There is no lexer/parser path for flag letters in a group prefix, and no mechanism to scope flags to a sub-expression even though a global case_insensitive flag already exists in common.CompileFlags.

Why this matters
The canonical GPT-4 BPE split pattern (from karpathy/rustbpe, via fancy-regex) is:
'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]|\s[\r\n]|\s+(?!\S)|\s+
The leading (?i:[sdmt]|ll|ve|re) makes only the contraction alternative case-insensitive (so 'S, 'LL, 'Ve match like 's, 'll, 've) while the rest of the pattern stays case-sensitive. A global case-insensitive flag is not a substitute: it would wrongly fold case across the entire pattern. The only faithful workaround today is manual case expansion ((?:[sdmtSDMT]|[lL][lL]|[vV][eE]|[rR][eE])), which is error-prone and does not generalize to larger case-insensitive sub-patterns.

Proposed scope
Parse a flag-letter run after (?: [imsxuU]* optionally followed by -[imsxuU]+, terminated by either : (scoped group) or ) (inline toggle).
Scoped group (?flags-flags:expr) — push the modified flag set for the duration of expr, restore on the closing ).
Inline toggle (?flags-flags) — apply to the remainder of the current group/alternation scope (standard Perl/PCRE2 semantics).
Flags compose with the existing global CompileFlags and nest correctly (inner groups override outer; unset via -).
Minimum viable subset for the tokenizer use case: just i in the (?i:expr) scoped form. m/s/x and inline toggles can follow.
Implementation note: the engine already has a case_insensitive matcher path; the main work is (a) lexing the flag prefix and (b) carrying a flag-state stack through parseGroup/parseAlternation so flags can be scoped per-node rather than only globally.

I'm happy to test a branch against a real BPE tokenizer port if that helps.

References
Rust regex crate — grouping and flags ((?flags:exp), (?flags); flags i m s U u x)
fancy-regex crate docs (the regex flavor rustbpe/the GPT-4 pattern targets)
PCRE2 pcre2pattern — internal option setting & option-setting groups
Perl perlre — modifiers and (?adlupimnsx-imnsx:pattern)
.NET — miscellaneous constructs: inline options (?imnsx-imnsx:subexpression)
Oniguruma RE syntax — (?imxWDSPy-imx:subexp)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions