Regex_Search : Add tokenizer #20

AbhishekRai456 · 2026-01-26T14:52:05Z

Adds initial regex tokenizer with support for:

Literals, operators, grouping and anchors (^, $)
Character classes with ranges and shorthands (\d, \w, \s and negations)
Quantifiers {m,n}, {m,}, {m}
Implicit concatenation insertion
Error reporting with position tracking

This is a draft for review and future integration with postfix conversion and NFA construction.

Revert accidental formatting changes Revert accidental formatting changes in exact module Final fixes l

AviShahCode · 2026-01-26T16:07:40Z

libpz/regex/RegexTokenizer.cpp

+
+  auto read_int = [&]() -> int {
+    skip_spaces();
+    int val = 0;


unsigned int

AviShahCode · 2026-01-26T16:08:08Z

libpz/regex/RegexTokenizer.cpp

+    if (!found)
+      PzError::report_error(PzError::PzErrorType::PZ_INVALID_INPUT,
+                            "Expected number in quantifier at position " +
+                                std::to_string(t.pos));


{,9} implicitly means {0,9}

AviShahCode · 2026-01-26T16:13:05Z

libpz/regex/RegexTokenizer.cpp

+    return t;
+  }
+
+  t.max = read_int();


should not throw error ex: {1,} is 1 or more

AviShahCode · 2026-01-26T16:49:03Z

libpz/regex/RegexTokenizer.cpp

+  const char MIN_CHAR = '\0';   // ascii index 0
+  const char MAX_CHAR = '\x7F'; // ascii index 127


use pz_types standard, same for other types

Add regex tokenizer

a40f05d

Revert accidental formatting changes Revert accidental formatting changes in exact module Final fixes l

AbhishekRai456 force-pushed the regex branch from 50daf4d to a40f05d Compare January 26, 2026 14:55

AviShahCode reviewed Jan 26, 2026

View reviewed changes

libpz/regex/RegexTokenizer.cpp

auto read_int = [&]() -> int {

skip_spaces();

int val = 0;

Copy link

Contributor

AviShahCode Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned int

AviShahCode reviewed Jan 26, 2026

View reviewed changes

libpz/regex/RegexTokenizer.cpp

return t;

}

t.max = read_int();

Copy link

Contributor

AviShahCode Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not throw error ex: {1,} is 1 or more

AviShahCode reviewed Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex_Search : Add tokenizer #20

Regex_Search : Add tokenizer #20

AbhishekRai456 commented Jan 26, 2026

Uh oh!

AviShahCode Jan 26, 2026

Uh oh!

AviShahCode Jan 26, 2026

Uh oh!

AviShahCode Jan 26, 2026

Uh oh!

AviShahCode Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const char MIN_CHAR = '\0'; // ascii index 0
		const char MAX_CHAR = '\x7F'; // ascii index 127

Regex_Search : Add tokenizer #20

Are you sure you want to change the base?

Regex_Search : Add tokenizer #20

Conversation

AbhishekRai456 commented Jan 26, 2026

Uh oh!

AviShahCode Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

AviShahCode Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

AviShahCode Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

AviShahCode Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants