[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern #27577

Myracle · 2026-02-11T03:18:13Z

What is the purpose of the change

This pull request adds a new built-in function REGEXP_SPLIT to Flink SQL and Table API, which splits a string by a regular expression pattern and returns an array of substrings. This function is commonly available in other SQL engines (e.g., Spark, Presto, Hive) and provides users with more powerful string manipulation capabilities using regex patterns.

Brief change log

Added REGEXP_SPLIT function definition in BuiltInFunctionDefinitions with proper input/output type strategies
Implemented RegexpSplitFunction as a scalar function with regex pattern caching for performance optimization
Added regexpSplit() method to BaseExpressions for Table API support
Added comprehensive test cases in RegexpFunctionsITCase covering various scenarios including null handling, empty regex, invalid regex patterns, and edge cases

Verifying this change

This change added tests and can be verified as follows:

Added integration tests in RegexpFunctionsITCase that cover:
- Basic regex split functionality (e.g., splitting by digit patterns [0-9]+)
- Null input handling (both null string and null pattern)
- Empty regex pattern (split by each character)
- Multi-character delimiter regex patterns (e.g., [,;|])
- Whitespace regex patterns (e.g., \\s+)
- No match scenarios (returns original string as single-element array)
- Invalid regex pattern handling (returns null)
- Input validation errors for non-string type inputs
- SQL signature validation errors

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): yes (BaseExpressions is @PublicEvolving, added regexpSplit() method)
The serializers: no
The runtime per-record code paths (performance sensitive): no (new function only, with pattern caching for optimization)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs (function usage examples are documented in RegexpSplitFunction class JavaDoc)

flinkbot · 2026-02-11T03:22:53Z

CI report:

f3f463b Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

nateab · 2026-02-11T07:36:08Z

...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSplitFunction.java

+        }
+
+        try {
+            // Cache the compiled pattern to improve performance


Every other REGEXP_* function uses SqlFunctionUtils.getRegexpMatcher() which delegates to the shared REGEXP_PATTERN_CACHE (a ThreadLocalCache)

see for example https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSubstrFunction.java#L42

@nateab Thanks for your time to review. The suggestions are valuable and I have modified the code.

nateab · 2026-02-11T07:52:29Z

...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSplitFunction.java

+import java.util.regex.PatternSyntaxException;
+
+/**
+ * Implementation of {@link BuiltInFunctionDefinitions#REGEXP_SPLIT}.


also see https://issues.apache.org/jira/browse/FLINK-6810 for general instructions on what else you need to add in order to contribute builtin functions, for example which docs to add, what other considerations to make

nateab · 2026-02-11T08:03:23Z

...le-planner/src/test/java/org/apache/flink/table/planner/functions/RegexpFunctionsITCase.java

+                                $("f0").regexpSplit("("),
+                                "REGEXP_SPLIT(f0, '(')",
+                                null,
+                                DataTypes.ARRAY(DataTypes.STRING()).notNull())


this seems inconsistent, since we expect the return value to be null?

…split string by regular expression pattern

nateab

Thanks for the fixes, almost lgtm just one comment

nateab · 2026-02-12T09:21:42Z

...ntime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSplitFunction.java

+            return new GenericArrayData(result);
+        }
+
+        Pattern pattern = getRegexpPattern(regexStr);


nice thanks for using the SqlFunctionUttils, but is there a reason you added getRegexpPattern instead of just using the existing getRegexpMatcher?

Thanks for the review!

The reason I added getRegexpPattern() instead of using getRegexpMatcher() is that REGEXP_SPLIT needs to call Pattern.split(str, -1), and the split() method is on the Pattern class, not the Matcher class.

The existing getRegexpMatcher() returns a Matcher object which is designed for matching operations like find(), group(), etc. - this works perfectly for other REGEXP_* functions like REGEXP_SUBSTR, REGEXP_COUNT, REGEXP_INSTR that need to iterate through matches.

However, REGEXP_SPLIT doesn't need to iterate through matches - it needs to split the input string by the pattern, which requires direct access to the Pattern object.
That said, if you prefer, I could inline the cache access directly in RegexpSplitFunction to avoid adding a new utility method:

Pattern pattern; try { pattern = SqlFunctionUtils.REGEXP_PATTERN_CACHE.get(regexStr); } catch (PatternSyntaxException e) { return null; }

Please let me know which approach you'd prefer:

Keep getRegexpPattern() as a reusable utility (current approach) - could be useful for future functions that need direct Pattern access

Inline the cache access directly in RegexpSplitFunction

nateab reviewed Feb 11, 2026

View reviewed changes

github-actions bot added the community-reviewed PR has been reviewed by the community. label Feb 11, 2026

Myracle added 2 commits February 12, 2026 14:41

[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to …

64f2e29

…split string by regular expression pattern

hotfix

f3f463b

Myracle force-pushed the FLINK-39064-REGEXP_SPLIT branch from 7588cee to f3f463b Compare February 12, 2026 07:19

nateab reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern #27577

[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern #27577

Myracle commented Feb 11, 2026

Uh oh!

flinkbot commented Feb 11, 2026 •

edited

Loading

Uh oh!

nateab Feb 11, 2026

Uh oh!

Myracle Feb 12, 2026

Uh oh!

nateab Feb 11, 2026

Uh oh!

nateab Feb 11, 2026

Uh oh!

nateab left a comment

Uh oh!

nateab Feb 12, 2026

Uh oh!

Myracle Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern #27577

Are you sure you want to change the base?

[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern #27577

Conversation

Myracle commented Feb 11, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

nateab Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Myracle Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

nateab Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

nateab Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

nateab left a comment

Choose a reason for hiding this comment

Uh oh!

nateab Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Myracle Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flinkbot commented Feb 11, 2026 •

edited

Loading