Skip to content

Conversation

@Myracle
Copy link
Contributor

@Myracle Myracle commented Feb 11, 2026

What is the purpose of the change

This pull request adds a new built-in function REGEXP_SPLIT to Flink SQL and Table API, which splits a string by a regular expression pattern and returns an array of substrings. This function is commonly available in other SQL engines (e.g., Spark, Presto, Hive) and provides users with more powerful string manipulation capabilities using regex patterns.

Brief change log

  • Added REGEXP_SPLIT function definition in BuiltInFunctionDefinitions with proper input/output type strategies
  • Implemented RegexpSplitFunction as a scalar function with regex pattern caching for performance optimization
  • Added regexpSplit() method to BaseExpressions for Table API support
  • Added comprehensive test cases in RegexpFunctionsITCase covering various scenarios including null handling, empty regex, invalid regex patterns, and edge cases

Verifying this change

This change added tests and can be verified as follows:

  • Added integration tests in RegexpFunctionsITCase that cover:
    • Basic regex split functionality (e.g., splitting by digit patterns [0-9]+)
    • Null input handling (both null string and null pattern)
    • Empty regex pattern (split by each character)
    • Multi-character delimiter regex patterns (e.g., [,;|])
    • Whitespace regex patterns (e.g., \\s+)
    • No match scenarios (returns original string as single-element array)
    • Invalid regex pattern handling (returns null)
    • Input validation errors for non-string type inputs
    • SQL signature validation errors

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes (BaseExpressions is @PublicEvolving, added regexpSplit() method)
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no (new function only, with pattern caching for optimization)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs (function usage examples are documented in RegexpSplitFunction class JavaDoc)

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 11, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

}

try {
// Cache the compiled pattern to improve performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every other REGEXP_* function uses SqlFunctionUtils.getRegexpMatcher() which delegates to the shared REGEXP_PATTERN_CACHE (a ThreadLocalCache)

see for example https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSubstrFunction.java#L42

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nateab Thanks for your time to review. The suggestions are valuable and I have modified the code.

import java.util.regex.PatternSyntaxException;

/**
* Implementation of {@link BuiltInFunctionDefinitions#REGEXP_SPLIT}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also see https://issues.apache.org/jira/browse/FLINK-6810 for general instructions on what else you need to add in order to contribute builtin functions, for example which docs to add, what other considerations to make

$("f0").regexpSplit("("),
"REGEXP_SPLIT(f0, '(')",
null,
DataTypes.ARRAY(DataTypes.STRING()).notNull())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems inconsistent, since we expect the return value to be null?

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Feb 11, 2026
@Myracle Myracle force-pushed the FLINK-39064-REGEXP_SPLIT branch from 7588cee to f3f463b Compare February 12, 2026 07:19
Copy link
Contributor

@nateab nateab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, almost lgtm just one comment

return new GenericArrayData(result);
}

Pattern pattern = getRegexpPattern(regexStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice thanks for using the SqlFunctionUttils, but is there a reason you added getRegexpPattern instead of just using the existing getRegexpMatcher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

The reason I added getRegexpPattern() instead of using getRegexpMatcher() is that REGEXP_SPLIT needs to call Pattern.split(str, -1), and the split() method is on the Pattern class, not the Matcher class.

The existing getRegexpMatcher() returns a Matcher object which is designed for matching operations like find(), group(), etc. - this works perfectly for other REGEXP_* functions like REGEXP_SUBSTR, REGEXP_COUNT, REGEXP_INSTR that need to iterate through matches.

However, REGEXP_SPLIT doesn't need to iterate through matches - it needs to split the input string by the pattern, which requires direct access to the Pattern object.
That said, if you prefer, I could inline the cache access directly in RegexpSplitFunction to avoid adding a new utility method:

Pattern pattern;
try {
    pattern = SqlFunctionUtils.REGEXP_PATTERN_CACHE.get(regexStr);
} catch (PatternSyntaxException e) {
    return null;
}

Please let me know which approach you'd prefer:

  1. Keep getRegexpPattern() as a reusable utility (current approach) - could be useful for future functions that need direct Pattern access
  2. Inline the cache access directly in RegexpSplitFunction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants