-
Notifications
You must be signed in to change notification settings - Fork 13.9k
[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern #27577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| try { | ||
| // Cache the compiled pattern to improve performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every other REGEXP_* function uses SqlFunctionUtils.getRegexpMatcher() which delegates to the shared REGEXP_PATTERN_CACHE (a ThreadLocalCache)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nateab Thanks for your time to review. The suggestions are valuable and I have modified the code.
| import java.util.regex.PatternSyntaxException; | ||
|
|
||
| /** | ||
| * Implementation of {@link BuiltInFunctionDefinitions#REGEXP_SPLIT}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also see https://issues.apache.org/jira/browse/FLINK-6810 for general instructions on what else you need to add in order to contribute builtin functions, for example which docs to add, what other considerations to make
| $("f0").regexpSplit("("), | ||
| "REGEXP_SPLIT(f0, '(')", | ||
| null, | ||
| DataTypes.ARRAY(DataTypes.STRING()).notNull()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems inconsistent, since we expect the return value to be null?
…split string by regular expression pattern
7588cee to
f3f463b
Compare
nateab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes, almost lgtm just one comment
| return new GenericArrayData(result); | ||
| } | ||
|
|
||
| Pattern pattern = getRegexpPattern(regexStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice thanks for using the SqlFunctionUttils, but is there a reason you added getRegexpPattern instead of just using the existing getRegexpMatcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
The reason I added getRegexpPattern() instead of using getRegexpMatcher() is that REGEXP_SPLIT needs to call Pattern.split(str, -1), and the split() method is on the Pattern class, not the Matcher class.
The existing getRegexpMatcher() returns a Matcher object which is designed for matching operations like find(), group(), etc. - this works perfectly for other REGEXP_* functions like REGEXP_SUBSTR, REGEXP_COUNT, REGEXP_INSTR that need to iterate through matches.
However, REGEXP_SPLIT doesn't need to iterate through matches - it needs to split the input string by the pattern, which requires direct access to the Pattern object.
That said, if you prefer, I could inline the cache access directly in RegexpSplitFunction to avoid adding a new utility method:
Pattern pattern;
try {
pattern = SqlFunctionUtils.REGEXP_PATTERN_CACHE.get(regexStr);
} catch (PatternSyntaxException e) {
return null;
}
Please let me know which approach you'd prefer:
- Keep getRegexpPattern() as a reusable utility (current approach) - could be useful for future functions that need direct Pattern access
- Inline the cache access directly in RegexpSplitFunction
What is the purpose of the change
This pull request adds a new built-in function
REGEXP_SPLITto Flink SQL and Table API, which splits a string by a regular expression pattern and returns an array of substrings. This function is commonly available in other SQL engines (e.g., Spark, Presto, Hive) and provides users with more powerful string manipulation capabilities using regex patterns.Brief change log
REGEXP_SPLITfunction definition inBuiltInFunctionDefinitionswith proper input/output type strategiesRegexpSplitFunctionas a scalar function with regex pattern caching for performance optimizationregexpSplit()method toBaseExpressionsfor Table API supportRegexpFunctionsITCasecovering various scenarios including null handling, empty regex, invalid regex patterns, and edge casesVerifying this change
This change added tests and can be verified as follows:
RegexpFunctionsITCasethat cover:[0-9]+)[,;|])\\s+)Does this pull request potentially affect one of the following parts:
@Public(Evolving): yes (BaseExpressionsis@PublicEvolving, addedregexpSplit()method)Documentation
RegexpSplitFunctionclass JavaDoc)