[Feature](func) Support REGEXP_EXTRACT_ALL_ARRAY#61156
[Feature](func) Support REGEXP_EXTRACT_ALL_ARRAY#61156linrrzqqq wants to merge 4 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
fc8f23d to
71b7276
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a new scalar function regexp_extract_all_array to complement regexp_extract_all, returning a true ARRAY<STRING> (instead of a string-formatted array) so array functions like array_size() work correctly.
Changes:
- Add FE/Nereids builtin scalar function definition and visitor hook for
regexp_extract_all_array. - Add BE implementation by refactoring
regexp_extract_alloutput handling and adding an array-typed output path. - Add regression coverage and a BE unit test for the new function.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/query_p0/sql_functions/string_functions/test_string_function_regexp.groovy | Adds query cases for regexp_extract_all_array in regexp function regression suite. |
| regression-test/data/query_p0/sql_functions/string_functions/test_string_function_regexp.out | Expected outputs for the new regression queries. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/ScalarFunctionVisitor.java | Adds visitor method for the new scalar function node. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/RegexpExtractAllArray.java | New Nereids scalar function class + signatures returning Array<String>. |
| fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinScalarFunctions.java | Registers regexp_extract_all_array as a builtin scalar function. |
| be/test/exprs/function/function_like_test.cpp | Adds a BE unit test for regexp_extract_all_array. |
| be/src/exprs/function/function_regexp.cpp | Implements regexp_extract_all_array and refactors regexp_extract_all to share extraction logic with different output handlers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| TEST(FunctionLikeTest, regexp_extract_all_array) { | ||
| std::string func_name = "regexp_extract_all_array"; | ||
| auto str_type = std::make_shared<DataTypeString>(); | ||
| auto return_type = make_nullable( | ||
| std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>()))); | ||
|
|
||
| auto run_case = [&](const std::string& str, const std::string& pattern, | ||
| const std::string& expected, bool expect_null = false) { | ||
| auto col_str = ColumnString::create(); | ||
| col_str->insert_data(str.data(), str.size()); | ||
| auto col_pattern = ColumnString::create(); | ||
| col_pattern->insert_data(pattern.data(), pattern.size()); | ||
|
|
||
| Block block; | ||
| block.insert({std::move(col_str), str_type, "str"}); | ||
| block.insert({ColumnConst::create(std::move(col_pattern), 1), str_type, "pattern"}); | ||
| block.insert({nullptr, return_type, "result"}); | ||
|
|
||
| ColumnsWithTypeAndName arg_cols = {block.get_by_position(0), block.get_by_position(1)}; | ||
| auto func = | ||
| SimpleFunctionFactory::instance().get_function(func_name, arg_cols, return_type); | ||
| ASSERT_TRUE(func != nullptr); | ||
|
|
||
| std::vector<DataTypePtr> arg_types = {str_type, str_type}; | ||
| FunctionUtils fn_utils({}, arg_types, false); | ||
| auto* fn_ctx = fn_utils.get_fn_ctx(); | ||
| fn_ctx->set_constant_cols( | ||
| {nullptr, std::make_shared<ColumnPtrWrapper>(block.get_by_position(1).column)}); | ||
|
|
||
| ASSERT_EQ(Status::OK(), func->open(fn_ctx, FunctionContext::FRAGMENT_LOCAL)); | ||
| ASSERT_EQ(Status::OK(), func->open(fn_ctx, FunctionContext::THREAD_LOCAL)); | ||
| ASSERT_EQ(Status::OK(), func->execute(fn_ctx, block, {0, 1}, 2, 1)); | ||
|
|
||
| auto result_col = block.get_by_position(2).column; | ||
| ASSERT_TRUE(result_col.get() != nullptr); | ||
| if (expect_null) { | ||
| EXPECT_TRUE(result_col->is_null_at(0)); | ||
| } else { | ||
| ASSERT_FALSE(result_col->is_null_at(0)); | ||
| auto result_str = return_type->to_string(*result_col, 0); | ||
| EXPECT_EQ(expected, result_str) | ||
| << "input: '" << str << "', pattern: '" << pattern << "'"; | ||
| } | ||
|
|
||
| static_cast<void>(func->close(fn_ctx, FunctionContext::THREAD_LOCAL)); | ||
| static_cast<void>(func->close(fn_ctx, FunctionContext::FRAGMENT_LOCAL)); | ||
| }; | ||
|
|
||
| run_case("x=a3&x=18abc&x=2&y=3&x=4&x=17bcd", "x=([0-9]+)([a-z]+)", "[\"18\", \"17\"]"); | ||
| run_case("x=a3&x=18abc&x=2&y=3&x=4", "^x=([a-z]+)([0-9]+)", "[\"a\"]"); | ||
| run_case("http://a.m.baidu.com/i41915173660.htm", "i([0-9]+)", "[\"41915173660\"]"); | ||
| run_case("http://a.m.baidu.com/i41915i73660.htm", "i([0-9]+)", "[\"41915\", \"73660\"]"); | ||
| run_case("hitdecisiondlist", "(i)(.*?)(e)", "[\"i\"]"); | ||
| run_case("no_match_here", "x=([0-9]+)", "[]"); | ||
| run_case("abc", "([a-z]+)", "[\"abc\"]"); | ||
| } |
There was a problem hiding this comment.
The new regexp_extract_all_array unit test only covers non-null inputs with a constant pattern. Since the function is marked always-nullable and should propagate NULLs / handle invalid patterns, add a couple of cases that assert (1) NULL input string, (2) NULL pattern, and ideally (3) invalid pattern behavior (error or NULL result depending on constness) so the null/exception semantics are covered at the unit-test level too.
a3dcc64 to
3d76940
Compare
|
run buildall |
TPC-H: Total hot run time: 27924 ms |
TPC-DS: Total hot run time: 154360 ms |
FE UT Coverage ReportIncrement line coverage |
|
run beut |
|
/review |
Code Review Summary: Feature Support REGEXP_EXTRACT_ALL_ARRAYOverall Assessment: No significant issues found. The PR is well-structured, clean, and follows existing conventions. Critical Checkpoint Conclusions
|
| for (size_t i = 0; i < input_rows_count; ++i) { | ||
| ReturnNull ? StringOP::push_null_string(i, result_data, result_offset, null_map) | ||
| : StringOP::push_empty_string(i, result_data, result_offset); | ||
| if (null_map[i]) { |
|
run buildall |
TPC-H: Total hot run time: 26818 ms |
TPC-DS: Total hot run time: 168082 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 27041 ms |
TPC-DS: Total hot run time: 169609 ms |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
Code Review Summary — PR #61156: Feature Support REGEXP_EXTRACT_ALL_ARRAYOverviewThis PR adds a new function Critical Checkpoint Conclusions
VerdictNo blocking issues found. The PR is well-structured with good test coverage and clean refactoring. |
|
run p0 |
What problem does this PR solve?
Issue Number: close #xxx
doc: apache/doris-website#3460
function
regexp_extract_allreturnsstringtype,regexp_extract_all_arraybehaves the same but returns anarray<string>type.Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)