Skip to content

[Feature](func) Support REGEXP_EXTRACT_ALL_ARRAY#61156

Open
linrrzqqq wants to merge 4 commits intoapache:masterfrom
linrrzqqq:regexp-extract-all-array
Open

[Feature](func) Support REGEXP_EXTRACT_ALL_ARRAY#61156
linrrzqqq wants to merge 4 commits intoapache:masterfrom
linrrzqqq:regexp-extract-all-array

Conversation

@linrrzqqq
Copy link
Contributor

@linrrzqqq linrrzqqq commented Mar 9, 2026

What problem does this PR solve?

Issue Number: close #xxx

doc: apache/doris-website#3460

function regexp_extract_all returns string type, regexp_extract_all_array behaves the same but returns an array<string> type.

-- returns a String type
Doris> SELECT regexp_extract_all('AbCdE', '([[:lower:]]+)C([[:lower:]]+)');
+--------------------------------------------------------------+
| regexp_extract_all('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') |
+--------------------------------------------------------------+
| ['b']                                                        |
+--------------------------------------------------------------+
1 row in set (0.13 sec)

Doris> SELECT array_size(regexp_extract_all('AbCdE', '([[:lower:]]+)C([[:lower:]]+)'));
ERROR 1105 (HY000): errCode = 2, detailMessage = Can not find the compatibility function signature: cardinality(VARCHAR(65533))

-- returns an Array<String> type
Doris> SELECT regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)');
+--------------------------------------------------------------------+
| regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') |
+--------------------------------------------------------------------+
| ["b"]                                                              |
+--------------------------------------------------------------------+
1 row in set (1.08 sec)

Doris> SELECT array_size(regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)'));
+--------------------------------------------------------------------------------+
| array_size(regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)')) |
+--------------------------------------------------------------------------------+
|                                                                              1 |
+--------------------------------------------------------------------------------+

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 6.67% (1/15) 🎉
Increment coverage report
Complete coverage report

Copilot AI review requested due to automatic review settings March 10, 2026 01:03
@linrrzqqq linrrzqqq force-pushed the regexp-extract-all-array branch from fc8f23d to 71b7276 Compare March 10, 2026 01:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new scalar function regexp_extract_all_array to complement regexp_extract_all, returning a true ARRAY<STRING> (instead of a string-formatted array) so array functions like array_size() work correctly.

Changes:

  • Add FE/Nereids builtin scalar function definition and visitor hook for regexp_extract_all_array.
  • Add BE implementation by refactoring regexp_extract_all output handling and adding an array-typed output path.
  • Add regression coverage and a BE unit test for the new function.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
regression-test/suites/query_p0/sql_functions/string_functions/test_string_function_regexp.groovy Adds query cases for regexp_extract_all_array in regexp function regression suite.
regression-test/data/query_p0/sql_functions/string_functions/test_string_function_regexp.out Expected outputs for the new regression queries.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/ScalarFunctionVisitor.java Adds visitor method for the new scalar function node.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/RegexpExtractAllArray.java New Nereids scalar function class + signatures returning Array<String>.
fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinScalarFunctions.java Registers regexp_extract_all_array as a builtin scalar function.
be/test/exprs/function/function_like_test.cpp Adds a BE unit test for regexp_extract_all_array.
be/src/exprs/function/function_regexp.cpp Implements regexp_extract_all_array and refactors regexp_extract_all to share extraction logic with different output handlers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +252 to +307
TEST(FunctionLikeTest, regexp_extract_all_array) {
std::string func_name = "regexp_extract_all_array";
auto str_type = std::make_shared<DataTypeString>();
auto return_type = make_nullable(
std::make_shared<DataTypeArray>(make_nullable(std::make_shared<DataTypeString>())));

auto run_case = [&](const std::string& str, const std::string& pattern,
const std::string& expected, bool expect_null = false) {
auto col_str = ColumnString::create();
col_str->insert_data(str.data(), str.size());
auto col_pattern = ColumnString::create();
col_pattern->insert_data(pattern.data(), pattern.size());

Block block;
block.insert({std::move(col_str), str_type, "str"});
block.insert({ColumnConst::create(std::move(col_pattern), 1), str_type, "pattern"});
block.insert({nullptr, return_type, "result"});

ColumnsWithTypeAndName arg_cols = {block.get_by_position(0), block.get_by_position(1)};
auto func =
SimpleFunctionFactory::instance().get_function(func_name, arg_cols, return_type);
ASSERT_TRUE(func != nullptr);

std::vector<DataTypePtr> arg_types = {str_type, str_type};
FunctionUtils fn_utils({}, arg_types, false);
auto* fn_ctx = fn_utils.get_fn_ctx();
fn_ctx->set_constant_cols(
{nullptr, std::make_shared<ColumnPtrWrapper>(block.get_by_position(1).column)});

ASSERT_EQ(Status::OK(), func->open(fn_ctx, FunctionContext::FRAGMENT_LOCAL));
ASSERT_EQ(Status::OK(), func->open(fn_ctx, FunctionContext::THREAD_LOCAL));
ASSERT_EQ(Status::OK(), func->execute(fn_ctx, block, {0, 1}, 2, 1));

auto result_col = block.get_by_position(2).column;
ASSERT_TRUE(result_col.get() != nullptr);
if (expect_null) {
EXPECT_TRUE(result_col->is_null_at(0));
} else {
ASSERT_FALSE(result_col->is_null_at(0));
auto result_str = return_type->to_string(*result_col, 0);
EXPECT_EQ(expected, result_str)
<< "input: '" << str << "', pattern: '" << pattern << "'";
}

static_cast<void>(func->close(fn_ctx, FunctionContext::THREAD_LOCAL));
static_cast<void>(func->close(fn_ctx, FunctionContext::FRAGMENT_LOCAL));
};

run_case("x=a3&x=18abc&x=2&y=3&x=4&x=17bcd", "x=([0-9]+)([a-z]+)", "[\"18\", \"17\"]");
run_case("x=a3&x=18abc&x=2&y=3&x=4", "^x=([a-z]+)([0-9]+)", "[\"a\"]");
run_case("http://a.m.baidu.com/i41915173660.htm", "i([0-9]+)", "[\"41915173660\"]");
run_case("http://a.m.baidu.com/i41915i73660.htm", "i([0-9]+)", "[\"41915\", \"73660\"]");
run_case("hitdecisiondlist", "(i)(.*?)(e)", "[\"i\"]");
run_case("no_match_here", "x=([0-9]+)", "[]");
run_case("abc", "([a-z]+)", "[\"abc\"]");
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new regexp_extract_all_array unit test only covers non-null inputs with a constant pattern. Since the function is marked always-nullable and should propagate NULLs / handle invalid patterns, add a couple of cases that assert (1) NULL input string, (2) NULL pattern, and ideally (3) invalid pattern behavior (error or NULL result depending on constness) so the null/exception semantics are covered at the unit-test level too.

Copilot uses AI. Check for mistakes.
@linrrzqqq linrrzqqq force-pushed the regexp-extract-all-array branch from a3dcc64 to 3d76940 Compare March 10, 2026 01:32
@linrrzqqq
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 27924 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3d769400e12fa6fb9571f5cd7d8bdd6877ccddc2, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17655	4525	4283	4283
q2	q3	10644	767	515	515
q4	4676	354	255	255
q5	7540	1191	1043	1043
q6	177	177	155	155
q7	781	821	663	663
q8	9300	1463	1372	1372
q9	4889	4848	4844	4844
q10	6348	1913	1675	1675
q11	498	281	253	253
q12	741	580	472	472
q13	18098	3289	2188	2188
q14	234	230	218	218
q15	940	798	823	798
q16	751	729	690	690
q17	707	867	440	440
q18	5893	5337	5230	5230
q19	1135	1004	615	615
q20	512	489	403	403
q21	4539	2252	1532	1532
q22	376	346	280	280
Total cold run time: 96434 ms
Total hot run time: 27924 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4670	4693	4619	4619
q2	q3	3876	4382	3836	3836
q4	916	1201	789	789
q5	4063	4376	4344	4344
q6	189	179	151	151
q7	1814	1654	1613	1613
q8	2487	2719	2570	2570
q9	7593	7537	7446	7446
q10	3716	3960	3700	3700
q11	542	456	429	429
q12	498	646	458	458
q13	2895	3196	2373	2373
q14	302	294	260	260
q15	837	799	833	799
q16	735	806	736	736
q17	1211	1410	1458	1410
q18	7210	6791	6793	6791
q19	1002	907	909	907
q20	2076	2202	2044	2044
q21	3992	3513	3460	3460
q22	444	416	388	388
Total cold run time: 51068 ms
Total hot run time: 49123 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 154360 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3d769400e12fa6fb9571f5cd7d8bdd6877ccddc2, data reload: false

query5	4318	663	540	540
query6	335	239	211	211
query7	4219	482	270	270
query8	334	258	250	250
query9	8744	2798	2780	2780
query10	518	425	345	345
query11	7313	5837	5710	5710
query12	186	131	124	124
query13	1271	461	358	358
query14	5692	3936	3690	3690
query14_1	2889	2891	2863	2863
query15	208	196	175	175
query16	1033	494	454	454
query17	1114	732	625	625
query18	2488	452	362	362
query19	229	215	184	184
query20	137	138	132	132
query21	226	149	129	129
query22	4792	4895	4560	4560
query23	15998	15570	15430	15430
query23_1	15536	16127	15968	15968
query24	7818	1690	1292	1292
query24_1	1316	1286	1263	1263
query25	647	490	482	482
query26	1368	276	170	170
query27	2822	529	311	311
query28	4835	2009	1926	1926
query29	892	592	529	529
query30	339	258	224	224
query31	1452	1593	1349	1349
query32	82	77	78	77
query33	522	351	321	321
query34	955	999	637	637
query35	645	685	604	604
query36	1082	1136	974	974
query37	131	92	85	85
query38	2960	3034	2874	2874
query39	902	879	831	831
query39_1	815	834	825	825
query40	228	154	137	137
query41	63	61	59	59
query42	310	313	308	308
query43	247	278	227	227
query44	
query45	200	192	179	179
query46	892	997	625	625
query47	2117	2111	2008	2008
query48	309	328	233	233
query49	631	490	386	386
query50	715	289	220	220
query51	4056	4140	4087	4087
query52	293	297	282	282
query53	288	345	290	290
query54	302	276	277	276
query55	98	86	87	86
query56	309	328	312	312
query57	1346	1335	1271	1271
query58	292	282	280	280
query59	1336	1432	1273	1273
query60	366	389	332	332
query61	153	148	151	148
query62	639	581	544	544
query63	321	283	286	283
query64	5040	1277	1005	1005
query65	
query66	1450	464	366	366
query67	16282	16452	16398	16398
query68	
query69	390	308	293	293
query70	1007	970	964	964
query71	356	320	313	313
query72	2876	2685	2529	2529
query73	562	571	338	338
query74	10091	9963	9750	9750
query75	2887	2781	2488	2488
query76	2286	1058	708	708
query77	374	396	338	338
query78	11137	11321	10608	10608
query79	1148	816	596	596
query80	1353	664	585	585
query81	556	295	254	254
query82	999	161	125	125
query83	344	279	255	255
query84	261	147	107	107
query85	976	580	501	501
query86	427	339	306	306
query87	3210	3107	2986	2986
query88	3651	2690	2692	2690
query89	438	380	357	357
query90	2056	187	188	187
query91	172	164	142	142
query92	75	76	73	73
query93	972	869	505	505
query94	634	332	307	307
query95	598	397	318	318
query96	654	537	232	232
query97	2478	2512	2428	2428
query98	231	222	226	222
query99	995	939	928	928
Total cold run time: 234754 ms
Total hot run time: 154360 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 6.67% (1/15) 🎉
Increment coverage report
Complete coverage report

@linrrzqqq
Copy link
Contributor Author

run beut

@zclllyybb
Copy link
Contributor

/review

@github-actions
Copy link
Contributor

Code Review Summary: Feature Support REGEXP_EXTRACT_ALL_ARRAY

Overall Assessment: No significant issues found. The PR is well-structured, clean, and follows existing conventions.

Critical Checkpoint Conclusions

  1. Goal and Test Coverage: The goal is to add a regexp_extract_all_array function that returns Array<String> instead of a string-formatted array. The code accomplishes this goal. Tests include BE unit tests (7 match cases, null input propagation, invalid pattern), regression tests (11 cases covering NULL, table columns, empty results, 0 capture groups, non-const patterns), which provide good coverage.

  2. Modification size/clarity: The change is well-scoped. The refactoring from monolithic execute_impl to a Handler-based template strategy pattern is a clean approach that avoids code duplication between the string output and array output variants. The RegexpExtractImpl refactoring (moving execute logic from the wrapper class into the Impl struct) is a separate concern but keeps the two Impl structs consistent.

  3. Concurrency: No concurrency concerns. The function execution is single-threaded per pipeline task.

  4. Lifecycle management: No special lifecycle issues. Function state uses shared_ptr<void> for automatic cleanup.

  5. Configuration items: None added.

  6. Incompatible changes / upgrade compatibility: No incompatible changes. This is a purely additive new function.

  7. Parallel code paths: The existing regexp_extract_all (string output) is refactored to use the same template infrastructure, maintaining behavioral compatibility. The string formatting logic (['a','b']) is preserved identically.

  8. Special conditional checks: None.

  9. Test coverage: Comprehensive - BE unit tests cover normal matches, empty results, NULL propagation, invalid patterns. Regression tests cover NULL, table columns with NULL values, non-const patterns (concat), 0 capture groups, and multiple match scenarios.

  10. Observability: Not applicable for a simple scalar function.

  11. Transaction/persistence: Not applicable.

  12. FE-BE variable passing: The new function is registered in both FE (BuiltinScalarFunctions, ScalarFunctionVisitor, new RegexpExtractAllArray.java) and BE (register_function_regexp_extract). The FE class correctly follows the pattern of the existing RegexpExtractAll with ArrayType return type.

  13. Performance: The Handler/template approach introduces zero runtime overhead (all dispatch is compile-time). The std::visit + make_bool_variant pattern for const/non-const dispatch is standard in the codebase. No redundant operations or anti-patterns.

  14. Other observations: No issues found. The FE AlwaysNullable trait is appropriate. The array column construction (ColumnArray with nested ColumnNullable) is correct. The push_back-based offset tracking in RegexpExtractAllArrayOutput correctly builds the array structure sequentially.

for (size_t i = 0; i < input_rows_count; ++i) {
ReturnNull ? StringOP::push_null_string(i, result_data, result_offset, null_map)
: StringOP::push_empty_string(i, result_data, result_offset);
if (null_map[i]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this be true?

@linrrzqqq
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26818 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3ae71b102760b34da1d43fe44ecfb74b5a1b463f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17677	4475	4286	4286
q2	q3	10645	800	525	525
q4	4683	371	261	261
q5	7556	1189	1023	1023
q6	173	179	149	149
q7	782	829	672	672
q8	9327	1453	1401	1401
q9	4928	4696	4691	4691
q10	6302	1896	1671	1671
q11	487	261	248	248
q12	731	572	468	468
q13	18054	2919	2151	2151
q14	232	239	215	215
q15	q16	751	733	666	666
q17	712	825	458	458
q18	6195	5362	5240	5240
q19	1128	976	616	616
q20	520	514	376	376
q21	4571	1826	1418	1418
q22	531	373	283	283
Total cold run time: 95985 ms
Total hot run time: 26818 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4795	4524	4769	4524
q2	q3	3883	4329	3836	3836
q4	879	1201	799	799
q5	4115	4416	4379	4379
q6	193	177	142	142
q7	1836	1661	1557	1557
q8	2508	2735	2534	2534
q9	7538	7466	7337	7337
q10	3808	4036	3618	3618
q11	498	450	413	413
q12	468	586	448	448
q13	2676	3376	2322	2322
q14	272	285	269	269
q15	q16	725	761	727	727
q17	1157	1376	1596	1376
q18	7310	6952	6690	6690
q19	911	898	887	887
q20	2030	2167	1964	1964
q21	3897	3471	3391	3391
q22	512	436	383	383
Total cold run time: 50011 ms
Total hot run time: 47596 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168082 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3ae71b102760b34da1d43fe44ecfb74b5a1b463f, data reload: false

query5	4335	645	517	517
query6	338	231	210	210
query7	4217	458	259	259
query8	338	247	239	239
query9	8720	2763	2727	2727
query10	496	378	349	349
query11	6992	5076	4870	4870
query12	201	128	124	124
query13	1267	458	346	346
query14	5768	3755	3472	3472
query14_1	2842	2808	2820	2808
query15	211	198	190	190
query16	1006	473	444	444
query17	906	717	619	619
query18	2474	454	376	376
query19	203	200	179	179
query20	131	127	126	126
query21	210	129	113	113
query22	13367	13990	14579	13990
query23	16121	15830	15631	15631
query23_1	15620	15633	15644	15633
query24	7703	1618	1215	1215
query24_1	1224	1213	1243	1213
query25	525	463	420	420
query26	1254	262	148	148
query27	2779	496	293	293
query28	4511	1844	1855	1844
query29	795	554	507	507
query30	295	235	189	189
query31	1025	949	860	860
query32	84	71	67	67
query33	508	320	283	283
query34	902	862	522	522
query35	622	669	588	588
query36	1088	1119	938	938
query37	130	95	86	86
query38	2952	2936	2846	2846
query39	855	837	800	800
query39_1	798	793	787	787
query40	243	151	135	135
query41	66	60	59	59
query42	260	254	256	254
query43	233	250	224	224
query44	
query45	194	191	186	186
query46	871	975	595	595
query47	2106	2148	1982	1982
query48	310	308	230	230
query49	639	459	383	383
query50	680	281	209	209
query51	4017	4025	4033	4025
query52	263	263	250	250
query53	292	336	276	276
query54	292	274	253	253
query55	96	87	83	83
query56	309	348	313	313
query57	1925	1843	1683	1683
query58	281	286	274	274
query59	2773	2986	2775	2775
query60	335	342	321	321
query61	160	147	154	147
query62	618	584	536	536
query63	311	275	274	274
query64	4931	1255	981	981
query65	
query66	1478	462	348	348
query67	24142	24213	24070	24070
query68	
query69	413	307	287	287
query70	945	983	893	893
query71	341	308	294	294
query72	2849	2690	2604	2604
query73	542	542	319	319
query74	9648	9581	9369	9369
query75	2854	2753	2480	2480
query76	2287	1039	686	686
query77	366	389	319	319
query78	10903	11017	10441	10441
query79	2557	770	577	577
query80	1794	631	578	578
query81	552	267	242	242
query82	1009	155	121	121
query83	340	274	248	248
query84	305	127	112	112
query85	971	569	437	437
query86	413	326	286	286
query87	3161	3098	3039	3039
query88	3541	2669	2645	2645
query89	440	365	349	349
query90	2004	176	179	176
query91	171	162	140	140
query92	73	74	73	73
query93	1129	827	490	490
query94	647	319	285	285
query95	578	398	323	323
query96	647	519	228	228
query97	2454	2456	2415	2415
query98	255	224	215	215
query99	1002	984	890	890
Total cold run time: 251025 ms
Total hot run time: 168082 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.69% (19739/37463)
Line Coverage 36.24% (184249/508370)
Region Coverage 32.37% (142159/439180)
Branch Coverage 33.56% (62150/185169)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.30% (26880/36669)
Line Coverage 56.60% (286772/506657)
Region Coverage 53.80% (238447/443229)
Branch Coverage 55.54% (103119/185651)

@linrrzqqq
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 27041 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 57ab120538f06ee3f6fad174d286ec33854ca6fa, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17655	4498	4348	4348
q2	q3	10647	771	514	514
q4	4674	344	248	248
q5	7557	1220	1024	1024
q6	174	171	147	147
q7	786	837	678	678
q8	9874	1493	1348	1348
q9	5038	4737	4683	4683
q10	6329	1931	1641	1641
q11	495	265	252	252
q12	742	580	465	465
q13	18033	2929	2174	2174
q14	224	235	215	215
q15	q16	727	728	675	675
q17	735	844	431	431
q18	5893	5332	5305	5305
q19	1374	983	608	608
q20	541	504	387	387
q21	4590	1851	1592	1592
q22	413	340	306	306
Total cold run time: 96501 ms
Total hot run time: 27041 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4705	4673	4711	4673
q2	q3	3874	4359	3807	3807
q4	891	1184	803	803
q5	4097	4486	4412	4412
q6	191	177	146	146
q7	1784	1642	1514	1514
q8	2483	2776	2543	2543
q9	7709	7333	7315	7315
q10	3779	4046	3687	3687
q11	507	431	414	414
q12	476	576	447	447
q13	2824	3078	2239	2239
q14	279	292	306	292
q15	q16	733	786	757	757
q17	1131	1366	1622	1366
q18	7088	6895	6757	6757
q19	895	928	890	890
q20	2051	2183	2050	2050
q21	3945	3511	3280	3280
q22	467	420	385	385
Total cold run time: 49909 ms
Total hot run time: 47777 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169609 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 57ab120538f06ee3f6fad174d286ec33854ca6fa, data reload: false

query5	4346	651	535	535
query6	351	239	221	221
query7	4228	472	267	267
query8	351	250	238	238
query9	8740	2717	2742	2717
query10	502	405	355	355
query11	6997	5116	4939	4939
query12	196	136	127	127
query13	1281	467	345	345
query14	5788	3734	3539	3539
query14_1	2914	2882	2832	2832
query15	208	199	178	178
query16	970	456	468	456
query17	1075	744	639	639
query18	2456	460	358	358
query19	226	218	190	190
query20	142	135	132	132
query21	216	140	113	113
query22	13223	14630	14557	14557
query23	16163	15893	15610	15610
query23_1	15692	15801	15833	15801
query24	7827	1630	1254	1254
query24_1	1260	1253	1271	1253
query25	622	458	416	416
query26	1236	271	154	154
query27	2757	489	295	295
query28	4448	1835	1856	1835
query29	853	572	484	484
query30	300	221	190	190
query31	1002	934	878	878
query32	96	80	68	68
query33	501	343	291	291
query34	889	894	543	543
query35	643	702	614	614
query36	1077	1129	958	958
query37	143	94	90	90
query38	2909	2912	2821	2821
query39	852	846	803	803
query39_1	806	826	799	799
query40	231	160	135	135
query41	63	58	62	58
query42	264	258	253	253
query43	239	259	224	224
query44	
query45	196	200	185	185
query46	896	1003	613	613
query47	2111	2570	2015	2015
query48	312	318	232	232
query49	627	472	391	391
query50	701	288	230	230
query51	4095	4033	4034	4033
query52	267	267	275	267
query53	293	344	284	284
query54	327	289	277	277
query55	102	91	86	86
query56	326	328	307	307
query57	1919	1900	1564	1564
query58	301	283	280	280
query59	2812	3037	2787	2787
query60	352	343	336	336
query61	152	156	155	155
query62	629	599	544	544
query63	313	282	280	280
query64	5009	1267	1017	1017
query65	
query66	1464	453	358	358
query67	24207	24253	24209	24209
query68	
query69	414	335	296	296
query70	983	959	968	959
query71	358	332	328	328
query72	2788	2648	2516	2516
query73	567	555	331	331
query74	9596	9582	9401	9401
query75	2844	2754	2460	2460
query76	2277	1067	703	703
query77	368	411	321	321
query78	10918	11110	10495	10495
query79	2739	785	585	585
query80	1726	644	556	556
query81	550	255	239	239
query82	1001	154	119	119
query83	318	264	242	242
query84	258	119	94	94
query85	901	492	450	450
query86	428	308	320	308
query87	3162	3175	3019	3019
query88	3586	2667	2668	2667
query89	431	382	352	352
query90	1999	188	175	175
query91	176	160	145	145
query92	79	81	69	69
query93	1178	864	506	506
query94	649	316	285	285
query95	602	346	389	346
query96	654	528	232	232
query97	2438	2473	2382	2382
query98	249	228	232	228
query99	1014	1016	905	905
Total cold run time: 252154 ms
Total hot run time: 169609 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 6.67% (1/15) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.71% (19791/37544)
Line Coverage 36.23% (184700/509867)
Region Coverage 32.49% (143037/440236)
Branch Coverage 33.67% (62542/185772)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.19% (21017/36747)
Line Coverage 40.24% (204502/508153)
Region Coverage 37.04% (164540/444282)
Branch Coverage 37.77% (70353/186254)

1 similar comment
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.19% (21017/36747)
Line Coverage 40.24% (204502/508153)
Region Coverage 37.04% (164540/444282)
Branch Coverage 37.77% (70353/186254)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.24% (21033/36747)
Line Coverage 40.27% (204619/508153)
Region Coverage 37.03% (164539/444282)
Branch Coverage 37.80% (70399/186254)

@zclllyybb
Copy link
Contributor

/review

@github-actions
Copy link
Contributor

Code Review Summary — PR #61156: Feature Support REGEXP_EXTRACT_ALL_ARRAY

Overview

This PR adds a new function regexp_extract_all_array that returns Array<String> type instead of the string-formatted array ['a','b'] returned by regexp_extract_all. The implementation is well-structured, using a strategy/handler pattern to share the core regex matching logic between the two output formats. The refactoring also consolidates RegexpReplaceImpl/RegexpReplaceOneImpl into a single template and lifts execute_impl logic from FunctionRegexpFunctionality into the Impl structs.

Critical Checkpoint Conclusions

  1. Goal & correctness: The goal is to add regexp_extract_all_array returning Array<Nullable<String>>. The code accomplishes this. Both BE unit tests and regression tests cover the new function. The refactoring of existing functions (regexp_replace, regexp_replace_one, regexp_extract, regexp_extract_or_null) is a reasonable consolidation. Tests exist for the new function.

  2. Modification scope: The PR is focused on the new function plus a clean refactoring of adjacent regexp functions. The refactoring is larger than strictly necessary for the feature but improves code quality by eliminating duplication. Acceptable scope.

  3. Concurrency: No concurrency concerns. Function execution is single-threaded per fragment instance. The THREAD_LOCAL function state pattern is correctly followed.

  4. Lifecycle management: No special lifecycle concerns. The RegexpExtractEngine is either pre-compiled in open() and stored via set_function_state, or created per-row in the inner loop with unique_ptr. Both patterns are correct.

  5. Configuration items: No new configuration items added. N/A.

  6. Incompatible changes / rolling upgrades: No incompatible changes. The new function is additive. Existing regexp_extract_all behavior is preserved (verified by existing tests in the same file).

  7. Parallel code paths: The handler pattern (RegexpExtractAllStringOutput / RegexpExtractAllArrayOutput) correctly shares the core regexp_extract_all_inner_loop. Both handlers are exercised. The null_map check removal from RegexpReplaceImpl::execute_impl is correct — the null_map is initialized to all-zeros and the IFunction framework handles nullable column unwrapping.

  8. Special conditional checks: N/A.

  9. Test coverage: Good coverage.

    • BE unit tests: cover basic matching, null input propagation (null string, null pattern), invalid pattern handling, no-match case, empty result (no capturing groups).
    • Regression tests: 11 test cases covering constant patterns, column patterns, non-constant patterns (concat('^', k)), NULL inputs, empty match, joins with other columns. Tests use qt_ prefix with ORDER BY for deterministic ordering.
    • Missing: no test for regexp_extract_all_array with enable_extended_regex=true (boost regex path). The existing regexp_extract_all has such tests but the array variant does not. This is minor since the matching engine is shared.
  10. Observability: N/A for a new scalar function.

  11. Transaction / persistence: N/A.

  12. Data writes: N/A.

  13. FE-BE variable passing: The new function is registered in BuiltinScalarFunctions.java, RegexpExtractAllArray.java and ScalarFunctionVisitor.java. The FE signature correctly maps to the BE function name regexp_extract_all_array. No new thrift variables needed.

  14. Performance:

    • The RegexpExtractAllArrayOutput::push_matches uses insert_data per match element, which is straightforward. For very large result sets this could be optimized with batch insertion, but for typical regex match counts this is fine.
    • The RegexpExtractAllStringOutput::push_matches uses a pre-calculated buffer with manual serialization, which is an improvement over the old code that used string concatenation (res += "'" + res_matches[j] + "'").
    • The std::visit + make_bool_variant pattern for const/non-const dispatch is consistent with the rest of the codebase.
  15. Other observations: No issues found. The code follows existing patterns in the codebase, the FE class mirrors RegexpExtractAll.java closely, and the regression tests follow the required conventions (table dropped before use, qt_ prefix with ORDER BY).

Verdict

No blocking issues found. The PR is well-structured with good test coverage and clean refactoring.

@linrrzqqq
Copy link
Contributor Author

run p0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants