branch-4.1: [feat](tvf) add Parquet metadata TVF (#58972)(#60938)(#56603)#61474
branch-4.1: [feat](tvf) add Parquet metadata TVF (#58972)(#60938)(#56603)#61474xylaaaaa wants to merge 4 commits intoapache:branch-4.1from
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Adds a new Parquet metadata table-valued function (TVF) that lets users inspect Parquet footer/schema/row-group stats (plus KV metadata and bloom-filter probing) via SQL, wiring it through FE planning + Thrift + BE metadata scanning, and introduces regression/UT coverage.
Changes:
- Introduce
parquet_metaTVF (and companion namesparquet_file_metadata,parquet_kv_metadata,parquet_bloom_probe) with FE parameter validation + scan-range construction. - Add BE-side
ParquetMetadataReaderand Parquet utility helpers/tests to read and expose Parquet footer metadata. - Add regression suite + baselines + Parquet test artifacts; plus related const_cast cleanup / const-access adjustments in vec columns.
Reviewed changes
Copilot reviewed 32 out of 36 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy | Adds end-to-end regression coverage for Parquet metadata TVFs (S3/HDFS/local + error cases). |
| regression-test/data/external_table_p0/tvf/test_parquet_meta_tvf.out | Baseline output for the new regression suite queries. |
| regression-test/data/external_table_p0/tvf/meta.parquet | Regression Parquet file used for metadata assertions. |
| regression-test/data/external_table_p0/tvf/kvmeta.parquet | Regression Parquet file containing KV metadata. |
| regression-test/data/external_table_p0/tvf/empty.parquet | Regression Parquet file representing an empty file case. |
| regression-test/data/external_table_p0/tvf/bloommeta.parquet | Regression Parquet file containing bloom-filter metadata for probing. |
| gensrc/thrift/Types.thrift | Adds TMetadataType.PARQUET for Parquet metadata scans. |
| gensrc/thrift/PlanNodes.thrift | Adds TParquetMetadataParams and wires it into TMetaScanRange. |
| fe/fe-core/src/main/java/org/apache/doris/tablefunction/TableValuedFunctionIf.java | Registers new TVF names and maps convenience names to mode via param copy. |
| fe/fe-core/src/main/java/org/apache/doris/tablefunction/ParquetMetadataTableValuedFunction.java | Implements FE TVF analysis/validation, storage property handling, and glob expansion. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/TableValuedFunctionVisitor.java | Adds Nereids visitor hook for ParquetMeta. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetMeta.java | Nereids TVF expression for parquet_meta. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetKvMetadata.java | Nereids TVF expression for parquet_kv_metadata. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetFileMetadata.java | Nereids TVF expression for parquet_file_metadata. |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetBloomProbe.java | Nereids TVF expression for parquet_bloom_probe. |
| fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinTableValuedFunctions.java | Registers new Parquet TVFs as built-in table-valued functions. |
| be/test/vec/exec/format/parquet/parquet_utils_test.cpp | Adds UT coverage for new Parquet utility helpers. |
| be/src/vec/exec/scan/meta_scanner.cpp | Routes TMetadataType::PARQUET to the new BE Parquet metadata reader. |
| be/src/vec/exec/format/table/parquet_utils.h | Declares Parquet metadata column layouts + helper APIs. |
| be/src/vec/exec/format/table/parquet_utils.cpp | Implements Parquet utility helpers (type formatting, stats decoding, inserts, etc.). |
| be/src/vec/exec/format/table/parquet_metadata_reader.h | Declares BE metadata reader for Parquet footer/schema/stats/bloom probe modes. |
| be/src/vec/exec/format/table/parquet_metadata_reader.cpp | Implements Parquet footer reading and row emission for each mode. |
| be/src/vec/columns/subcolumn_tree.h | Adds non-const lookup helpers and adjusts mutable access (used by Variant changes). |
| be/src/vec/columns/column_vector.cpp | Adds explanatory comment for an existing const_cast in serialization path. |
| be/src/vec/columns/column_variant.h | Adjusts Variant root-type enforcement API signature. |
| be/src/vec/columns/column_variant.cpp | Removes some const_cast usage; tweaks Variant subcolumn access and root-type enforcement. |
| be/src/vec/columns/column_struct.cpp | Adds explanatory comment for an existing const_cast in serialization path. |
| be/src/vec/columns/column_string.cpp | Adds explanatory comment for an existing const_cast in serialization path. |
| be/src/vec/columns/column_nullable.cpp | Adds explanatory comments for const_cast in serialization; removes one const_cast in selector filtering. |
| be/src/vec/columns/column_map.cpp | Removes unnecessary const_cast in recursive map dedup; adds serialization comment. |
| be/src/vec/columns/column_decimal.cpp | Adds explanatory comment for an existing const_cast in serialization path. |
| be/src/vec/columns/column_const.h | Adds explanatory comment for an existing const_cast in serialization path. |
| be/src/vec/columns/column_array.cpp | Adds explanatory comment for an existing const_cast in serialization path. |
| be/src/vec/columns/column.h | Adds non-const check_and_get_column overload; removes const_cast in check_and_get_column_ptr. |
| be/src/vec/columns/column.cpp | Adds explanatory comment for using const_cast to call non-const subcolumn traversal. |
| be/src/olap/rowset/segment_v2/column_reader.cpp | Removes const_cast by using non-const check_and_get_column overload for nullable defaults insertion. |
Comments suppressed due to low confidence (1)
be/src/vec/columns/column_variant.cpp:2113
ensure_root_node_type(...)is declaredconst, but it mutates the root subcolumn (casts and rewritesroot.data[...],root.data_types[...], etc.). Making thisconstforces other APIs (SubcolumnsTree::get_mutable_root) to become mutable-through-const too. Please makeensure_root_node_typenon-const (and keep mutable accessors non-const), or otherwise encapsulate the mutation behind clearlymutablestate with strong justification.
void ColumnVariant::ensure_root_node_type(const DataTypePtr& expected_root_type) const {
auto& root = subcolumns.get_mutable_root()->data;
if (!root.get_least_common_type()->equals(*expected_root_type)) {
// make sure the root type is alawys as expected
ColumnPtr casted_column;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| protected TableValuedFunctionIf toCatalogFunction() { | ||
| try { | ||
| Map<String, String> arguments = getTVFProperties().getMap(); | ||
| arguments.put("mode", "parquet_file_metadata"); | ||
| return new ParquetMetadataTableValuedFunction(arguments); | ||
| } catch (Throwable t) { |
| protected TableValuedFunctionIf toCatalogFunction() { | ||
| try { | ||
| Map<String, String> arguments = getTVFProperties().getMap(); | ||
| arguments.put("mode", "parquet_kv_metadata"); | ||
| return new ParquetMetadataTableValuedFunction(arguments); | ||
| } catch (Throwable t) { |
| protected TableValuedFunctionIf toCatalogFunction() { | ||
| try { | ||
| Map<String, String> arguments = getTVFProperties().getMap(); | ||
| arguments.put("mode", "parquet_bloom_probe"); | ||
| return new ParquetMetadataTableValuedFunction(arguments); | ||
| } catch (Throwable t) { |
| String scheme = null; | ||
| try { | ||
| scheme = new URI(parsedPath).getScheme(); | ||
| } catch (URISyntaxException e) { | ||
| scheme = null; | ||
| } | ||
| if (uriProvided) { | ||
| if (Strings.isNullOrEmpty(scheme)) { | ||
| throw new AnalysisException("Property 'uri' must contain a scheme for parquet_meta"); | ||
| } | ||
| } else if (!Strings.isNullOrEmpty(scheme)) { | ||
| throw new AnalysisException("Property 'file_path' must not contain a scheme for parquet_meta"); | ||
| } |
| test { | ||
| sql """ | ||
| select * from parquet_meta( | ||
| "file_path" = "${outFilePath}/__parquet_meta_tvf_no_match_*.parquet", | ||
| "mode" = "parquet_metadata" | ||
| ); | ||
| """ | ||
| exception "failed to glob" | ||
| } |
| @@ -332,13 +343,13 @@ class SubcolumnsTree { | |||
| } | |||
|
|
|||
| private: | |||
| const Node* find_impl(const PathInData& path, bool find_exact) const { | |||
| Node* find_impl(const PathInData& path, bool find_exact) const { | |||
| if (!root) { | |||
| return nullptr; | |||
| } | |||
|
|
|||
| const auto& parts = path.get_parts(); | |||
| const Node* current_node = root.get(); | |||
| Node* current_node = root.get(); | |||
| ColumnVariant::Subcolumn* ColumnVariant::get_subcolumn(const PathInData& key, size_t key_index) { | ||
| // Since the cache stores const types, non-const versions cannot be used. const_cast must be employed to | ||
| // eliminate const semantics. As all nodes are created via std::make_shared<Node>, modifying them will | ||
| // not result in uninitialized behavior |
|
|
||
| // ensure root node is a certain type | ||
| void ensure_root_node_type(const DataTypePtr& type); | ||
| void ensure_root_node_type(const DataTypePtr& type) const; |
| List<String> expanded = | ||
| expandSingleGlob(inputPath, storageProperties, storageParams, fileType); | ||
| if (expanded.isEmpty()) { | ||
| throw new AnalysisException("No files matched parquet_meta path patterns: " + inputPath); |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
Expose Parquet file metadata via a table-valued function for inspection and debugging. - Add a Parquet metadata TVF so users can query Parquet file metadata via SQL. - Backend adds a Parquet metadata reader and scan path; frontend wires the TVF definition. - Enables easy inspection of partitions/row groups/column stats to aid troubleshooting.
…kets (apache#60938) ## Summary - adjust `test_parquet_meta_tvf` S3-mode checks to compare only stable columns - avoid asserting `file_name` / full S3 URI fields that vary by pipeline bucket - update the corresponding `.out` baseline for the changed query projections ## Why Different CI pipelines may use different bucket names, which causes false failures when full URI/file name columns are compared. ## Test - attempted: `./run-regression-test.sh --run -f external_table_p0/tvf/test_parquet_meta_tvf -forceGenOut` - in this environment it failed with S3 `FORBIDDEN` while reading regression parquet files
…pache#56603) go through whole be/ and find all const_cast Issue Number: apache#55057 Problem Summary: 1. remove useless const_cast 2. explain why using const_cast does not result in undefined behavior 3. don't modify some const_cast (1) some code in DBUG_EXECUTE_IF or test file (2) underlying data structures, such as cow (3) const_cast<const T*>
d59219f to
20e8e8a
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Replacement for #61446 because the original PR head branch could not be updated from this environment.
This branch contains:
Please review this PR instead of #61446.