Skip to content

branch-4.1: [feat](tvf) add Parquet metadata TVF (#58972)(#60938)(#56603)#61474

Open
xylaaaaa wants to merge 4 commits intoapache:branch-4.1from
xylaaaaa:fix/parquet-meta-tvf-branch-4.1-20260318
Open

branch-4.1: [feat](tvf) add Parquet metadata TVF (#58972)(#60938)(#56603)#61474
xylaaaaa wants to merge 4 commits intoapache:branch-4.1from
xylaaaaa:fix/parquet-meta-tvf-branch-4.1-20260318

Conversation

@xylaaaaa
Copy link
Contributor

Replacement for #61446 because the original PR head branch could not be updated from this environment.

This branch contains:

Please review this PR instead of #61446.

@xylaaaaa xylaaaaa requested a review from yiguolei as a code owner March 18, 2026 08:39
Copilot AI review requested due to automatic review settings March 18, 2026 08:39
@Thearas
Copy link
Contributor

Thearas commented Mar 18, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@xylaaaaa
Copy link
Contributor Author

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Parquet metadata table-valued function (TVF) that lets users inspect Parquet footer/schema/row-group stats (plus KV metadata and bloom-filter probing) via SQL, wiring it through FE planning + Thrift + BE metadata scanning, and introduces regression/UT coverage.

Changes:

  • Introduce parquet_meta TVF (and companion names parquet_file_metadata, parquet_kv_metadata, parquet_bloom_probe) with FE parameter validation + scan-range construction.
  • Add BE-side ParquetMetadataReader and Parquet utility helpers/tests to read and expose Parquet footer metadata.
  • Add regression suite + baselines + Parquet test artifacts; plus related const_cast cleanup / const-access adjustments in vec columns.

Reviewed changes

Copilot reviewed 32 out of 36 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
regression-test/suites/external_table_p0/tvf/test_parquet_meta_tvf.groovy Adds end-to-end regression coverage for Parquet metadata TVFs (S3/HDFS/local + error cases).
regression-test/data/external_table_p0/tvf/test_parquet_meta_tvf.out Baseline output for the new regression suite queries.
regression-test/data/external_table_p0/tvf/meta.parquet Regression Parquet file used for metadata assertions.
regression-test/data/external_table_p0/tvf/kvmeta.parquet Regression Parquet file containing KV metadata.
regression-test/data/external_table_p0/tvf/empty.parquet Regression Parquet file representing an empty file case.
regression-test/data/external_table_p0/tvf/bloommeta.parquet Regression Parquet file containing bloom-filter metadata for probing.
gensrc/thrift/Types.thrift Adds TMetadataType.PARQUET for Parquet metadata scans.
gensrc/thrift/PlanNodes.thrift Adds TParquetMetadataParams and wires it into TMetaScanRange.
fe/fe-core/src/main/java/org/apache/doris/tablefunction/TableValuedFunctionIf.java Registers new TVF names and maps convenience names to mode via param copy.
fe/fe-core/src/main/java/org/apache/doris/tablefunction/ParquetMetadataTableValuedFunction.java Implements FE TVF analysis/validation, storage property handling, and glob expansion.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/TableValuedFunctionVisitor.java Adds Nereids visitor hook for ParquetMeta.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetMeta.java Nereids TVF expression for parquet_meta.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetKvMetadata.java Nereids TVF expression for parquet_kv_metadata.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetFileMetadata.java Nereids TVF expression for parquet_file_metadata.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/table/ParquetBloomProbe.java Nereids TVF expression for parquet_bloom_probe.
fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinTableValuedFunctions.java Registers new Parquet TVFs as built-in table-valued functions.
be/test/vec/exec/format/parquet/parquet_utils_test.cpp Adds UT coverage for new Parquet utility helpers.
be/src/vec/exec/scan/meta_scanner.cpp Routes TMetadataType::PARQUET to the new BE Parquet metadata reader.
be/src/vec/exec/format/table/parquet_utils.h Declares Parquet metadata column layouts + helper APIs.
be/src/vec/exec/format/table/parquet_utils.cpp Implements Parquet utility helpers (type formatting, stats decoding, inserts, etc.).
be/src/vec/exec/format/table/parquet_metadata_reader.h Declares BE metadata reader for Parquet footer/schema/stats/bloom probe modes.
be/src/vec/exec/format/table/parquet_metadata_reader.cpp Implements Parquet footer reading and row emission for each mode.
be/src/vec/columns/subcolumn_tree.h Adds non-const lookup helpers and adjusts mutable access (used by Variant changes).
be/src/vec/columns/column_vector.cpp Adds explanatory comment for an existing const_cast in serialization path.
be/src/vec/columns/column_variant.h Adjusts Variant root-type enforcement API signature.
be/src/vec/columns/column_variant.cpp Removes some const_cast usage; tweaks Variant subcolumn access and root-type enforcement.
be/src/vec/columns/column_struct.cpp Adds explanatory comment for an existing const_cast in serialization path.
be/src/vec/columns/column_string.cpp Adds explanatory comment for an existing const_cast in serialization path.
be/src/vec/columns/column_nullable.cpp Adds explanatory comments for const_cast in serialization; removes one const_cast in selector filtering.
be/src/vec/columns/column_map.cpp Removes unnecessary const_cast in recursive map dedup; adds serialization comment.
be/src/vec/columns/column_decimal.cpp Adds explanatory comment for an existing const_cast in serialization path.
be/src/vec/columns/column_const.h Adds explanatory comment for an existing const_cast in serialization path.
be/src/vec/columns/column_array.cpp Adds explanatory comment for an existing const_cast in serialization path.
be/src/vec/columns/column.h Adds non-const check_and_get_column overload; removes const_cast in check_and_get_column_ptr.
be/src/vec/columns/column.cpp Adds explanatory comment for using const_cast to call non-const subcolumn traversal.
be/src/olap/rowset/segment_v2/column_reader.cpp Removes const_cast by using non-const check_and_get_column overload for nullable defaults insertion.
Comments suppressed due to low confidence (1)

be/src/vec/columns/column_variant.cpp:2113

  • ensure_root_node_type(...) is declared const, but it mutates the root subcolumn (casts and rewrites root.data[...], root.data_types[...], etc.). Making this const forces other APIs (SubcolumnsTree::get_mutable_root) to become mutable-through-const too. Please make ensure_root_node_type non-const (and keep mutable accessors non-const), or otherwise encapsulate the mutation behind clearly mutable state with strong justification.
void ColumnVariant::ensure_root_node_type(const DataTypePtr& expected_root_type) const {
    auto& root = subcolumns.get_mutable_root()->data;
    if (!root.get_least_common_type()->equals(*expected_root_type)) {
        // make sure the root type is alawys as expected
        ColumnPtr casted_column;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +46
protected TableValuedFunctionIf toCatalogFunction() {
try {
Map<String, String> arguments = getTVFProperties().getMap();
arguments.put("mode", "parquet_file_metadata");
return new ParquetMetadataTableValuedFunction(arguments);
} catch (Throwable t) {
Comment on lines +41 to +46
protected TableValuedFunctionIf toCatalogFunction() {
try {
Map<String, String> arguments = getTVFProperties().getMap();
arguments.put("mode", "parquet_kv_metadata");
return new ParquetMetadataTableValuedFunction(arguments);
} catch (Throwable t) {
Comment on lines +41 to +46
protected TableValuedFunctionIf toCatalogFunction() {
try {
Map<String, String> arguments = getTVFProperties().getMap();
arguments.put("mode", "parquet_bloom_probe");
return new ParquetMetadataTableValuedFunction(arguments);
} catch (Throwable t) {
Comment on lines +200 to +212
String scheme = null;
try {
scheme = new URI(parsedPath).getScheme();
} catch (URISyntaxException e) {
scheme = null;
}
if (uriProvided) {
if (Strings.isNullOrEmpty(scheme)) {
throw new AnalysisException("Property 'uri' must contain a scheme for parquet_meta");
}
} else if (!Strings.isNullOrEmpty(scheme)) {
throw new AnalysisException("Property 'file_path' must not contain a scheme for parquet_meta");
}
Comment on lines +516 to +524
test {
sql """
select * from parquet_meta(
"file_path" = "${outFilePath}/__parquet_meta_tvf_no_match_*.parquet",
"mode" = "parquet_metadata"
);
"""
exception "failed to glob"
}
Comment on lines 300 to +352
@@ -332,13 +343,13 @@ class SubcolumnsTree {
}

private:
const Node* find_impl(const PathInData& path, bool find_exact) const {
Node* find_impl(const PathInData& path, bool find_exact) const {
if (!root) {
return nullptr;
}

const auto& parts = path.get_parts();
const Node* current_node = root.get();
Node* current_node = root.get();
ColumnVariant::Subcolumn* ColumnVariant::get_subcolumn(const PathInData& key, size_t key_index) {
// Since the cache stores const types, non-const versions cannot be used. const_cast must be employed to
// eliminate const semantics. As all nodes are created via std::make_shared<Node>, modifying them will
// not result in uninitialized behavior

// ensure root node is a certain type
void ensure_root_node_type(const DataTypePtr& type);
void ensure_root_node_type(const DataTypePtr& type) const;
List<String> expanded =
expandSingleGlob(inputPath, storageProperties, storageParams, fileType);
if (expanded.isEmpty()) {
throw new AnalysisException("No files matched parquet_meta path patterns: " + inputPath);
@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.15% (1788/2259)
Line Coverage 64.40% (31931/49583)
Region Coverage 65.24% (15982/24499)
Branch Coverage 55.78% (8499/15236)

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 18.09% (195/1078) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.81% (19386/36707)
Line Coverage 36.17% (181251/501118)
Region Coverage 32.66% (140037/428788)
Branch Coverage 33.67% (61059/181345)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 18.11% (195/1077) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.23% (25593/35930)
Line Coverage 53.99% (270033/500188)
Region Coverage 51.52% (223095/433034)
Branch Coverage 52.94% (96344/181974)

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 18, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

xylaaaaa and others added 4 commits March 19, 2026 09:11
Expose Parquet file metadata via a table-valued function for inspection
and debugging.

- Add a Parquet metadata TVF so users can query Parquet file metadata
via SQL.
- Backend adds a Parquet metadata reader and scan path; frontend wires
the TVF definition.
- Enables easy inspection of partitions/row groups/column stats to aid
troubleshooting.
…kets (apache#60938)

## Summary
- adjust `test_parquet_meta_tvf` S3-mode checks to compare only stable
columns
- avoid asserting `file_name` / full S3 URI fields that vary by pipeline
bucket
- update the corresponding `.out` baseline for the changed query
projections

## Why
Different CI pipelines may use different bucket names, which causes
false failures when full URI/file name columns are compared.

## Test
- attempted: `./run-regression-test.sh --run -f
external_table_p0/tvf/test_parquet_meta_tvf -forceGenOut`
- in this environment it failed with S3 `FORBIDDEN` while reading
regression parquet files
…pache#56603)

 go through whole be/ and find all const_cast

Issue Number: apache#55057

Problem Summary:
1. remove useless const_cast
2. explain why using const_cast does not result in undefined behavior
3. don't modify some const_cast
    (1) some code in DBUG_EXECUTE_IF or test file
    (2) underlying data structures, such as cow
    (3) const_cast<const T*>
@xylaaaaa xylaaaaa force-pushed the fix/parquet-meta-tvf-branch-4.1-20260318 branch from d59219f to 20e8e8a Compare March 19, 2026 01:11
@xylaaaaa
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.06% (1786/2259)
Line Coverage 64.42% (31939/49583)
Region Coverage 65.26% (15989/24499)
Branch Coverage 55.82% (8505/15236)

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 18.09% (195/1078) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.77% (19384/36731)
Line Coverage 36.15% (181233/501400)
Region Coverage 32.60% (139884/429117)
Branch Coverage 33.63% (61033/181471)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 18.11% (195/1077) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.19% (25595/35954)
Line Coverage 53.96% (270049/500470)
Region Coverage 51.38% (222666/433363)
Branch Coverage 52.81% (96164/182100)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants