Skip to content

Conversation

@xylaaaaa
Copy link
Contributor

@xylaaaaa xylaaaaa commented Jan 23, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

When building paimon-cpp in Doris thirdparty, compilation fails due to ORC API incompatibility.

Paimon-cpp defaults to using ORC 2.x APIs, but Doris thirdparty has pre-built ORC 1.9.0. The API differences between ORC 1.x and 2.x cause compilation errors like:

  • readAsync override error (function not virtual in ORC 1.x)
  • getHighBits()/getLowBits() const mismatch
  • setData method not found

This PR enables PAIMON_ORC_V1=ON flag to use the ORC 1.x compatibility wrapper that already exists in paimon-cpp codebase.

Release note

None

Check List (For Author)

Test

  • Regression test
  • Unit Test
  • Manual test (add detailed scripts or steps below)
  • No need to test or manual test. Explain why:

Manual test steps:

cd thirdparty
./build-thirdparty.sh -j 50 paimon_cpp
# Build completes successfully

Copilot AI review requested due to automatic review settings January 23, 2026 12:25
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@xylaaaaa xylaaaaa changed the title Pr 59787 [Feature](paimon) Integrate paimon-cpp library for native Paimon table support Jan 23, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an optional paimon-cpp–based reader path for Paimon table scans, including third-party integration, FE session plumbing, and BE scan/reader implementation.

Changes:

  • Add paimon-cpp as a third-party dependency (download, patching, build/install integration).
  • Add FE session variable + thrift propagation to enable paimon-cpp reader, and adjust Paimon split serialization/table-location passing.
  • Add BE paimon-cpp reader, predicate pushdown conversion, and a Doris-backed filesystem adapter for paimon-cpp.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
thirdparty/vars.sh Adds paimon-cpp archive metadata and includes it in the third-party download list.
thirdparty/patches/paimon-cpp-b1ffd6f73e5edf57aac24ec2eaf6d2ef9e9a9850.patch Applies Doris integration changes to paimon-cpp (CMake options, ORC compat, external deps, etc.).
thirdparty/download-thirdparty.sh Applies the paimon-cpp patch during third-party source preparation.
thirdparty/build-thirdparty.sh Adds paimon-cpp build/install steps and adjusts protobuf build flags.
gensrc/thrift/PaloInternalService.thrift Adds enable_paimon_cpp_reader to TQueryOptions.
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java Adds session variable enable_paimon_cpp_reader and extends ignore-split enum/options.
fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/source/PaimonSource.java Adds table-location extraction helper for passing to BE.
fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/source/PaimonScanNode.java Switches split serialization for paimon-cpp and passes table location via thrift.
fe/fe-core/src/main/java/org/apache/doris/datasource/paimon/PaimonUtil.java Adds native Paimon DataSplit serialization (standard Base64) for BE paimon-cpp.
fe/be-java-extensions/paimon-scanner/src/main/java/org/apache/doris/paimon/PaimonUtils.java Adds Base64 decode fallback for non-URL-safe encoding.
build.sh Threads paimon-cpp enablement and PAIMON_HOME through to BE CMake configure.
bin/start_be.sh Prepends ${DORIS_HOME}/lib to LD_LIBRARY_PATH.
be/src/vec/exec/scan/file_scanner.cpp Routes paimon scans to paimon-cpp reader when enabled; adds predicate pushdown wiring.
be/src/vec/exec/format/table/paimon_predicate_converter.h Declares conversion from Doris VExpr predicates to paimon-cpp predicates.
be/src/vec/exec/format/table/paimon_predicate_converter.cpp Implements predicate conversion (binary ops, IN, is null, prefix-like).
be/src/vec/exec/format/table/paimon_doris_file_system.h Declares force-link registration helper for paimon-cpp FS factory.
be/src/vec/exec/format/table/paimon_doris_file_system.cpp Implements paimon-cpp filesystem backed by Doris FileFactory/IO stack.
be/src/vec/exec/format/table/paimon_cpp_reader.h Declares the paimon-cpp reader implementation for BE scans.
be/src/vec/exec/format/table/paimon_cpp_reader.cpp Implements split decode, read context setup, batch import via Arrow C bridge, and block materialization.
be/src/tools/CMakeLists.txt Adds an optional paimon_cpp_demo tool when ENABLE_PAIMON_CPP is ON.
be/CMakeLists.txt Adds ENABLE_PAIMON_CPP/PAIMON_HOME options and links paimon-cpp static libs and dependencies into BE.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +229 to +233
// Set table location for paimon-cpp reader
String tableLocation = source.getTableLocation();
if (tableLocation != null) {
fileDesc.setPaimonTable(tableLocation);
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPaimonFileDesc.paimon_table is marked deprecated in gensrc/thrift/PlanNodes.thrift. Reusing it for paimon-cpp table location makes the API harder to evolve and risks removal/behavior changes. Prefer adding a new non-deprecated thrift field (or updating the thrift definition/comment if this field is intended to be used again) for the paimon-cpp table root/location.

Suggested change
// Set table location for paimon-cpp reader
String tableLocation = source.getTableLocation();
if (tableLocation != null) {
fileDesc.setPaimonTable(tableLocation);
}

Copilot uses AI. Check for mistakes.
return to_paimon_status(exists_status);
}
if (exists) {
return Status::Exist("file already exists: ", normalized_path);
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call returns Status::Exist with two arguments. Elsewhere in this file paimon::Status helpers are called with a single message string, so this likely won’t compile against paimon::Status API. Consider building a single message string (e.g., concatenation/format) and passing that to Status::Exist, or use the correct overload if one exists.

Suggested change
return Status::Exist("file already exists: ", normalized_path);
return Status::Exist(std::string("file already exists: ") + normalized_path);

Copilot uses AI. Check for mistakes.
Comment on lines +327 to +332
@@ -0,0 +1,82 @@
+#!/bin/bash
+# Paimon-cpp 依赖库检查脚本
+
+# 默认 PAIMON_HOME
+PAIMON_HOME="${PAIMON_HOME:-/mnt/disk1/chenjunwei/paimon-install-asan}"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency check script added to the paimon-cpp patch hardcodes a developer-specific default PAIMON_HOME ("/mnt/disk1/chenjunwei/paimon-install-asan"). This will be wrong in almost all environments and can mislead users. Prefer no default (require PAIMON_HOME to be set) or derive it relative to the script location, and keep the output/messages environment-agnostic.

Suggested change
@@ -0,0 +1,82 @@
+#!/bin/bash
+# Paimon-cpp 依赖库检查脚本
+
+# 默认 PAIMON_HOME
+PAIMON_HOME="${PAIMON_HOME:-/mnt/disk1/chenjunwei/paimon-install-asan}"
@@ -0,0 +1,86 @@
+#!/bin/bash
+# Paimon-cpp 依赖库检查脚本
+
+# 默认 PAIMON_HOME:优先使用环境变量,否则相对于当前脚本目录推导
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+if [ -z "${PAIMON_HOME:-}" ]; then
+ PAIMON_HOME="$(cd "$SCRIPT_DIR/.." && pwd)"
+fi

Copilot uses AI. Check for mistakes.
Comment on lines +582 to +615
# Paimon static library dependencies
set(paimon_fmt_lib "${PAIMON_HOME}/lib64/paimon_deps/libfmtd.a")
if (NOT EXISTS "${paimon_fmt_lib}")
set(paimon_fmt_lib "${PAIMON_HOME}/lib64/paimon_deps/libfmt.a")
endif()
add_library(paimon_fmt STATIC IMPORTED)
set_target_properties(paimon_fmt PROPERTIES IMPORTED_LOCATION "${paimon_fmt_lib}")
set(paimon_tbb_lib "${PAIMON_HOME}/lib64/paimon_deps/libtbb_debug.a")
if (NOT EXISTS "${paimon_tbb_lib}")
set(paimon_tbb_lib "${PAIMON_HOME}/lib64/paimon_deps/libtbb.a")
endif()
add_library(paimon_tbb STATIC IMPORTED)
set_target_properties(paimon_tbb PROPERTIES IMPORTED_LOCATION "${paimon_tbb_lib}")
add_library(paimon_roaring STATIC IMPORTED)
set_target_properties(paimon_roaring PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libroaring_bitmap.a")
add_library(paimon_xxhash STATIC IMPORTED)
set_target_properties(paimon_xxhash PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libxxhash.a")
add_library(paimon_arrow_dataset STATIC IMPORTED)
set_target_properties(paimon_arrow_dataset PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libarrow_dataset.a")
add_library(paimon_arrow_acero STATIC IMPORTED)
set_target_properties(paimon_arrow_acero PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libarrow_acero.a")
add_library(paimon_arrow STATIC IMPORTED)
set_target_properties(paimon_arrow PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libarrow.a")
set(paimon_arrow_bundled_deps_lib "${PAIMON_HOME}/lib64/paimon_deps/libarrow_bundled_dependencies.a")
if (EXISTS "${paimon_arrow_bundled_deps_lib}")
add_library(paimon_arrow_bundled_dependencies STATIC IMPORTED)
set_target_properties(paimon_arrow_bundled_dependencies PROPERTIES IMPORTED_LOCATION "${paimon_arrow_bundled_deps_lib}")
endif()
add_library(paimon_parquet STATIC IMPORTED)
set_target_properties(paimon_parquet PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/libparquet.a")
if(NOT PAIMON_USE_EXTERNAL_ORC)
add_library(paimon_orc STATIC IMPORTED)
set_target_properties(paimon_orc PROPERTIES IMPORTED_LOCATION "${PAIMON_HOME}/lib64/paimon_deps/liborc.a")
endif()
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imported paimon dependency libraries (fmt/tbb fallbacks, roaring/xxhash/arrow/parquet/orc) are not validated after selecting paths. If any file is missing, CMake will fail later with a less actionable error. Consider adding explicit EXISTS checks and a clear FATAL_ERROR listing the missing artifacts under PAIMON_HOME, especially after the fallback selection for fmt/tbb.

Copilot uses AI. Check for mistakes.
Comment on lines 69 to 71
#include "vec/exec/format/table/max_compute_jni_reader.h"
#include "vec/exec/format/table/paimon_cpp_reader.h"
#include "vec/exec/format/table/paimon_predicate_converter.h"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paimon-cpp headers are included unconditionally, but the project only adds PAIMON_HOME include paths when ENABLE_PAIMON_CPP is ON. With ENABLE_PAIMON_CPP=OFF this will fail to compile due to missing "paimon/..." headers (and Vec/CMakeLists.txt globs all *.cpp, so paimon_cpp_reader.cpp will be built too). Please gate the includes/usage (and the new paimon-cpp .cpp files) behind the CMake option (e.g., exclude sources when disabled and/or add a compile definition when enabled).

Suggested change
#include "vec/exec/format/table/max_compute_jni_reader.h"
#include "vec/exec/format/table/paimon_cpp_reader.h"
#include "vec/exec/format/table/paimon_predicate_converter.h"
#include "vec/exec/format/table/max_compute_jni_reader.h"
#ifdef ENABLE_PAIMON_CPP
#include "vec/exec/format/table/paimon_cpp_reader.h"
#include "vec/exec/format/table/paimon_predicate_converter.h"
#endif

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +80
public String getTableLocation() {
if (originTable instanceof FileStoreTable) {
return ((FileStoreTable) originTable).location().toString();
}
// Fallback to path option
return originTable.options().get("path");
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTableLocation() dereferences originTable without a null check, but the @VisibleForTesting no-arg constructor explicitly sets originTable = null. This will now throw NPE in unit tests (and any test helpers using the no-arg constructor). Consider returning null when originTable is null (or throwing a clearer exception in non-test paths).

Copilot uses AI. Check for mistakes.
@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.32% (1787/2253)
Line Coverage 64.79% (31832/49129)
Region Coverage 65.52% (15866/24214)
Branch Coverage 56.09% (8432/15032)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 16.00% (4/25) 🎉
Increment coverage report
Complete coverage report

@xylaaaaa
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.32% (1787/2253)
Line Coverage 64.76% (31818/49129)
Region Coverage 65.45% (15848/24214)
Branch Coverage 56.05% (8425/15032)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 16.00% (4/25) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants