Skip to content

[fix](be) Restore mutable ownership in COW paths#63001

Draft
zclllyybb wants to merge 1 commit intoapache:masterfrom
zclllyybb:cow
Draft

[fix](be) Restore mutable ownership in COW paths#63001
zclllyybb wants to merge 1 commit intoapache:masterfrom
zclllyybb:cow

Conversation

@zclllyybb
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary: assume_mutable now asserts exclusive ownership, so paths that moved columns out of blocks or borrowed shared subcolumns must return ownership explicitly or mutate through owning COW handles. This restores COW-safe mutation in block reuse, variant extraction, parquet/orc conversion, memtable aggregation, compaction readers, result buffering, and affected BE unit tests.

### What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary: assume_mutable now asserts exclusive ownership, so paths that moved columns out of blocks or borrowed shared subcolumns must return ownership explicitly or mutate through owning COW handles. This restores COW-safe mutation in block reuse, variant extraction, parquet/orc conversion, memtable aggregation, compaction readers, result buffering, and affected BE unit tests.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-be-ut.sh --run -j 100
    - ./run-be-ut.sh --run --filter=LocalExchangerTest.*
    - ./run-be-ut.sh --run --filter=ArrowResultBlockBufferTest.*:BitUtil.CountZero:Parameters/TestRowIdConversion.*
    - ./run-be-ut.sh --run --filter=VariantColumnWriterReaderTest.*:HierarchicalDataIteratorTest.*
    - ./run-be-ut.sh --run --filter=VariantColumnWriterReaderTest.test_nested_iter_nullable:VariantColumnWriterReaderTest.test_streaming_compaction_writer_streams_regular_array_paths_across_batches:TabletCooldownTest.normal
    - PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.bun/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.codex/tmp/arg0/codex-arg0w0dN9c:/mnt/disk6/common/node-v24.14.1-linux-x64/lib/node_modules/@openai/codex/node_modules/@openai/codex-linux-x64/vendor/x86_64-unknown-linux-musl/path:/mnt/disk3/zhaochangle/.bun/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk3/zhaochangle/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/usr/share/Modules/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin ./build-support/clang-format.sh
    - git diff --cached --check
    - PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.bun/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.codex/tmp/arg0/codex-arg0w0dN9c:/mnt/disk6/common/node-v24.14.1-linux-x64/lib/node_modules/@openai/codex/node_modules/@openai/codex-linux-x64/vendor/x86_64-unknown-linux-musl/path:/mnt/disk3/zhaochangle/.bun/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk3/zhaochangle/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/usr/share/Modules/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin ./build-support/check-format.sh (attempted; blocked by pre-existing formatting drift in be/src/exec/operator/distinct_streaming_aggregation_operator.cpp)
    - PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.bun/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.codex/tmp/arg0/codex-arg0w0dN9c:/mnt/disk6/common/node-v24.14.1-linux-x64/lib/node_modules/@openai/codex/node_modules/@openai/codex-linux-x64/vendor/x86_64-unknown-linux-musl/path:/mnt/disk3/zhaochangle/.bun/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/mnt/disk3/zhaochangle/.opencode/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk3/zhaochangle/.local/bin:/mnt/disk3/zhaochangle/bin:/mnt/disk6/common/apache-maven-3.9.14/bin:/mnt/disk6/common/ldb_toolchain_028/bin:/mnt/disk6/common/jdk-17.0.16/bin:/mnt/disk6/common/node-v24.14.1-linux-x64/bin:/usr/share/Modules/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin ./build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (attempted; blocked by existing clang-tidy analyzer errors from be/src/util/jni-util.h static_assert(false) and pre-existing diagnostics)
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zclllyybb
Copy link
Copy Markdown
Contributor Author

run buildall

@zclllyybb
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a blocking COW correctness issue in complex-type deserialization. The PR changes these paths from mutating the existing child through assume_mutable() to mutating a possibly cloned child, but the cloned child is not written back to the parent complex column. That means deserialization can update offsets while dropping child data when nested children are shared, which is exactly the ownership case this PR is trying to make safe.

Critical checkpoint conclusions:

  • Goal/test: The goal is to restore COW-safe mutation after assume_mutable() enforces exclusive ownership. The broad BE unit-test list helps, but the complex-type deserialize path still has an uncovered shared-child case.
  • Scope/focus: The change is focused on COW ownership, but several deserialize updates need the same write-back pattern used elsewhere after mutating detached children.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, configuration, or storage-format compatibility concern identified in the reviewed changes.
  • Parallel paths: Array, Map, and Struct deserialize have the same detached-child pattern and all need to be fixed consistently.
  • Tests: Please add or adjust coverage for deserializing complex columns whose child ColumnPtr is shared so this does not regress.
  • Observability/persistence/write correctness/performance: No additional observability, persistence, transaction, or performance blocker found beyond the data-correctness issue above.

User focus: No additional user-provided review focus was specified.

// children
auto nested_column = data_column->get_data_ptr()->assume_mutable();
auto nested_column = std::move(*data_column->get_data_ptr()).mutate();
buf = get_nested_type()->deserialize(buf, &nested_column, be_exec_version);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now mutates a local nested_column, but if get_data_ptr() is shared, mutate() clones the child and deserialization fills only that clone. Since the clone is never assigned back to the ColumnArray, the array keeps the old child while offsets have already been resized/copied, producing an inconsistent deserialized array. Please store nested_column back into data_column->get_data_ptr() after get_nested_type()->deserialize(...).

auto nested_keys_column = map_column->get_keys_ptr()->assume_mutable();
auto nested_values_column = map_column->get_values_ptr()->assume_mutable();
auto nested_keys_column = std::move(*map_column->get_keys_ptr()).mutate();
auto nested_values_column = std::move(*map_column->get_values_ptr()).mutate();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same COW ownership issue here: mutate() may return cloned key/value children when the map subcolumns are shared, but the deserialized nested_keys_column and nested_values_column are dropped instead of being written back to map_column. That leaves updated offsets with stale keys/values. Please assign both mutated children back to get_keys_ptr() and get_values_ptr() after deserializing them.

for (size_t i = 0; i < elems.size(); ++i) {
auto child_column = struct_column->get_column_ptr(i)->assume_mutable();
auto child_column = std::move(*struct_column->get_column_ptr(i)).mutate();
buf = elems[i]->deserialize(buf, &child_column, be_exec_version);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a struct child is shared, this mutate() clones it; elems[i]->deserialize() then writes into the clone, but the clone is discarded at the end of the loop iteration. The parent ColumnStruct continues to reference the old child. Please write child_column back to struct_column->get_column_ptr(i) after deserialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants