Skip to content

[Bug] Doris BE crash: SIGSEGV address not mapped to object (@0x0) #60172

@vortual

Description

@vortual

Search before asking

  • I had searched in the issues and found no similar issues.

Version

4.0.2

What's Wrong?

以下内容为AI帮忙诊断分析

1. 问题概述

Doris BE (Backend) 进程在 2026-01-22 22:30 左右发生 SIGSEGV 崩溃,导致正在进行的 Mongo CDC 同步任务中断。BE 进程自动重启后恢复正常


2. 崩溃时间线

时间 事件 来源
22:30:11.973 前一批 stream load 正常完成 (txn 109883, 109893) BE INFO 日志
22:30:12.170 两个 stream load 在同一毫秒到达 FE FE 日志
22:30:12.173 FE 为 txn 109908 开启事务 FE 日志
22:30:12.181 BE 开始执行 stream load BE INFO 日志
22:30:12.189 HTTP header 处理完成 BE INFO 日志
22:30:1x~50 💀 BE SIGSEGV 崩溃 be.out
22:30:50.105 BE 重启完成 be.out
22:31:10.347 FE 检测到 BE 重启,回滚 txn 109908 FE 日志
22:31:18.608 Mongo CDC 自动重试 stream load FE 日志
22:31:34.672 重试事务 109929 成功提交 FE 日志

3. 日志证据

3.1 崩溃堆栈 (be.out)

文件: /data/apache-doris-4.0.2-bin-x64/be/log/be.out

*** Query id: 82486cd1d80c3022-81101df6583bb08e ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1769092212 (unix time) try "date -d @1769092212" if you are using GNU date ***
*** Current BE git commitID: 30d2df0459 ***
*** SIGSEGV address not mapped to object (@0x0) received by PID 2408066 (TID 2461123 OR 0x7f65349bc640) from PID 0; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:420
 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /opt/jdk-17.0.17+10/lib/server/libjvm.so
 2# JVM_handle_linux_signal in /opt/jdk-17.0.17+10/lib/server/libjvm.so
 3# 0x00007F795AA3FC30 in /lib64/libc.so.6
 4# 0x0000561F10943A42 in /data/apache-doris-4.0.2-bin-x64/be/lib/doris_be
 5# brpc::Controller::call_id() in /data/apache-doris-4.0.2-bin-x64/be/lib/doris_be
 6# doris::DummyBrpcCallback<doris::PTabletWriterAddBlockResult>::DummyBrpcCallback() at /home/zcp/repo_center/doris_release/doris/be/src/util/brpc_closure.h:39
 7# doris::vectorized::VNodeChannel::init(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:575
 8# doris::vectorized::IndexChannel::init(doris::RuntimeState*, std::vector<doris::TTabletWithPartition, std::allocator<doris::TTabletWithPartition> > const&, bool) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:151
 9# doris::vectorized::VTabletWriter::_init(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:1630
10# doris::vectorized::VTabletWriter::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:1389
11# doris::vectorized::AsyncResultWriter::process_block(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/async_result_writer.cpp:119
12# std::_Function_handler<void (), doris::vectorized::AsyncResultWriter::start_writer(doris::RuntimeState*, doris::RuntimeProfile*)::$_0>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
13# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:623
14# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:461
15# start_thread in /lib64/libc.so.6
16# __GI___clone3 in /lib64/libc.so.6

分析:

  • 错误类型: SIGSEGV address not mapped to object (@0x0) - 空指针解引用
  • Query ID: 82486cd1d80c3022-81101df6583bb08e
  • 崩溃位置: VNodeChannel::init()DummyBrpcCallback()brpc::Controller::call_id()
  • 根因: bRPC Controller 对象为 NULL

3.2 触发崩溃的查询 (BE INFO 日志)

文件: /data/apache-doris-4.0.2-bin-x64/be/log/be.INFO.log.20260122-193049

I20260122 22:30:12.172951 2409869 stream_load.cpp:218] new income streaming load request.id=82486cd1d80c3022-81101df6583bb08e, job_id=-1, txn_id=-1, label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, elapse(s)=0, db=ods, tbl=table2, group_commit=0, HTTP headers=...

I20260122 22:30:12.173527 2409884 stream_load.cpp:218] new income streaming load request.id=ab4c7fe784d16dc7-f873f51fc204c8b1, job_id=-1, txn_id=-1, label=lb_mongo_doris_ods_table1_0_3866_8e69c489-c858-4d5b-98bd-035b3f8df879, elapse(s)=0, db=ods, tbl=table1, group_commit=0, HTTP headers=...

I20260122 22:30:12.181313 2409869 stream_load_executor.cpp:74] begin to execute stream load. label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, txn_id=109908, query_id=82486cd1d80c3022-81101df6583bb08e

I20260122 22:30:12.189524 2409869 stream_load.cpp:225] finished to handle HTTP header, id=82486cd1d80c3022-81101df6583bb08e, job_id=-1, txn_id=109908, label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, elapse(s)=0

分析:

  • 触发查询 ID: 82486cd1d80c3022-81101df6583bb08e
  • 目标表: table2
  • 事务 ID: 109908
  • 来源: Mongo CDC 同步任务
  • 注意: 两个 stream load 在 0.6ms 内先后到达 (22:30:12.172 和 22:30:12.173)

3.3 并发 Stream Load 同毫秒到达 (FE 日志)

文件: /data/apache-doris-4.0.2-bin-x64/fe/log/fe.log.20260122-1

2026-01-22 22:30:12,170 INFO (qtp149820420-710|710) [LoadAction.streamLoad():106] streamload action, db: ods, tbl: table1, headers: ...label:lb_mongo_doris_ods_table1_0_3866_8e69c489-c858-4d5b-98bd-035b3f8df879...

2026-01-22 22:30:12,170 INFO (qtp149820420-10209|10209) [LoadAction.streamLoad():106] streamload action, db: ods, tbl: table2, headers: ...label:lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb...

2026-01-22 22:30:12,173 INFO (thrift-server-pool-470|12411) [DatabaseTransactionMgr.beginTransaction():382] begin transaction: txn id 109908 with label lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb from coordinator BE: 10.66.7.1, listener id: -1

关键发现:

  • 两个 stream load 在完全相同的毫秒 (22:30:12,170) 到达 FE
  • 这两个请求分别由线程 710 和 10209 处理
  • 高并发触发了 bRPC 初始化的竞争条件

3.6 BE 重启记录 (be.out)

文件: /data/apache-doris-4.0.2-bin-x64/be/log/be.out

INFO: java_cmd /opt/jdk-17.0.17+10/bin/java
INFO: jdk_version 17
StdoutLogger 2026-01-22 22:30:50,105 Start time: Thu Jan 22 10:30:50 PM CST 2026
INFO: java_cmd /opt/jdk-17.0.17+10/bin/java
INFO: jdk_version 17
OpenJDK 64-Bit Server VM warning: Option CriticalJNINatives was deprecated in version 16.0 and will likely be removed in a future release.
...
start BE in local mode

分析:

  • BE 在 22:30:50 重启完成
  • 从崩溃到重启约 38 秒

3.7 并发数量统计 (BE INFO 日志)

命令:

grep "table1" be.INFO.log.20260122-193049 | grep "22:30:1" | wc -l

结果: 9

分析:

  • 在 22:30:1x 这 10 秒时间段内
  • 有 9 条与 table1 相关的 stream load 记录
  • 证明高并发场景

3.8 排除 OOM (系统日志)

命令:

dmesg | grep -i "killed process"

结果: 空

分析:

  • 没有 OOM Kill 记录

4. 根因分析

4.1 直接原因

VNodeChannel::init() 函数中创建 DummyBrpcCallback 时,访问了未初始化的 brpc::Controller 对象,导致空指针解引用 (SIGSEGV @0x0)。

4.2 触发条件

两个 stream load 请求在 完全相同的毫秒 (22:30:12,170) 到达 FE,并被转发到同一个 BE 节点并发初始化 VNodeChannel,触发了 bRPC 相关的竞争条件 (Race Condition)。

4.3 调用栈分析

AsyncResultWriter::process_block()
  └→ VTabletWriter::open()
       └→ VTabletWriter::_init()
            └→ IndexChannel::init()
                 └→ VNodeChannel::init()           // 初始化节点通道
                      └→ DummyBrpcCallback()       // 创建 bRPC 回调
                           └→ brpc::Controller::call_id()  // 💥 空指针

4.4 初步判断

并发竞争条件 (Race Condition):多个 stream load 同时初始化 VNodeChannel 时,共享的 bRPC Controller 资源产生竞争,导致其中一个线程访问到未初始化的对象。

What You Expected?

正常运行

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions