-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Description
Search before asking
- I had searched in the issues and found no similar issues.
Version
4.0.2
What's Wrong?
以下内容为AI帮忙诊断分析
1. 问题概述
Doris BE (Backend) 进程在 2026-01-22 22:30 左右发生 SIGSEGV 崩溃,导致正在进行的 Mongo CDC 同步任务中断。BE 进程自动重启后恢复正常
2. 崩溃时间线
| 时间 | 事件 | 来源 |
|---|---|---|
| 22:30:11.973 | 前一批 stream load 正常完成 (txn 109883, 109893) | BE INFO 日志 |
| 22:30:12.170 | 两个 stream load 在同一毫秒到达 FE | FE 日志 |
| 22:30:12.173 | FE 为 txn 109908 开启事务 | FE 日志 |
| 22:30:12.181 | BE 开始执行 stream load | BE INFO 日志 |
| 22:30:12.189 | HTTP header 处理完成 | BE INFO 日志 |
| 22:30:1x~50 | 💀 BE SIGSEGV 崩溃 | be.out |
| 22:30:50.105 | BE 重启完成 | be.out |
| 22:31:10.347 | FE 检测到 BE 重启,回滚 txn 109908 | FE 日志 |
| 22:31:18.608 | Mongo CDC 自动重试 stream load | FE 日志 |
| 22:31:34.672 | 重试事务 109929 成功提交 | FE 日志 |
3. 日志证据
3.1 崩溃堆栈 (be.out)
文件: /data/apache-doris-4.0.2-bin-x64/be/log/be.out
*** Query id: 82486cd1d80c3022-81101df6583bb08e ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1769092212 (unix time) try "date -d @1769092212" if you are using GNU date ***
*** Current BE git commitID: 30d2df0459 ***
*** SIGSEGV address not mapped to object (@0x0) received by PID 2408066 (TID 2461123 OR 0x7f65349bc640) from PID 0; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /opt/jdk-17.0.17+10/lib/server/libjvm.so
2# JVM_handle_linux_signal in /opt/jdk-17.0.17+10/lib/server/libjvm.so
3# 0x00007F795AA3FC30 in /lib64/libc.so.6
4# 0x0000561F10943A42 in /data/apache-doris-4.0.2-bin-x64/be/lib/doris_be
5# brpc::Controller::call_id() in /data/apache-doris-4.0.2-bin-x64/be/lib/doris_be
6# doris::DummyBrpcCallback<doris::PTabletWriterAddBlockResult>::DummyBrpcCallback() at /home/zcp/repo_center/doris_release/doris/be/src/util/brpc_closure.h:39
7# doris::vectorized::VNodeChannel::init(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:575
8# doris::vectorized::IndexChannel::init(doris::RuntimeState*, std::vector<doris::TTabletWithPartition, std::allocator<doris::TTabletWithPartition> > const&, bool) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:151
9# doris::vectorized::VTabletWriter::_init(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:1630
10# doris::vectorized::VTabletWriter::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/vtablet_writer.cpp:1389
11# doris::vectorized::AsyncResultWriter::process_block(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/writer/async_result_writer.cpp:119
12# std::_Function_handler<void (), doris::vectorized::AsyncResultWriter::start_writer(doris::RuntimeState*, doris::RuntimeProfile*)::$_0>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
13# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/util/threadpool.cpp:623
14# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:461
15# start_thread in /lib64/libc.so.6
16# __GI___clone3 in /lib64/libc.so.6
分析:
- 错误类型:
SIGSEGV address not mapped to object (@0x0)- 空指针解引用 - Query ID:
82486cd1d80c3022-81101df6583bb08e - 崩溃位置:
VNodeChannel::init()→DummyBrpcCallback()→brpc::Controller::call_id() - 根因: bRPC Controller 对象为 NULL
3.2 触发崩溃的查询 (BE INFO 日志)
文件: /data/apache-doris-4.0.2-bin-x64/be/log/be.INFO.log.20260122-193049
I20260122 22:30:12.172951 2409869 stream_load.cpp:218] new income streaming load request.id=82486cd1d80c3022-81101df6583bb08e, job_id=-1, txn_id=-1, label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, elapse(s)=0, db=ods, tbl=table2, group_commit=0, HTTP headers=...
I20260122 22:30:12.173527 2409884 stream_load.cpp:218] new income streaming load request.id=ab4c7fe784d16dc7-f873f51fc204c8b1, job_id=-1, txn_id=-1, label=lb_mongo_doris_ods_table1_0_3866_8e69c489-c858-4d5b-98bd-035b3f8df879, elapse(s)=0, db=ods, tbl=table1, group_commit=0, HTTP headers=...
I20260122 22:30:12.181313 2409869 stream_load_executor.cpp:74] begin to execute stream load. label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, txn_id=109908, query_id=82486cd1d80c3022-81101df6583bb08e
I20260122 22:30:12.189524 2409869 stream_load.cpp:225] finished to handle HTTP header, id=82486cd1d80c3022-81101df6583bb08e, job_id=-1, txn_id=109908, label=lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb, elapse(s)=0
分析:
- 触发查询 ID:
82486cd1d80c3022-81101df6583bb08e - 目标表:
table2 - 事务 ID: 109908
- 来源: Mongo CDC 同步任务
- 注意: 两个 stream load 在 0.6ms 内先后到达 (22:30:12.172 和 22:30:12.173)
3.3 并发 Stream Load 同毫秒到达 (FE 日志)
文件: /data/apache-doris-4.0.2-bin-x64/fe/log/fe.log.20260122-1
2026-01-22 22:30:12,170 INFO (qtp149820420-710|710) [LoadAction.streamLoad():106] streamload action, db: ods, tbl: table1, headers: ...label:lb_mongo_doris_ods_table1_0_3866_8e69c489-c858-4d5b-98bd-035b3f8df879...
2026-01-22 22:30:12,170 INFO (qtp149820420-10209|10209) [LoadAction.streamLoad():106] streamload action, db: ods, tbl: table2, headers: ...label:lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb...
2026-01-22 22:30:12,173 INFO (thrift-server-pool-470|12411) [DatabaseTransactionMgr.beginTransaction():382] begin transaction: txn id 109908 with label lb_mongo_doris_ods_table2_0_3866_29d47f74-d2e7-4add-ac00-0ab0a131d7fb from coordinator BE: 10.66.7.1, listener id: -1
关键发现:
- 两个 stream load 在完全相同的毫秒 (22:30:12,170) 到达 FE
- 这两个请求分别由线程 710 和 10209 处理
- 高并发触发了 bRPC 初始化的竞争条件
3.6 BE 重启记录 (be.out)
文件: /data/apache-doris-4.0.2-bin-x64/be/log/be.out
INFO: java_cmd /opt/jdk-17.0.17+10/bin/java
INFO: jdk_version 17
StdoutLogger 2026-01-22 22:30:50,105 Start time: Thu Jan 22 10:30:50 PM CST 2026
INFO: java_cmd /opt/jdk-17.0.17+10/bin/java
INFO: jdk_version 17
OpenJDK 64-Bit Server VM warning: Option CriticalJNINatives was deprecated in version 16.0 and will likely be removed in a future release.
...
start BE in local mode
分析:
- BE 在 22:30:50 重启完成
- 从崩溃到重启约 38 秒
3.7 并发数量统计 (BE INFO 日志)
命令:
grep "table1" be.INFO.log.20260122-193049 | grep "22:30:1" | wc -l结果: 9
分析:
- 在 22:30:1x 这 10 秒时间段内
- 有 9 条与
table1相关的 stream load 记录 - 证明高并发场景
3.8 排除 OOM (系统日志)
命令:
dmesg | grep -i "killed process"结果: 空
分析:
- 没有 OOM Kill 记录
4. 根因分析
4.1 直接原因
在 VNodeChannel::init() 函数中创建 DummyBrpcCallback 时,访问了未初始化的 brpc::Controller 对象,导致空指针解引用 (SIGSEGV @0x0)。
4.2 触发条件
两个 stream load 请求在 完全相同的毫秒 (22:30:12,170) 到达 FE,并被转发到同一个 BE 节点并发初始化 VNodeChannel,触发了 bRPC 相关的竞争条件 (Race Condition)。
4.3 调用栈分析
AsyncResultWriter::process_block()
└→ VTabletWriter::open()
└→ VTabletWriter::_init()
└→ IndexChannel::init()
└→ VNodeChannel::init() // 初始化节点通道
└→ DummyBrpcCallback() // 创建 bRPC 回调
└→ brpc::Controller::call_id() // 💥 空指针
4.4 初步判断
并发竞争条件 (Race Condition):多个 stream load 同时初始化 VNodeChannel 时,共享的 bRPC Controller 资源产生竞争,导致其中一个线程访问到未初始化的对象。
What You Expected?
正常运行
How to Reproduce?
No response
Anything Else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
No labels