Before Creating the Bug Report
Runtime platform environment
ubuntu
RocketMQ version
develop
JDK Version
1.8
Describe the Bug
Motivation
When switching from file-based timer engine to RocksDB timer engine via switchTimerEngine, the
checkAndReviseMetrics scheduled task in TimerMessageStore continues to execute without any engine
switch guard. This causes RocksDB-side timer metrics to be incorrectly overwritten.
Root Cause
-
Shared TimerMetrics: Both TimerMessageStore (file-based) and TimerMessageRocksDBStore (RocksDB)
share the same TimerMetrics object.
-
No switch guard in scheduler: The checkAndReviseMetrics scheduled task registered in
TimerMessageStore.start() has no check for timerStopEnqueue or timerRocksDBEnable. After
switchTimerEngine(ROCKSDB_TIMELINE) sets timerStopEnqueue=true, the scheduler still fires.
-
Overwrite via putAll: checkAndReviseMetrics() only traverses timerLog (file-based data) to
rebuild metric counts for "small" topics, then calls timerMetrics.getTimingCount().putAll(newSmallOnes).
Since RocksDB-side data is not in timerLog, any topic with metrics from RocksDB gets overwritten to 0
(or loses the RocksDB portion for shared topics).
Timeline
Steps to Reproduce
Fix
Add a storeConfig.isTimerStopEnqueue() guard in the checkAndReviseMetrics scheduled task. When the
file-based engine has stopped enqueuing (indicating a switch to RocksDB), skip checkAndReviseMetrics
to prevent overwriting RocksDB-side metrics.
Why timerStopEnqueue?
switchTimerEngine always sets timerStopEnqueue=true when switching to RocksDB
- When switching back to file-based, it sets
timerStopEnqueue=false, so checkAndReviseMetrics resumes
- The semantics are precise: "file-based engine has stopped, should not revise file-based metrics"
- Minimal change, no new config flags needed
Changes
store/src/main/java/org/apache/rocketmq/store/timer/TimerMessageStore.java
Added timerStopEnqueue check in the scheduler task before calling checkAndReviseMetrics():
What Did You Expect to See?
After switching the engine, the indicators returned to normal.
What Did You See Instead?
null
Additional Context
No response
Before Creating the Bug Report
I found a bug, not just asking a question, which should be created in GitHub Discussions.
I have searched the GitHub Issues and GitHub Discussions of this repository and believe that this is not a duplicate.
I have confirmed that this bug belongs to the current repository, not other repositories of RocketMQ.
Runtime platform environment
ubuntu
RocketMQ version
develop
JDK Version
1.8
Describe the Bug
Motivation
When switching from file-based timer engine to RocksDB timer engine via
switchTimerEngine, thecheckAndReviseMetricsscheduled task inTimerMessageStorecontinues to execute without any engineswitch guard. This causes RocksDB-side timer metrics to be incorrectly overwritten.
Root Cause
Shared TimerMetrics: Both
TimerMessageStore(file-based) andTimerMessageRocksDBStore(RocksDB)share the same
TimerMetricsobject.No switch guard in scheduler: The
checkAndReviseMetricsscheduled task registered inTimerMessageStore.start()has no check fortimerStopEnqueueortimerRocksDBEnable. AfterswitchTimerEngine(ROCKSDB_TIMELINE)setstimerStopEnqueue=true, the scheduler still fires.Overwrite via putAll:
checkAndReviseMetrics()only traversestimerLog(file-based data) torebuild metric counts for "small" topics, then calls
timerMetrics.getTimingCount().putAll(newSmallOnes).Since RocksDB-side data is not in
timerLog, any topic with metrics from RocksDB gets overwritten to 0(or loses the RocksDB portion for shared topics).
Timeline
Steps to Reproduce
Fix
Add a
storeConfig.isTimerStopEnqueue()guard in thecheckAndReviseMetricsscheduled task. When thefile-based engine has stopped enqueuing (indicating a switch to RocksDB), skip
checkAndReviseMetricsto prevent overwriting RocksDB-side metrics.
Why
timerStopEnqueue?switchTimerEnginealways setstimerStopEnqueue=truewhen switching to RocksDBtimerStopEnqueue=false, socheckAndReviseMetricsresumesChanges
store/src/main/java/org/apache/rocketmq/store/timer/TimerMessageStore.javaAdded
timerStopEnqueuecheck in the scheduler task before callingcheckAndReviseMetrics():What Did You Expect to See?
After switching the engine, the indicators returned to normal.
What Did You See Instead?
null
Additional Context
No response