Skip to content

Fix MinionSubTaskHighWaitTime alert to self-resolve using gauge#18517

Open
swaminathanmanish wants to merge 2 commits into
apache:masterfrom
swaminathanmanish:fix/minion-subtask-wait-time-gauge
Open

Fix MinionSubTaskHighWaitTime alert to self-resolve using gauge#18517
swaminathanmanish wants to merge 2 commits into
apache:masterfrom
swaminathanmanish:fix/minion-subtask-wait-time-gauge

Conversation

@swaminathanmanish
Copy link
Copy Markdown
Contributor

Summary

MinionSubTaskHighWaitTime alert does not self-resolve after the minion queue drains because it is based on ControllerTimer.SUBTASK_WAITING_TIME, a histogram. Histogram _Max retains its peak value across emission cycles and does not decay when there are no longer any waiting subtasks, causing the alert to stay firing indefinitely.

  • Add a new MAX_SUBTASK_WAIT_TIME_MS gauge in ControllerGauge (per-table, non-global)
  • In TaskMetricsEmitter, replace the timer emission with per-(table, taskType) gauge emission: the max wait time across all waiting subtasks, or 0 when none are waiting
  • Clean up the gauge in removeTableTaskTypeMetrics when a task type/table is retired
  • Update TaskMetricsEmitterTest to reflect the new metric and assert correct gauge values

The gauge is written every emit cycle and self-resolves when the queue clears. Alert rule update (separate config repo):

expr: max(pinot_controller_MaxSubtaskWaitTimeMs) by (exported_table, taskType) > 14400000

Test plan

  • TaskMetricsEmitterTest updated with assertions for maxSubtaskWaitTimeMs gauge values per table (3000ms for waiting table, 0ms for non-waiting table)
  • Run ./mvnw -pl pinot-controller -am -Dtest=TaskMetricsEmitterTest test

Replace ControllerTimer.SUBTASK_WAITING_TIME (histogram) with a new
MAX_SUBTASK_WAIT_TIME_MS gauge in TaskMetricsEmitter.

The timer's _Max stat retains its peak value across emission cycles and
does not decay when the queue drains, causing the alert to stay firing
after all subtasks have finished. The new gauge is written every cycle
with the current max wait time across waiting subtasks for each
(table, taskType) pair, defaulting to 0 when no subtasks are waiting,
so the alert self-resolves on the next emit cycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.77%. Comparing base (3eff094) to head (2e72de7).
⚠️ Report is 230 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (3eff094) and HEAD (2e72de7). Click for more details.

HEAD has 16 uploads less than BASE
Flag BASE (3eff094) HEAD (2e72de7)
java-21 5 4
unittests1 2 1
unittests 4 1
temurin 8 4
java-11 3 0
unittests2 2 0
integration 4 3
integration2 2 1
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18517      +/-   ##
============================================
- Coverage     63.48%   56.77%   -6.72%     
+ Complexity     1627        7    -1620     
============================================
  Files          3244     2567     -677     
  Lines        197365   148958   -48407     
  Branches      30540    24099    -6441     
============================================
- Hits         125306    84564   -40742     
+ Misses        62019    57205    -4814     
+ Partials      10040     7189    -2851     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 ?
java-21 56.77% <100.00%> (-6.70%) ⬇️
temurin 56.77% <100.00%> (-6.72%) ⬇️
unittests 56.76% <100.00%> (-6.72%) ⬇️
unittests1 56.76% <100.00%> (+1.30%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@swaminathanmanish swaminathanmanish marked this pull request as ready for review May 26, 2026 11:27
@swaminathanmanish swaminathanmanish requested a review from Copilot May 26, 2026 11:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…IT_TIME_MS gauge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants