Skip to content

metrics: add handle ddl event duration metric#4320

Open
wk989898 wants to merge 5 commits intopingcap:masterfrom
wk989898:ddl-handle-metric
Open

metrics: add handle ddl event duration metric#4320
wk989898 wants to merge 5 commits intopingcap:masterfrom
wk989898:ddl-handle-metric

Conversation

@wk989898
Copy link
Collaborator

@wk989898 wk989898 commented Feb 28, 2026

What problem does this PR solve?

Issue Number: close #4295

What is changed and how it works?

  • New DDL Handling Duration Metric: Introduced a new Prometheus histogram metric, ticdc_ddl_handle_duration_bucket, to precisely track the duration of DDL event handling within the system.
  • Metric Integration: Integrated the new DDL handling duration metric into the BasicDispatcher to record the time taken from event processing to the completion of the post-flush callback for DDL events.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
截屏2026-02-28 18 01 11

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • New Features

    • Added DDL handling duration metric and timing instrumentation to measure and expose DDL processing latency.
  • Chores

    • Added Grafana heatmap panels to visualize DDL handle duration.
    • Ensured metric label cleanup when instances are closed.
  • Bug Fixes

    • Added debug logging for DDL timing to aid diagnostics.

Signed-off-by: wk989898 <nhsmwk@gmail.com>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 28, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bc3877 and 3611d61.

📒 Files selected for processing (5)
  • downstreamadapter/dispatcher/basic_dispatcher.go
  • downstreamadapter/dispatcher/basic_dispatcher_info.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
🚧 Files skipped from review as they are similar to previous changes (4)
  • downstreamadapter/dispatcher/basic_dispatcher.go
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • downstreamadapter/dispatcher/basic_dispatcher_info.go
  • metrics/grafana/ticdc_new_arch.json

📝 Walkthrough

Walkthrough

Instrumentation and a new Prometheus histogram were added to measure DDL handling duration in the dispatcher; metric lifecycle cleanup was implemented; Grafana dashboard JSONs gained a heatmap panel for "Handle DDL Duration". No control-flow or return-value changes.

Changes

Cohort / File(s) Summary
Metric Definition & Registration
pkg/metrics/ddl.go
Added exported HandleDDLHistogram (labels: keyspace, changefeed) and registered it in initDDLMetrics.
Metric Initialization & Lifecycle
downstreamadapter/dispatcher/basic_dispatcher_info.go
Added metricHandleDDLHis prometheus.Observer to SharedInfo; initialized with HandleDDLHistogram.WithLabelValues(...); delete label values on Close().
Timing Instrumentation
downstreamadapter/dispatcher/basic_dispatcher.go
Captured timestamp before adding DDL post-flush callback; in callback observed elapsed time via metricHandleDDLHis and logged debug message with duration and DDL.
Grafana Dashboard Panels
metrics/grafana/ticdc_new_arch.json, metrics/nextgengrafana/ticdc_new_arch_next_gen.json, metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
Added heatmap panel "Handle DDL Duration" (id 62001) to DDL dashboard rows; bumped dashboard version from 38→39; panel uses tsbuckets and Prometheus rate on ticdc_ddl_handle_duration_bucket.

Sequence Diagram(s)

sequenceDiagram
    participant Dispatcher as BasicDispatcher
    participant Shared as SharedInfo
    participant Prom as Prometheus
    participant Graf as Grafana

    Dispatcher->>Shared: register DDL post-flush callback (record start time)
    note right of Shared: callback will run after flush completes
    Shared->>Dispatcher: (on DDL post-flush) invoke callback
    Dispatcher->>Prom: observe duration via HandleDDLHistogram.WithLabels(...)
    Prom->>Graf: histogram buckets exposed (scraped by Prometheus)
    Graf->>Graf: heatmap panel queries `ticdc_ddl_handle_duration_bucket` for visualization
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, size/L

Suggested reviewers

  • asddongmen
  • hongyunyan

Poem

🐰 A hop, a tick, a tiny trail,

I timed the DDL without fail.
Buckets fill and colors hum,
Grafana shows how long they run.
Hooray — metrics done, time to nibble a kale!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description includes the required issue reference, explains what changed and how it works, but leaves critical sections incomplete: testing details, performance/compatibility questions, and release notes are unfilled. Complete the unanswered questions about performance regression/compatibility impact and provide a proper release note following the style guide; specify which test types were actually performed.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding a Prometheus histogram metric to measure DDL event handling duration.
Linked Issues check ✅ Passed The PR successfully implements the objective from issue #4295: it adds a new Prometheus histogram metric (HandleDDLHistogram) to measure DDL event handling duration and integrates it into BasicDispatcher to capture pre-sink processing time.
Out of Scope Changes check ✅ Passed All changes are scoped to the DDL handling duration metric objective: instrumentation in dispatcher logic, metric definition and registration, and Grafana dashboard updates to visualize the new metric.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the monitoring capabilities of TiCDC by introducing a dedicated metric to measure the duration of Data Definition Language (DDL) event handling. This provides crucial insights into the performance and latency characteristics of DDL operations, which are vital for maintaining database schema consistency and overall system health. The addition of corresponding Grafana panels ensures that this new observability data is readily available for analysis and troubleshooting.

Highlights

  • New DDL Handling Duration Metric: Introduced a new Prometheus histogram metric, ticdc_ddl_handle_duration_bucket, to precisely track the duration of DDL event handling within the system.
  • Metric Integration: Integrated the new DDL handling duration metric into the BasicDispatcher to record the time taken from event processing to the completion of the post-flush callback for DDL events.
  • Grafana Dashboard Updates: Updated multiple Grafana dashboards (ticdc_new_arch.json, ticdc_new_arch_next_gen.json, ticdc_new_arch_with_keyspace_name.json) to include a new 'Handle DDL Duration' heatmap panel, providing visual representation of the new metric.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • downstreamadapter/dispatcher/basic_dispatcher.go
    • Added a timestamp capture (now) before DDL post-flush function execution.
    • Recorded the DDL handling duration using metricHandleDDLHis.Observe within the DDL post-flush function.
    • Included debug logging for DDL event finish, showing the cost.
  • downstreamadapter/dispatcher/basic_dispatcher_info.go
    • Imported metrics and prometheus packages.
    • Added metricHandleDDLHis (a prometheus.Observer) to the SharedInfo struct.
    • Initialized metricHandleDDLHis with metrics.HandleDDLHistogram during NewSharedInfo creation.
    • Implemented cleanup logic in the Close() method to delete the metric label values.
  • metrics/grafana/ticdc_new_arch.json
    • Added a new Grafana heatmap panel titled 'Handle DDL Duration' to the DDL section.
    • Configured the new panel to display the ticdc_ddl_handle_duration_bucket metric.
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
    • Added a new Grafana heatmap panel titled 'Handle DDL Duration' to the DDL section.
    • Configured the new panel to display the ticdc_ddl_handle_duration_bucket metric, including the keyspace_name label.
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
    • Added a new Grafana heatmap panel titled 'Handle DDL Duration' to the DDL section.
    • Configured the new panel to display the ticdc_ddl_handle_duration_bucket metric, including the keyspace_name label.
  • pkg/metrics/ddl.go
    • Defined HandleDDLHistogram as a new Prometheus histogram vector for DDL handling duration.
    • Registered HandleDDLHistogram in the initDDLMetrics function.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new metric, ticdc_ddl_handle_duration, to monitor the duration of DDL event handling. The implementation involves adding the metric definition, observing its value within the dispatcher logic, and updating Grafana dashboards with a new visualization panel. The changes are generally well-implemented. I've provided a couple of suggestions to improve code consistency and performance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
downstreamadapter/dispatcher/basic_dispatcher_info.go (1)

73-75: Fix the metric field comment to match the actual identifier.

Line 73 says metricExecDDLHis, but the field at Line 75 is metricHandleDDLHis. Please align the comment to avoid confusion during maintenance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatcher/basic_dispatcher_info.go` around lines 73 - 75,
The comment above the metric field is referring to the wrong identifier
(mentions metricExecDDLHis while the field is named metricHandleDDLHis); update
the comment to match the actual field name (metricHandleDDLHis) or rename the
field to metricExecDDLHis so they align—ensure the comment describes that
metricHandleDDLHis records each DDL handling duration (execution + wait for
resolution) and use the exact identifier metricHandleDDLHis in the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@downstreamadapter/dispatcher/basic_dispatcher.go`:
- Around line 672-681: The metric currently measures after sink flush and
callback overhead because now is set before AddPostFlushFunc but
d.sharedInfo.metricHandleDDLHis.Observe is called inside the post-flush callback
after wakeCallback; change to compute elapsed := time.Since(now) once inside the
post-flush func and call
d.sharedInfo.metricHandleDDLHis.Observe(elapsed.Seconds()) before invoking
wakeCallback (and before any heavy post-sink work like
d.tableSchemaStore.AddEvent or wakeCallback) so the metric reflects pre-sink
dispatcher handling only.

---

Nitpick comments:
In `@downstreamadapter/dispatcher/basic_dispatcher_info.go`:
- Around line 73-75: The comment above the metric field is referring to the
wrong identifier (mentions metricExecDDLHis while the field is named
metricHandleDDLHis); update the comment to match the actual field name
(metricHandleDDLHis) or rename the field to metricExecDDLHis so they
align—ensure the comment describes that metricHandleDDLHis records each DDL
handling duration (execution + wait for resolution) and use the exact identifier
metricHandleDDLHis in the comment.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8e6236 and 7bc3877.

📒 Files selected for processing (6)
  • downstreamadapter/dispatcher/basic_dispatcher.go
  • downstreamadapter/dispatcher/basic_dispatcher_info.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
  • pkg/metrics/ddl.go

Comment on lines 672 to 681
now := time.Now()
ddl.AddPostFlushFunc(func() {
if d.tableSchemaStore != nil {
d.tableSchemaStore.AddEvent(ddl)
}
wakeCallback()
d.sharedInfo.metricHandleDDLHis.Observe(time.Since(now).Seconds())
log.Debug("dispatcher handle ddl event finish",
zap.Duration("cost", time.Since(now)),
zap.Any("ddl", ddl))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Metric boundary currently measures beyond pre-sink handling.

Timer starts at Line 672 but is observed in post-flush callback at Line 678 (after sink write/flush), and after wakeCallback() at Line 677. This captures end-to-end + callback overhead, not just dispatcher pre-sink DDL handling from issue #4295.

If pre-sink latency is the target, record the metric right before the first sink write path. At minimum, move observe before wakeCallback and reuse one computed elapsed duration to avoid callback skew.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatcher/basic_dispatcher.go` around lines 672 - 681, The
metric currently measures after sink flush and callback overhead because now is
set before AddPostFlushFunc but d.sharedInfo.metricHandleDDLHis.Observe is
called inside the post-flush callback after wakeCallback; change to compute
elapsed := time.Since(now) once inside the post-flush func and call
d.sharedInfo.metricHandleDDLHis.Observe(elapsed.Seconds()) before invoking
wakeCallback (and before any heavy post-sink work like
d.tableSchemaStore.AddEvent or wakeCallback) so the metric reflects pre-sink
dispatcher handling only.

wk989898 and others added 4 commits February 28, 2026 16:56
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: wk989898 <nhsmwk@gmail.com>
Signed-off-by: wk989898 <nhsmwk@gmail.com>
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 28, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 28, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asddongmen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 28, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-02-28 10:50:16.097813058 +0000 UTC m=+9660.675892242: ☑️ agreed by asddongmen.

@ti-chi-bot ti-chi-bot bot added the approved label Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add more metrics for ddl event

2 participants