Skip to content

Conversation

@auricom
Copy link
Collaborator

@auricom auricom commented Dec 22, 2025

Problem

Two critical issues were identified in the metrics system:

  1. time_since_last_block_seconds showing outdated large values: The metric was only updated by the verifier's WebSocket subscription, but when that failed, the metric would show increasingly large values (e.g., 220,366 seconds = ~2.5 days) even though the drift exporter's HTTP polling was correctly seeing new blocks at height 3,709,309.

  2. block_time_seconds_bucket corruption after ~190 blocks: The initial fix attempt incorrectly used RecordBlockTime() in the drift exporter, which corrupted the block time histogram by recording invalid inter-block times from polling data.

Solution

Implemented a proper separation of concerns between real-time and polling metrics:

  1. Added new UpdateLastBlockTime() method in pkg/metrics/metrics.go:

    • Updates last block arrival time and resets time_since_last_block
    • Does NOT record block time histogram (unlike RecordBlockTime())
    • Designed specifically for pollers that can't measure inter-block time
  2. Updated drift exporter in pkg/exporters/drift/drift.go:

    • Changed from RecordBlockTime() to UpdateLastBlockTime()
    • Added calls when recording both reference node and full node heights
    • Provides redundancy for time_since_last_block metric

Changes Made

  • pkg/metrics/metrics.go: Added UpdateLastBlockTime() method
  • pkg/exporters/drift/drift.go: Updated to use UpdateLastBlockTime() instead of RecordBlockTime()

Impact

  • time_since_last_block_seconds now stays accurate even if WebSocket fails
  • block_time_seconds_bucket continues to work correctly for SLO calculations
  • ✅ No more metric corruption after prolonged running
  • ✅ Proper redundancy between real-time and polling metrics
  • ✅ Maintains backward compatibility and all existing functionality

Testing

  • Build passes: go build -o ev-metrics .
  • Tests pass: go test ./...
  • No breaking changes to existing API

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @auricom, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability and accuracy of block-related metrics within the system. It addresses critical issues where "time_since_last_block_seconds" could become stale and "block_time_seconds_bucket" could be corrupted, by implementing a clear separation of concerns for metric updates. A new method, "UpdateLastBlockTime()", is introduced for polling mechanisms, ensuring metrics are consistently updated and accurate, even in the event of real-time data source failures, without introducing side effects to other metrics.

Highlights

  • time_since_last_block_seconds accuracy: Ensures the "time_since_last_block_seconds" metric remains accurate by updating it via polling mechanisms, even if the primary WebSocket subscription fails, preventing it from showing outdated large values.
  • block_time_seconds_bucket integrity: Prevents corruption of the "block_time_seconds_bucket" histogram by no longer using "RecordBlockTime()" with polling data, which previously recorded invalid inter-block times.
  • Dedicated polling metric update: Introduces a new "UpdateLastBlockTime()" method specifically for pollers, allowing them to update the last block arrival time and reset the "time_since_last_block" metric without affecting the block time histogram.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new UpdateLastBlockTime method in the metrics package, designed to update the last block arrival time and reset the 'time since last block' metric, particularly for pollers to ensure metric accuracy even if WebSocket subscriptions fail. This new method is then integrated into the drift exporter, where it's called after recording reference node and full node block heights. The reviewer suggests refactoring the UpdateLastBlockTime method by extracting its core logic into an unexported helper function, updateLastBlockTimeUnsafe, to reduce code duplication and improve maintainability, proposing that both UpdateLastBlockTime and the existing RecordBlockTime method could utilize this helper.

Comment on lines 571 to 581
// UpdateLastBlockTime updates the last block arrival time and resets time since last block metric
// without recording block time histogram. This is used by pollers that can't measure inter-block time.
func (m *Metrics) UpdateLastBlockTime(chainID string, arrivalTime time.Time) {
m.mu.Lock()
defer m.mu.Unlock()

// update last seen arrival time
m.lastBlockArrivalTime[chainID] = arrivalTime
// reset time since last block to 0
m.TimeSinceLastBlock.WithLabelValues(chainID).Set(0)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and reduce code duplication, you can extract the common logic from UpdateLastBlockTime and RecordBlockTime into an unexported helper function. This new function would handle updating lastBlockArrivalTime and resetting TimeSinceLastBlock. Both UpdateLastBlockTime and RecordBlockTime (which is outside the diff but can be refactored separately) can then call this helper while holding the lock. This makes the code cleaner and ensures consistency.

// UpdateLastBlockTime updates the last block arrival time and resets time since last block metric
// without recording block time histogram. This is used by pollers that can't measure inter-block time.
func (m *Metrics) UpdateLastBlockTime(chainID string, arrivalTime time.Time) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.updateLastBlockTimeUnsafe(chainID, arrivalTime)
}

// updateLastBlockTimeUnsafe is an unexported helper that updates the last block arrival time
// and resets the time since last block gauge.
// This function is not thread-safe and should be called with a lock held.
func (m *Metrics) updateLastBlockTimeUnsafe(chainID string, arrivalTime time.Time) {
	// update last seen arrival time
	m.lastBlockArrivalTime[chainID] = arrivalTime
	// reset time since last block to 0
	m.TimeSinceLastBlock.WithLabelValues(chainID).Set(0)
}

@auricom auricom merged commit e20a6b2 into main Dec 22, 2025
3 checks passed
@auricom auricom deleted the claude/connection_fix branch December 22, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants