-
Notifications
You must be signed in to change notification settings - Fork 0
fix: ensure time_since_last_block accuracy with polling fallback #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @auricom, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the reliability and accuracy of block-related metrics within the system. It addresses critical issues where "time_since_last_block_seconds" could become stale and "block_time_seconds_bucket" could be corrupted, by implementing a clear separation of concerns for metric updates. A new method, "UpdateLastBlockTime()", is introduced for polling mechanisms, ensuring metrics are consistently updated and accurate, even in the event of real-time data source failures, without introducing side effects to other metrics. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new UpdateLastBlockTime method in the metrics package, designed to update the last block arrival time and reset the 'time since last block' metric, particularly for pollers to ensure metric accuracy even if WebSocket subscriptions fail. This new method is then integrated into the drift exporter, where it's called after recording reference node and full node block heights. The reviewer suggests refactoring the UpdateLastBlockTime method by extracting its core logic into an unexported helper function, updateLastBlockTimeUnsafe, to reduce code duplication and improve maintainability, proposing that both UpdateLastBlockTime and the existing RecordBlockTime method could utilize this helper.
| // UpdateLastBlockTime updates the last block arrival time and resets time since last block metric | ||
| // without recording block time histogram. This is used by pollers that can't measure inter-block time. | ||
| func (m *Metrics) UpdateLastBlockTime(chainID string, arrivalTime time.Time) { | ||
| m.mu.Lock() | ||
| defer m.mu.Unlock() | ||
|
|
||
| // update last seen arrival time | ||
| m.lastBlockArrivalTime[chainID] = arrivalTime | ||
| // reset time since last block to 0 | ||
| m.TimeSinceLastBlock.WithLabelValues(chainID).Set(0) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve maintainability and reduce code duplication, you can extract the common logic from UpdateLastBlockTime and RecordBlockTime into an unexported helper function. This new function would handle updating lastBlockArrivalTime and resetting TimeSinceLastBlock. Both UpdateLastBlockTime and RecordBlockTime (which is outside the diff but can be refactored separately) can then call this helper while holding the lock. This makes the code cleaner and ensures consistency.
// UpdateLastBlockTime updates the last block arrival time and resets time since last block metric
// without recording block time histogram. This is used by pollers that can't measure inter-block time.
func (m *Metrics) UpdateLastBlockTime(chainID string, arrivalTime time.Time) {
m.mu.Lock()
defer m.mu.Unlock()
m.updateLastBlockTimeUnsafe(chainID, arrivalTime)
}
// updateLastBlockTimeUnsafe is an unexported helper that updates the last block arrival time
// and resets the time since last block gauge.
// This function is not thread-safe and should be called with a lock held.
func (m *Metrics) updateLastBlockTimeUnsafe(chainID string, arrivalTime time.Time) {
// update last seen arrival time
m.lastBlockArrivalTime[chainID] = arrivalTime
// reset time since last block to 0
m.TimeSinceLastBlock.WithLabelValues(chainID).Set(0)
}
Problem
Two critical issues were identified in the metrics system:
time_since_last_block_seconds showing outdated large values: The metric was only updated by the verifier's WebSocket subscription, but when that failed, the metric would show increasingly large values (e.g., 220,366 seconds = ~2.5 days) even though the drift exporter's HTTP polling was correctly seeing new blocks at height 3,709,309.
block_time_seconds_bucket corruption after ~190 blocks: The initial fix attempt incorrectly used
RecordBlockTime()in the drift exporter, which corrupted the block time histogram by recording invalid inter-block times from polling data.Solution
Implemented a proper separation of concerns between real-time and polling metrics:
Added new
UpdateLastBlockTime()method inpkg/metrics/metrics.go:RecordBlockTime())Updated drift exporter in
pkg/exporters/drift/drift.go:RecordBlockTime()toUpdateLastBlockTime()Changes Made
pkg/metrics/metrics.go: AddedUpdateLastBlockTime()methodpkg/exporters/drift/drift.go: Updated to useUpdateLastBlockTime()instead ofRecordBlockTime()Impact
time_since_last_block_secondsnow stays accurate even if WebSocket failsblock_time_seconds_bucketcontinues to work correctly for SLO calculationsTesting
go build -o ev-metrics .go test ./...