Prune two workflow update related metrics in favor of logs w/ namespace#9269
Prune two workflow update related metrics in favor of logs w/ namespace#9269
Conversation
| tag.String("update-id", updateID), | ||
| tag.String("message", fmt.Sprintf("%T", msg)), | ||
| tag.Stringer("state", state), | ||
| tag.String("namespace", namespace), |
| i.oneOf(metrics.WorkflowExecutionUpdateRegistrySizeLimited.Name()) | ||
| // TODO: remove log once limit is enforced everywhere | ||
| func (i *instrumentation) countRegistrySizeLimited(updateCount, registrySize, payloadSize int, namespace string) { | ||
| i.log.Warn("update registry size limit reached", | ||
| tag.Int("registry-size", registrySize), | ||
| tag.Int("payload-size", payloadSize), | ||
| tag.Int("update-count", updateCount)) |
There was a problem hiding this comment.
Actually, looking back at the original PR, the reason for this log line was to get better data on how/when the registry size limit was hit (avoiding some of the issues with a metrics-based solution). Since this is a rate limit, I think metrics might be a better way to capture this actually, no?
There was a problem hiding this comment.
I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often. We also have workflow_update_registry_size to get similar information.
There was a problem hiding this comment.
I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often.
I might need more clarity on how the cost works out; if (1) we'll need namespace tags for other metrics anyway and (2) the volume is low; what's the issue?
Apart from that, rate limits are typically tracked as metrics across the codebase AFAIK; logs are not as useful as they are much more limiting to query. Our log queries puts a cap on how much data it can ingest. On a big cluster that limits how far back in time you can go (I've had it unable to process more than 1h, for example). Metrics don't have that issue.
We also have workflow_update_registry_size to get similar information.
It's not quite true; as you cannot make a leap from that to whether a limit was hit. If the size is at 99%, you cannot assume it hit the limit. Or if it's at 10%, it can still happen that an Update hits the limit.
There was a problem hiding this comment.
Thoughts on leaving the metric and adding namespace to the log for now? Once we eventually add the namespace to the metric we can remove the log entirely but keep the metric for now
What changed?
Remove
invalid_state_transition_workflow_update_messageandworkflow_update_registry_size_limitedmetrics. Add thenamespacetag to the logs/softasserts in theinstrumentationmethods the metrics are used for.Why?
The logs will give us the same information and the namespace without cardinality concerns. These metrics are not expected to fire frequently.
How did you test it?
Potential risks
Minimal, metrics changes only.