Prune two workflow update related metrics in favor of logs w/ namespace by spkane31 · Pull Request #9269 · temporalio/temporal

spkane31 · 2026-02-10T00:25:54Z

What changed?

Remove invalid_state_transition_workflow_update_message and workflow_update_registry_size_limited metrics. Add the namespace tag to the logs/softasserts in the instrumentation methods the metrics are used for.

Why?

The logs will give us the same information and the namespace without cardinality concerns. These metrics are not expected to fire frequently.

How did you test it?

Potential risks

Minimal, metrics changes only.

stephanos · 2026-02-10T00:44:51Z

service/history/workflow/update/util.go

 		tag.String("update-id", updateID),
 		tag.String("message", fmt.Sprintf("%T", msg)),
 		tag.Stringer("state", state),
+		tag.String("namespace", namespace),


stephanos · 2026-02-10T00:48:09Z

service/history/workflow/update/util.go

-	i.oneOf(metrics.WorkflowExecutionUpdateRegistrySizeLimited.Name())
-	// TODO: remove log once limit is enforced everywhere
+func (i *instrumentation) countRegistrySizeLimited(updateCount, registrySize, payloadSize int, namespace string) {
 	i.log.Warn("update registry size limit reached",
 		tag.Int("registry-size", registrySize),
 		tag.Int("payload-size", payloadSize),
-		tag.Int("update-count", updateCount))


Actually, looking back at the original PR, the reason for this log line was to get better data on how/when the registry size limit was hit (avoiding some of the issues with a metrics-based solution). Since this is a rate limit, I think metrics might be a better way to capture this actually, no?

I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often. We also have workflow_update_registry_size to get similar information.

I think getting the namespace is important here and using logs instead of metrics removes the cardinality issue. I'd rather use a log here and add a metric if we hit this often.

I might need more clarity on how the cost works out; if (1) we'll need namespace tags for other metrics anyway and (2) the volume is low; what's the issue?

Apart from that, rate limits are typically tracked as metrics across the codebase AFAIK; logs are not as useful as they are much more limiting to query. Our log queries puts a cap on how much data it can ingest. On a big cluster that limits how far back in time you can go (I've had it unable to process more than 1h, for example). Metrics don't have that issue.

We also have workflow_update_registry_size to get similar information.

It's not quite true; as you cannot make a leap from that to whether a limit was hit. If the size is at 99%, you cannot assume it hit the limit. Or if it's at 10%, it can still happen that an Update hits the limit.

Thoughts on leaving the metric and adding namespace to the log for now? Once we eventually add the namespace to the metric we can remove the log entirely but keep the metric for now

👍 I'm good with that

…-metrics-prune

Prune two workflow update related metrics in favor of logs w/ namespace

cb94ce8

spkane31 requested review from a team as code owners February 10, 2026 00:25

spkane31 requested a review from stephanos February 10, 2026 00:26

unit test

48622d7

stephanos reviewed Feb 10, 2026

View reviewed changes

restoring metric

4fcaea9

spkane31 requested a review from stephanos February 11, 2026 17:37

spkane31 added 3 commits February 11, 2026 10:42

linters

e7559f1

Merge branch 'main' of github.com:temporalio/temporal into spk/update…

6828fb0

…-metrics-prune

fix unit test

5cf7c67

This comment was marked as spam.

Sign in to view

stephanos self-requested a review February 15, 2026 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune two workflow update related metrics in favor of logs w/ namespace#9269

Prune two workflow update related metrics in favor of logs w/ namespace#9269
spkane31 wants to merge 6 commits intomainfrom
spk/update-metrics-prune

spkane31 commented Feb 10, 2026

Uh oh!

stephanos Feb 10, 2026

Uh oh!

stephanos Feb 10, 2026

Uh oh!

spkane31 Feb 10, 2026

Uh oh!

stephanos Feb 10, 2026 •

edited

Loading

Uh oh!

spkane31 Feb 10, 2026

Uh oh!

stephanos Feb 10, 2026

Uh oh!

spkane31 Feb 11, 2026

Uh oh!

This comment was marked as spam.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spkane31 commented Feb 10, 2026

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

stephanos Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

stephanos Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

spkane31 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

stephanos Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spkane31 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

stephanos Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

spkane31 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stephanos Feb 10, 2026 •

edited

Loading