Conversation
| // | ||
| // (-- api-linter: core::0140::prepositions=disabled | ||
| // aip.dev/not-precedent: "to" is used to indicate interval. --) | ||
| google.protobuf.Duration schedule_to_start_timeout = 4; |
There was a problem hiding this comment.
Can you clarify whether this has a specific recent need, or was this just something found on the backlog at temporalio/temporal#778? History fields are pretty scrutinized, so would want to confirm there is a pressing reason for this.
Also, the PR description is missing the accompanying server PR showing how this will be implemented.
There was a problem hiding this comment.
This change came from the backlog.
There isn’t a new production incident driving it.
The original issue was observability: the UI could show a user-configured schedule-to-start timeout (e.g. 10s) while the server used a different (e.g. hardcoded) value, so users couldn’t see what was actually in effect from history. The goal is to make history reflect the schedule-to-start timeout when the server enforces one, so there’s no pressing reason beyond that observability/consistency improvement. If you’d rather not add history fields for that alone, I’m happy to close or repurpose the PR.
Also thanks for pointing out the missing server PR I thought I added it in the description.
| // | ||
| // (-- api-linter: core::0140::prepositions=disabled | ||
| // aip.dev/not-precedent: "to" is used to indicate interval. --) | ||
| google.protobuf.Duration schedule_to_start_timeout = 4; |
There was a problem hiding this comment.
Would like to understand details of how this value is present and/or updated or not on successive task failures or timeouts. I wouldn't expect speculative task failure is ever recorded, or is it? And as for sticky, I suppose it can only occur once because rescheduled task does not have a schedule to start?
Any reason not to add this field to WorkflowTaskTimedOutEventAttributes instead?
There was a problem hiding this comment.
This is what I think is happening so far the server only sets schedule_to_start_timeout when it actually enforces a schedule-to-start timeout i.e. for sticky workflow tasks and for speculative workflow tasks on the normal queue. For a normal task on the normal queue the field stays unset, so it isn’t “updated” across successive timeouts in the usual retry sense; it’s only present on the scheduled events where that timeout applies.
For speculative tasks, the schedule-to-start timeout is recorded: when the timer fires we create the WORKFLOW_TASK_SCHEDULED event at that moment (in AddWorkflowTaskScheduleToStartTimeoutEvent) and then add the WORKFLOW_TASK_TIMED_OUT event, so the scheduled event and its timeout are written together.
For sticky, schedule-to-start only applies to the sticky task; after it times out we reschedule on the normal queue, and that new scheduled event doesn’t have a schedule-to-start timeout, so the field wouldn’t be set there.
I’m happy to put the duration on WorkflowTaskTimedOutEventAttributes instead it would make the timeout event self-contained (you’d see both the timeout type and the duration that fired), avoid ambiguity about when the field is set on the scheduled event, and better match “this is what timed out.” I can switch the change to add the field there (and only populate it when timeout_type is SCHEDULE_TO_START) if that’s the direction you prefer; otherwise I can keep it on the scheduled event and add a short comment documenting this lifecycle.
READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.
What changed?
Added optional schedule_to_start_timeout (google.protobuf.Duration, field number 4) to WorkflowTaskScheduledEventAttributes in temporal/api/history/v1/message.proto.
Regenerated Go and OpenAPI in this repo (via make proto / project tooling).
Why?
Fixes temporalio/temporal#778: the server applies a schedule-to-start timeout (e.g. 5s for speculative workflow tasks, or the sticky timeout) but the event did not record it, so UIs could show a different value (e.g. 10s) and users were confused when the task timed out earlier. Storing the timeout on the event makes the server’s behavior visible and consistent for clients and UIs.
Breaking changes
None. New field is optional; existing events and clients remain valid.
Server PR
Server PR (to temporalio/temporal) that populates and uses this field: temporalio/temporal#9257