Skip to content

Include transient and speculative WFT events in GetWorkflowExecutionHistoryResponse#9325

Merged
spkane31 merged 60 commits intomainfrom
spk/update-premature-end-stream
Feb 19, 2026
Merged

Include transient and speculative WFT events in GetWorkflowExecutionHistoryResponse#9325
spkane31 merged 60 commits intomainfrom
spk/update-premature-end-stream

Conversation

@spkane31
Copy link
Copy Markdown
Contributor

@spkane31 spkane31 commented Feb 13, 2026

What changed?

Re-does #9138 which was incidentally merged.

Include transient and speculative WFT events in GetWorkflowExecutionHistoryReponse response, unless UI or CLI made request.

  • Adds transient_or_speculative_events back to GetMutableStateResponse
  • Reserve transient_workflow_task in HisotryCOntinuation token
  • Add validation helpers
  • Add query-compare-query for transient events at request start and end

Re-implements #7732

Why?

Fix "premature end of stream" errors when workers request history after cache eviction w/ transient/speculative workflow tasks present. This adds transient & speculative WFT events in GetWorkflowExecutionHistory (already in PollWorkflowTask). Worker cache eviction w/ speculative workflow tasks causes the expected and actual event counts to be different. #7732 passed transient events through continuation tokens, which could become stale during pagination. This PR implements mutable state querying at both start and end of pagination and compares transient event IDs to detect if WFT state changed during pagination and return a retryable error.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Same risks from #7732

FirstEventId: firstEventID,
NextEventId: nextEventID,
PersistenceToken: persistenceToken,
TransientWorkflowTask: response.GetTransientWorkflowTask(),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TransientWorkflowTask here is the main change from #9138

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same change is made in respondworkflowtaskcompleted. The reason for the change is the continuation token only fetches the transient tasks on the first page, if there are multiple pages and the workflow updates during the pagination then the continuation token will have outdate information. GetWorkflowExecutionHistory now handles transient events by querying mutable state so this is unnecessary.

@spkane31 spkane31 marked this pull request as ready for review February 19, 2026 03:32
@spkane31 spkane31 requested review from a team as code owners February 19, 2026 03:32
Comment thread tests/relay_task_test.go
for range 3 {
events := s.GetHistory(s.Namespace().String(), workflowExecution)
if len(events) == 8 {
if len(events) >= 8 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be == 9?

@spkane31 spkane31 merged commit 0fbc386 into main Feb 19, 2026
46 checks passed
@spkane31 spkane31 deleted the spk/update-premature-end-stream branch February 19, 2026 22:24
02strich added a commit that referenced this pull request Feb 23, 2026
* origin/main:
  CHASM: improve support for implementing Terminate method (#9351)
  Add testhooks package documentation (#9373)
  Improve re-usability of ringpop membership & PerNamespaceWorker (#9321)
  Fairness counter: fix heap bug in map counter (#9370)
  Avoid finalGC when ack level is zero (#9371)
  Fairness counter: persist top K keys (#9188)
  Flake Fix: In Reactivation Cache tests, wait for appropriate delays when confirming expected drainage status (#9352)
  Include transient and speculative WFT events in GetWorkflowExecutionHistoryResponse (#9325)
  Fix flaky test TestTransitionDuringTransientTask (#9356)
  Add per-workflow scheduler for history task processing (#9141)
  Populate currentAttemptScheduledTime on PollActivityTaskQueueResponse for standalone activities (#9333)
  Standalone activity heartbeating bug fix (#9354)
  Revert "Last part of making Nexus work OOTB" (#9343)
  Convert flake report from Python to Go (#9334)
  Do not enforce payload limits for system nexus endpoint (#9344)
stephanos pushed a commit to stephanos/temporal that referenced this pull request Feb 23, 2026
…istoryResponse (temporalio#9325)

## What changed?

Re-does temporalio#9138 which was incidentally merged.

Include transient and speculative WFT events in
`GetWorkflowExecutionHistoryReponse` response, unless UI or CLI made
request.

* Adds `transient_or_speculative_events` back to
`GetMutableStateResponse`
* Reserve `transient_workflow_task` in `HisotryCOntinuation` token
* Add validation helpers
* Add query-compare-query for transient events at request start and end

Re-implements temporalio#7732

## Why?
Fix "premature end of stream" errors when workers request history after
cache eviction w/ transient/speculative workflow tasks present. This
adds transient & speculative WFT events in `GetWorkflowExecutionHistory`
(already in `PollWorkflowTask`). Worker cache eviction w/ speculative
workflow tasks causes the expected and actual event counts to be
different. temporalio#7732 passed transient events through continuation tokens,
which could become stale during pagination. This PR implements mutable
state querying at both start and end of pagination and compares
transient event IDs to detect if WFT state changed during pagination and
return a retryable error.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Same risks from temporalio#7732
dandavison added a commit to dandavison/temporalio-temporal that referenced this pull request Feb 25, 2026
Three test categories:

1. TestTransientWFTEventsInGetHistory: TaskPoller-based diagnostic that
   validates GetWorkflowExecutionHistory includes transient events for
   a pending transient WFT (confirms PR temporalio#9325 works for the simple case).

2. TestPrematureEndOfStreamStress: parameterized stress test using real
   SDK workers with speculative WFTs (Updates) + sticky cache miss +
   gRPC interceptor + concurrent mutations. Explores parameter space
   (cache eviction, concurrent signals, history fetch delays, WFT
   timeouts, workflow counts).

3. TestPrematureEndOfStreamShardClosure: reliably reproduces the bug
   by closing the shard between the SDK receiving the speculative WFT
   poll response and calling GetWorkflowExecutionHistory. The shard
   reopens with fresh mutable state that has lost the in-memory
   speculative events, causing the 2-event gap.

Key finding: transient WFTs (from failures) cannot trigger this bug
because the server clears sticky on WFT failure (failWorkflowTask in
workflow_task_state_machine.go), routing the retry to the normal queue
where full history is in the poll response. The bug requires speculative
WFTs (Updates) on the sticky queue with a cache miss.
iw added a commit to iw/temporal that referenced this pull request Mar 19, 2026
Merges temporalio/temporal main branch up to df2e384.

Key upstream changes:
- Customizable serialization (temporalio#8426): EncodingTypeFromEnv(), EncodingType() on Encoder
- Per-workflow scheduler for history task processing (temporalio#9141)
- Ringpop membership & PerNamespaceWorker reusability (temporalio#9321)
- Transient/speculative WFT events in history response (temporalio#9325)
- Per-check diagnostics in DeepHealthCheck API (temporalio#9350)
- Fairness counter heap bug fix (temporalio#9370)
- System nexus endpoint (temporalio#9002)
- Mixed brain non-blocking (temporalio#9406)
- interface{} → any across persistence layer
- Various CI, test, and chasm improvements

DSQL fork code preserved:
- TxRetryPolicy/TxRetryMetrics/TxRetryPolicyProvider (common.go)
- ExecutionStoreCreator interface and factory wrapping (factory.go)
- OCC-aware lockShard bypass (shard.go)
- PoolSizeHint ephemeral pool sizing (version_checker.go)
- DSQL-safe InitializeSystemNamespaces (metadata_manager.go)
- Snowflake ID generator (idgenerator.go)
- RegisterPluginAlias (store.go)
- Full DSQL plugin (sqlplugin/dsql/)
birme pushed a commit to eyevinn-osaas/temporal that referenced this pull request Mar 23, 2026
…istoryResponse (temporalio#9325)

## What changed?

Re-does temporalio#9138 which was incidentally merged.

Include transient and speculative WFT events in
`GetWorkflowExecutionHistoryReponse` response, unless UI or CLI made
request.

* Adds `transient_or_speculative_events` back to
`GetMutableStateResponse`
* Reserve `transient_workflow_task` in `HisotryCOntinuation` token
* Add validation helpers
* Add query-compare-query for transient events at request start and end

Re-implements temporalio#7732

## Why?
Fix "premature end of stream" errors when workers request history after
cache eviction w/ transient/speculative workflow tasks present. This
adds transient & speculative WFT events in `GetWorkflowExecutionHistory`
(already in `PollWorkflowTask`). Worker cache eviction w/ speculative
workflow tasks causes the expected and actual event counts to be
different. temporalio#7732 passed transient events through continuation tokens,
which could become stale during pagination. This PR implements mutable
state querying at both start and end of pagination and compares
transient event IDs to detect if WFT state changed during pagination and
return a retryable error.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Same risks from temporalio#7732
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants