Skip to content

💥 Fix possible panic due to LA completing after an eviction#1304

Merged
Sushisource merged 2 commits into
mainfrom
sj/fix-evict-activations-concurrent
May 29, 2026
Merged

💥 Fix possible panic due to LA completing after an eviction#1304
Sushisource merged 2 commits into
mainfrom
sj/fix-evict-activations-concurrent

Conversation

@Sushisource
Copy link
Copy Markdown
Member

@Sushisource Sushisource commented May 29, 2026

What was changed

This changes local activity eviction semantics so local activities are invalidated when their
owning workflow run instance is evicted, rather than allowing them to survive and potentially
be adopted by a later replay of the same RunID.

The old behavior was timing-dependent: an in-flight LA completion could arrive after eviction
while the workflow was replaying, but before replay had recreated the corresponding local
activity machine. In that state, the completion no longer had a valid owner, yet Core could try
to apply it to the new workflow-machine instance, causing nondeterminism or corrupting WFT/
activation bookkeeping. If the completion happened to arrive later, after replay recreated the
LA command, it could appear to work, but that relied on an accidental race rather than a sound
ownership model.

This also makes duplicate local activity queueing an invariant violation instead of silently
suppressing it, and warns when eviction is requested while local activities are still in
flight, which is especially problematic with max_cached_workflows = 0.

Warning

It's probably important to call this one out in release notes, as it's possible there are people out there semi-depending on the behavior that accidentally worked. This is probably only likely in situations where they have the cache off. They should not have the cache off. This is not actually a breaking change, though.

Why?

Bugfix

Checklist

  1. Closes [Bug] Flaky panic triggered by test_workflow_cancel_activity in sdk-python #1230

  2. How was this tested:
    Added test

  3. Any docs updates needed?

@Sushisource Sushisource requested a review from a team as a code owner May 29, 2026 02:40
@Sushisource Sushisource force-pushed the sj/fix-evict-activations-concurrent branch from 8677cc5 to 386b90c Compare May 29, 2026 02:41
run_id=%info.run_id,
reason=?info.reason,
outstanding_local_activities=outstanding_las,
"Eviction requested while local activities are still in flight; local activities when using max_cached_workflows=0 are likely to be dropped or retried"
Copy link
Copy Markdown
Contributor

@mjameswh mjameswh May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting observation. It seems like Go and Java do not have the bug you're fixing in this PR, but this specific call out (LA vs max_cached_workflows=0) may apply there too.

@yuandrew
Copy link
Copy Markdown
Contributor

Warning

It's probably important to call this one out in release notes, as it's possible there are people out there semi-depending on the behavior that accidentally worked. This is probably only likely in situations where they have the cache off. They should not have the cache off.

worth adding a 💥 in the title so lang remembers to call this out?

@Sushisource
Copy link
Copy Markdown
Member Author

Warning

It's probably important to call this one out in release notes, as it's possible there are people out there semi-depending on the behavior that accidentally worked. This is probably only likely in situations where they have the cache off. They should not have the cache off.

worth adding a 💥 in the title so lang remembers to call this out?

Done. It's not actually a breaking change, but, it'll call attention to this

@Sushisource Sushisource changed the title Fix possible panic due to LA completing after an eviction 💥 Fix possible panic due to LA completing after an eviction May 29, 2026
@Sushisource Sushisource force-pushed the sj/fix-evict-activations-concurrent branch from 386b90c to 9efb5fe Compare May 29, 2026 15:45
Comment on lines +470 to +471
return None;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure returning None is correct here.

According to a comment above:

Returns the next pending local-activity related action, or None if shutdown has initiated and there are no more remaining actions to take.

But here, we haven't confirmed there isn't more remaining actions in the queue, nor that shutdown has been initiated. Can't there be other pending LA actions after the one that has been invalidated?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me see

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this didn't actually matter since the caller is also double-checking the shutdown token, but, I've changed it to loop (there were other cases that should've been loops too) and everything is more consistent w/ the docstring now.

@Sushisource Sushisource force-pushed the sj/fix-evict-activations-concurrent branch from 9efb5fe to a86c65a Compare May 29, 2026 18:18
@Sushisource Sushisource merged commit 6a8355a into main May 29, 2026
35 of 39 checks passed
@Sushisource Sushisource deleted the sj/fix-evict-activations-concurrent branch May 29, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Flaky panic triggered by test_workflow_cancel_activity in sdk-python

3 participants