💥 Fix possible panic due to LA completing after an eviction#1304
Conversation
8677cc5 to
386b90c
Compare
| run_id=%info.run_id, | ||
| reason=?info.reason, | ||
| outstanding_local_activities=outstanding_las, | ||
| "Eviction requested while local activities are still in flight; local activities when using max_cached_workflows=0 are likely to be dropped or retried" |
There was a problem hiding this comment.
Interesting observation. It seems like Go and Java do not have the bug you're fixing in this PR, but this specific call out (LA vs max_cached_workflows=0) may apply there too.
worth adding a 💥 in the title so lang remembers to call this out? |
Done. It's not actually a breaking change, but, it'll call attention to this |
386b90c to
9efb5fe
Compare
| return None; | ||
| } |
There was a problem hiding this comment.
I'm not sure returning None is correct here.
According to a comment above:
Returns the next pending local-activity related action, or None if shutdown has initiated and there are no more remaining actions to take.
But here, we haven't confirmed there isn't more remaining actions in the queue, nor that shutdown has been initiated. Can't there be other pending LA actions after the one that has been invalidated?
There was a problem hiding this comment.
Good point, let me see
There was a problem hiding this comment.
Looks like this didn't actually matter since the caller is also double-checking the shutdown token, but, I've changed it to loop (there were other cases that should've been loops too) and everything is more consistent w/ the docstring now.
9efb5fe to
a86c65a
Compare
What was changed
This changes local activity eviction semantics so local activities are invalidated when their
owning workflow run instance is evicted, rather than allowing them to survive and potentially
be adopted by a later replay of the same RunID.
The old behavior was timing-dependent: an in-flight LA completion could arrive after eviction
while the workflow was replaying, but before replay had recreated the corresponding local
activity machine. In that state, the completion no longer had a valid owner, yet Core could try
to apply it to the new workflow-machine instance, causing nondeterminism or corrupting WFT/
activation bookkeeping. If the completion happened to arrive later, after replay recreated the
LA command, it could appear to work, but that relied on an accidental race rather than a sound
ownership model.
This also makes duplicate local activity queueing an invariant violation instead of silently
suppressing it, and warns when eviction is requested while local activities are still in
flight, which is especially problematic with max_cached_workflows = 0.
Warning
It's probably important to call this one out in release notes, as it's possible there are people out there semi-depending on the behavior that accidentally worked. This is probably only likely in situations where they have the cache off. They should not have the cache off. This is not actually a breaking change, though.
Why?
Bugfix
Checklist
Closes [Bug] Flaky panic triggered by test_workflow_cancel_activity in sdk-python #1230
How was this tested:
Added test
Any docs updates needed?