-
Notifications
You must be signed in to change notification settings - Fork 320
Don't fail orchestrations on missing blobs from previous executions #1189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cgillum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR. If I understand the description correctly, the root cause of the problem is that we're attempting to load old, partially deleted history. If I understand the fix correctly, we're trying to catch a very specific exception (blob not found) for a very narrow use case in order to detect this condition.
I wonder if we can make a broader fix and still keep things relatively simple. Note that whenever we load history, we're already going to load the sentinel row, so there shouldn't be a need to fetch it explicitly like we are in this PR. What if we instead loaded the history as normal but validated the history first before attempting to download or deserialize any blobs. By "validating", I mean (at a minimum) confirming that each entity has an execution ID that matches the sentinel row. If validation, then we don't even attempt to load the blob at all and discard the message. This would be more efficient and resolve a broader range of potential issues.
Thoughts on this alternate approach?
| if (!success) | ||
| { | ||
| // Some properties were not retrieved because we are apparently trying to load | ||
| // outdated history. No reason to raise an exception here, as this does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about how/why we'd find outdated history at this point. Shouldn't we have filtered out all the outdated history events via the check on line 173 (where we check the ExecutionId value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this is the source of our troubles. Line 173 doesn't perform proper filtering because executionId contains an old execution ID, which happens because we passed an old expectedExecutionId to GetHistoryEntitiesResponseInfoAsync, so results.Entities[0] also belongs to the outdated history, and line 173 filters out everything unrelated to the old execution. Eventually, if we don't encounter missing blobs, we hit this line:
| message = runtimeState.Events.Count == 0 ? "No such instance" : "Invalid history (may have been overwritten by a newer instance)"; |
@cgillum Yes, you understand the description and the approach correctly. I've been intentionally trying to make it a very targeted fix because I'm not super-familiar with this code, so I may not realize all the consequences.
Thank you for pointing out that we already have the sentinel row, so no need to fetch it separately or even change the query |
|
@cgillum I've updated the fix to rely on the execution ID in the sentinel row and skip loading history if it doesn't match the expected ID. |
cgillum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity of the new approach! Does the PR description need to be updated?
In any case, I'm good with this change. I left a comment for an optional change, but I'm fine if you think it's not worth risking further changes.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Chris Gillum <cgillum@microsoft.com>
|
Updated the PR description |
Fixes Azure/azure-functions-durable-extension#3022.
Fixes #802.
Sometimes a duplicate message for an execution that is already finished is received late, when this execution is already completed. For example, this can happen when a message is in fact processed, but the worker failed to remove it from the queue for any reason, so the message automatically reappears in the queue later, potentially when the execution is already finished. In most cases, this is not a problem, and this message will eventually be discarded without any negative consequences. However, here is a sequence of events that leads to a stuck orchestration:
TaskScheduledmessage is delivered to a worker and the worker that starts executing an activity.TaskScheduledmessage from the queue, but the orchestration continues.TaskScheduledmessage reappears in the queue and is eventually picked up by a worker. The worker starts executing the activity again (which is fine because Durable Tasks guarantee at-least-once execution, but not exactly-once).ContinueAsNew. As a result, all the blobs for the previous execution are deleted (but the history table entries remain).TaskCompletedmessage, which is eventually picked up.TaskCompletedmessage has the previous execution id, so the worker tries to load the history for that execution id. However, all the blobs are missing. The worker repeatedly tries to retrieve the blobs, hitting the same error, and making no progress on the orchestration anymore. The orchestration is stuck indefinitely for no good reason.This PR addresses the problem by making the history retrieval logic rely on the execution ID recorded in the sentinel row and skip history entries for different execution IDs, so there will be no attempt to retrieve missing blobs or interpret these history entries in any other way.