Don't fail orchestrations on missing blobs from previous executions #1189

AnatoliB · 2025-03-06T21:40:43Z

Fixes Azure/azure-functions-durable-extension#3022.
Fixes #802.

Sometimes a duplicate message for an execution that is already finished is received late, when this execution is already completed. For example, this can happen when a message is in fact processed, but the worker failed to remove it from the queue for any reason, so the message automatically reappears in the queue later, potentially when the execution is already finished. In most cases, this is not a problem, and this message will eventually be discarded without any negative consequences. However, here is a sequence of events that leads to a stuck orchestration:

A TaskScheduled message is delivered to a worker and the worker that starts executing an activity.
After successfully executing an activity, the worker fails to remove the TaskScheduled message from the queue, but the orchestration continues.
The activity returns a large amount of data (>~45K). This data is stored in a blob, and the history table entry contains a reference to this blob.
Eventually, the TaskScheduled message reappears in the queue and is eventually picked up by a worker. The worker starts executing the activity again (which is fine because Durable Tasks guarantee at-least-once execution, but not exactly-once).
In the meantime, the orchestrator function invokes ContinueAsNew. As a result, all the blobs for the previous execution are deleted (but the history table entries remain).
Eventually, the worker finishes the second activity execution and produces a TaskCompleted message, which is eventually picked up.
This TaskCompleted message has the previous execution id, so the worker tries to load the history for that execution id. However, all the blobs are missing. The worker repeatedly tries to retrieve the blobs, hitting the same error, and making no progress on the orchestration anymore. The orchestration is stuck indefinitely for no good reason.

This PR addresses the problem by making the history retrieval logic rely on the execution ID recorded in the sentinel row and skip history entries for different execution IDs, so there will be no attempt to retrieve missing blobs or interpret these history entries in any other way.

cgillum

Thank you for this PR. If I understand the description correctly, the root cause of the problem is that we're attempting to load old, partially deleted history. If I understand the fix correctly, we're trying to catch a very specific exception (blob not found) for a very narrow use case in order to detect this condition.

I wonder if we can make a broader fix and still keep things relatively simple. Note that whenever we load history, we're already going to load the sentinel row, so there shouldn't be a need to fetch it explicitly like we are in this PR. What if we instead loaded the history as normal but validated the history first before attempting to download or deserialize any blobs. By "validating", I mean (at a minimum) confirming that each entity has an execution ID that matches the sentinel row. If validation, then we don't even attempt to load the blob at all and discard the message. This would be more efficient and resolve a broader range of potential issues.

Thoughts on this alternate approach?

cgillum · 2025-03-11T03:18:57Z

src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs

+                    if (!success)
+                    {
+                        // Some properties were not retrieved because we are apparently trying to load
+                        // outdated history. No reason to raise an exception here, as this does not


I'm a bit confused about how/why we'd find outdated history at this point. Shouldn't we have filtered out all the outdated history events via the check on line 173 (where we check the ExecutionId value)?

Exactly, this is the source of our troubles. Line 173 doesn't perform proper filtering because executionId contains an old execution ID, which happens because we passed an old expectedExecutionId to GetHistoryEntitiesResponseInfoAsync, so results.Entities[0] also belongs to the outdated history, and line 173 filters out everything unrelated to the old execution. Eventually, if we don't encounter missing blobs, we hit this line:

durabletask/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

Line 1072 in 3bc255c

message = runtimeState.Events.Count == 0 ? "No such instance" : "Invalid history (may have been overwritten by a newer instance)";

AnatoliB · 2025-03-11T21:54:49Z

Thank you for this PR. If I understand the description correctly, the root cause of the problem is that we're attempting to load old, partially deleted history. If I understand the fix correctly, we're trying to catch a very specific exception (blob not found) for a very narrow use case in order to detect this condition.

@cgillum Yes, you understand the description and the approach correctly. I've been intentionally trying to make it a very targeted fix because I'm not super-familiar with this code, so I may not realize all the consequences.

I wonder if we can make a broader fix and still keep things relatively simple. Note that whenever we load history, we're already going to load the sentinel row, so there shouldn't be a need to fetch it explicitly like we are in this PR. What if we instead loaded the history as normal but validated the history first before attempting to download or deserialize any blobs. By "validating", I mean (at a minimum) confirming that each entity has an execution ID that matches the sentinel row. If validation, then we don't even attempt to load the blob at all and discard the message. This would be more efficient and resolve a broader range of potential issues.

Thank you for pointing out that we already have the sentinel row, so no need to fetch it separately or even change the query
in GetHistoryEntitiesResponseInfoAsync, I missed this somehow! I wanted to avoid hurting performance just for the purposes of mitigating a relatively rare corner case. But, since we already have the sentinel anyway, this may be a good idea, I'll try it.

AnatoliB · 2025-03-12T18:06:43Z

@cgillum I've updated the fix to rely on the execution ID in the sentinel row and skip loading history if it doesn't match the expected ID.

cgillum

I like the simplicity of the new approach! Does the PR description need to be updated?

In any case, I'm good with this change. I left a comment for an optional change, but I'm fine if you think it's not worth risking further changes.

src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs

Co-authored-by: Chris Gillum <cgillum@microsoft.com>

AnatoliB · 2025-03-12T22:58:52Z

Updated the PR description

AnatoliB added 9 commits March 6, 2025 12:39

Log a warning on failure to download or decompress a blob

4c099e0

Stop loading history on missing blobs

045da67

Extract TryDecompressLargeEntityPropertiesAsync method

57be86b

Add comments to TryDecompressLargeEntityPropertiesAsync

59fb463

Rename a parameter for consistency

d9391fb

Fixed comment

50e2c24

Add a warning to TryDecompressLargeEntityPropertiesAsync

20cfbed

Restore an accidentally removed comment

722a883

Add a comment in GetHistoryEventsAsync

3dcd4e5

AnatoliB requested review from cgillum and davidmrdavid March 6, 2025 21:41

cgillum reviewed Mar 11, 2025

View reviewed changes

AnatoliB added 6 commits March 11, 2025 18:15

Skip history loading if sentinel execution ID doesn't match

9a74e40

Revert missing blob handling

32851e1

Allow expectedExecutionId == null

9ae4b1c

Move sentinel initialization

8d640a9

Add a comment

1ed474e

Fix comment

440dce1

cgillum approved these changes Mar 12, 2025

View reviewed changes

src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Outdated Show resolved Hide resolved

Prefer sentinel ExecutionId over the first history entry ExecutionId

8c24eea

Co-authored-by: Chris Gillum <cgillum@microsoft.com>

AnatoliB merged commit 8c26e62 into main Mar 12, 2025
44 checks passed

AnatoliB deleted the anatolib/missing-blob-after-can-fix branch March 12, 2025 22:59

cgillum mentioned this pull request Mar 13, 2025

Exception: DurableTask.AzureStorage.Storage.DurableTaskStorageException: The specified blob does not exist #802

Closed

firedigger mentioned this pull request May 14, 2025

Instances stuck due to DurableTask.AzureStorage.Storage.DurableTaskStorageException: An error occurred while communicating with Azure Storage ---> Azure.RequestFailedException: The specified blob does not exist. #1215

Closed

torosent mentioned this pull request Dec 2, 2025

Fix BlobNotFound error in entity history retrieval #1269

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't fail orchestrations on missing blobs from previous executions #1189

Don't fail orchestrations on missing blobs from previous executions #1189

Uh oh!

AnatoliB commented Mar 6, 2025 •

edited by cgillum

Loading

Uh oh!

cgillum left a comment

Uh oh!

cgillum Mar 11, 2025

Uh oh!

AnatoliB Mar 11, 2025

Uh oh!

AnatoliB commented Mar 11, 2025

Uh oh!

AnatoliB commented Mar 12, 2025

Uh oh!

cgillum left a comment

Uh oh!

Uh oh!

AnatoliB commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Don't fail orchestrations on missing blobs from previous executions #1189

Don't fail orchestrations on missing blobs from previous executions #1189

Uh oh!

Conversation

AnatoliB commented Mar 6, 2025 • edited by cgillum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgillum left a comment

Choose a reason for hiding this comment

Uh oh!

cgillum Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

AnatoliB Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

AnatoliB commented Mar 11, 2025

Uh oh!

AnatoliB commented Mar 12, 2025

Uh oh!

cgillum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AnatoliB commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AnatoliB commented Mar 6, 2025 •

edited by cgillum

Loading