Skip to content

Conversation

@AnatoliB
Copy link
Collaborator

@AnatoliB AnatoliB commented Mar 6, 2025

Fixes Azure/azure-functions-durable-extension#3022.
Fixes #802.

Sometimes a duplicate message for an execution that is already finished is received late, when this execution is already completed. For example, this can happen when a message is in fact processed, but the worker failed to remove it from the queue for any reason, so the message automatically reappears in the queue later, potentially when the execution is already finished. In most cases, this is not a problem, and this message will eventually be discarded without any negative consequences. However, here is a sequence of events that leads to a stuck orchestration:

  1. A TaskScheduled message is delivered to a worker and the worker that starts executing an activity.
  2. After successfully executing an activity, the worker fails to remove the TaskScheduled message from the queue, but the orchestration continues.
  3. The activity returns a large amount of data (>~45K). This data is stored in a blob, and the history table entry contains a reference to this blob.
  4. Eventually, the TaskScheduled message reappears in the queue and is eventually picked up by a worker. The worker starts executing the activity again (which is fine because Durable Tasks guarantee at-least-once execution, but not exactly-once).
  5. In the meantime, the orchestrator function invokes ContinueAsNew. As a result, all the blobs for the previous execution are deleted (but the history table entries remain).
  6. Eventually, the worker finishes the second activity execution and produces a TaskCompleted message, which is eventually picked up.
  7. This TaskCompleted message has the previous execution id, so the worker tries to load the history for that execution id. However, all the blobs are missing. The worker repeatedly tries to retrieve the blobs, hitting the same error, and making no progress on the orchestration anymore. The orchestration is stuck indefinitely for no good reason.

This PR addresses the problem by making the history retrieval logic rely on the execution ID recorded in the sentinel row and skip history entries for different execution IDs, so there will be no attempt to retrieve missing blobs or interpret these history entries in any other way.

@AnatoliB AnatoliB requested review from cgillum and davidmrdavid March 6, 2025 21:41
Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. If I understand the description correctly, the root cause of the problem is that we're attempting to load old, partially deleted history. If I understand the fix correctly, we're trying to catch a very specific exception (blob not found) for a very narrow use case in order to detect this condition.

I wonder if we can make a broader fix and still keep things relatively simple. Note that whenever we load history, we're already going to load the sentinel row, so there shouldn't be a need to fetch it explicitly like we are in this PR. What if we instead loaded the history as normal but validated the history first before attempting to download or deserialize any blobs. By "validating", I mean (at a minimum) confirming that each entity has an execution ID that matches the sentinel row. If validation, then we don't even attempt to load the blob at all and discard the message. This would be more efficient and resolve a broader range of potential issues.

Thoughts on this alternate approach?

if (!success)
{
// Some properties were not retrieved because we are apparently trying to load
// outdated history. No reason to raise an exception here, as this does not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about how/why we'd find outdated history at this point. Shouldn't we have filtered out all the outdated history events via the check on line 173 (where we check the ExecutionId value)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is the source of our troubles. Line 173 doesn't perform proper filtering because executionId contains an old execution ID, which happens because we passed an old expectedExecutionId to GetHistoryEntitiesResponseInfoAsync, so results.Entities[0] also belongs to the outdated history, and line 173 filters out everything unrelated to the old execution. Eventually, if we don't encounter missing blobs, we hit this line:

message = runtimeState.Events.Count == 0 ? "No such instance" : "Invalid history (may have been overwritten by a newer instance)";

@AnatoliB
Copy link
Collaborator Author

Thank you for this PR. If I understand the description correctly, the root cause of the problem is that we're attempting to load old, partially deleted history. If I understand the fix correctly, we're trying to catch a very specific exception (blob not found) for a very narrow use case in order to detect this condition.

@cgillum Yes, you understand the description and the approach correctly. I've been intentionally trying to make it a very targeted fix because I'm not super-familiar with this code, so I may not realize all the consequences.

I wonder if we can make a broader fix and still keep things relatively simple. Note that whenever we load history, we're already going to load the sentinel row, so there shouldn't be a need to fetch it explicitly like we are in this PR. What if we instead loaded the history as normal but validated the history first before attempting to download or deserialize any blobs. By "validating", I mean (at a minimum) confirming that each entity has an execution ID that matches the sentinel row. If validation, then we don't even attempt to load the blob at all and discard the message. This would be more efficient and resolve a broader range of potential issues.

Thank you for pointing out that we already have the sentinel row, so no need to fetch it separately or even change the query
in GetHistoryEntitiesResponseInfoAsync, I missed this somehow! I wanted to avoid hurting performance just for the purposes of mitigating a relatively rare corner case. But, since we already have the sentinel anyway, this may be a good idea, I'll try it.

@AnatoliB
Copy link
Collaborator Author

@cgillum I've updated the fix to rely on the execution ID in the sentinel row and skip loading history if it doesn't match the expected ID.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of the new approach! Does the PR description need to be updated?

In any case, I'm good with this change. I left a comment for an optional change, but I'm fine if you think it's not worth risking further changes.

Co-authored-by: Chris Gillum <cgillum@microsoft.com>
@AnatoliB
Copy link
Collaborator Author

Updated the PR description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants