Skip to content

ResourceNotFound exception during concurrent orchestrationService.PurgeInstanceHistoryAsync call #1210

@apogarg

Description

@apogarg

Problem statement

We have implemented a background clean up service that cleans up orchestrations in terminal state using the orchestrationService.PurgeInstanceHistoryAsync

public Task<PurgeHistoryResult> PurgeInstanceHistoryAsync(DateTime createdTimeFrom, DateTime? createdTimeTo, IEnumerable<OrchestrationStatus> runtimeStatus)
periodically every 12 hrs. We recently got DurableTaskStorageException with an inner exception of TableTransactionFailedException with error code ResourceNotFound.

Setup

  • The background clean up service runs on multiple pods and therefore there can be Purge call on same instance multiple times.
  • Each orchestration instance has 6-7 activities.

Investigation findings

  • This exception is usually thrown when there are lots of instances to be cleaned
  • Underlying logic in tracking store lists all instances within the time window and individually calls delete instance on them. Refer
    var options = new ParallelOptions { MaxDegreeOfParallelism = this.settings.MaxStorageOperationConcurrency };
    . We believe this to be the cause of exception. As background clean up runs on multiple pods, it is possible that delete is called on an already deleted instance which causes this ResourceNotFound exception. This also aligns with the first observation as the chances of this happening increase as the number of instances to be deleted increase.

Possible solutions

  • Is it advised to clean up orchestration once it is completed? This would reduce the load on the background clean up service and even remove its requirement altogether.
  • We could move the background clean up to a separate service altogether (right now it is running in the same service that has orchestration logic), but could that impact locks and produce latencies in already running orchestration? I believe not as there is no locking logic in the purge call.
  • Is there any general advice on performing purge of the instances?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions