ResourceNotFound exception during concurrent orchestrationService.PurgeInstanceHistoryAsync call

## Problem statement
We have implemented a background clean up service that cleans up orchestrations in terminal state using the orchestrationService.PurgeInstanceHistoryAsync https://github.com/Azure/durabletask/blob/6d09d0353383e25caba94bdee862532b6175a847/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs#L1986 periodically every 12 hrs. We recently got **DurableTaskStorageException** with an inner exception of **TableTransactionFailedException** with error code **ResourceNotFound**. 

### Setup
- The background clean up service runs on multiple pods and therefore there can be Purge call on same instance multiple times.
- Each orchestration instance has 6-7 activities.

### Investigation findings
- This exception is usually thrown when there are lots of instances to be cleaned
- Underlying logic in tracking store lists all instances within the time window and individually calls delete instance on them. Refer https://github.com/Azure/durabletask/blob/1514129bac2cbdc67b664fca61f919ff1c225ee3/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs#L576. We believe this to be the cause of exception. As background clean up runs on multiple pods, it is possible that delete is called on an already deleted instance which causes this ResourceNotFound exception. This also aligns with the first observation as the chances of this happening increase as the number of instances to be deleted increase.

## Possible solutions
- Is it advised to clean up orchestration once it is completed? This would reduce the load on the background clean up service and even remove its requirement altogether.
- We could move the background clean up to a separate service altogether (right now it is running in the same service that has orchestration logic), but could that impact locks and produce latencies in already running orchestration? I believe not as there is no locking logic in the purge call.
- Is there any general advice on performing purge of the instances?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ResourceNotFound exception during concurrent orchestrationService.PurgeInstanceHistoryAsync call #1210

Problem statement

Setup

Investigation findings

Possible solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ResourceNotFound exception during concurrent orchestrationService.PurgeInstanceHistoryAsync call #1210

Description

Problem statement

Setup

Investigation findings

Possible solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions