-
Notifications
You must be signed in to change notification settings - Fork 320
Open
Description
Problem statement
We have implemented a background clean up service that cleans up orchestrations in terminal state using the orchestrationService.PurgeInstanceHistoryAsync
| public Task<PurgeHistoryResult> PurgeInstanceHistoryAsync(DateTime createdTimeFrom, DateTime? createdTimeTo, IEnumerable<OrchestrationStatus> runtimeStatus) |
Setup
- The background clean up service runs on multiple pods and therefore there can be Purge call on same instance multiple times.
- Each orchestration instance has 6-7 activities.
Investigation findings
- This exception is usually thrown when there are lots of instances to be cleaned
- Underlying logic in tracking store lists all instances within the time window and individually calls delete instance on them. Refer . We believe this to be the cause of exception. As background clean up runs on multiple pods, it is possible that delete is called on an already deleted instance which causes this ResourceNotFound exception. This also aligns with the first observation as the chances of this happening increase as the number of instances to be deleted increase.
var options = new ParallelOptions { MaxDegreeOfParallelism = this.settings.MaxStorageOperationConcurrency };
Possible solutions
- Is it advised to clean up orchestration once it is completed? This would reduce the load on the background clean up service and even remove its requirement altogether.
- We could move the background clean up to a separate service altogether (right now it is running in the same service that has orchestration logic), but could that impact locks and produce latencies in already running orchestration? I believe not as there is no locking logic in the purge call.
- Is there any general advice on performing purge of the instances?
Metadata
Metadata
Assignees
Labels
No labels