Skip to content

Enhance purge with parallel batch deletes and partial purge timeout#1321

Open
YunchuWang wants to merge 21 commits intomainfrom
wangbill/enpurge
Open

Enhance purge with parallel batch deletes and partial purge timeout#1321
YunchuWang wants to merge 21 commits intomainfrom
wangbill/enpurge

Conversation

@YunchuWang
Copy link
Member

@YunchuWang YunchuWang commented Mar 18, 2026

Summary

Enhance the Azure Storage purge implementation with parallel batch deletes, CancellationToken-based partial purge timeout, improved error handling, and comprehensive tests.

Motivation

Purging large numbers of orchestration instances (100K+) with the current implementation causes:

  1. Timeouts: Sequential batch deletes are too slow, causing gRPC deadline timeouts in isolated worker
  2. Storage errors: DeleteBatchAsync fails with 404 when entities are already deleted (race condition)
  3. Silent data loss: gRPC cancellation kills the response but not the in-flight storage operations — caller has no visibility into progress
  4. No progress tracking: No way to know how many instances remain

Changes

Core (DurableTask.Core)

  • PurgeInstanceFilter.Timeout (TimeSpan?): Optional timeout for partial purge
  • PurgeResult.IsComplete (bool?): Already existed, now properly populated

Azure Storage (DurableTask.AzureStorage)

  • PurgeHistoryResult.IsComplete: New property + constructor overload, forwarded via ToCorePurgeHistoryResult()
  • AzureStorageOrchestrationService.PurgeInstanceHistoryAsync(..., TimeSpan timeout): New overload
  • AzureTableTrackingStore.DeleteHistoryAsync: CancellationToken-based timeout using linked CancellationTokenSource
  • Table.DeleteBatchParallelAsync: New parallel batch delete with concurrent transactions and 404 fallback
  • MessageManager.DeleteLargeMessageBlobs: Fixed 404 handling with try/catch instead of ExistsAsync + delete
  • Concurrency control: SemaphoreSlim(100) for instance-level parallelism

Behavior

When Timeout is set:

  • Creates a CancellationTokenSource(timeout) linked with the caller's CancellationToken
  • Passes the effective token to table queries, throttle waits, and ThrowIfCancellationRequested
  • On timeout: catches OperationCanceledException, waits for in-flight deletions, returns IsComplete = false
  • Already-dispatched instance deletions use effectiveToken and can be cancelled in flight when timeout
    When Timeout is not set:
  • Existing behavior unchanged (IsComplete = null for backward compatibility)

Benchmark Results

100K Instances (EP1, separate ASPs/storage)

Metric Baseline (stock) Optimized Delta
Total Deleted 28,702 99,949 3.5x
Purge Rate 48.7 inst/s 336.5 inst/s 6.9x
Errors 16 0 Error-free

500K Instances (EP1, isolated worker SDK path with 25s timeout)

Metric Baseline (no timeout) Optimized (25s timeout) Delta
Reported Deleted 17,402 (3.5%) 499,560 (99.9%) 28.7x
Purge Rate 12.3 inst/s 318.1 inst/s 25.9x
Errors 41 (95%) 0 Error-free

Breaking Changes

None. All changes are additive:

  • New optional Timeout property on PurgeInstanceFilter
  • New constructor overload on PurgeHistoryResult
  • New PurgeInstanceHistoryAsync overload (original method unchanged)
  • Internal interface/base class changes are non-public

Tests Added

  • PartialPurge_TimesOutThenCompletesOnRetry
  • PartialPurge_GenerousTimeout_CompletesAll
  • PartialPurge_WithoutTimeout_ReturnsNullIsComplete
  • PurgeMultipleInstancesHistoryByTimePeriod_ScalabilityValidation
  • PurgeSingleInstanceWithIdempotency
  • PurgeSingleInstance_WithLargeBlobs_CleansUpBlobs
  • PurgeInstance_WithManyHistoryRows_DeletesAll
  • 9 unit tests for DeleteBatchParallelAsync

Related PRs

YunchuWang and others added 7 commits March 13, 2026 15:33
- Add TimeSpan? Timeout to PurgeInstanceFilter for partial purge support
- Add bool? IsComplete to PurgeHistoryResult to indicate completion status
- Add new PurgeInstanceHistoryAsync overload with TimeSpan timeout parameter
- Use CancellationToken-based timeout (linked CTS) in DeleteHistoryAsync
- Already-dispatched deletions complete before returning partial results
- Backward compatible: no timeout = original behavior (IsComplete = null)
- Forward IsComplete through ToCorePurgeHistoryResult to PurgeResult
- Add scenario tests for partial purge timeout, generous timeout, and compat
- Always cap timeout to 30s max, even if not specified or exceeds 30s
- Pass effectiveToken into DeleteAllDataForOrchestrationInstance so in-flight deletes are also cancelled on timeout
- Catch OperationCanceledException from Task.WhenAll for timed-out in-flight deletes
- External cancellationToken cancellation still propagates normally
Copilot AI review requested due to automatic review settings March 18, 2026 19:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves purge scalability and robustness for DurableTask’s Azure Storage backend by adding parallelized table batch deletes, optional timeout-based partial purging, better 404/idempotency handling, and expanded test coverage.

Changes:

  • Add optional purge timeout (PurgeInstanceFilter.Timeout) and propagate completion status via IsComplete into core PurgeResult.
  • Implement parallel table batch deletion with 404 fallback to per-entity deletes.
  • Add scenario + unit tests for partial purge behavior, blob cleanup, and parallel batch delete behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/DurableTask.AzureStorage.Tests/TestOrchestrationClient.cs Adds helper API to invoke the new timed purge overload in tests.
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs Adds new purge/partial-purge scenario tests and validation for large-message blob cleanup.
src/DurableTask.Core/PurgeInstanceFilter.cs Introduces optional Timeout for partial purge.
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs Extends purge-by-time signature to include an optional timeout.
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs Extends tracking store purge API contract to include optional timeout.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Implements timeout-aware, parallel purge-by-time behavior and uses parallel batch delete.
src/DurableTask.AzureStorage/Storage/Table.cs Adds DeleteBatchParallelAsync with transactional chunking and 404 fallback.
src/DurableTask.AzureStorage/PurgeHistoryResult.cs Adds IsComplete and forwards it to core PurgeResult.
src/DurableTask.AzureStorage/MessageManager.cs Improves 404 handling for large-message blob deletion by relying on list/delete with exception handling.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Adds timed purge overload and wires PurgeInstanceFilter.Timeout into the call path.
Test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs Adds unit tests validating parallel batch delete chunking, fallback, and cancellation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Hard-code 30s CancellationToken-based timeout in DeleteHistoryAsync
- Remove configurable Timeout from PurgeInstanceFilter (not needed)
- Remove timeout overload from AzureStorageOrchestrationService
- IsComplete = true when all purged within 30s, false when timed out
- Callers loop until IsComplete = true for large-scale purge
- Add TimeSpan? Timeout property to PurgeInstanceFilter (opt-in, default null)
- When null: unbounded purge, IsComplete=null (backward compat, no behavior change)
- When set: CancellationToken-based timeout, IsComplete=true/false
- Thread Timeout through IOrchestrationServicePurgeClient path
- Zero breaking changes: existing callers unaffected
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the Azure Storage purge pipeline to improve throughput and reliability for large purges by introducing parallelized batch deletes, a timeout-driven partial purge mechanism, and forwarding completion status back to the core purge result shape.

Changes:

  • Added PurgeInstanceFilter.Timeout and plumbed timeout support into Azure Storage tracking-store purging.
  • Implemented Table.DeleteBatchParallelAsync with 404/idempotency fallback and updated purge to use it.
  • Added/updated purge-related tests and extended purge result types to carry IsComplete.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs Adds new purge scenario tests for scalability/idempotency/large-blob cleanup and a test intended to validate completion semantics.
src/DurableTask.Core/PurgeInstanceFilter.cs Adds Timeout option to the core purge filter contract.
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs Extends time-range purge signature to accept optional timeout.
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs Extends tracking store purge API with an optional timeout parameter.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Implements timeout-aware, parallel instance purging and returns IsComplete based on timeout.
src/DurableTask.AzureStorage/Storage/Table.cs Adds DeleteBatchParallelAsync with parallel transactions and 404 fallback to individual deletes.
src/DurableTask.AzureStorage/PurgeHistoryResult.cs Adds IsComplete to AzureStorage purge result and forwards it to DurableTask.Core.PurgeResult.
src/DurableTask.AzureStorage/MessageManager.cs Improves 404 handling for large message blob cleanup by relying on try/catch rather than container existence checks.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Wires PurgeInstanceFilter.Timeout into the tracking-store purge path used by IOrchestrationServicePurgeClient.
Test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs Adds unit tests for DeleteBatchParallelAsync (but currently placed outside the referenced test project directory).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Update PurgeInstanceFilter.Timeout docs: in-flight deletions are cancelled (intentional)
- Add using var for SemaphoreSlim disposal
- Fix DateTime.Now/UtcNow mixing in purge tests (use UtcNow consistently)
- Rename PurgeReturnsIsComplete test to match actual assertions
- Move TableDeleteBatchParallelTests.cs from Test/ to test/ (correct project path)
- Fix typos: grater->greater, status->statuses in XML docs
- Use LINQ Select for foreach loop per code quality suggestion
Copilot AI review requested due to automatic review settings March 20, 2026 00:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Azure Storage purge pipeline to better handle large-scale instance purges by adding parallelized table batch deletes, introducing an optional timeout for partial purges, and improving idempotency around already-deleted storage artifacts. It also expands scenario/unit test coverage to validate the new purge behaviors and scalability characteristics.

Changes:

  • Add PurgeInstanceFilter.Timeout and propagate IsComplete via PurgeHistoryResultPurgeResult.
  • Implement parallel table batch deletion with 404 fallback to per-entity deletes.
  • Update purge and blob cleanup implementations for better cancellation/timeout behavior and add comprehensive tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/DurableTask.AzureStorage.Tests/Storage/TableDeleteBatchParallelTests.cs Adds unit tests validating new parallel batch delete behavior (including 404 fallback and cancellation).
test/DurableTask.AzureStorage.Tests/AzureStorageScenarioTests.cs Adds end-to-end purge scenario tests and uses UTC timestamps for purge windows.
src/DurableTask.Core/PurgeInstanceFilter.cs Introduces optional Timeout for partial purge semantics.
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs Extends tracking store purge API shape to accept optional timeout.
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs Updates tracking store interface to include optional timeout parameter.
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Implements timeout-linked cancellation + throttled parallel instance purges and uses parallel history row deletes.
src/DurableTask.AzureStorage/Storage/Table.cs Adds DeleteBatchParallelAsync with concurrent chunk submission and 404 fallback behavior.
src/DurableTask.AzureStorage/PurgeHistoryResult.cs Adds IsComplete and forwards completion to core PurgeResult.
src/DurableTask.AzureStorage/MessageManager.cs Improves large-message blob deletion to handle missing containers via exception-based 404 handling.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Threads the new timeout value through purge calls and fixes doc typos.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

YunchuWang and others added 2 commits March 19, 2026 18:57
- PurgeHistoryResultTests: constructor IsComplete (true/false/null), ToCorePurgeHistoryResult propagation, backward compat
- PurgeInstanceFilterTests: Timeout default null, set/reset, PurgeResult IsComplete tri-state, old constructor compat
- Remove unused using System.Collections.Concurrent (#1)
- Pass original cancellationToken (not effectiveToken) to in-flight deletes (#3)
- Update ITrackingStore doc to include Canceled status (#4)
- Use wall-clock Stopwatch for DeleteBatchParallelAsync Elapsed (#5)
Copilot AI review requested due to automatic review settings March 20, 2026 04:04
@YunchuWang
Copy link
Member Author

Regarding the pendingTasks memory concern: With the new opt-in timeout feature (default 30s when used), the maximum number of pending tasks is naturally bounded by how many instances can be dispatched within the timeout window (~100 concurrent 30s a few thousand tasks at most). For the no-timeout path (backward compat), the existing behavior is preserved. The SemaphoreSlim(100) already limits actual concurrency. Switching to Parallel.ForEachAsync would be a larger refactor that changes the async enumeration pattern better suited for a follow-up.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Revert to effectiveToken so in-flight deletes are cancelled on timeout
- Update PurgeInstanceFilter.Timeout XML doc to match behavior
- Docs and comments now consistently say in-flight deletes are cancelled
Copy link

Copilot AI commented Mar 20, 2026

@YunchuWang I've opened a new pull request, #1325, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 18:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/DurableTask.AzureStorage/MessageManager.cs:1

  • storageOperationCount no longer counts the list operation when blobs exist (it only counts deletes and only adds “1” when there are no blobs). If this value is used as “requests sent to storage,” it will undercount in the common case where blobs exist. Consider initializing to 1 before enumerating (to count the list) and then adding delete counts, or explicitly incrementing for the list call regardless of whether any blobs were found.
//  ----------------------------------------------------------------------------------

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 18:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 20, 2026 21:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants