Skip to content

Commit 2363926

Browse files
Bernd VerstCopilot
andcommitted
Add gRPC connection resiliency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 762b247 commit 2363926

4 files changed

Lines changed: 32 additions & 33 deletions

File tree

CHANGELOG.md

Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,15 @@ ADDED
1212
- Added `GrpcChannelOptions` and `GrpcRetryPolicyOptions` for configuring
1313
gRPC transport behavior, including message-size limits, keepalive settings,
1414
and channel-level retry policy service configuration.
15-
- Added `GrpcWorkerResiliencyOptions` and `GrpcClientResiliencyOptions` for
16-
configuring public gRPC reconnect, hello timeout, and channel recreation
17-
thresholds.
1815
- Added optional `channel` and `channel_options` parameters to
1916
`TaskHubGrpcClient`, `AsyncTaskHubGrpcClient`, and `TaskHubGrpcWorker` to
2017
support pre-configured channel passthrough and low-level gRPC channel
2118
customization.
22-
- Added optional `resiliency_options` parameters to `TaskHubGrpcClient`,
23-
`AsyncTaskHubGrpcClient`, and `TaskHubGrpcWorker` so applications can pass
24-
gRPC resiliency settings through constructor APIs.
19+
- Added `GrpcWorkerResiliencyOptions` and `GrpcClientResiliencyOptions`, plus
20+
`resiliency_options` constructor parameters on `TaskHubGrpcClient`,
21+
`AsyncTaskHubGrpcClient`, and `TaskHubGrpcWorker`, to configure hello
22+
deadlines, silent-disconnect detection, reconnect backoff, and channel
23+
recreation thresholds for SDK-managed gRPC connections.
2524
- Added `get_orchestration_history()` and `list_instance_ids()` to the sync
2625
and async gRPC clients.
2726
- Added in-memory backend support for `StreamInstanceHistory` and
@@ -30,18 +29,13 @@ ADDED
3029

3130
FIXED
3231

33-
- Hardened `TaskHubGrpcWorker` reconnect handling so configured hello timeouts
34-
apply on fresh connections, received work items reset failure tracking,
35-
SDK-owned channels are cleaned up on shutdown and full resets, and
36-
caller-owned channels are never recreated or closed during worker reconnects.
37-
- Fixed sync `TaskHubGrpcClient` transport resiliency so SDK-owned channels are
38-
recreated after repeated transport failures while long-poll timeout
39-
deadlines, successful replies, and application-level RPC errors reset the
40-
failure tracker.
41-
- Fixed async `AsyncTaskHubGrpcClient` transport resiliency so SDK-owned
42-
channels are recreated after repeated transport failures while long-poll
43-
timeout deadlines, successful replies, and application-level RPC errors
44-
reset the failure tracker.
32+
- Improved `TaskHubGrpcWorker` recovery from stale or disconnected gRPC streams
33+
so configured hello timeouts apply on fresh connections, received work resets
34+
failure tracking, SDK-owned channels are refreshed and cleaned up safely, and
35+
caller-owned channels are never recreated or closed during reconnects.
36+
- Improved sync and async gRPC clients so repeated transport failures recreate
37+
SDK-owned channels, while long-poll deadlines, successful replies, and
38+
application-level RPC errors do not trigger unnecessary channel replacement.
4539

4640
## v1.4.0
4741

docs/superpowers/specs/2026-04-23-grpc-resiliency-design.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -135,35 +135,39 @@ The monitor reports one of these outcomes:
135135
The outer worker loop uses those outcomes as follows:
136136

137137
- `message_received`: reset health counters
138-
- `graceful_close_before_first_message`: count as channel poison
139-
- `graceful_close_after_message`: reconnect immediately without poisoning the
140-
channel
138+
- `graceful_close_before_first_message`: immediately reset the current stream
139+
and force a fresh SDK-owned channel on the next connect attempt
140+
- `graceful_close_after_message`: immediately reset the current stream and
141+
reconnect without incrementing the transport-failure counter
141142
- `silent_disconnect`: count as channel poison
142143
- `shutdown`: exit cleanly
143144

144-
This keeps rolling upgrades and normal peer-driven reconnects from being
145-
treated the same as a stale half-open stream.
145+
This keeps rolling upgrades and normal peer-driven reconnects from inflating
146+
the failure threshold while still forcing SDK-owned workers to establish a
147+
fresh channel after graceful stream closures.
146148

147149
#### Failure counting and recreation
148150

149151
The worker increments the consecutive-failure counter only for
150152
transport-shaped failures:
151153

152154
- `UNAVAILABLE`
153-
- `Hello` `DEADLINE_EXCEEDED`
155+
- `DEADLINE_EXCEEDED`
154156
- explicit silent-disconnect timeout
155-
- graceful stream close before the first message
156157

157158
It does not increment the counter for errors that channel recreation is
158159
unlikely to fix, such as:
159160

160161
- `UNAUTHENTICATED`
161162
- `NOT_FOUND`
162163
- orchestration or activity execution failures
164+
- graceful stream closures before or after work items
163165

164166
When the threshold is reached and the worker owns the channel, it recreates the
165-
channel and stub. When the worker does not own the channel, it keeps retrying
166-
the existing transport and logs that the channel could not be recreated.
167+
channel and stub. Graceful stream closures also force an immediate fresh
168+
SDK-owned channel even though they do not increment the threshold. When the
169+
worker does not own the channel, it keeps retrying the existing transport and
170+
logs that the channel could not be recreated.
167171

168172
### Client behavior
169173

@@ -284,9 +288,9 @@ Add focused unit tests for the new behavior.
284288

285289
- hello deadline failure counts toward recreation
286290
- silent-disconnect timeout is detected and classified
287-
- graceful close before the first message poisons the channel
288-
- graceful close after a message triggers reconnect without poisoning
289-
- user-supplied channels are not recreated
291+
- graceful stream closes force a fresh SDK-owned connection without increasing
292+
the failure counter
293+
- user-supplied channels are not recreated or closed
290294

291295
### Client tests
292296

durabletask-azuremanaged/CHANGELOG.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111
`DurableTaskSchedulerClient`, `AsyncDurableTaskSchedulerClient`, and
1212
`DurableTaskSchedulerWorker` to allow combining custom gRPC interceptors with
1313
DTS defaults and to support pre-configured/customized gRPC channels.
14-
- Added optional `resiliency_options` parameters to
14+
- Added pass-through `resiliency_options` support on
1515
`DurableTaskSchedulerClient`, `AsyncDurableTaskSchedulerClient`, and
16-
`DurableTaskSchedulerWorker` so applications can pass gRPC resiliency
17-
settings through their constructors.
16+
`DurableTaskSchedulerWorker` so Azure Managed applications can use the core
17+
SDK's gRPC resiliency option types through their constructors.
1818
- Added `workerid` gRPC metadata on Durable Task Scheduler worker calls for
1919
improved worker identity and observability.
2020
- Improved sync access token refresh concurrency handling to avoid duplicate

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ include = ["durabletask", "durabletask.*"]
5151
minversion = "6.0"
5252
testpaths = ["tests"]
5353
asyncio_mode = "auto"
54+
addopts = "--import-mode=importlib"
5455
markers = [
5556
"azurite: tests that require Azurite (local Azure Storage emulator)",
5657
]

0 commit comments

Comments
 (0)