fix: don't hang on peer disconnect during BLE connect#108
Merged
Conversation
When the peer disconnects mid-`start_notify` (e.g. failed pairing on a device that requires it, then the peer drops the link after 30s), the BlueZ D-Bus `StartNotify` call never returns and `SMPBLETransport.connect()` hangs forever. The user-supplied `timeout_s` was also only honored for the initial scan, not for the rest of the connect flow. Refs intercreate/smpmgr#97. - Bound the entire `connect()` to `timeout_s` via `asyncio.wait_for`. - Pass `timeout=timeout_s` to `BleakClient` so its own connect honors it. - Race `start_notify` against `_disconnected_event` so a peer disconnect surfaces as `SMPTransportDisconnected` instead of hanging. - Best-effort `BleakClient.disconnect()` cleanup on any connect failure. - Bump `bleak>=3.0.2,<4` — `bleak` 2.0.x has its own AcquireNotify hang regression (hbldh/bleak#1885); pinning above it is the simplest way to keep downstream consumers off the broken line. - Bump `async-timeout>=5.0.1` (the 3.10-only fallback) to current stable.
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a BLE connection hang in SMPBLETransport.connect() when the peer disconnects during notification setup, enforces a single user timeout across the entire connect flow, and updates BLE-related dependency constraints to avoid known bleak regressions.
Changes:
- Wraps the full BLE connect flow in an overall
asyncio.wait_for(...)timeout and adds best-effort disconnect cleanup on failure. - Adds
_await_or_disconnect()to race a potentially-hanging GATT operation (e.g.start_notify) against a disconnect event. - Pins
bleak>=3.0.2,<4and bumpsasync-timeoutfor the Python <3.11 fallback.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/smpclient/transport/ble.py |
Enforces end-to-end connect timeout, adds disconnect-aware awaiting for start_notify, and adds best-effort cleanup on connect failure. |
tests/test_smp_ble_transport.py |
Adds regression tests covering peer disconnect + timeout while start_notify hangs. |
pyproject.toml |
Updates optional BLE dependency constraint and async-timeout minimum version. |
uv.lock |
Locks updated dependency versions (bleak 3.0.2, async-timeout 5.0.1, etc.). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses two PR #108 review comments: - `connect()` was catching `BaseException`, which also swept up `KeyboardInterrupt` and `SystemExit` — async cleanup on those would delay process termination. Narrow to `(Exception, asyncio.CancelledError)` so Ctrl+C terminates promptly while connect-flow failures and caller-driven cancellation still get the best-effort disconnect. - `_await_or_disconnect()` cancelled and drained pending sub-tasks only after `asyncio.wait()` returned, so an outer cancel (e.g. the `asyncio.wait_for` timeout in `connect()`) raised at the `wait()` itself would leave `op_task` (the hung `start_notify`) and `disconnected_task` running. Move the cancel + drain into a `finally` so both sub-tasks are guaranteed cancelled and gathered before `_await_or_disconnect()` returns or propagates. Adds `test_connect_does_not_leak_tasks_on_external_cancel` which asserts no sub-tasks remain after the caller cancels `connect()` mid `start_notify`. The test fails without the `finally` fix.
Collaborator
Author
|
Going to merge pretty quick since it seems low risk and we do need the bleak 3 bump. |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SMPBLETransport.connect()hang when the peer disconnects mid-start_notify(e.g. failed pairing on a device that requires it, then peer drops the link).--timeouthonor the entire connect flow, not just the initial scan.bleak>=3.0.2,<4to keep downstream consumers off the brokenbleak2.0.x AcquireNotify path (hbldh/bleak#1885).Closes the smpclient side of intercreate/smpmgr#97. A follow-up PR on smpmgr will close that ticket once a release with this fix is published.
Root cause
From the debug log on smpmgr#97, the sequence is:
BleakClient.connect()succeeds, services discovered, MTU acquired.await self._client.start_notify(...)issuesStartNotifyover D-Bus.StartNotifyD-Bus call, so the await never returns.disconnected_callbackdoes fire but there's nothing watching for it during connect.The user-supplied
timeout_swas also only threaded intoBleakScanner.find_device_by_address(timeout=...)— not intoBleakClient.connect()and not enforced over the wholeconnect(), which is why--timeout 10was being ignored on smpmgr ≥ 0.13.1.Changes
SMPBLETransport.connect()now wraps the body inasyncio.wait_for(timeout=timeout_s)and adds a best-effortBleakClient.disconnect()on any failure._await_or_disconnect()helper races a coroutine against_disconnected_eventand raisesSMPTransportDisconnectedif the peer disconnect wins (same wait+cancel shape as the existing_notify_or_disconnect).BleakClient(..., timeout=timeout_s, ...)so bleak's own connect timeout honors the caller's value.bleak>=3.0.2,<4(was>=2.0.0).bleak2.0.x has the AcquireNotify regression; 2.1.0+ recovered, and 3.0.2 is current stable.async-timeout>=5.0.1(the Python 3.10-only fallback) — current stable, no API change.Test plan
task all(ruff format / lint, pydoclint, mypy, full pytest suite)tests/test_smp_ble_transport.py:test_connect_raises_on_peer_disconnect_during_start_notify— mocks aBleakClientwhosestart_notifyhangs forever, then firesdisconnected_callback; assertsSMPTransportDisconnectedis raised andBleakClient.disconnect()is called for cleanup.test_connect_raises_on_timeout_during_start_notify— same hanging mock, callsconnect(..., 0.05), assertsasyncio.TimeoutErroris raised (which is whatsmpmgr.connect_with_spinneralready catches).🤖 Generated with Claude Code