Skip to content

fix: don't hang on peer disconnect during BLE connect#108

Merged
JPHutchins merged 2 commits into
mainfrom
fix/ble-disconnect-hang
May 22, 2026
Merged

fix: don't hang on peer disconnect during BLE connect#108
JPHutchins merged 2 commits into
mainfrom
fix/ble-disconnect-hang

Conversation

@JPHutchins
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes the SMPBLETransport.connect() hang when the peer disconnects mid-start_notify (e.g. failed pairing on a device that requires it, then peer drops the link).
  • Makes --timeout honor the entire connect flow, not just the initial scan.
  • Pins bleak>=3.0.2,<4 to keep downstream consumers off the broken bleak 2.0.x AcquireNotify path (hbldh/bleak#1885).

Closes the smpclient side of intercreate/smpmgr#97. A follow-up PR on smpmgr will close that ticket once a release with this fix is published.

Root cause

From the debug log on smpmgr#97, the sequence is:

  1. BleakClient.connect() succeeds, services discovered, MTU acquired.
  2. await self._client.start_notify(...) issues StartNotify over D-Bus.
  3. The peer demands pairing for the CCCD write; pairing never completes; the peer drops the link 30 s later.
  4. BlueZ never replies to the StartNotify D-Bus call, so the await never returns. disconnected_callback does fire but there's nothing watching for it during connect.

The user-supplied timeout_s was also only threaded into BleakScanner.find_device_by_address(timeout=...) — not into BleakClient.connect() and not enforced over the whole connect(), which is why --timeout 10 was being ignored on smpmgr ≥ 0.13.1.

Changes

  • SMPBLETransport.connect() now wraps the body in asyncio.wait_for(timeout=timeout_s) and adds a best-effort BleakClient.disconnect() on any failure.
  • New _await_or_disconnect() helper races a coroutine against _disconnected_event and raises SMPTransportDisconnected if the peer disconnect wins (same wait+cancel shape as the existing _notify_or_disconnect).
  • BleakClient(..., timeout=timeout_s, ...) so bleak's own connect timeout honors the caller's value.
  • Bumped bleak>=3.0.2,<4 (was >=2.0.0). bleak 2.0.x has the AcquireNotify regression; 2.1.0+ recovered, and 3.0.2 is current stable.
  • Bumped async-timeout>=5.0.1 (the Python 3.10-only fallback) — current stable, no API change.

Test plan

  • task all (ruff format / lint, pydoclint, mypy, full pytest suite)
  • Two new regression tests in tests/test_smp_ble_transport.py:
    • test_connect_raises_on_peer_disconnect_during_start_notify — mocks a BleakClient whose start_notify hangs forever, then fires disconnected_callback; asserts SMPTransportDisconnected is raised and BleakClient.disconnect() is called for cleanup.
    • test_connect_raises_on_timeout_during_start_notify — same hanging mock, calls connect(..., 0.05), asserts asyncio.TimeoutError is raised (which is what smpmgr.connect_with_spinner already catches).
  • Bench against the reporter's hardware on smpmgr#97 once a release is cut (cannot reproduce locally — needs the failed-pairing peer).

🤖 Generated with Claude Code

When the peer disconnects mid-`start_notify` (e.g. failed pairing on a
device that requires it, then the peer drops the link after 30s), the
BlueZ D-Bus `StartNotify` call never returns and `SMPBLETransport.connect()`
hangs forever. The user-supplied `timeout_s` was also only honored for the
initial scan, not for the rest of the connect flow.

Refs intercreate/smpmgr#97.

- Bound the entire `connect()` to `timeout_s` via `asyncio.wait_for`.
- Pass `timeout=timeout_s` to `BleakClient` so its own connect honors it.
- Race `start_notify` against `_disconnected_event` so a peer disconnect
  surfaces as `SMPTransportDisconnected` instead of hanging.
- Best-effort `BleakClient.disconnect()` cleanup on any connect failure.
- Bump `bleak>=3.0.2,<4` — `bleak` 2.0.x has its own AcquireNotify hang
  regression (hbldh/bleak#1885); pinning above it is the simplest way to
  keep downstream consumers off the broken line.
- Bump `async-timeout>=5.0.1` (the 3.10-only fallback) to current stable.
Copy link
Copy Markdown
Collaborator Author

@JPHutchins JPHutchins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a BLE connection hang in SMPBLETransport.connect() when the peer disconnects during notification setup, enforces a single user timeout across the entire connect flow, and updates BLE-related dependency constraints to avoid known bleak regressions.

Changes:

  • Wraps the full BLE connect flow in an overall asyncio.wait_for(...) timeout and adds best-effort disconnect cleanup on failure.
  • Adds _await_or_disconnect() to race a potentially-hanging GATT operation (e.g. start_notify) against a disconnect event.
  • Pins bleak>=3.0.2,<4 and bumps async-timeout for the Python <3.11 fallback.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/smpclient/transport/ble.py Enforces end-to-end connect timeout, adds disconnect-aware awaiting for start_notify, and adds best-effort cleanup on connect failure.
tests/test_smp_ble_transport.py Adds regression tests covering peer disconnect + timeout while start_notify hangs.
pyproject.toml Updates optional BLE dependency constraint and async-timeout minimum version.
uv.lock Locks updated dependency versions (bleak 3.0.2, async-timeout 5.0.1, etc.).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/smpclient/transport/ble.py Outdated
Comment thread src/smpclient/transport/ble.py
Addresses two PR #108 review comments:

- `connect()` was catching `BaseException`, which also swept up
  `KeyboardInterrupt` and `SystemExit` — async cleanup on those would
  delay process termination. Narrow to `(Exception, asyncio.CancelledError)`
  so Ctrl+C terminates promptly while connect-flow failures and
  caller-driven cancellation still get the best-effort disconnect.

- `_await_or_disconnect()` cancelled and drained pending sub-tasks only
  after `asyncio.wait()` returned, so an outer cancel (e.g. the
  `asyncio.wait_for` timeout in `connect()`) raised at the `wait()`
  itself would leave `op_task` (the hung `start_notify`) and
  `disconnected_task` running. Move the cancel + drain into a `finally`
  so both sub-tasks are guaranteed cancelled and gathered before
  `_await_or_disconnect()` returns or propagates.

Adds `test_connect_does_not_leak_tasks_on_external_cancel` which
asserts no sub-tasks remain after the caller cancels `connect()` mid
`start_notify`. The test fails without the `finally` fix.
@JPHutchins
Copy link
Copy Markdown
Collaborator Author

Going to merge pretty quick since it seems low risk and we do need the bleak 3 bump.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

Comment thread tests/test_smp_ble_transport.py
@JPHutchins JPHutchins merged commit 45c4610 into main May 22, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants