Skip to content

Replace cluster handlers with entity attributes#657

Merged
puddly merged 87 commits into
zigpy:devfrom
puddly:puddly/drop-cluster-handlers
May 27, 2026
Merged

Replace cluster handlers with entity attributes#657
puddly merged 87 commits into
zigpy:devfrom
puddly:puddly/drop-cluster-handlers

Conversation

@puddly
Copy link
Copy Markdown
Contributor

@puddly puddly commented Feb 17, 2026

See #653 for a different approach.

As an example:

    # This now matches clusters instead of cluster handlers
    _cluster_match = ClusterMatch(
        server_clusters=frozenset({PowerConfiguration.cluster_id}),
        profile_device_types=frozenset(
            {(zha.PROFILE_ID, SMARTTHINGS_ARRIVAL_SENSOR_DEVICE_TYPE)}
        ),
    )

    # There is a similar `_client_cluster_config`
    _server_cluster_config = {
        PowerConfiguration.cluster_id: ClusterConfig(
            bind=True,
            attributes={
                PowerConfiguration.AttributeDefs.battery_voltage: AttrConfig(
                    read_on_startup=False,
                    reporting=REPORT_CONFIG_BATTERY_SAVE,
                ),
                PowerConfiguration.AttributeDefs.battery_percentage_remaining: AttrConfig(
                    read_on_startup=False,
                    reporting=REPORT_CONFIG_BATTERY_SAVE,
                ),
                PowerConfiguration.AttributeDefs.battery_size: AttrConfig(
                    read_on_startup=True,
                ),
                PowerConfiguration.AttributeDefs.battery_quantity: AttrConfig(
                    read_on_startup=True,
                ),
            },
        ),
    }

For an individual cluster, all entities referencing that cluster will have their individual cluster configurations combined and then aggregated into binding and attribute reporting requests.

This PR introduces the concept of virtual entities: these are entities that participate in discovery, binding reporting, and so on, same as any other. The only difference is that they have no public API and should be ignored by Core. They effectively replace cluster handlers for the very few circumstances we require something like them and actually cleans up a few hacks.


We should maybe introduce version for entity configuration? Right now, changing a reporting interval or adding new attributes requires a user to manually click "reconfigure". An alternative would be to keep track of all binding and reporting config in the zigpy database and diff it with the ZHA defaults. We'd need to track whether configuration has been automatically configured or if it was manually pushed, in case someone overrides reporting.

This is part of / a requirement for:

@dmulcahey
Copy link
Copy Markdown
Contributor

Cluster handlers that do not create entities but are used for binding need to be explicitly handled. I think the simplest way to do this would be to merge #594 and live with having duplicate entities before device deprecating automation triggers.

I agree. I think adding event entity support first is the way to go

We should maybe introduce version for entity configuration? Right now, changing a reporting interval or adding new attributes requires a user to manually click "reconfigure". An alternative would be to keep track of all binding and reporting config in the zigpy database and diff it with the ZHA defaults. We'd need to track whether configuration has been automatically configured or if it was manually pushed, in case someone overrides reporting.

we need several things to do this correctly .... an additional thing that comes to mind is queuing of commands for sleepy devices for example (if we want to auto apply changes when a new version is created).

Copy link
Copy Markdown
Collaborator

@zigpy-review-bot zigpy-review-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 2026-05-27 — re-reviewed at 6243adf2. All blockers and must-address items are resolved; what remains is two low-priority optionals. Per-item notes inline.

Comment-level review of a large, well-staged refactor. CI is green across 3.12–3.14, and unique_ids are preserved across the refactor — only one entity changes identity across all 810 regenerated snapshots — because the new ClusterMatch discovery reconstructs the same ieee-endpoint-clusterid-suffix ids the old cluster-handler discovery produced (the migrate_unique_ids escape hatch exists but is empty throughout this PR). The feature_priority mechanism is sound where it's used for genuine alternatives (e.g. EM_POLLING).

Merge blockers

  1. The zigpy work backing this PR isn't pinned in it. Resolved in 9b4b78bapyproject.toml pins zigpy==1.5.0, which contains both zigpy/zigpy#1823 (read/write/report byte-batching) and zigpy/zigpy#1824 (request retries) — verified: the 1.5.0 tag is #1824's merge commit and #1823 is its direct ancestor; CI is green against the real 1.5.0.
  2. Lixee ZLinky / TICMeter lose their primary energy sensor. Resolved in 036fea41feature_priority dropped from the base SmartEnergySummation and the six tier classes; summation_delivered is back alongside the tiers.

Must-address

  1. Timer-handle leak on entity removal (virtual.py). Resolved in 9e6580f9OnOffClientCacheSync.on_remove cancels and nulls _off_listener.
  2. Direct cluster access drops retry + ZHAException wrapping. Resolved. The retry half is handled by zigpy 1.5.0/#1824. The exception-type half is being moved to Home Assistant — the proposed addition to convert_zha_error_to_ha_error (which already wraps every affected service method: switch/lock/cover/number/select/button/…) catches TimeoutError and zigpy.exceptions.ZigbeeExceptionHomeAssistantError. This is a cleaner split than wrapping in the library; just needs to land as a paired HA Core PR in lockstep with the ZHA release.

Optional / already-tracked

  • discovery.py:89 (_pick_primary_cluster): next(iter(match.server_clusters)) over a frozenset is latent unique_id non-determinism. Resolved in 6243adf2 — now min(...) for required clusters and sorted(...) for optional ones, so the picked cluster (and derived unique_id) is deterministic.
  • discovery.py:428: a cluster_id present in both in_clusters and out_clusters is evaluated twice, so an entity registered to it can be instantiated twice (masked downstream by (platform, unique_id) keying, but on_add still runs twice). A per-endpoint seen set closes it cheaply.
  • device.py:997/1273: aggregate_cluster_configs runs twice per join (configure + initialize) on identical input.
  • cluster_config.py:77 docstring says the key is (endpoint_id, cluster_id); it's the 3-tuple incl. is_server.
  • Real (in-venv) mypy went 76→79, entirely unstubbed-zigpy-enum annotation noise in the refactored light/alarm_control_panel/siren/cover files — no substantive regression.

The skip_configuration thread on the initialize path was addressed by 1feebc66 (guard at device.py:998 & :1274 + test); maintainer concurred.

Comment thread zha/application/platforms/sensor/__init__.py Outdated
Comment thread zha/application/platforms/virtual.py
Comment thread tests/data/devices/lixee-zlinky-tic.json
@puddly
Copy link
Copy Markdown
Contributor Author

puddly commented May 27, 2026

The exception type re-wrapping can be moved into Home Assistant so I think this should be good to go:

diff --git a/homeassistant/components/zha/helpers.py b/homeassistant/components/zha/helpers.py
index 4625aa4d0cc..ddbb5ab7fc6 100644
--- a/homeassistant/components/zha/helpers.py
+++ b/homeassistant/components/zha/helpers.py
@@ -1418,6 +1418,17 @@ def convert_zha_error_to_ha_error[**_P, _EntityT: ZHAEntity](
     async def handler(self: _EntityT, *args: _P.args, **kwargs: _P.kwargs) -> None:
         try:
             return await func(self, *args, **kwargs)
+        except TimeoutError as exc:
+            raise HomeAssistantError(
+                "Failed to send request: device did not respond"
+            ) from exc
+        except zigpy.exceptions.ZigbeeException as exc:
+            message = "Failed to send request"
+
+            if str(exc):
+                message = f"{message}: {exc}"
+
+            raise HomeAssistantError(message) from exc
         except ZHAException as err:
             raise HomeAssistantError(err) from err

Comment thread zha/application/helpers.py Outdated
Comment thread zha/application/platforms/binary_sensor/__init__.py Outdated
Comment thread zha/application/platforms/switch.py Outdated
@TheJulianJES
Copy link
Copy Markdown
Contributor

I think this is good, but I'm doing one last (manual) pass with Claude and Codex.

@TheJulianJES
Copy link
Copy Markdown
Contributor

Oh well, Codex usage limits ran out, but one bot comment incoming. Not sure they need to be addressed though.

Copy link
Copy Markdown
Collaborator

@zigpy-review-bot zigpy-review-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full re-review at 9d05dfa5. CI green on 3.12–3.14, real-venv mypy steady at 79 (unstubbed-enum annotation noise only), and zigpy is now pinned to 1.5.0 (verified to contain #1823 + #1824).

Previously-raised items now resolved (for the record): zigpy bump (9b4b78ba), Lixee summation regression (036fea41), OnOffClientCacheSync timer leak (9e6580f9), exception-type wrapping (moving to HA's convert_zha_error_to_ha_error + retries restored via zigpy 1.5.0), _pick_primary_cluster non-determinism (6243adf2), the redundant BinarySensor.is_on guard (fdab3ef1), and unused pylint ignores (9d05dfa5).

New findings are left as inline comments. Note for completeness: external tooling suggested chunking aggregated reporting/startup reads — that's handled by the zigpy 1.5.0 bump (#1823 byte-chunks reporting and caps reads at 5), and bf9f24ce reverted the ZHA-side chunking deliberately. No action needed there.

Still-open items previously raised (tracked in their existing threads, not re-commented inline)

  • discovery.py:428 — a cluster_id present on both endpoint sides is evaluated twice via the itertools.chain(in_clusters, out_clusters) loop, so an entity registered to it is instantiated twice (masked downstream by (platform, unique_id) keying, but on_add still runs twice). A per-endpoint seen set closes it.
  • device.pyaggregate_cluster_configs(self._discovered_entities) runs twice per join (in async_configure and again in async_initialize) on identical input.
  • device.py skip_configuration now also skips the from_cache=True initialize. Discussed previously; likely safe since entities read through to the DB-backed cluster cache, but noting for completeness.
  • cluster_config.py aggregate_cluster_configs docstring says the key is (endpoint_id, cluster_id); it's actually the 3-tuple (endpoint_id, cluster_id, is_server).
  • virtual.py:152 except (ZHAException, Exception)Exception already covers ZHAException.

Test coverage

tests/test_cluster_handlers.py (1515 lines) was deleted with no replacement, and nothing under tests/ references cluster_config, aggregate_cluster_configs, AggregatedAttrConfig, or VirtualEntity. The risk-bearing new logic — AggregatedAttrConfig.merge (min/max/reportable_change), aggregate_cluster_configs, bind/report execution, and the virtual-entity lifecycle — is only exercised indirectly through platform/device integration tests. Worth adding targeted unit tests for merge and aggregate_cluster_configs.

Comment thread zha/zigbee/device.py
*(endpoint.async_configure() for endpoint in self._endpoints.values())
)
# Configure binding and reporting from entity-level cluster configs
aggregated = aggregate_cluster_configs(self._discovered_entities)
Copy link
Copy Markdown
Collaborator

@zigpy-review-bot zigpy-review-bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate listeners on rejoin (transient — confirmed). (Anchored here; the relevant call is self._discover_new_entities() just above at L994.) async_configure() calls _discover_new_entities(), which runs entity.on_add() (subscribing each entity to its cluster's events) and appends to _pending_entities, but does not drain them. Draining — _add_pending_entities(), which removes re-discovered duplicates via on_remove() — happens only in async_initialize().

The rejoin path _async_device_rejoined calls only async_configure() (initialize runs later, on availability). So between rejoin's configure and the next initialize, the freshly-on_add()'d instances and the live ones in _platform_entities are both subscribed to the same cluster, producing duplicate state / zha_event emissions. With this PR's entity-direct-subscription model on_add() subscribes the entity itself, so re-running discovery re-subscribes — unlike dev.

Not a permanent leak, to be precise: the next async_initialize()_add_pending_entities() drains all accumulated pending instances (both the configure batch and the initialize batch), since every one matches an existing _platform_entities key and is dropped via on_remove(). So the duplicate subscriptions are cleaned up at the next initialize — the harm is bounded to the configure→initialize window (and to the case where several rejoins land before an initialize, since _discover_new_entities() clears _discovered_entities but not _pending_entities). It's a transient double-emission window, not unbounded accumulation. Still worth closing: a just-rejoined device is actively reporting precisely during that window.

Consider draining (or clearing _pending_entities) in async_configure, or having _discover_new_entities() skip already-present unique_ids before calling on_add().

Comment thread zha/application/platforms/cover/__init__.py
)

for (endpoint_id, _cluster_id, _is_server), agg in configs.items():
if agg.bind:
Copy link
Copy Markdown
Collaborator

@zigpy-review-bot zigpy-review-bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting is configured even when the cluster isn't bound. Binding runs only under this if agg.bind:, but reporting is configured whenever any aggregated attr has a reporting config (the if reporting_attrs: block below). An entity author who sets reporting=... without bind=True gets Configure Reporting pushed to an unbound cluster — reports never arrive, with no warning.

On dev the cluster-handler model defaulted BIND = True unconditionally for every handler (cluster_handlers/__init__.py:195), so any handler carrying a REPORT_CONFIG always bound too. The new ClusterConfig.bind defaults to False, so reporting and binding are now decoupled.

To be clear, no current entity triggers this — every ClusterConfig under zha/application/platforms/ that sets reporting= also sets bind=True, so this is a latent footgun, not a live bug. Consider asserting bind is set when any aggregated attr has reporting (or auto-enabling it) to keep it that way.

for optional in match.optional_cluster_handlers:
if optional in endpoint.cluster_handlers_by_name:
server_handlers.add(optional)
selected_matches = matches_by_priority[highest_priority]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This selection is immediately overwritten — the first override block is dead. selected_matches set here is overwritten by the second override pass just below (L550-558), which re-derives the override matches from matches_by_priority.values() regardless. As a result the earlier override block (the override_by_priority rebuild around L509-528) has no effect on the discovered entities — it only influences highest_priority and the Ignored matches debug log. Removing that first block collapses the two override passes into one.

@dmulcahey
Copy link
Copy Markdown
Contributor

dmulcahey commented May 27, 2026

Oh well, Codex usage limits ran out, but one bot comment incoming. Not sure they need to be addressed though.

P1: Platform.VIRTUAL entities leak through the public entity surface.
VirtualEntity says wrappers should not register these entities, but _discover_new_entities() appends them at device.py, and _add_entity() stores all entities in Device.platform_entities at device.py. Core converts every exposed entity platform through HA’s platform enum at helpers.py; HA has no virtual platform. A subagent probe reproduced this with a TRADFRI remote producing a ScenesClientBind virtual entity. Keep virtual entities in an internal lifecycle/config collection or filter them before public entity storage and entity-added events.

P1/P2: Direct platform cluster calls lost retry and exception normalization.
New platform code calls raw zigpy methods, for example switch.py, lock/init.py, and cover/init.py. The old cluster-handler proxy retried transient Zigbee failures and wrapped failures as ZHAException; the new paths can fail after one attempt and leak raw TimeoutError / ZigbeeException. Add a shared request helper with the old retry/wrap behavior and route command/write helpers through it.

P2: Cluster configuration lost retry and reporting/read chunking.
Bind is one raw call at cluster_config.py, reporting is one unchunked configure_reporting_multiple() at cluster_config.py, and startup reads are unchunked one-shot calls at cluster_config.py and cluster_config.py. Thermostat already declares many reporting attrs under climate/init.py. Subagent probes confirmed transient bind/report/read failures are attempted once. Restore retry and chunk reporting/reads like the old handler path.

P2: async_configure() can leak duplicate pending entities and listeners.
async_configure() rediscovers entities at device.py, _discover_new_entities() calls on_add() and appends to _pending_entities, but configure-only paths do not drain that queue; _add_pending_entities() runs from initialize at device.py. Probe evidence: after repeated async_configure(), pending entities grew 0 -> 5 -> 10 and Scenes listeners 2 -> 3 -> 4. Split config discovery from staged public add/listener setup, or drain/cleanup configure-only rediscovery.

P2: Quirks-v2 generated entities are skipped by cluster_id metadata filters.
_entity_targets_cluster() returns false when _cluster_match is None at device.py. Generated quirks-v2 entities are created with a real backing entity.cluster but no class _cluster_match, so change_entity_metadata(cluster_id=...) and prevent_default_entity_creation(cluster_id=...) silently miss them. Two subagents reproduced this. Fall back to entity.cluster.cluster_id and direction, or assign an instance-level match during generated entity creation.

P2: IAS enrollment writes cie_addr with a raw write.
virtual.py calls cluster.write_attributes() directly before sending enroll response. The old path used safe/retried writes and checked per-attribute status. A non-success write can be logged as success and the device can remain unenrolled or enrolled to the wrong CIE. Route this through the safe write/retry helper and test non-success status plus transient timeout.

P2: Several zha_event payload/order contracts changed.
OnOff client events now emit before cache mutation at virtual.py versus updates at virtual.py and virtual.py. IAS ACE events use platform entity IDs at alarm_control_panel/init.py and alarm_control_panel/init.py. Identify trigger_effect uses self.unique_id at virtual.py. These break automation filters expecting the old cluster-event unique IDs and, for OnOff, can expose stale state to immediate event consumers.

# ZHA PR #657 Architecture Review

Date: 2026-05-27

PR: #657

PR title: Replace cluster handlers with entity attributes

PR state reviewed: open, not draft, not merged

Base: 9b4b78babe56102889796f18892d341ab7551ed6 (dev)

Head: f841e8ae067e80a520e6158688bca20386a4a2c7 (puddly/drop-cluster-handlers)

Local checkout reviewed: /Users/davidmulcahey/zha-ai-development/zha-pr-657-review

Patch size: 870 files changed, 9304 insertions, 115748 deletions

Scope And Method

I reviewed the PR as an architectural replacement of ZHA's cluster-handler layer, not as a narrow patch review. I used:

  • GitHub PR metadata for the reviewed head and base.
  • A clean local PR checkout at the exact head commit above.
  • GitNexus MCP over the PR checkout and the existing @ZHA group.
  • Local source reads and targeted grep to verify GitNexus impact paths.

GitNexus was used as the primary map for the new execution paths. On the PR checkout, the relevant contexts were:

  • Device.async_configure
  • Device.async_initialize
  • discover_entities_for_endpoint
  • discover_quirks_v2_entities
  • aggregate_cluster_configs
  • configure_cluster_configs
  • initialize_cluster_configs
  • VirtualEntity

Across the @ZHA group, GitNexus highlighted Home Assistant Core consumers that still participate in the old cluster-handler contract:

  • handle_zha_channel_configure_reporting
  • handle_zha_channel_bind
  • handle_zha_channel_cfg_done
  • handle_zha_device_entity_added_event
  • _create_entity_metadata
  • async_get_actions
  • _execute_cluster_handler_command_based_action
  • _get_ias_wd_cluster_handler

I did not run the ZHA test suite. This worktree does not have the test environment installed: python3 -m pytest --version fails with No module named pytest. The right validation path for implementation follow-up is still the repo setup script, ./script/setup, followed by targeted uv run pytest commands.

Executive Summary

The architecture direction is good. Removing the separate cluster-handler object graph simplifies ZHA's internal model: entities become the source of truth for cluster binding, reporting, cache initialization, runtime event listeners, and virtual behavior. That is a cleaner long-term shape than maintaining both entity discovery and a parallel handler ownership graph.

The PR is not merge-ready as a standalone ZHA library change. It removes a major public-ish integration surface that Home Assistant Core still imports and calls. It also introduces a few lifecycle and matching regressions inside ZHA that are easy to miss because they are not syntax-level failures.

The highest-risk issues are:

  1. Home Assistant Core still depends on deleted zha.zigbee.cluster_handlers modules, constants, endpoint attributes, and handler methods.
  2. New discovery can instantiate the same registered entity class twice when the same cluster ID exists on both the server and client side of an endpoint.
  3. New bind/reporting execution lost the old retry wrapper, making sleepy or flaky devices more likely to fail configuration.
  4. Quirks v2 cluster-targeted metadata no longer matches entities created directly from quirks metadata.
  5. Device configure and initialize can rediscover separate entity instances before registration, which makes configuration/initialization hooks depend on throwaway instances.

The right merge shape is either a lockstep Core compatibility PR or a temporary compatibility layer in ZHA that preserves enough of the old contract while Core migrates.

Architecture Direction

Before this PR, ZHA had two coupled runtime graphs:

  • Cluster handlers owned cluster configuration, attribute reporting, command routing, and some device-specific behavior.
  • Platform entities owned exposed HA-facing behavior and state.

This PR collapses much of that into entity declarations:

  • ClusterConfig
  • AttrConfig
  • ReportingConfig
  • PlatformEntity._server_cluster_config
  • PlatformEntity._client_cluster_config
  • PlatformEntity.async_configure_cluster
  • PlatformEntity.async_initialize_cluster

That direction has real benefits:

  • Discovery, binding, reporting, and cache reads are tied to the entity that needs them.
  • Hidden behavior can be represented as Platform.VIRTUAL entities instead of invisible cluster handlers.
  • Quirks v2 metadata can translate directly into per-entity cluster config.
  • Diagnostics can describe ZHA library entities without exposing a second cluster-handler model.

The main architectural obligation is that this new model must replace all old cluster-handler contracts, including cross-repo contracts in Core. The PR currently replaces most of the ZHA-internal implementation but does not provide a complete migration boundary for Core.

Finding 1: Core Still Depends On Deleted Cluster-Handler APIs

Materiality: Critical release blocker unless paired with a Core PR.

PR #657 deletes the cluster-handler package and replaces the old event names with new ZHA device/cluster events:

  • zha/application/const.py:160-162 defines ZHA_CLUSTER_BIND_EVENT, ZHA_CLUSTER_CONFIGURE_REPORTING_EVENT, and ZHA_DEVICE_CONFIGURED_EVENT.
  • zha/zigbee/device.py:255-269 defines the new ClusterBindEvent and ClusterConfigureReportingEvent.

Current Home Assistant Core still imports the removed surface:

  • core/homeassistant/components/zha/helpers.py:31-35 imports old cluster-handler event constants.
  • core/homeassistant/components/zha/helpers.py:74 imports ClusterBindEvent and ClusterConfigureReportingEvent from zha.zigbee.cluster_handlers.
  • core/homeassistant/components/zha/helpers.py:75-77 imports ClusterHandlerConfigurationComplete from zha.zigbee.device.
  • core/homeassistant/components/zha/device_action.py:7-13 imports cluster-handler constants and manufacturer-specific command enums.
  • core/homeassistant/components/zha/websocket_api.py:47 imports CLUSTER_HANDLER_IAS_WD.

Core also calls endpoint cluster-handler collections that do not exist in the new model:

  • core/homeassistant/components/zha/device_action.py:153-157 lists device actions from endpoint.claimed_cluster_handlers.
  • core/homeassistant/components/zha/device_action.py:211-233 searches endpoint.all_cluster_handlers and invokes methods on the selected handler.
  • core/homeassistant/components/zha/websocket_api.py:427-430 reaches through entity.cluster_handlers to map groupable entities to endpoints.
  • core/homeassistant/components/zha/websocket_api.py:1483-1490 finds the IAS warning device through endpoint.claimed_cluster_handlers.
  • core/homeassistant/components/zha/websocket_api.py:1500-1501 calls cluster_handler.issue_squawk.
  • core/homeassistant/components/zha/websocket_api.py:1543-1546 calls cluster_handler.issue_start_warning.

This is not only an import compatibility problem. The old Core API shape encoded behavior:

  • Device actions are keyed by cluster-handler names.
  • Warning device services invoke cluster-handler methods.
  • Groupable entity discovery assumes entities have cluster_handlers.
  • Configuration progress is bridged through old cluster-handler message constants.

Local grep found this coupling in homeassistant/components/zha; I did not find direct zha.zigbee.cluster_handlers imports in the local zigpy, zha-device-handlers, or bellows checkouts.

Why This Matters

If ZHA is upgraded in Core without a companion migration, Core will fail before runtime behavior can be evaluated. If imports are shimmed but endpoint/entity handler attributes are not, device actions, IAS warning services, groupable entity metadata, and configuration progress UI will still break.

Recommendation

Treat this as a lockstep multi-repo migration. The Core side needs a companion PR that:

  1. Imports the new ZHA event classes and constants.
  2. Stops using endpoint.claimed_cluster_handlers and endpoint.all_cluster_handlers.
  3. Replaces cluster-handler-name based action detection with direct entity, cluster, or capability APIs exposed by ZHA.
  4. Replaces IAS warning-device service implementation with a stable ZHA method, for example a device-level helper that locates the IAS WD cluster and issues the command.
  5. Replaces groupable entity endpoint lookup with a direct PlatformEntity.endpoint/cluster property path.
  6. Filters or explicitly handles Platform.VIRTUAL in Core metadata creation.
  7. Adds contract tests that install the PR ZHA package into Core and import/run homeassistant.components.zha.

If a lockstep Core PR is not ready, ZHA should keep a compatibility module for the deleted import paths and provide transitional endpoint/entity properties with deprecation warnings. A shim will not be as clean, but it is safer than removing the surface before Core no longer calls it.

Finding 2: Virtual Entities Can Leak Into Core Entity Metadata

Materiality: High when paired with Finding 1.

PR #657 introduces Platform.VIRTUAL entities to replace non-HA-facing cluster-handler behavior. ZHA diagnostics already knows to skip them:

  • zha/zigbee/device.py:1823-1827 skips Platform.VIRTUAL while building diagnostics JSON.

Core does not currently have the same guard:

  • core/homeassistant/components/zha/helpers.py:912-918 appends every device.platform_entities value to ha_zha_data.platforms[Platform(entity.PLATFORM)].
  • core/homeassistant/components/zha/helpers.py:499-510 does the same for dynamic entity-added events.

If virtual entities are placed in device.platform_entities, Core will try to store them under ha_zha_data.platforms[Platform.VIRTUAL]. Whether that becomes a hard failure depends on the Core enum and platform setup path at the time of integration, but architecturally the boundary is unclear: virtual entities are ZHA internal behavior carriers, not Home Assistant entities.

Recommendation

Make the virtual-entity boundary explicit in both packages:

  • ZHA should document whether virtual entities are allowed in device.platform_entities.
  • Core should skip Platform.VIRTUAL anywhere it creates HA entity metadata.
  • ZHA should have a test that virtual entities can configure clusters without appearing in external entity metadata.
  • Core should have a compatibility test that no HA entity wrapper is created for a virtual ZHA entity.

Finding 3: Entity Discovery Can Duplicate Matches For Clusters Present On Both Sides

Materiality: High.

discover_entities_for_endpoint() now looks up candidate entity classes by iterating all input clusters and all output clusters:

  • zha/application/discovery.py:422-425 chains in_clusters.values() and out_clusters.values().
  • zha/application/discovery.py:428 fetches registry entries by cluster.cluster_id.
  • zha/application/discovery.py:502-504 appends (match, entity_class) into matches_by_feature_and_priority.
  • zha/application/discovery.py:560-577 instantiates every selected match.

There is no dedupe for (match, entity_class) before appending. That means an entity registered under a cluster ID can be selected twice when the endpoint has that cluster ID in both directions, even when the match itself only requires the server side.

The PR already has a concrete fixture with this shape:

  • tests/data/devices/ikea-of-sweden-tradfri-bulb-gu10-ws-400lm.json:252-255 has input LightLink cluster 0x1000.
  • tests/data/devices/ikea-of-sweden-tradfri-bulb-gu10-ws-400lm.json:286-289 has output LightLink cluster 0x1000.

The virtual LightLinkGroupJoin entity is registered by cluster ID:

  • zha/application/platforms/virtual.py:183-184 registers LightLinkGroupJoin under LightLink.cluster_id.
  • zha/application/platforms/virtual.py:189-191 requires a server-side LightLink cluster.
  • zha/application/platforms/virtual.py:197-222 performs coordinator group-join work during async_configure_cluster.

Because the registry key is only the cluster ID, both the input and output LightLink clusters cause the same class to be considered. The match then checks the endpoint-level cluster sets, not the direction of the cluster that caused the lookup, so the server-side requirement passes both times.

Why This Matters

The duplicate may not show up as two final HA entities if their unique IDs collide and _add_pending_entities() removes duplicates. The side effects still happen earlier:

  • _discover_new_entities() calls entity.on_add() before pending dedupe (zha/zigbee/device.py:1156-1157).
  • aggregate_cluster_configs() appends every matching entity into agg.entities (zha/zigbee/cluster_config.py:94-97 and 115-117).
  • configure_cluster_configs() calls async_configure_cluster() for every entity in agg.entities (zha/zigbee/cluster_config.py:233-235).

For LightLinkGroupJoin, this can duplicate get_group_identifiers() and coordinator add_to_group() calls.

Recommendation

Deduplicate before entity instantiation. A minimal fix is to track selected pairs:

seen: set[tuple[type[PlatformEntity], ClusterMatch]] = set()

for entity_class in ENTITY_REGISTRY.get(cluster.cluster_id, []):
    match = entity_class._cluster_match
    if match is None:
        continue
    key = (entity_class, match)
    if key in seen:
        continue
    seen.add(key)
    ...

If ClusterMatch is not hashable enough for this, use id(match) or a normalized tuple of the match fields. A better long-term model is to build candidate classes from the union of cluster IDs first, then evaluate each class exactly once per endpoint.

Add a regression using the IKEA GU10 fixture and assert that LightLinkGroupJoin.async_configure_cluster() runs once.

Finding 4: Bind And Reporting Lost The Old Retry Semantics

Materiality: High for real networks, medium for pure code structure.

The old cluster-handler implementation wrapped bind and configure-reporting requests in Zigbee retry logic:

  • Base zha/zigbee/cluster_handlers/__init__.py:333 called RETRYABLE_REQUEST_DECORATOR(self.cluster.bind)().
  • Base zha/zigbee/cluster_handlers/__init__.py:399-401 called RETRYABLE_REQUEST_DECORATOR(self.cluster.configure_reporting_multiple)(reports).

The new aggregate configuration path calls both operations directly:

  • zha/zigbee/cluster_config.py:149 calls await agg.cluster.bind().
  • zha/zigbee/cluster_config.py:198 calls await agg.cluster.configure_reporting_multiple(reporting_attrs).

The old code also chunked reporting records via REPORT_CONFIG_ATTR_PER_REQ (base zha/zigbee/cluster_handlers/__init__.py:386-401). The new aggregator sends the merged attribute set for a cluster in a single request.

Why This Matters

Zigbee configuration traffic is often sent while devices are joining, waking, or recovering from interview. Removing retries makes configuration more sensitive to transient failures. A single timeout now marks bind/reporting as failed where the old path would have tried multiple times.

The chunking change may be fine for most clusters, but it should be intentional and tested. If a device rejects large configure-reporting batches, the PR has made the failure mode broader.

Recommendation

Preserve the old transport behavior in the new architecture:

RETRYABLE_REQUEST_DECORATOR = zigpy.util.retryable_request(tries=3)

res = await RETRYABLE_REQUEST_DECORATOR(agg.cluster.bind)()
res = await RETRYABLE_REQUEST_DECORATOR(
    agg.cluster.configure_reporting_multiple
)(reporting_attrs)

Also decide whether REPORT_CONFIG_ATTR_PER_REQ remains part of the ZHA contract. If it does, split reporting_attrs into chunks and preserve event status aggregation across chunks.

Add tests that inject a timeout on the first bind/reporting request and verify that the second attempt succeeds.

Finding 5: Reporting Attribute Lookup Can Raise Outside The Failure Event Path

Materiality: Medium.

In the new reporting loop:

  • zha/zigbee/cluster_config.py:177-182 builds reporting_attrs.
  • zha/zigbee/cluster_config.py:181 calls agg.cluster.find_attribute(attr_name) before the later try block.
  • zha/zigbee/cluster_config.py:197-221 catches exceptions only around configure_reporting_multiple().

If an entity config references a reporting attribute that the concrete cluster class cannot find, find_attribute() can raise before the code emits a ClusterConfigureReportingEvent with failure status.

The old code handled name resolution differently for event data:

  • Base zha/zigbee/cluster_handlers/__init__.py:372-375 caught KeyError while producing event names.
  • Base zha/zigbee/cluster_handlers/__init__.py:392-401 then built reports and attempted the request inside the chunk loop.

The new model should probably fail fast for invalid developer declarations, but then this should be a tested invariant. If the goal is runtime tolerance equivalent to cluster handlers, the exception should be converted into a failed per-attribute configuration event.

Recommendation

Pick the intended contract:

  • Strict contract: validate all entity cluster configs at import/test time and let invalid configs fail tests, not runtime joins.
  • Runtime-tolerant contract: catch KeyError around find_attribute(), mark that attribute as failed, and continue configuring valid attributes.

Given ZHA's device-support surface, I would prefer strict tests plus a graceful runtime failure event.

Finding 6: Quirks V2 Cluster-Targeted Metadata No Longer Matches Direct Quirks Entities

Materiality: Medium-high.

PR #657 routes default entity removal and metadata changes through _entity_targets_cluster():

  • zha/zigbee/device.py:106-114 returns False when entity._cluster_match is None.
  • zha/zigbee/device.py:1061-1063 uses _entity_targets_cluster() for disabled default entities.
  • zha/zigbee/device.py:1090-1092 uses _entity_targets_cluster() for changed entity metadata.

Quirks v2 entities are created directly from a concrete cluster:

  • zha/application/discovery.py:315-320 instantiates the entity with cluster=cluster and entity_metadata=entity_metadata.
  • zha/application/platforms/__init__.py:521-523 stores that cluster on the entity.
  • zha/application/platforms/__init__.py:595-598 exposes it through entity.cluster.

These quirks-created entities do not need a class-level _cluster_match; their backing cluster is already known. Under the new helper, cluster-targeted metadata will not match them if _cluster_match is None.

Why This Matters

Quirks v2 metadata commonly uses endpoint, cluster, attribute, and entity metadata to shape the exposed entity. A migration that stops cluster-targeted change_entity_metadata() or disabled-default rules from applying to quirks-created entities is subtle: entities still appear, but names, categories, primary flags, enabled-by-default state, or replacements may be wrong.

Recommendation

Make _entity_targets_cluster() understand direct cluster-backed entities:

if match is None:
    cluster = getattr(entity, "cluster", None)
    if cluster is None:
        return False
    if cluster.cluster_id != cluster_id:
        return False
    if cluster_type is None:
        return True
    return cluster.cluster_type == cluster_type

Then add regression tests for quirks v2 metadata that targets an exposed entity by cluster_id and cluster_type.

Finding 7: Configure And Initialize Can Run Against Different Entity Instances

Materiality: Medium architectural risk.

The new lifecycle discovers prospective entities during both configure and initialize:

  • zha/zigbee/device.py:994 calls _discover_new_entities() during async_configure().
  • zha/zigbee/device.py:997-999 aggregates those discovered entities and configures their cluster configs.
  • zha/zigbee/device.py:1269-1275 calls _discover_new_entities() again during async_initialize() and initializes attributes from that new set.
  • zha/zigbee/device.py:1278 then adds pending entities.

_discover_new_entities() clears _discovered_entities but does not clear _pending_entities:

  • zha/zigbee/device.py:1137 clears _discovered_entities.
  • zha/zigbee/device.py:1148-1157 appends each new entity to both _discovered_entities and _pending_entities.

_add_pending_entities() later dedupes by (platform, unique_id):

  • zha/zigbee/device.py:1217-1237 iterates pending entities, calls on_remove() on unsupported or duplicate entries, and then clears _pending_entities.

This means configuration hooks may run on one set of entity instances, initialization hooks may run on a second set, and the registered entities may be whichever duplicate was first accepted by pending dedupe.

Why This Matters

Today most async_configure_cluster() and async_initialize_cluster() hooks appear stateless enough that this may not break immediately. Architecturally, though, the new entity-owned cluster config model invites entity hooks to cache setup state, computed capability data, or listener handles. That becomes fragile if hooks run on instances that are later discarded.

Recommendation

Make discovery a single plan per lifecycle phase:

  • Discover prospective entities once before configure/initialize.
  • Run configure and initialize against the same prospective entity objects.
  • Register or discard those same objects.
  • Clear pending state before rediscovery if rediscovery is intentional.

If configure and initialize must stay separately callable, add an explicit invariant: hook methods cannot rely on instance state until after registration. That is weaker than reusing the same objects and will be harder to enforce.

Positive Architectural Changes Worth Keeping

Despite the findings above, the PR has several strong design moves.

Entity-Owned Cluster Config

aggregate_cluster_configs() gives the system a single place to merge binding, reporting, and initialization needs:

  • zha/zigbee/cluster_config.py:69-127 aggregates configs by (endpoint_id, cluster_id, is_server).
  • It merges attribute configs and keeps the entities that contributed to each cluster config.

That is a good replacement for cluster handlers because it makes "why are we binding this cluster?" answerable from the entity declarations.

Virtual Entities

Virtual entities are the right abstraction for behavior that is tied to endpoint discovery and lifecycle but should not become a Home Assistant entity. Examples include:

  • LightLink group join behavior.
  • Client command cache synchronization.
  • IAS enrollment behavior.

The only missing piece is an explicit external boundary so Core never treats virtual entities as normal HA platform entities.

Fewer Hidden Handler Contracts Inside ZHA

The patch removes a large amount of handler-specific scaffolding and pushes behavior into platform modules. That reduces the number of concepts a future entity author needs to understand. A sensor, binary sensor, light, or switch can declare its clusters and attributes directly.

Diagnostics Already Show The Intended Boundary

ZHA diagnostics skips virtual entities (zha/zigbee/device.py:1823-1827), which is the right external behavior. That same boundary should be applied to Core entity metadata and dynamic entity events.

Recommended Merge Gates

I would not merge PR #657 into a branch consumed by Core until these gates are covered.

ZHA Gates

  1. Duplicate-discovery regression:

    • Use a fixture with the same cluster ID in input and output clusters.
    • Assert a registered entity class is instantiated/configured once.
  2. Retry regression:

    • Simulate first-attempt timeout for bind().
    • Simulate first-attempt timeout for configure_reporting_multiple().
    • Assert both retry and eventual success.
  3. Quirks v2 metadata regression:

    • Create a quirks v2 entity from direct cluster metadata.
    • Apply change_entity_metadata() or disabled default metadata targeting cluster_id.
    • Assert the metadata change/removal applies.
  4. Virtual entity boundary:

    • Assert virtual entities can contribute cluster config.
    • Assert diagnostics and exported entity metadata skip them.
  5. Lifecycle consistency:

    • Assert configure and initialize hooks run on the entity instance that is eventually registered, or document and test the opposite contract.

Core Gates

  1. Core imports cleanly with this ZHA branch installed.
  2. ZHA setup creates HA entities from non-virtual ZHA entities only.
  3. Device actions no longer depend on cluster-handler names.
  4. IAS warning device services work through a new ZHA API.
  5. Groupable entity websocket output no longer reaches through entity.cluster_handlers.
  6. Cluster bind/configure-reporting progress still reaches the UI/reconfigure diagnostics path.

Suggested Core test entry points after a companion PR exists:

pytest tests/components/zha/test_init.py
pytest tests/components/zha/test_device_action.py
pytest tests/components/zha/test_websocket_api.py
pytest tests/components/zha/test_diagnostics.py

Suggested ZHA test entry points after ./script/setup:

uv run pytest tests/test_discover.py
uv run pytest tests/test_device.py
uv run pytest tests/test_sensor.py
uv run pytest tests/test_cluster_configs.py

The exact test names should be narrowed once the new regressions are added.

Overall Assessment

PR #657 is a serious and directionally valuable simplification of ZHA. The idea is worth pursuing: entity-owned cluster configuration and virtual entities are better long-term primitives than cluster handlers.

The current implementation needs another compatibility and lifecycle pass before it should land. The cross-repo Core dependency is the blocking issue. The duplicate discovery, retry regression, and quirks metadata regression are smaller but still material because they affect real device setup behavior.

My recommendation is to continue with this architecture, but treat the PR as a multi-repo migration rather than a ZHA-only refactor. The minimum acceptable landing package is:

  • ZHA fixes for duplicate discovery, retries, quirks metadata matching, and lifecycle consistency.
  • A Core companion PR that removes old cluster-handler imports and behavior assumptions.
  • Contract tests proving Core can consume the new ZHA branch.
  • Focused ZHA regression tests around the new entity-owned configuration model.

@puddly
Copy link
Copy Markdown
Contributor Author

puddly commented May 27, 2026

Thanks @dmulcahey, fixed 😄

@puddly
Copy link
Copy Markdown
Contributor Author

puddly commented May 27, 2026

I think the event unique_id bugs were the only things really blocking this PR so we're good to go.

For the other points:

  • The virtual entity filtering will be done in Core
  • Re-wrapping exceptions will also be done in Core
  • Zigpy now handlers retries

@puddly puddly merged commit e399b13 into zigpy:dev May 27, 2026
10 of 11 checks passed
@puddly puddly deleted the puddly/drop-cluster-handlers branch May 27, 2026 13:14
@puddly puddly mentioned this pull request May 27, 2026
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Further shape Zigbee support by implementing missing features in ZHA while retaining choice

5 participants