Replace cluster handlers with entity attributes#657
Conversation
I agree. I think adding event entity support first is the way to go
we need several things to do this correctly .... an additional thing that comes to mind is queuing of commands for sleepy devices for example (if we want to auto apply changes when a new version is created). |
There was a problem hiding this comment.
Updated 2026-05-27 — re-reviewed at
6243adf2. All blockers and must-address items are resolved; what remains is two low-priority optionals. Per-item notes inline.
Comment-level review of a large, well-staged refactor. CI is green across 3.12–3.14, and unique_ids are preserved across the refactor — only one entity changes identity across all 810 regenerated snapshots — because the new ClusterMatch discovery reconstructs the same ieee-endpoint-clusterid-suffix ids the old cluster-handler discovery produced (the migrate_unique_ids escape hatch exists but is empty throughout this PR). The feature_priority mechanism is sound where it's used for genuine alternatives (e.g. EM_POLLING).
Merge blockers
The zigpy work backing this PR isn't pinned in it.Resolved in9b4b78ba—pyproject.tomlpinszigpy==1.5.0, which contains both zigpy/zigpy#1823 (read/write/report byte-batching) and zigpy/zigpy#1824 (request retries) — verified: the1.5.0tag is #1824's merge commit and #1823 is its direct ancestor; CI is green against the real 1.5.0.Lixee ZLinky / TICMeter lose their primary energy sensor.Resolved in036fea41—feature_prioritydropped from the baseSmartEnergySummationand the six tier classes;summation_deliveredis back alongside the tiers.
Must-address
Timer-handle leak on entity removal (Resolved invirtual.py).9e6580f9—OnOffClientCacheSync.on_removecancels and nulls_off_listener.Direct cluster access drops retry +Resolved. The retry half is handled by zigpy 1.5.0/#1824. The exception-type half is being moved to Home Assistant — the proposed addition toZHAExceptionwrapping.convert_zha_error_to_ha_error(which already wraps every affected service method: switch/lock/cover/number/select/button/…) catchesTimeoutErrorandzigpy.exceptions.ZigbeeException→HomeAssistantError. This is a cleaner split than wrapping in the library; just needs to land as a paired HA Core PR in lockstep with the ZHA release.
Optional / already-tracked
Resolved indiscovery.py:89(_pick_primary_cluster):next(iter(match.server_clusters))over a frozenset is latent unique_id non-determinism.6243adf2— nowmin(...)for required clusters andsorted(...)for optional ones, so the picked cluster (and derived unique_id) is deterministic.discovery.py:428: acluster_idpresent in bothin_clustersandout_clustersis evaluated twice, so an entity registered to it can be instantiated twice (masked downstream by(platform, unique_id)keying, buton_addstill runs twice). A per-endpointseenset closes it cheaply.device.py:997/1273:aggregate_cluster_configsruns twice per join (configure + initialize) on identical input.cluster_config.py:77docstring says the key is(endpoint_id, cluster_id); it's the 3-tuple incl.is_server.- Real (in-venv) mypy went 76→79, entirely unstubbed-zigpy-enum annotation noise in the refactored
light/alarm_control_panel/siren/coverfiles — no substantive regression.
The skip_configuration thread on the initialize path was addressed by 1feebc66 (guard at device.py:998 & :1274 + test); maintainer concurred.
|
The exception type re-wrapping can be moved into Home Assistant so I think this should be good to go: diff --git a/homeassistant/components/zha/helpers.py b/homeassistant/components/zha/helpers.py
index 4625aa4d0cc..ddbb5ab7fc6 100644
--- a/homeassistant/components/zha/helpers.py
+++ b/homeassistant/components/zha/helpers.py
@@ -1418,6 +1418,17 @@ def convert_zha_error_to_ha_error[**_P, _EntityT: ZHAEntity](
async def handler(self: _EntityT, *args: _P.args, **kwargs: _P.kwargs) -> None:
try:
return await func(self, *args, **kwargs)
+ except TimeoutError as exc:
+ raise HomeAssistantError(
+ "Failed to send request: device did not respond"
+ ) from exc
+ except zigpy.exceptions.ZigbeeException as exc:
+ message = "Failed to send request"
+
+ if str(exc):
+ message = f"{message}: {exc}"
+
+ raise HomeAssistantError(message) from exc
except ZHAException as err:
raise HomeAssistantError(err) from err |
|
I think this is good, but I'm doing one last (manual) pass with Claude and Codex. |
|
Oh well, Codex usage limits ran out, but one bot comment incoming. Not sure they need to be addressed though. |
zigpy-review-bot
left a comment
There was a problem hiding this comment.
Full re-review at 9d05dfa5. CI green on 3.12–3.14, real-venv mypy steady at 79 (unstubbed-enum annotation noise only), and zigpy is now pinned to 1.5.0 (verified to contain #1823 + #1824).
Previously-raised items now resolved (for the record): zigpy bump (9b4b78ba), Lixee summation regression (036fea41), OnOffClientCacheSync timer leak (9e6580f9), exception-type wrapping (moving to HA's convert_zha_error_to_ha_error + retries restored via zigpy 1.5.0), _pick_primary_cluster non-determinism (6243adf2), the redundant BinarySensor.is_on guard (fdab3ef1), and unused pylint ignores (9d05dfa5).
New findings are left as inline comments. Note for completeness: external tooling suggested chunking aggregated reporting/startup reads — that's handled by the zigpy 1.5.0 bump (#1823 byte-chunks reporting and caps reads at 5), and bf9f24ce reverted the ZHA-side chunking deliberately. No action needed there.
Still-open items previously raised (tracked in their existing threads, not re-commented inline)
discovery.py:428— acluster_idpresent on both endpoint sides is evaluated twice via theitertools.chain(in_clusters, out_clusters)loop, so an entity registered to it is instantiated twice (masked downstream by(platform, unique_id)keying, buton_addstill runs twice). A per-endpointseenset closes it.device.py—aggregate_cluster_configs(self._discovered_entities)runs twice per join (inasync_configureand again inasync_initialize) on identical input.device.pyskip_configurationnow also skips thefrom_cache=Trueinitialize. Discussed previously; likely safe since entities read through to the DB-backed cluster cache, but noting for completeness.cluster_config.pyaggregate_cluster_configsdocstring says the key is(endpoint_id, cluster_id); it's actually the 3-tuple(endpoint_id, cluster_id, is_server).virtual.py:152except (ZHAException, Exception)—Exceptionalready coversZHAException.
Test coverage
tests/test_cluster_handlers.py (1515 lines) was deleted with no replacement, and nothing under tests/ references cluster_config, aggregate_cluster_configs, AggregatedAttrConfig, or VirtualEntity. The risk-bearing new logic — AggregatedAttrConfig.merge (min/max/reportable_change), aggregate_cluster_configs, bind/report execution, and the virtual-entity lifecycle — is only exercised indirectly through platform/device integration tests. Worth adding targeted unit tests for merge and aggregate_cluster_configs.
| *(endpoint.async_configure() for endpoint in self._endpoints.values()) | ||
| ) | ||
| # Configure binding and reporting from entity-level cluster configs | ||
| aggregated = aggregate_cluster_configs(self._discovered_entities) |
There was a problem hiding this comment.
Duplicate listeners on rejoin (transient — confirmed). (Anchored here; the relevant call is self._discover_new_entities() just above at L994.) async_configure() calls _discover_new_entities(), which runs entity.on_add() (subscribing each entity to its cluster's events) and appends to _pending_entities, but does not drain them. Draining — _add_pending_entities(), which removes re-discovered duplicates via on_remove() — happens only in async_initialize().
The rejoin path _async_device_rejoined calls only async_configure() (initialize runs later, on availability). So between rejoin's configure and the next initialize, the freshly-on_add()'d instances and the live ones in _platform_entities are both subscribed to the same cluster, producing duplicate state / zha_event emissions. With this PR's entity-direct-subscription model on_add() subscribes the entity itself, so re-running discovery re-subscribes — unlike dev.
Not a permanent leak, to be precise: the next async_initialize() → _add_pending_entities() drains all accumulated pending instances (both the configure batch and the initialize batch), since every one matches an existing _platform_entities key and is dropped via on_remove(). So the duplicate subscriptions are cleaned up at the next initialize — the harm is bounded to the configure→initialize window (and to the case where several rejoins land before an initialize, since _discover_new_entities() clears _discovered_entities but not _pending_entities). It's a transient double-emission window, not unbounded accumulation. Still worth closing: a just-rejoined device is actively reporting precisely during that window.
Consider draining (or clearing _pending_entities) in async_configure, or having _discover_new_entities() skip already-present unique_ids before calling on_add().
| ) | ||
|
|
||
| for (endpoint_id, _cluster_id, _is_server), agg in configs.items(): | ||
| if agg.bind: |
There was a problem hiding this comment.
Reporting is configured even when the cluster isn't bound. Binding runs only under this if agg.bind:, but reporting is configured whenever any aggregated attr has a reporting config (the if reporting_attrs: block below). An entity author who sets reporting=... without bind=True gets Configure Reporting pushed to an unbound cluster — reports never arrive, with no warning.
On dev the cluster-handler model defaulted BIND = True unconditionally for every handler (cluster_handlers/__init__.py:195), so any handler carrying a REPORT_CONFIG always bound too. The new ClusterConfig.bind defaults to False, so reporting and binding are now decoupled.
To be clear, no current entity triggers this — every ClusterConfig under zha/application/platforms/ that sets reporting= also sets bind=True, so this is a latent footgun, not a live bug. Consider asserting bind is set when any aggregated attr has reporting (or auto-enabling it) to keep it that way.
| for optional in match.optional_cluster_handlers: | ||
| if optional in endpoint.cluster_handlers_by_name: | ||
| server_handlers.add(optional) | ||
| selected_matches = matches_by_priority[highest_priority] |
There was a problem hiding this comment.
This selection is immediately overwritten — the first override block is dead. selected_matches set here is overwritten by the second override pass just below (L550-558), which re-derives the override matches from matches_by_priority.values() regardless. As a result the earlier override block (the override_by_priority rebuild around L509-528) has no effect on the discovered entities — it only influences highest_priority and the Ignored matches debug log. Removing that first block collapses the two override passes into one.
P1: Platform.VIRTUAL entities leak through the public entity surface. P1/P2: Direct platform cluster calls lost retry and exception normalization. P2: Cluster configuration lost retry and reporting/read chunking. P2: async_configure() can leak duplicate pending entities and listeners. P2: Quirks-v2 generated entities are skipped by cluster_id metadata filters. P2: IAS enrollment writes cie_addr with a raw write. P2: Several zha_event payload/order contracts changed. # ZHA PR #657 Architecture ReviewDate: 2026-05-27 PR: #657 PR title: Replace cluster handlers with entity attributes PR state reviewed: open, not draft, not merged Base: Head: Local checkout reviewed: Patch size: 870 files changed, 9304 insertions, 115748 deletions Scope And MethodI reviewed the PR as an architectural replacement of ZHA's cluster-handler layer, not as a narrow patch review. I used:
GitNexus was used as the primary map for the new execution paths. On the PR checkout, the relevant contexts were:
Across the
I did not run the ZHA test suite. This worktree does not have the test environment installed: Executive SummaryThe architecture direction is good. Removing the separate cluster-handler object graph simplifies ZHA's internal model: entities become the source of truth for cluster binding, reporting, cache initialization, runtime event listeners, and virtual behavior. That is a cleaner long-term shape than maintaining both entity discovery and a parallel handler ownership graph. The PR is not merge-ready as a standalone ZHA library change. It removes a major public-ish integration surface that Home Assistant Core still imports and calls. It also introduces a few lifecycle and matching regressions inside ZHA that are easy to miss because they are not syntax-level failures. The highest-risk issues are:
The right merge shape is either a lockstep Core compatibility PR or a temporary compatibility layer in ZHA that preserves enough of the old contract while Core migrates. Architecture DirectionBefore this PR, ZHA had two coupled runtime graphs:
This PR collapses much of that into entity declarations:
That direction has real benefits:
The main architectural obligation is that this new model must replace all old cluster-handler contracts, including cross-repo contracts in Core. The PR currently replaces most of the ZHA-internal implementation but does not provide a complete migration boundary for Core. Finding 1: Core Still Depends On Deleted Cluster-Handler APIsMateriality: Critical release blocker unless paired with a Core PR. PR #657 deletes the cluster-handler package and replaces the old event names with new ZHA device/cluster events:
Current Home Assistant Core still imports the removed surface:
Core also calls endpoint cluster-handler collections that do not exist in the new model:
This is not only an import compatibility problem. The old Core API shape encoded behavior:
Local grep found this coupling in Why This MattersIf ZHA is upgraded in Core without a companion migration, Core will fail before runtime behavior can be evaluated. If imports are shimmed but endpoint/entity handler attributes are not, device actions, IAS warning services, groupable entity metadata, and configuration progress UI will still break. RecommendationTreat this as a lockstep multi-repo migration. The Core side needs a companion PR that:
If a lockstep Core PR is not ready, ZHA should keep a compatibility module for the deleted import paths and provide transitional endpoint/entity properties with deprecation warnings. A shim will not be as clean, but it is safer than removing the surface before Core no longer calls it. Finding 2: Virtual Entities Can Leak Into Core Entity MetadataMateriality: High when paired with Finding 1. PR #657 introduces
Core does not currently have the same guard:
If virtual entities are placed in RecommendationMake the virtual-entity boundary explicit in both packages:
Finding 3: Entity Discovery Can Duplicate Matches For Clusters Present On Both SidesMateriality: High.
There is no dedupe for The PR already has a concrete fixture with this shape:
The virtual
Because the registry key is only the cluster ID, both the input and output Why This MattersThe duplicate may not show up as two final HA entities if their unique IDs collide and
For RecommendationDeduplicate before entity instantiation. A minimal fix is to track selected pairs: seen: set[tuple[type[PlatformEntity], ClusterMatch]] = set()
for entity_class in ENTITY_REGISTRY.get(cluster.cluster_id, []):
match = entity_class._cluster_match
if match is None:
continue
key = (entity_class, match)
if key in seen:
continue
seen.add(key)
...If Add a regression using the IKEA GU10 fixture and assert that Finding 4: Bind And Reporting Lost The Old Retry SemanticsMateriality: High for real networks, medium for pure code structure. The old cluster-handler implementation wrapped bind and configure-reporting requests in Zigbee retry logic:
The new aggregate configuration path calls both operations directly:
The old code also chunked reporting records via Why This MattersZigbee configuration traffic is often sent while devices are joining, waking, or recovering from interview. Removing retries makes configuration more sensitive to transient failures. A single timeout now marks bind/reporting as failed where the old path would have tried multiple times. The chunking change may be fine for most clusters, but it should be intentional and tested. If a device rejects large configure-reporting batches, the PR has made the failure mode broader. RecommendationPreserve the old transport behavior in the new architecture: RETRYABLE_REQUEST_DECORATOR = zigpy.util.retryable_request(tries=3)
res = await RETRYABLE_REQUEST_DECORATOR(agg.cluster.bind)()
res = await RETRYABLE_REQUEST_DECORATOR(
agg.cluster.configure_reporting_multiple
)(reporting_attrs)Also decide whether Add tests that inject a timeout on the first bind/reporting request and verify that the second attempt succeeds. Finding 5: Reporting Attribute Lookup Can Raise Outside The Failure Event PathMateriality: Medium. In the new reporting loop:
If an entity config references a reporting attribute that the concrete cluster class cannot find, The old code handled name resolution differently for event data:
The new model should probably fail fast for invalid developer declarations, but then this should be a tested invariant. If the goal is runtime tolerance equivalent to cluster handlers, the exception should be converted into a failed per-attribute configuration event. RecommendationPick the intended contract:
Given ZHA's device-support surface, I would prefer strict tests plus a graceful runtime failure event. Finding 6: Quirks V2 Cluster-Targeted Metadata No Longer Matches Direct Quirks EntitiesMateriality: Medium-high. PR #657 routes default entity removal and metadata changes through
Quirks v2 entities are created directly from a concrete cluster:
These quirks-created entities do not need a class-level Why This MattersQuirks v2 metadata commonly uses endpoint, cluster, attribute, and entity metadata to shape the exposed entity. A migration that stops cluster-targeted RecommendationMake if match is None:
cluster = getattr(entity, "cluster", None)
if cluster is None:
return False
if cluster.cluster_id != cluster_id:
return False
if cluster_type is None:
return True
return cluster.cluster_type == cluster_typeThen add regression tests for quirks v2 metadata that targets an exposed entity by Finding 7: Configure And Initialize Can Run Against Different Entity InstancesMateriality: Medium architectural risk. The new lifecycle discovers prospective entities during both configure and initialize:
This means configuration hooks may run on one set of entity instances, initialization hooks may run on a second set, and the registered entities may be whichever duplicate was first accepted by pending dedupe. Why This MattersToday most RecommendationMake discovery a single plan per lifecycle phase:
If configure and initialize must stay separately callable, add an explicit invariant: hook methods cannot rely on instance state until after registration. That is weaker than reusing the same objects and will be harder to enforce. Positive Architectural Changes Worth KeepingDespite the findings above, the PR has several strong design moves. Entity-Owned Cluster Config
That is a good replacement for cluster handlers because it makes "why are we binding this cluster?" answerable from the entity declarations. Virtual EntitiesVirtual entities are the right abstraction for behavior that is tied to endpoint discovery and lifecycle but should not become a Home Assistant entity. Examples include:
The only missing piece is an explicit external boundary so Core never treats virtual entities as normal HA platform entities. Fewer Hidden Handler Contracts Inside ZHAThe patch removes a large amount of handler-specific scaffolding and pushes behavior into platform modules. That reduces the number of concepts a future entity author needs to understand. A sensor, binary sensor, light, or switch can declare its clusters and attributes directly. Diagnostics Already Show The Intended BoundaryZHA diagnostics skips virtual entities ( Recommended Merge GatesI would not merge PR #657 into a branch consumed by Core until these gates are covered. ZHA Gates
Core Gates
Suggested Core test entry points after a companion PR exists: pytest tests/components/zha/test_init.py
pytest tests/components/zha/test_device_action.py
pytest tests/components/zha/test_websocket_api.py
pytest tests/components/zha/test_diagnostics.pySuggested ZHA test entry points after uv run pytest tests/test_discover.py
uv run pytest tests/test_device.py
uv run pytest tests/test_sensor.py
uv run pytest tests/test_cluster_configs.pyThe exact test names should be narrowed once the new regressions are added. Overall AssessmentPR #657 is a serious and directionally valuable simplification of ZHA. The idea is worth pursuing: entity-owned cluster configuration and virtual entities are better long-term primitives than cluster handlers. The current implementation needs another compatibility and lifecycle pass before it should land. The cross-repo Core dependency is the blocking issue. The duplicate discovery, retry regression, and quirks metadata regression are smaller but still material because they affect real device setup behavior. My recommendation is to continue with this architecture, but treat the PR as a multi-repo migration rather than a ZHA-only refactor. The minimum acceptable landing package is:
|
|
Thanks @dmulcahey, fixed 😄 |
|
I think the event For the other points:
|
See #653 for a different approach.
As an example:
For an individual cluster, all entities referencing that cluster will have their individual cluster configurations combined and then aggregated into binding and attribute reporting requests.
This PR introduces the concept of virtual entities: these are entities that participate in discovery, binding reporting, and so on, same as any other. The only difference is that they have no public API and should be ignored by Core. They effectively replace cluster handlers for the very few circumstances we require something like them and actually cleans up a few hacks.
We should maybe introduce
versionfor entity configuration? Right now, changing a reporting interval or adding new attributes requires a user to manually click "reconfigure". An alternative would be to keep track of all binding and reporting config in the zigpy database anddiffit with the ZHA defaults. We'd need to track whether configuration has been automatically configured or if it was manually pushed, in case someone overrides reporting.This is part of / a requirement for: