fix: site-level handler overrides were ignored, org-level disable always won#1590
fix: site-level handler overrides were ignored, org-level disable always won#1590tkotthakota-adobe wants to merge 4 commits into
Conversation
|
This PR will trigger no release when merged. |
| if (disabledSites.includes(siteId)) { | ||
| return false; | ||
| } | ||
| if (enabledSites.includes(siteId)) { |
There was a problem hiding this comment.
Comment by @solaris007
Important: This flips existing config rows where a site is in enabled.sites and the org is in disabled.orgs from disabled to enabled at deploy time. Run a query against prod to enumerate affected rows before merging.
There was a problem hiding this comment.
I do not have prod access @solaris007 What is the best way to do this query?
solaris007
left a comment
There was a problem hiding this comment.
Hey @tkotthakota-adobe,
Strengths
- The new precedence ladder in
isHandlerEnabledForSite(configuration.model.js:186-207) reads cleanly top-to-bottom: site-disable > site-enable > org-disable > default > org-enable. Much easier to reason about than the old nested-if approach, and the local destructuring makes each check self-documenting. #updatedHandlernow keepsenabledanddisabledlists mutually exclusive on every transition (configuration.model.js:249-265), closing a stale-state hazard where the same entity could end up in both lists.- The
(handler.disabled[entityKey] || []).filter(...)pattern (configuration.model.js:250-251) fixes a real correctness issue - the old.filter(...) || []was dead code since.filter()always returns an array, but the parent could still beundefined. - Previously requested test for headline behavior change (Thread 2) has been addressed - configuration.model.test.js:248-256 directly asserts site-override-org-disable.
- No new dependencies, CI passing, well-scoped diff.
Issues
Important (Should Fix)
1. Write path does not deliver the site-override for enabledByDefault: true handlers (configuration.model.js:253)
#updatedHandler only adds the entity to enabled[entityKey] when !handler.enabledByDefault. For enabledByDefault: true handlers, calling enableHandlerForSite when the org is in disabled.orgs results in: site not added to enabled.sites -> isHandlerEnabledForSite falls through to disabledOrgs check -> returns false.
The read path claims "site enable overrides org disable" universally, but the write path only delivers it for non-default handlers. The test at line 248 passes because it uses addHandler to pre-populate enabled.sites directly, bypassing the public API.
This may not affect the immediate use case (paid-tier audits are typically enabledByDefault: false), but it's the same shape of bug this PR exists to fix - a latent inconsistency between what the config reads and what the API writes.
Fix: Always add to enabled[entityKey] when enabled === true, regardless of enabledByDefault. Remove the !handler.enabledByDefault guard at line 253.
2. Pre-merge gate: enumerate prod config rows affected by the behavior flip
Building on the open thread at line 197 - @solaris007 correctly flagged that any handler where (site in enabled.sites AND org in disabled.orgs) will flip from "audit off" to "audit on" at deploy time. The author replied "I do not have prod access."
Path forward: someone with prod read access fetches the S3 singleton config and enumerates the affected (handler, siteId, orgId) triples. This is a read-only operation on a single JSON blob - should take under 30 minutes. Based on the count:
- Zero: merge with confidence.
- Small (<10): list them in the PR description so on-call knows what to expect.
- Large: stage the rollout or pre-clean stale
enabled.sitesentries.
This is not a code fix but a deploy-readiness gate. The PR should not merge until this is answered.
3. New cleanup branches in #updatedHandler lack direct tests (configuration.model.test.js)
The PR adds write-side logic that removes entities from the opposite list on every transition (lines 250-251, 261-262 of the model). The existing tests start with handlers where the entity is NOT in the opposite list, so the cleanup is a no-op and never exercised. No test fails if those .filter(...) lines are removed.
Fix: Add round-trip tests, plus a enabledByDefault: false override test (the current test at line 248 uses enabledByDefault: true only). See inline comment for examples.
Minor (Nice to Have)
4. isHandlerEnabledForSite and isHandlerEnabledForOrg now use different precedence models, undocumented
After this PR, isHandlerEnabledForSite applies site-then-org precedence while isHandlerEnabledForOrg still uses "any disabled wins." A config where org O is in disabled.orgs and site S (in O) is in enabled.sites will report isHandlerEnabledForOrg(O) = false while isHandlerEnabledForSite(S) = true. This is likely intentional but undocumented.
Fix: Add a brief JSDoc on both methods noting the divergence.
5. "Site in both enabled.sites and disabled.sites" conflict resolution is implicit
The new code makes disabled.sites win (checked first at line 194). #updatedHandler keeps these disjoint on writes, but legacy data or manual edits could violate this. A one-line comment ("assumes enabled.sites and disabled.sites are disjoint; see #updatedHandler") or a test pinning the conflict resolution would make the invariant explicit.
Recommendations
- Add a DEBUG-level log line when the site-override fires (
enabledSites.includes(siteId) && disabledOrgs.includes(orgId)) to give on-call a grep target for post-deploy validation. - Update the PR description to include a "Deployment impact" section noting the behavior change for existing config rows.
- Consider adding a JSDoc block on
isHandlerEnabledForSiteenumerating the full precedence rules.
Assessment
Ready to merge? No, with fixes.
Reasoning: The read-path fix is clean and well-tested for the headline case, but the write path leaves the enabledByDefault: true override unreachable via the public API (finding 1), the new cleanup branches lack direct test coverage (finding 3), and the prod-state enumeration from Thread 1 remains unanswered (finding 2). Fixing findings 1 and 3 are small code changes; finding 2 is a read-only prod query that someone with access can run.
Next Steps
- Fix the write path in
#updatedHandlerto always add toenabled[entityKey]when enabling (finding 1). - Add the missing tests:
enabledByDefault: falseoverride, round-trip cleanup assertions (finding 3). - Resolve Thread 1 by running the prod config enumeration before merging (finding 2).
- Minor items (findings 4-5) are optional but improve future readability.
|
Hi @solaris007 thank you for the review. I addressed review comments. |
solaris007
left a comment
There was a problem hiding this comment.
Hey @tkotthakota-adobe,
Thanks for the thorough iteration. All five prior findings + two of three recommendations cleanly addressed. Two Important test gaps remain and the prod-enumeration thread (Important issue 2 from the prior review) needs a reviewer-side action - that one is on me, not you.
Strengths
Previously flagged issues now addressed:
- Important issue 1 (write path doesn't deliver site-override for
enabledByDefault: truehandlers): cleanly fixed atconfiguration.model.js:280-286. Theif (!handler.enabledByDefault)guard is gone;enableHandlerForSitenow reliably populatesenabled.sitesfor both default and non-default handlers, making the override actually reachable via the public API. The new comment ("Always add to enabled so isHandlerEnabledForSite can find the site in enabled.sites and return true even when the org is in disabled.orgs (site-level override)") accurately describes the behavior. - Important issue 3 (cleanup branches lack direct tests): addressed via three new tests. Test 2 ("removes site from disabled.sites when re-enabling for a non-default handler") genuinely exercises the new write path - handler starts with
disabled.sites: [siteId], afterupdateHandlerSites(..., true)both the cleanup branch AND the unconditional-add fix from Important issue 1 are exercised. Test 3 covers the symmetric case forenabledByDefault: truegoing through theelse ifbranch. - Minor issue 4 (precedence divergence undocumented): JSDoc on both methods is accurate. I traced
isHandlerEnabledForSiteline-by-line against the 6-step ladder and it matches. JSDoc explicitly calls out the divergence between the two methods. - Minor issue 5 (implicit conflict resolution between enabled.sites/disabled.sites): JSDoc on
isHandlerEnabledForSitedocuments both the disjointness assumption AND the invariant maintained by#updatedHandler. The "lists disjoint" assumption is now actively enforced by the writer rather than just assumed. - Recommendation 1 (DEBUG log on site-override fire): implemented at
configuration.model.js:213-215. The log fires only when the override is actually triggered (site inenabled.sitesAND org indisabled.orgs). Includestype,siteId,orgIdfor correlation. - Recommendation 3 (JSDoc precedence enumeration): addressed via the Minor issue 4 fix.
This iteration:
- Write-path simplification reads cleanly.
#updatedHandlerno longer branches onenabledByDefaultfor the enable case; always-add-to-enabled / always-remove-from-disabled is easier to reason about and matches the new read path. - Asymmetric precedence model (site-overrides-org for
isHandlerEnabledForSite; org-disable-always-wins forisHandlerEnabledForOrg) is now intentional and self-documented. The JSDoc contrasts the two methods explicitly. The asymmetry maps cleanly to how each method is used at its respective fan-out boundary (per-site vs per-org consumer in jobs-dispatcher reports). - The 6-step precedence table is now the de-facto contract. Future readers can cite it instead of reverse-engineering branching code.
- Authorization model integrity is sound.
(enabled.sites, disabled.orgs)configuration is written only by trusted admin paths inspacecat-api-service(adminx-api-keycontrollers, internal onboarding flows, Slack admin commands). End users cannot self-promote a site intoenabled.sites. The "site overrides org" semantics are an explicit, policy-driven business rule (paid upgrade), not a privilege-escalation surface. - DEBUG log payload
{ type, siteId, orgId }contains only non-secret identifiers. No PII, tokens, or credentials. - Diff is tightly scoped to addressing prior findings; no drive-by changes.
Issues
Important (Should Fix)
-
Headline production scenario has no end-to-end test. The PR exists to fix the free-trial -> paid upgrade path, but no test exercises the full flow:
updateHandlerSites('paid-handler', siteId, true)on a state where the org is indisabled.orgs, then assertisHandlerEnabledForSite('paid-handler', site)is true. The new "paid-handler" test is read-path-only (usesaddHandlerto pre-populate, bypassing the write path); Test 2 exercises the write path but without org indisabled.orgs. Add one combining test - this is what tells future maintainers what the PR is for. See inline comment for the test shape. -
enabledByDefault: falsedisable branch is not directly tested.#updatedHandlernow has three write paths: enable (any handler), disable withenabledByDefault: true, and disable withenabledByDefault: false(theelsebranch atconfiguration.model.js:292-294). The first two are covered; the third is not. With this PR's "always add to enabled on enable" change, the asymmetry of the third branch is load-bearing: a paid handler that gets enabled (now inenabled.sites) and then disabled must NOT linger as enabled. See inline comment for the test shape.
Recommendations
- Prod-state enumeration (prior Important issue 2) - shifting to my side. Author correctly stated they lack prod access; this is a deployment-readiness gate, not a code fix. I will run the query against the prod Configuration singleton: pull the current S3 config, compute
intersect(enabled.sites, sitesOf(disabled.orgs))per handler, and report the count of (handler, siteId, orgId) triples that flip from disabled to enabled at deploy. If zero or small + expected, merge proceeds; if surprising, the team can decide whether to merge-as-is, pre-clean, or notify affected customers. Tracking this on the PR before approval. - DEBUG vs INFO for the override log. The author followed the prior review recommendation (DEBUG). Worth re-evaluating: DEBUG is typically filtered out in prod observability, so the post-deploy validation purpose may not work without changing log filters. For the first 30 days post-deploy, INFO would make the canary signal greppable by on-call without filter changes; afterwards demote to DEBUG. Reasonable counter-argument: keep DEBUG and document in the rollout runbook. Author's call - this reverses my prior recommendation, which I now think was suboptimal for the canary purpose.
- Add "Deployment impact" section to PR description (recommendation 2 from prior review, still not addressed). One paragraph noting: configs where site-in-enabled-sites + org-in-disabled-orgs flip from disabled to enabled at deploy time, the prod-enumeration result once it is run, and the rollback path. The merge commit body should be accurate.
- Document the rollback recovery path. If a customer reports unexpected audits-on after deploy: Path A (small blast radius) - clean up the prod config singleton (single S3 write, recovers immediately). Path B (large blast radius or unclear scope) - revert this PR in spacecat-shared, publish patch release, bump consumers (slow, multi-repo). Worth noting in the PR description or a follow-up runbook.
- Future: precedence test matrix derived from JSDoc. The 6-row precedence ladder is now the de-facto contract. A small data-driven test where each row of the table is one assertion would lock the spec and the implementation together and prevent silent drift. Cheap to add, exactly the shape of test that does not regress when refactoring the body. Out of scope for this PR; worth a follow-up issue.
Minor
-
Disjointness invariant lacks a pinning test. JSDoc says "disabled.sites and enabled.sites are assumed disjoint; disabled.sites wins if violated." A one-line test seeding both lists with the same
siteIdand assertingfalsewould prevent silent precedence drift on legacy data. -
Idempotency of enable not pinned.
Array.from(new Set([...]))deduplication is the only thing keepingenabled.sitesfrom growing unbounded ifupdateHandlerSites(..., true)is called repeatedly. A single test assertingenabled.sites.length === 1after two consecutive enables would lock this behavior. -
updateHandlerOrgscleanup not directly tested. Same#updatedHandlercode path as sites, but the read-side semantics for orgs are now formally asymmetric (no site-level override). One test verifyingupdateHandlerOrgs(type, orgId, true)followed byisHandlerEnabledForOrg(...)returns true after a prior disable would mirror the site-side assertion.
Assessment
Ready to merge? With fixes.
Reasoning: code-level work is clean. The write-path simplification reads well, the JSDoc captures the contract, and the cleanup branch tests address the prior review properly. Two Important test gaps remain - the headline end-to-end scenario and the enabledByDefault: false disable branch - both small additions. The prod-enumeration deployment-readiness gate (prior Important issue 2) is now reviewer-side; I will run the query and report results on the PR before approval.
Next Steps
- Add the two missing tests (Important issues 1 and 2). Both are small additions to the existing test file.
- Reviewer (me) runs the prod-state enumeration before approving. I will paste results into the PR.
- Decide on the DEBUG vs INFO log level question.
- Optional: update PR description with deployment impact and rollback path; add the Minor tests if you have appetite for completeness.
| enabledByDefault: false, | ||
| enabled: { sites: [site.getId()], orgs: [] }, | ||
| disabled: { sites: [], orgs: [site.getOrganizationId()] }, | ||
| }); |
There was a problem hiding this comment.
Important - the headline production scenario lacks an end-to-end test.
The PR exists to fix the free-trial -> paid upgrade path: org has handler X with enabledByDefault: false and is in disabled.orgs; we upgrade site S via updateHandlerSites('X', siteId, true); we expect isHandlerEnabledForSite('X', S) to return true afterwards. The new "paid-handler" test above is a READ-PATH-only assertion - it pre-populates state directly via addHandler and never calls updateHandlerSites. Test 2 below ("removes site from disabled.sites when re-enabling for a non-default handler") exercises the write path but with disabled.orgs: [], so the override-against-disabled-org case is not asserted post-write.
Without a test that combines write + read, a future regression that breaks updateHandlerSites for non-default handlers (e.g., someone re-introducing the if (!enabledByDefault) guard the prior review asked you to remove) would not be caught: the read-path "paid-handler" test would still pass because it bypasses the write path. Add one test:
it('upgrades non-default handler from disabled org via updateHandlerSites', () => {
instance.addHandler('paid-handler', {
enabledByDefault: false,
enabled: { sites: [], orgs: [] },
disabled: { sites: [], orgs: [site.getOrganizationId()] },
});
instance.updateHandlerSites('paid-handler', site.getId(), true);
expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true;
});This is the test that tells future maintainers what the PR is for.
|
|
||
| instance.updateHandlerSites('default-cleanup-handler', site.getId(), false); | ||
|
|
||
| expect(instance.getHandler('default-cleanup-handler').enabled.sites).to.not.include(site.getId()); |
There was a problem hiding this comment.
Important - enabledByDefault: false disable branch is not directly tested.
#updatedHandler has three write paths after this PR's simplification:
- Enable: remove from disabled + add to enabled (any handler) - covered by Test 2 ("cleanup-handler")
- Disable +
enabledByDefault: true: add to disabled + remove from enabled - covered by Test 3 ("default-cleanup-handler") - Disable +
enabledByDefault: false(theelsebranch atconfiguration.model.js:292-294): only remove from enabled, do NOT add to disabled - NOT covered.
With the new "always add to enabled on enable" change in this PR, the asymmetry of branch 3 is now load-bearing: a paid handler that gets enabled (now in enabled.sites) and then disabled must NOT linger in enabled.sites. Add at least:
it('removes site from enabled.sites when disabling for a non-default handler', () => {
instance.addHandler('paid-disable-handler', {
enabledByDefault: false,
enabled: { sites: [site.getId()], orgs: [] },
disabled: { sites: [], orgs: [] },
});
instance.updateHandlerSites('paid-disable-handler', site.getId(), false);
expect(instance.getHandler('paid-disable-handler').enabled.sites).to.not.include(site.getId());
expect(instance.getHandler('paid-disable-handler').disabled.sites).to.not.include(site.getId());
});This pins the contract that disabling a non-default handler is "not in either list" (default-off applies), distinct from disabling a default handler ("in disabled list, default-on overridden").
|
Hi @solaris007 Let me what you find from production data and if anything else needed from me. |
solaris007
left a comment
There was a problem hiding this comment.
Hey @tkotthakota-adobe,
Returning to this after the 2026-05-08 review with no new commits to assess. The Round 1 fixes still hold cleanly, the two Round 2 test gaps remain, and tracing the consumer call sites surfaces a system-level consequence of the asymmetric precedence model that should be documented before merge.
Strengths
The prior round's findings stay addressed in commit 7462cfe3:
- Read-path rewrite at
configuration.model.js:192-226reads top-to-bottom with destructured array bindings; the 6-step precedence ladder in JSDoc (lines 177-191) matches the code 1:1 and is a usable on-call reference. - Write-path simplification at
configuration.model.js:279-295: enable is unconditional remove-from-disabled + add-to-enabled (the Round 1 Important 1 fix), disable is symmetric. Inline comments accurately describe intent. - JSDoc on
isHandlerEnabledForOrg(lines 228-239) explicitly contrasts itself with the site variant - the asymmetry is now named, not implicit. - Cleanup branches have direct tests: Test 4 (line 268) exercises re-enable from disabled state, Test 5 (line 281) exercises disable cleanup of
enabled.sitesfor default handlers. - Disjointness invariant documented in JSDoc (line 188) and actively enforced by
#updatedHandler. - DEBUG log on override fire (line 214) is well placed and the payload
{ type, siteId, orgId }is non-sensitive. - Authorization model verified independently: the public write paths (
updateHandlerSites,enableHandlerForSite) are reached only through admin-gated controllers (sites-audits-toggle.jsshort-circuits onaccessControlUtil.hasAdminAccess()), internal onboarding flows, and the Slack admin command. End users cannot self-promote a site intoenabled.sites. - Diff is tightly scoped to the bug fix and prior review feedback.
Issues
Important (Should Fix)
-
Headline production scenario has no end-to-end test (
test/unit/models/configuration/configuration.model.test.js).Re-raising from Round 2. The PR exists for the free-trial -> paid upgrade path, but no test composes write-path-then-read-path with the full org-disabled context. Test 3 (line 258) pre-populates
enabled.sitesviaaddHandler(read path only). Test 4 (line 268) exercises the write path but without org indisabled.orgs. Neither tests the exact scenario the PR is for. A future refactor of#updatedHandlercould silently regress the headline behavior with all tests still green.Fix: one chained test, approximately 10 lines:
it('enables a paid handler for a site whose org is in disabled.orgs (free-trial -> paid)', () => { instance.addHandler('paid-handler', { enabledByDefault: false, enabled: { sites: [], orgs: [] }, disabled: { sites: [], orgs: [site.getOrganizationId()] }, }); expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.false; instance.updateHandlerSites('paid-handler', site.getId(), true); expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true; expect(instance.getHandler('paid-handler').enabled.sites).to.include(site.getId()); });
-
enabledByDefault: falsedisable branch is reached but not behaviorally asserted (configuration.model.js:292-294).Re-raising from Round 2. The existing test at line 349-353 does exercise this branch through
disableHandlerForSite('organic-keywords', site), but its assertion isexpect(...disabled.sites).to.not.include(siteId)- a tautology sinceorganic-keywords.disabled.siteswas never populated by the prior enable. The branch's actual job (removing siteId fromenabled.sitesso the round-trip leaves no stale override) is not asserted.With this PR's "always add to enabled on enable" change, a paid handler can land in
enabled.sitesfor sites whose org is indisabled.orgs. If the disable path'senabled.sitesfilter is broken or removed in a future refactor, a disabled handler would still read true via step 2 of the precedence ladder. Line-coverage tools would not catch it.Fix: add
expect(...enabled.sites).to.not.include(siteId)to the line 349-353 test, or add a dedicated round-trip test on a paid handler. 3-5 lines. -
Cross-system consequence of the asymmetric model is not documented (
configuration.model.js:228-239JSDoc onisHandlerEnabledForOrg, PR description).Tracing the consumer call sites:
isHandlerEnabledForOrgis used as a gate inspacecat-jobs-dispatcher/src/handlers/reports.js(filters orgs before sending org-targeted report payloads) andspacecat-reporting-worker/src/digest/handler-external.js(when a digest message arrives keyed onorgId, returnsnoContent()if the org is disabled AND enumerates all sites in the org without re-checking per site).Implication of the new asymmetric semantics: a paid site in
enabled.siteswhose org is indisabled.orgswill start receiving site-targeted reports (correct, this is what the PR enables) but will NOT receive org-level digests routed throughorgId- the digest handler short-circuits at the org gate. The PR's stated invariant "site-level handler overrides were ignored, org-level disable always won" still holds in the digest fan-out path.This is an in-repo doc fix, not a cross-repo code fix. Two ways forward, your call:
- (a) Declare it intentional: org-level reports follow org policy, site-level reports follow site policy. Document in the JSDoc on
isHandlerEnabledForOrg("intended for org-scoped report fan-out; site-level overrides do not propagate here") and in the PR description. - (b) Acknowledge the gap and file a Jira for the digest handler to re-check per-site. Link it from the PR description as a known limitation.
Either is defensible. The point is the asymmetric model is documented at the method level but the cross-system consequence is invisible today.
- (a) Declare it intentional: org-level reports follow org policy, site-level reports follow site policy. Document in the JSDoc on
-
DEBUG log level defeats the canary purpose (
configuration.model.js:214).The log fires only when the override path is taken - exactly the post-deploy validation signal. In SpaceCat Lambdas shipped to Coralogix, DEBUG is typically filtered before shipment. For the first 30 days post-deploy, on-call needs a greppable canary signal without per-subsystem filter changes. If the prod-state enumeration count is non-zero (Round 2 reviewer-side gate), the override fires on production traffic and the responder needs to confirm the behavior matches expectations.
Fix: change
this.log.debug(...)tothis.log.info(...)for the first 30 days, then demote in a follow-up commit. Track the demotion ticket so it does not become permanent INFO noise. Author's call: if you prefer keeping DEBUG, write the grep target into the rollout runbook explicitly. -
Rollback path is undocumented in the PR description.
The Configuration is a versioned S3 singleton, so two rollback levers exist with very different cost profiles:
- Path A (preferred for small blast radius): restore the prior S3 object version, or write a corrective config that removes the now-conflicting
(enabled.sites, disabled.orgs)rows. Recovers immediately. No redeploy. - Path B (large blast radius or root-cause unclear): revert this PR, publish a patch release, bump consumers in spacecat-api-service, spacecat-audit-worker, spacecat-autofix-worker, spacecat-jobs-dispatcher, spacecat-reporting-worker. Multi-repo, multi-hour.
Fix: add a "Rollback" section to the PR description naming the S3 versioned-object lever first, then the revert-and-bump fallback.
- Path A (preferred for small blast radius): restore the prior S3 object version, or write a corrective config that removes the now-conflicting
Minor (Nice to Have)
-
Disjointness invariant lacks a pinning test (test file). JSDoc says "disabled.sites and enabled.sites are assumed disjoint; disabled.sites wins if violated." A one-line test seeding both lists with the same
siteIdand assertingfalsewould pin the conflict resolution against silent precedence drift on legacy data. -
Idempotency of enable not pinned (test file).
Array.from(new Set([...]))deduplication is the only thing keepingenabled.sitesfrom growing unbounded ifupdateHandlerSites(..., true)is called repeatedly (relevant for at-least-once admin endpoints). A single test assertingenabled.sites.length === 1after two consecutive enables would lock this behavior. -
New write-shape side effect on default-enabled handlers is invisible to tests (test file). The existing tests for
enabledByDefault: truehandlers do not assert thatenabled[entityKey]now grows on enable (it stayed empty pre-PR). Downstream consumers ofgetEnabledSiteIdsForHandlerand direct readers ofhandler.enabled.*may see unexpected new entries. Tighten the existing two tests to also assert the entity now appears inenabled[entityKey]. -
Hot-path log-volume risk (
configuration.model.js:213-215).isHandlerEnabledForSiteis called per-site in the jobs-dispatcher fan-out (audits, reports handlers). If the prod-state enumeration shows N affected (handler, siteId, orgId) triples, the log fires N times per dispatch tick. For a 5-min cron with thousands of sites, this multiplies. If the enumeration is non-trivial, sample the log (e.g. once per minute per handler-org pair) or move to a counter. Deferred until prod enumeration result is known. -
Downstream consumer set not enumerated in PR description. The fix lives in spacecat-shared but the runtime behavior change touches spacecat-api-service, spacecat-audit-worker, spacecat-autofix-worker, spacecat-jobs-dispatcher, and spacecat-reporting-worker (all read the same S3 singleton at runtime). Listing the consumers makes the deploy posture explicit for reviewers and on-call.
Recommendations
- Codify the precedence ladder as a parametric test: the 6-row JSDoc table is now the de-facto contract for
isHandlerEnabledForSite. A small data-driven test where each row is one assertion would lock spec and implementation together; this is the kind of test that does not regress when the body is refactored. Out of scope for this PR; worth a follow-up issue. - Move the prod-state enumeration into a PR template question, not a reviewer-side comment thread: Round 1 / Round 2 raised it as Important; the author correctly noted they lack prod access; the reviewer agreed to run it. That pattern places deploy-readiness knowledge in a reviewer's head. For behavior-flip PRs of this shape, the template should require a "Deployment impact" section with the enumeration answered before review. Workspace-level follow-up.
- Forward-looking: align with the Entitlement model: as more handlers move to paid tiers, the
Configuration.handler.enabled/disabledpartition becomes the de-facto product entitlement system. The codebase already has an Entitlement model. Two sources of truth for "is this site entitled to this handler" is a brittle pattern. Worth a design conversation: should handler enable/disable per (site, org) be derived from Entitlements, or vice versa? Out of scope for this PR. - Post-deploy validation hooks: once Important 4 is resolved (DEBUG -> INFO), schedule explicit post-deploy checks at T+1h, T+24h, T+7d: count of override log fires, any unexpected handler types, consumer-side error spikes (
[isHandlerEnabledForSite]error logs). Pairs with the rollback path documentation. - Add a write-time audit log at the admin entry points (sites-audits-toggle controller, Slack command, onboarding flows): the current DEBUG read-path log captures override evaluation, but the security-relevant event - WHO promoted the site WHEN - is not logged at the boundary. INFO-level audit at the controller layer with admin principal, target siteId/orgId, handler type, and before/after override state. Out of scope for this PR.
Out of scope, worth tracking
- Reviewer-side prod-state enumeration (Round 1 Important 2, owned by solaris007): legitimate deploy-readiness gate, not a code change. Should be resolved before merge.
- Cross-system fan-out gap if treated as a follow-up (Important 3 option b): Jira-tracked, not just commented.
getEnabledSiteIdsForHandler(type)will now return different (non-empty) data for default-enabled handlers after any explicit re-enable. If any consumer treats "site in enabled.sites" as "explicitly opted in" vs "default", that semantic distinction is now lost. Worth a grep follow-up in dependent repos.- The S3 singleton write protection is unchanged by this PR; the admin-only argument depends on IAM/bucket policy enforcement, which is an infra review item separate from this code.
- Migration to
PlatformConfiguration(mysticat-data-service) is in flight; the new API will need to inherit these precedence semantics. - Periodic reconciliation job that flags entries in
enabled.siteswhose org is indisabled.orgsagainst an entitlement source of truth. Hygiene control. - Precedence-table parametric test as a follow-up issue.
events2metricsrule on the override log line for dashboarding.
Assessment
Ready to merge? No, with fixes.
Reasoning: code-level work is clean and the architectural moves hold up. Three items keep this from a clean approval - two unresolved Round 2 test gaps (Importants 1 and 2, both small additions to an existing file), one new system-level consequence that should be named in JSDoc or the PR description (Important 3), one operational signal that should be addressable in this PR (Important 4 - one-line log level change), and the rollback path that should be in the description before merge (Important 5). The reviewer-side prod enumeration remains a deploy-readiness gate, not a code fix.
The long-term direction (Configuration handler enable/disable becoming a de-facto entitlement engine, alignment with the Entitlement model) deserves its own design pass before more paid handlers land. Not this PR's job to fix.
Next Steps
- Add the headline end-to-end test (Important 1).
- Tighten the
enabledByDefault: falsedisable branch assertion (Important 2). - Pick (a) or (b) for the cross-system fan-out gap and write it down (Important 3).
- Change DEBUG to INFO for the override log for the first 30 days (Important 4).
- Add a Rollback section to the PR description (Important 5).
- Reviewer-side prod-state enumeration before approving (out of scope, owned by solaris007).
- Minor items are optional but several are small additions to the existing test file.
| disabled: { sites: [], orgs: [site.getOrganizationId()] }, | ||
| }); | ||
|
|
||
| expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true; |
There was a problem hiding this comment.
Important: headline end-to-end test missing (Round 2 unresolved).
The PR fixes the free-trial -> paid upgrade path, but no test composes write-path-then-read-path with the full org-disabled context. Test 3 (this test) pre-populates enabled.sites via addHandler (read path only). Test 4 below exercises the write path but without org in disabled.orgs. Neither tests the exact scenario the PR is for.
Fix: one chained test, approximately 10 lines:
it('enables a paid handler for a site whose org is in disabled.orgs (free-trial -> paid)', () => {
instance.addHandler('paid-handler', {
enabledByDefault: false,
enabled: { sites: [], orgs: [] },
disabled: { sites: [], orgs: [site.getOrganizationId()] },
});
expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.false;
instance.updateHandlerSites('paid-handler', site.getId(), true);
expect(instance.isHandlerEnabledForSite('paid-handler', site)).to.be.true;
expect(instance.getHandler('paid-handler').enabled.sites).to.include(site.getId());
});| instance.updateHandlerSites('default-cleanup-handler', site.getId(), false); | ||
|
|
||
| expect(instance.getHandler('default-cleanup-handler').enabled.sites).to.not.include(site.getId()); | ||
| expect(instance.getHandler('default-cleanup-handler').disabled.sites).to.include(site.getId()); |
There was a problem hiding this comment.
Important: enabledByDefault: false disable branch is reached but not behaviorally asserted (Round 2 unresolved).
The existing test at line 349-353 exercises this branch through disableHandlerForSite('organic-keywords', site), but its assertion is expect(...disabled.sites).to.not.include(siteId) - a tautology since organic-keywords.disabled.sites was never populated by the prior enable. The branch's actual job (removing siteId from enabled.sites so the round-trip leaves no stale override) is not asserted.
With this PR's "always add to enabled on enable" change, a paid handler can land in enabled.sites for sites whose org is in disabled.orgs. If the disable path's enabled.sites filter is broken or removed in a future refactor, a disabled handler would still read true via step 2 of the precedence ladder. Line-coverage tools would not catch it.
Fix: add expect(...enabled.sites).to.not.include(siteId) to the line 349-353 test, or add a dedicated round-trip test on a paid handler. 3-5 lines.
| * Note: unlike isHandlerEnabledForSite, there is no site-level override here. | ||
| * An org in disabled.orgs is always disabled regardless of site-level entries. | ||
| */ | ||
| isHandlerEnabledForOrg(type, org) { |
There was a problem hiding this comment.
Important: cross-system consequence of the asymmetric model is not documented.
Tracing the consumer call sites: isHandlerEnabledForOrg is used as a gate in spacecat-jobs-dispatcher/src/handlers/reports.js and spacecat-reporting-worker/src/digest/handler-external.js (the latter enumerates all sites in the org without re-checking per site when the org is disabled).
Implication: a paid site in enabled.sites whose org is in disabled.orgs will receive site-targeted reports (correct, this is what the PR enables) but will NOT receive org-level digests routed through orgId - the digest handler short-circuits at the org gate. The PR's stated invariant "site-level handler overrides were ignored, org-level disable always won" still holds in the digest fan-out path.
This is an in-repo doc fix. Pick one:
- (a) Declare it intentional: extend this JSDoc to say "intended for org-scoped report fan-out; site-level overrides do not propagate here", document in PR description.
- (b) File a Jira for the digest handler to re-check per-site, link from PR description as a known limitation.
Either is defensible. The asymmetric model is documented at the method level but the cross-system consequence is invisible today.
| } | ||
| if (enabledSites.includes(siteId)) { | ||
| if (disabledOrgs.includes(orgId)) { | ||
| this.log.debug('[isHandlerEnabledForSite] site-override: site in enabled.sites overrides org in disabled.orgs', { type, siteId, orgId }); |
There was a problem hiding this comment.
Important: DEBUG log level defeats the canary purpose.
This log fires only when the override path is taken - exactly the post-deploy validation signal. In SpaceCat Lambdas shipped to Coralogix, DEBUG is typically filtered before shipment. For the first 30 days post-deploy, on-call needs a greppable canary signal without per-subsystem filter changes. If the prod-state enumeration count (reviewer-side gate) is non-zero, the override fires on production traffic and the responder needs to confirm the behavior matches expectations.
Fix: change this.log.debug(...) to this.log.info(...) for the first 30 days, then demote in a follow-up commit. Track the demotion ticket so it does not become permanent INFO noise. If you prefer keeping DEBUG, write the grep target into the rollout runbook explicitly.
|
Hi @solaris007 I must have missed it. Here is what addressed
Minor (all done):
|
solaris007
left a comment
There was a problem hiding this comment.
Hey @tkotthakota-adobe,
Thanks for the thorough iteration on Round 3 feedback. Most prior findings are cleanly addressed; two operational items remain and one Minor test gap from Round 3 was not picked up.
Strengths
Previously flagged issues now addressed (Round 3):
- Important 1 (headline e2e test missing): closed at configuration.model.test.js:268-279. The new test composes the full write-then-read round-trip starting from
disabled.orgs=[orgId], assertsfalseinitially, callsupdateHandlerSites(..., true), then verifies both the read predicate flips totrueANDenabled.siteswas populated. Exactly the production scenario. - Important 2 (enabledByDefault: false disable round-trip): closed at configuration.model.test.js:385-391. Both write-shape sides pinned with
enabled.sitesincludes before / excludes after.organic-keywordsisenabledByDefault: false, so the test exercises the correct branch of#updatedHandler. - Important 3 (cross-system consequence not documented): closed at configuration.model.js:240-246. JSDoc verified accurate against both consumer call sites - in
spacecat-jobs-dispatcher/src/handlers/reports.js,processInternalruns two independent fan-outs (sites viacreateSiteMapper, orgs viacreateOrgsMapper), so a paid site inenabled.siteswhose org is indisabled.orgsdoes receive a siteId payload but is excluded from the org fan-out exactly as the JSDoc claims. The reporting worker re-appliesisHandlerEnabledForOrgas a second gate on the orgId branch. The "asymmetry is intentional" sentence is a faithful summary. - Important 4 (DEBUG log defeats canary purpose): closed at configuration.model.js:214.
- Minor 6 (disjointness pinning test): closed at configuration.model.test.js:307-315.
- Minor 7 (idempotency test): closed at configuration.model.test.js:317-328.
This iteration:
- Log payload
{ type, siteId, orgId }confirmed non-sensitive; supply-chain surface unchanged (nopackage.json/lockfile delta). - Authorization model integrity holds; the new commit is auth-neutral.
- JSDoc + new behavioral tests together form a coherent contract.
Issues
Important (Should Fix)
1. Rollback section in PR description still missing; the procedural pushback should not stand.
You wrote in the latest top-level comment: "Rollback section in PR description - needs to be written by you." Respectfully, that inverts the standard. The person changing behavior is the person who knows what to undo and in what order. The reviewer's job is to require the section, not to write it.
This PR specifically needs two levers, both of which you are best positioned to specify:
- Path A (small blast radius, data fix): if one or a small number of (siteId, orgId) tuples re-enable incorrectly after deploy, restore the prior S3 Configuration object version (versioning is enabled) OR write a corrective config removing the siteId from
enabled.sitesfor the affected handler. No code revert; resolves in minutes. - Path B (large blast radius, code fix): if the read-path logic is broadly wrong, revert this PR, publish a patch release of
spacecat-shared, bump the dependency in 5 consumers (spacecat-api-service,spacecat-audit-worker,spacecat-autofix-worker,spacecat-jobs-dispatcher,spacecat-reporting-worker). Each is a separate deploy. Recovery time scales with the slowest consumer.
The PR description should call out which signal triggers which path (e.g., 1-2 spurious site re-enables in the INFO override log -> Path A; precedence test regressions or broad behavior change in audit fan-out -> Path B). The merge commit body inherits the PR description; this is the document on-call reads when the page fires.
2. INFO log demotion follow-up is not tracked.
Round 3 explicitly required: "change to INFO for the first 30 days, then demote in a follow-up commit. Track the demotion ticket so it does not become permanent INFO noise." The DEBUG -> INFO change at configuration.model.js:214 is done, but no Jira ticket or TODO marker exists for the demotion. Without it, INFO becomes permanent and the canary signal degrades into background log volume.
Fix: file a Jira ticket for the T+30 demotion and add a code comment near line 213 referencing it (e.g., // TODO(SITES-NNNNN): demote to debug after 30-day canary window). One line of code, one ticket.
Minor (Nice to Have)
3. Minor 8 from Round 3 (write-shape side effect on default-enabled handlers) remains unresolved.
The existing tests for enabledByDefault: true handlers around configuration.model.test.js:330 and configuration.model.test.js:431 still assert only the negative case (disabled.sites does not include siteId). With this PR's write-path simplification, updateHandlerSites('<default-handler>', siteId, true) now also grows enabled.sites to [siteId] (previously stayed empty for default-true handlers). A regression where #updatedHandler stopped growing enabled.sites would slip past tests today.
One-line fix in at least one of those tests: expect(instance.getHandler('404').enabled.sites).to.include(site.getId());
4. No paired test for the documented asymmetry.
The JSDoc at configuration.model.js:240-246 makes a substantive behavioral claim - site override does NOT propagate to org-level. No test pins this. A two-line addition asserting both isHandlerEnabledForSite (true) and isHandlerEnabledForOrg (false) on the same handler with site-in-enabled-sites + org-in-disabled-orgs would lock the asymmetry against future drift.
5. Tautology kept alongside new assertion in "disables a handler for a site" test.
expect(...disabled.sites).to.not.include(site.getId()) remains in the test at configuration.model.test.js:387. For organic-keywords (enabledByDefault: false), the disable path enters the else branch and never touches disabled.sites, so this assertion is structurally guaranteed to pass. The new enabled.sites assertion is the real check; the dead line can be removed for clarity. Pure cleanup.
Recommendations
- Optional: emit a counter alongside the INFO log keyed only on handler
type(cardinality-bounded).configuration.override.site_over_org{type=X}gives Coralogix-independent observability and supports alerting once the canary window closes. Cheaper long-term signal than INFO logs. - Rephrase the cross-system JSDoc note more abstractly to avoid naming specific downstream services. If either changes which method gates their fan-out, the comment goes stale silently. Phrase: "callers that fan out on org policy" or point to a cross-repo spec.
- Rename the new test handler at configuration.model.test.js:268-279 to
paid-handler-migrationor similar - the namepaid-handleris reused from the immediately preceding test, and whilebeforeEachmakes it safe, it's mildly confusing when diffing the file. - Assert
mockLogger.infowas called in the new e2e test if you want the canary log itself pinned by the test suite. Optional. - Forward-looking: handler enable/disable is becoming a de-facto entitlement engine. Configuration's
isHandlerEnabledForSite/isHandlerEnabledForOrgare now expressing policy with non-obvious override semantics. The codebase already has an Entitlement model. Over time this deserves a dedicated abstraction. Not for this PR; roadmap.
Out of scope, worth tracking
- Inverse asymmetry on the reporting-worker org-digest path:
spacecat-reporting-worker/src/digest/handler-external.jscallsSite.allByOrganizationId(orgId)and does NOT filter the returned sites throughisHandlerEnabledForSite. A site indisabled.siteswhose org is inenabled.orgsis still included in the org-level digest. In-consumer behavior, separate PR if it turns out to be wrong. - Prod-state enumeration (Round 1 Important 2) still owned by reviewer (me). Tracking.
Assessment
Ready to merge? With fixes.
Reasoning: code-level work is clean and the Round 3 Important findings 1-4 plus Minors 6-7 are all properly addressed. Two operational items remain - Rollback section in PR description (the author should write, not the reviewer), and a tracked demotion follow-up for the INFO log (one Jira + one code comment). Plus one Minor test gap from Round 3 (Minor 8) that wasn't picked up, and two new Minor cleanup items in the test file. None of these are code redesign; all are quick fixes to land.
Next Steps
- Write the Rollback section in the PR description (Path A + Path B + which signal triggers which).
- File a Jira ticket for INFO -> DEBUG demotion at T+30 days and add a TODO comment referencing it near configuration.model.js:213.
- Add a one-line assertion to one of the existing default-true handler tests to pin the new
enabled.sitesgrowth behavior (Minor 8). - Optional: add the asymmetry-pinning test, drop the tautology assertion, and apply other Minor cleanups.
- Reviewer-side (me): run the prod-state enumeration before approving.
https://jira.corp.adobe.com/browse/SITES-41916
Problem:
Site-level handler overrides were ignored — org-level disable always won, so upgrading a site from
free-trial to paid couldn't re-enable audits. Also, enabling a handler could crash with a TypeError if
the disabled.sites array didn't exist.
isHandlerEnabledForSite now checks site-level before org-level — site enable overrides org disable
#updatedHandler cleans up both enabled/disabled lists to keep them consistent
Safe array access with || [] fallback to prevent TypeError on missing keys
Rollback
Path A — small blast radius (data fix, no code revert)
Trigger: 1–2 spurious site re-enables visible in the [isHandlerEnabledForSite] site-override INFO log, or a small number of reported incorrect audit
invocations.
Path B — large blast radius (code revert)
Trigger: Precedence test regressions, broad unexpected behavior change in audit fan-out across multiple sites/orgs, or [isHandlerEnabledForSite] INFO log
firing at unexpected volume.