Skip to content

feat(allocation-policy): Per-org overrides for bytes-scanned policy#7975

Open
phacops wants to merge 2 commits into
masterfrom
feat/per-org-bytes-scanned-overrides
Open

feat(allocation-policy): Per-org overrides for bytes-scanned policy#7975
phacops wants to merge 2 commits into
masterfrom
feat/per-org-bytes-scanned-overrides

Conversation

@phacops
Copy link
Copy Markdown
Contributor

@phacops phacops commented May 28, 2026

Add per-organization controls to BytesScannedRejectingPolicy so individual noisy orgs can be tuned without affecting every other organization.

Per-org scan-limit overrides

Two new configs override the sliding-window scan limit on the organization branch:

  • organization_referrer_scan_limit_override, keyed by (organization_id, referrer)
  • organization_scan_limit_override, keyed by organization_id

Overrides are resolved in order of specificity, with the first one set winning:

(organization_id, referrer)
  > organization_id
  > (all orgs, referrer)      # existing referrer_all_organizations_scan_limit_override
  > default                    # existing organization_referrer_scan_limit

Per-org max_bytes_to_read cap

Two more configs forward a hard max_bytes_to_read value to ClickHouse and bypass the sliding-window check entirely:

  • organization_referrer_max_bytes_to_read, keyed by (organization_id, referrer)
  • organization_max_bytes_to_read, keyed by organization_id

When set, the query is allowed to run at full threads with the cap applied; ClickHouse aborts it if it would scan more than the cap. (org_id, referrer) wins over org_id. This complements the existing global limit_bytes_instead_of_rejecting flow, which only caps queries after a tenant exceeds its scan limit.

The project branch and cross-org behavior are unchanged.

Drive-by fixes

Touching the file caused the pre-commit hook to flag pre-existing lint/mypy issues in the same files (E712 truthiness comparisons, untyped tenant_ids dicts, ResourceIdentifier arg type, and a # type: ignore that drifted off the offending line after ruff reformatted it). These are fixed in the same PR since the hook blocks otherwise.

…ngPolicy

Add two new Configuration entries on BytesScannedRejectingPolicy:

- organization_referrer_scan_limit_override, keyed by
  (organization_id, referrer)
- organization_scan_limit_override, keyed by organization_id

Previously the only way to override the scan limit for the
organization branch was the per-referrer override that applied to
every organization, which made it impossible to tune the limit for a
single noisy org without affecting everyone else.

Overrides on the organization branch are now resolved in order of
specificity, with the first one set winning:

  (organization_id, referrer)
    > organization_id
    > (all orgs, referrer)
    > default

The project branch and cross-org behavior are unchanged.

Also fix pre-existing lint/mypy issues in the same files that the
pre-commit hook surfaces once the files are touched (E712 truthiness
asserts, untyped tenant_ids dicts, ResourceIdentifier arg type, and a
misplaced type: ignore exposed by ruff reformatting).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/xXAC6_vLoFmdoaI4mlEdxiC7CUhp2mohJV2MuJoaNPU
@phacops phacops requested a review from a team as a code owner May 28, 2026 00:17
Add two new Configuration entries on BytesScannedRejectingPolicy that
forward a hard max_bytes_to_read value to ClickHouse and bypass the
sliding-window scan limit for the configured organization:

- organization_referrer_max_bytes_to_read, keyed by
  (organization_id, referrer)
- organization_max_bytes_to_read, keyed by organization_id

When either is set the query runs at full threads with the configured
cap and the sliding window is not consulted; ClickHouse aborts the
query if it would scan more than the cap. (org_id, referrer) is more
specific and wins over org_id.

This complements the existing global limit_bytes_instead_of_rejecting
flow, which only caps queries after a tenant exceeds its scan limit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/K7ElxY0inzZzY1icw8PnMWaO4jjRKEoV3AthvEUGg7E
@phacops phacops changed the title feat(allocation-policy): Add per-org overrides to BytesScannedRejectingPolicy feat(allocation-policy): Per-org overrides for bytes-scanned policy May 28, 2026
Comment on lines +282 to +292
max_bytes_to_read=org_cap,
explanation={
"reason": f"organization_id {tenant_ids.get('organization_id')} runs with a per-org max_bytes_to_read cap of {org_cap}"
},
is_throttled=False,
throttle_threshold=MAX_THRESHOLD,
rejection_threshold=MAX_THRESHOLD,
quota_used=0,
quota_unit=QUOTA_UNIT,
suggestion=NO_SUGGESTION,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The new organization-level max_bytes_to_read check runs before determining the query type, causing project-based queries to bypass their specific rate limits if an org-level cap is set.
Severity: HIGH

Suggested Fix

The organization cap check should be moved to after the call to _get_customer_tenant_key_and_value(). The check should only be applied when the customer_tenant_key is confirmed to be "organization_id", ensuring it only affects organization-based queries.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: snuba/query/allocation_policies/bytes_scanned_rejecting_policy.py#L277-L292

Potential issue: The `__get_organization_max_bytes_to_read()` check is applied to any
query with an `organization_id` in `tenant_ids`. This check occurs before the logic that
determines if a query is project-based or organization-based. Consequently, for a
project-based query that also includes an `organization_id`, if an organization-level
`max_bytes_to_read` cap is configured, the function will return early with the
organization's cap. This bypasses the intended project-level sliding-window rate limits,
potentially allowing the project to scan an unlimited number of bytes beyond its
configured limit.

Did we get this right? 👍 / 👎 to inform future reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants