Skip to content

feat(AGX1-275): per-RPC task permission rewire and 404/403 wrap#249

Open
asherfink wants to merge 1 commit into
mainfrom
asher.fink/agx1-275-task-route-migration
Open

feat(AGX1-275): per-RPC task permission rewire and 404/403 wrap#249
asherfink wants to merge 1 commit into
mainfrom
asher.fink/agx1-275-task-route-migration

Conversation

@asherfink
Copy link
Copy Markdown

@asherfink asherfink commented May 26, 2026

Related work

Parent epic: AGX1-264 — per-task FGAC. Follow-ups bundled in AGX1-291.

This change is part of a 5-PR stack across 3 repos. Merge order: scaleapi/scaleapi#144783 (sgp-authz 0.7.1) ✅ already merged → scaleapi/scaleapi#145000 + scaleapi/scaleapi#145044 + scaleapi/agentex#353 → #246this PR.

Repo PR Purpose
scaleapi/scaleapi scaleapi/scaleapi#144783 ✅ merged sgp-authz 0.7.1 — Action.CANCEL
scaleapi/scaleapi scaleapi/scaleapi#145000 register FGAC_AGENTEX_AUTH_SPARK flag
scaleapi/scaleapi scaleapi/scaleapi#145044 add cancel to SGP's AgentexOperation enum + role map
scaleapi/agentex scaleapi/agentex#353 agentex-auth per-account routing + cancel op
scaleapi/scale-agentex #246 task creator audit columns + FGAC dual-write + flag
scaleapi/scale-agentex this PR per-RPC operation rewire + 404/403 wrap

Last in the stack — this is the route-side change that actually exercises the permissions written by #246.

Summary

  • Routes each task-resource RPC to the correct AuthorizedOperationType instead of using execute everywhere: MESSAGE_SEND / EVENT_SENDupdate, TASK_CANCELcancel, TASK_CREATE stays create.
  • Broadens the execute → update swap across messages.py, checkpoints.py, and states.py so the editor role can send messages and update checkpoints/state without needing owner. The task SpiceDB schema defines permission update = (editor + owner) & internal_tenant_gate, so leaving these on execute (owner-only) would lock editors out of routine mutations.
  • Collapses every task-resource denial into 404 (via ItemDoesNotExist) across all surfaces — path id, query id, body id, and name routes — so callers can no longer distinguish "task present in another tenant" from "task absent" by comparing 403 vs 404.
  • Extracts the collapse helper to src/utils/task_authorization.py, reused from both the FastAPI dep factories and the RPC authorize hook.
  • Unknown AgentRPCMethod values now raise NotImplementedError rather than falling through authz-free — defense-in-depth so a new RPC must be explicitly wired before it can dispatch.

What changed

  • src/utils/task_authorization.py (new): check_task_or_collapse_to_404(authorization, task_id, operation) — the shared wrap.
  • src/utils/authorization_shortcuts.py: DAuthorizedId / DAuthorizedQuery / DAuthorizedBodyId route task checks through the wrap; their inner deps no longer take a task_repository (parameter was unused). DAuthorizedName now applies the wrap when resource_type == AgentexResourceType.task — previously the name surface leaked 403 vs 404 because tasks.name is globally unique, so a probe checked the whole system rather than a single tenant.
  • src/api/routes/agents.py _authorize_rpc_request: each task-resource branch routes through the wrap. The MESSAGE_SEND block with task_name is restructured to a try/else shape so a denied-update on an existing task surfaces as 404 (it must NOT fall through to the create-fallback except — that would silently promote a denied-update into a create check, which is a privilege escalation footgun). The unknown-method case (case _:) now raises NotImplementedError.
  • src/api/routes/messages.py / checkpoints.py / states.py: execute → update for routine mutations.
  • TASK_CREATE and the wildcard task("*") checks intentionally untouched.

AGX1-275 list deliverable — already covered

The AGX1-275 list-filtering deliverable (only return tasks the caller can read) is already met by existing infrastructure that this PR does not change:

  • DAuthorizedResourceIds(AgentexResourceType.task) at src/utils/authorization_shortcuts.py:130 resolves the per-principal accessible task id set.
  • It is wired into the list route at src/api/routes/tasks.py:82, which intersects user filters with the accessible set before hitting the repository.

No additional code is needed here — the list surface is already authorized end-to-end.

Pre-merge wire dependency

The execute → update swap changes the operation literal that hits agentex-auth's check endpoint. The SGP gateway forwards the literal verbatim, so the SGP permissions backend must resolve both update and cancel against the existing OWNER grant before this deploys:

  • update is already in SGP's AgentexOperation enum on master — no work needed.
  • cancel is added by scaleapi/scaleapi#145044. This PR is held until that PR ships everywhere.

For Spark-routed accounts the dependency chain is scaleapi/agentex#353 + sgp-authz 0.7.1 (already merged) → ResourceRole.OWNER → Action.CANCEL, all of which are in place upstream of this PR.

Tests

  • tests/unit/api/test_tasks_authz.py — 17 tests pass:
    • 8 TestPerRpcOperationRouting tests (incl. MESSAGE_SEND create-fallback preserved through the restructure).
    • 2 TestCheckTaskOrCollapseTo404 tests (allow + denied-collapses-to-404).
    • 3 TestDAuthorizedBodyIdTaskWrap tests.
    • 3 TestDAuthorizedNameTaskWrap tests (denied-task → 404, allow returns name, agent path unaffected).
    • Wire-contract test pins cancel → "cancel" (mirrors agentex-auth's).
  • Ruff + ruff-format clean (pre-commit hooks).

Out of scope / follow-ups (tracked in AGX1-291)

  • /agents/name/{agent_name} has the same leak shape — agent FGAC is outside AGX1-264.
  • Restoring the 403/404 split for same-tenant calls once tasks carry tenant/account_id scope at the data layer (AGX1-290).

Greptile Summary

This PR is the final route-side piece of the AGX1-264 per-task FGAC stack. It rewires _authorize_rpc_request to use operation-specific literals (update for MESSAGE_SEND/EVENT_SEND, cancel for TASK_CANCEL), collapses every task-resource authorization denial into a 404 (via a shared check_task_or_collapse_to_404 helper), and guards the unknown-method case with NotImplementedError so new RPCs cannot slip through authorization-free.

  • src/utils/task_authorization.py (new): check_task_or_collapse_to_404 wraps any AuthorizationError on a task resource into ItemDoesNotExist, eliminating the cross-tenant existence leak that tasks.name (globally unique) would otherwise expose via 403 vs 404 divergence.
  • src/utils/authorization_shortcuts.py: All four dep factories (DAuthorizedId, DAuthorizedQuery, DAuthorizedBodyId, DAuthorizedName) now route task-type resources through the collapse helper; the MESSAGE_SEND-with-task_name branch uses a try/else shape so a denied-update can never fall through to the create-fallback check (privilege escalation footgun addressed).
  • src/api/routes/{messages,checkpoints}.py: execute → update swap for routine mutations so the editor role can send messages and update checkpoints/state without needing owner.

Confidence Score: 4/5

Safe to merge once the upstream wire dependency (scaleapi/scaleapi#145044 shipping cancel into SGP's AgentexOperation enum) is confirmed live; the execute→update swap for MESSAGE_SEND/EVENT_SEND is already safe per SGP master.

The routing logic, collapse helper, and try/else privilege-escalation guard are all correct and well-tested. The one remaining gate — cancel reaching every SGP region before this deploys — is explicitly documented in the PR description as a hold condition; no in-repo verification of that live state exists. The change is otherwise clean and the score reflects that single external dependency rather than any defect in the code itself.

No files contain defects. The pre-merge wire dependency on scaleapi/scaleapi#145044 is the only open gate, documented in the PR description and in the existing review comment on agents.py.

Important Files Changed

Filename Overview
agentex/src/utils/task_authorization.py New helper that collapses AuthorizationError → ItemDoesNotExist (404) for task resources; logic is simple, well-documented with a TODO for the 403/404 restore, and fully unit-tested.
agentex/src/api/routes/agents.py Per-RPC operation rewire is correct; try/else shape for MESSAGE_SEND task-name path prevents the denied-update → create-fallback privilege escalation; unknown-method guard raises NotImplementedError defensively.
agentex/src/utils/authorization_shortcuts.py All dep factories now route task resources through the collapse helper; DAuthorizedName's lookup-before-authz pattern is safe because both absent-name and denied-name paths produce 404.
agentex/src/api/routes/messages.py execute → update swap on all four message mutation endpoints; straightforward mechanical change, correctly enables editor-role access per the SpiceDB schema.
agentex/src/api/routes/checkpoints.py execute → update swap on put_checkpoint and put_writes; import block reordered alphabetically as a side effect, no functional change.
agentex/src/api/schemas/authorization_types.py Adds cancel to AuthorizedOperationType StrEnum; wire-format value "cancel" is pinned by a dedicated unit test.
agentex/tests/unit/api/test_tasks_authz.py 17 tests covering per-RPC routing, 404-collapse semantics, body-id and name-path dep factories, wire-format contract pins, and the unwired-method fail-closed case; tests are well-structured and isolate the authz gateway via mocks.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    RPC["Incoming AgentRPC Request"] --> MATCH{"match request.method"}

    MATCH -->|TASK_CREATE| CREATE_CHECK["authorization.check\ntask(*) · create"]
    MATCH -->|MESSAGE_SEND| MSG_ID{"task_id present?"}
    MATCH -->|EVENT_SEND| EVT_ID{"task_id present?"}
    MATCH -->|TASK_CANCEL| CXL_ID{"task_id present?"}
    MATCH -->|unknown| NIE["raise NotImplementedError\n→ 500"]

    MSG_ID -->|yes| MSG_WRAP_ID["check_task_or_collapse_to_404\ntask_id · update"]
    MSG_ID -->|no, task_name| MSG_LOOKUP["task_service.get_task(name)"]
    MSG_ID -->|no task_id/name| MSG_CREATE["authorization.check\ntask(*) · create"]
    MSG_LOOKUP -->|ItemDoesNotExist| MSG_CREATE
    MSG_LOOKUP -->|found| MSG_WRAP_NAME["check_task_or_collapse_to_404\nexisting_task.id · update"]

    EVT_ID -->|yes| EVT_WRAP_ID["check_task_or_collapse_to_404\ntask_id · update"]
    EVT_ID -->|no, task_name| EVT_LOOKUP["task_service.get_task(name)"]
    EVT_ID -->|no task_id/name| EVT_ERR["raise ValueError → 400"]
    EVT_LOOKUP --> EVT_WRAP_NAME["check_task_or_collapse_to_404\nexisting_task.id · update"]

    CXL_ID -->|yes| CXL_WRAP_ID["check_task_or_collapse_to_404\ntask_id · cancel"]
    CXL_ID -->|no, task_name| CXL_LOOKUP["task_service.get_task(name)"]
    CXL_ID -->|no task_id/name| CXL_ERR["raise ValueError → 400"]
    CXL_LOOKUP --> CXL_WRAP_NAME["check_task_or_collapse_to_404\nexisting_task.id · cancel"]

    subgraph WRAP["check_task_or_collapse_to_404"]
        direction TB
        AZ["authorization.check(task, op)"] -->|AuthorizationError| FOLD["raise ItemDoesNotExist → 404"]
        AZ -->|ok| PASS["return (allow)"]
    end

    MSG_WRAP_ID & MSG_WRAP_NAME & EVT_WRAP_ID & EVT_WRAP_NAME & CXL_WRAP_ID & CXL_WRAP_NAME --> WRAP
Loading

Reviews (4): Last reviewed commit: "feat(AGX1-275): per-RPC task permission ..." | Re-trigger Greptile

dm36 added a commit that referenced this pull request May 27, 2026
…se and two-factor mutations

Mirrors AGX1-275 (PR #249) for agent_api_keys. Wires Spark AuthZ checks
into every api_key route, collapses denials to 404 (so name/id probes
can't distinguish "present in another tenant" from "absent"), and relies
on SpiceDB's transitive expansion of api_key.{update,delete} (= editor &
parent_agent->update & tenant_gate) for two-factor mutations rather than
issuing two explicit checks at the route layer.

- src/utils/agent_api_key_authorization.py (new):
  _check_api_key_or_collapse_to_404 — catches AuthorizationError, raises
  ItemDoesNotExist. Same shape as Asher's task helper.
- src/utils/authorization_shortcuts.py: DAuthorizedId routes
  AgentexResourceType.api_key through the wrap. (DAuthorizedName isn't
  used for api_keys; the name lookup is (agent_id, name, api_key_type),
  not a single globally-unique path param — the route handlers call the
  collapse helper inline instead.)
- src/api/routes/agent_api_keys.py:
  * POST: explicit agent.update on parent (no api_key resource yet).
  * GET list: DAuthorizedResourceIds + filter; None passes through.
  * GET /name/{name}: inline collapse helper.
  * GET /{id}: DAuthorizedId(api_key, read).
  * DELETE /{id}: DAuthorizedId(api_key, delete). Two-factor via SpiceDB
    schema (api_key.delete expands to parent_agent.update); no second
    route-layer check.
  * DELETE /name/{api_key_name}: inline collapse helper.
- tests/unit/api/test_agent_api_keys_authz.py (new): 12 tests, all pass.

Stacked on dhruv/agx1-272-agent-api-keys-dual-write (PR A). Does NOT
touch dual-write logic. Does NOT modify agentex-auth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@asherfink asherfink force-pushed the asher.fink/agx1-275-task-route-migration branch from b936a08 to 7841696 Compare May 27, 2026 21:15
@asherfink asherfink marked this pull request as ready for review May 27, 2026 21:59
@asherfink asherfink requested a review from a team as a code owner May 27, 2026 21:59
dm36 added a commit that referenced this pull request May 28, 2026
…se and two-factor mutations

Mirrors AGX1-275 (PR #249) for agent_api_keys. Wires Spark AuthZ checks
into every api_key route, collapses denials to 404 (so name/id probes
can't distinguish "present in another tenant" from "absent"), and relies
on SpiceDB's transitive expansion of api_key.{update,delete} (= editor &
parent_agent->update & tenant_gate) for two-factor mutations rather than
issuing two explicit checks at the route layer.

- src/utils/agent_api_key_authorization.py (new):
  _check_api_key_or_collapse_to_404 — catches AuthorizationError, raises
  ItemDoesNotExist. Same shape as Asher's task helper.
- src/utils/authorization_shortcuts.py: DAuthorizedId routes
  AgentexResourceType.api_key through the wrap. (DAuthorizedName isn't
  used for api_keys; the name lookup is (agent_id, name, api_key_type),
  not a single globally-unique path param — the route handlers call the
  collapse helper inline instead.)
- src/api/routes/agent_api_keys.py:
  * POST: explicit agent.update on parent (no api_key resource yet).
  * GET list: DAuthorizedResourceIds + filter; None passes through.
  * GET /name/{name}: inline collapse helper.
  * GET /{id}: DAuthorizedId(api_key, read).
  * DELETE /{id}: DAuthorizedId(api_key, delete). Two-factor via SpiceDB
    schema (api_key.delete expands to parent_agent.update); no second
    route-layer check.
  * DELETE /name/{api_key_name}: inline collapse helper.
- tests/unit/api/test_agent_api_keys_authz.py (new): 12 tests, all pass.

Stacked on dhruv/agx1-272-agent-api-keys-dual-write (PR A). Does NOT
touch dual-write logic. Does NOT modify agentex-auth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dm36 added a commit that referenced this pull request May 28, 2026
…se and two-factor mutations

Mirrors AGX1-275 (PR #249) for agent_api_keys. Wires Spark AuthZ checks
into every api_key route, collapses denials to 404 (so name/id probes
can't distinguish "present in another tenant" from "absent"), and relies
on SpiceDB's transitive expansion of api_key.{update,delete} (= editor &
parent_agent->update & tenant_gate) for two-factor mutations rather than
issuing two explicit checks at the route layer.

- src/utils/agent_api_key_authorization.py (new):
  _check_api_key_or_collapse_to_404 — catches AuthorizationError, raises
  ItemDoesNotExist. Same shape as Asher's task helper.
- src/utils/authorization_shortcuts.py: DAuthorizedId routes
  AgentexResourceType.api_key through the wrap. (DAuthorizedName isn't
  used for api_keys; the name lookup is (agent_id, name, api_key_type),
  not a single globally-unique path param — the route handlers call the
  collapse helper inline instead.)
- src/api/routes/agent_api_keys.py:
  * POST: explicit agent.update on parent (no api_key resource yet).
  * GET list: DAuthorizedResourceIds + filter; None passes through.
  * GET /name/{name}: inline collapse helper.
  * GET /{id}: DAuthorizedId(api_key, read).
  * DELETE /{id}: DAuthorizedId(api_key, delete). Two-factor via SpiceDB
    schema (api_key.delete expands to parent_agent.update); no second
    route-layer check.
  * DELETE /name/{api_key_name}: inline collapse helper.
- tests/unit/api/test_agent_api_keys_authz.py (new): 12 tests, all pass.

Stacked on dhruv/agx1-272-agent-api-keys-dual-write (PR A). Does NOT
touch dual-write logic. Does NOT modify agentex-auth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@asherfink asherfink marked this pull request as draft May 28, 2026 18:00
asherfink added a commit that referenced this pull request May 28, 2026
…n creation

Adds two nullable creator-audit columns to the `tasks` table —
`creator_user_id` and `creator_service_account_id` — populated from
the request principal in `AgentTaskService.create_task`. A CHECK
constraint `ck_tasks_at_most_one_creator` enforces that at most one
of the two is set; partial indexes back future "tasks created by X"
lookups.

Online migration: the CHECK is added `NOT VALID` then
`VALIDATE`d separately so the brief ACCESS EXCLUSIVE lock doesn't
have to wait on an existence scan. `tasks` is a high-write table;
a vanilla CHECK addition would queue behind in-flight transactions
and block readers until released. Indexes use
`CREATE INDEX CONCURRENTLY` inside `autocommit_block`.

Best-effort attribution: tasks created outside an HTTP request
context (Temporal activities, background workers, any path that
constructs `AgentTaskService` without `request.state.principal_context`)
leave both columns NULL. The CHECK constraint allows both-NULL,
and an integration test exercises the no-resolvable-creator path.

These columns are how the AGX1-291 operator runbook identifies
orphan rows for backfill when the dual-write call sites added in
the next commit fail under load.

Part of the AGX1-264 stack: scaleapi/scaleapi NEW2
(per-account FF endpoint) → scaleapi/agentex#353 (agentex-auth
routing + cancel) → this PR → #249 (per-RPC
route migration). Two commits land together in #246; this one is
the schema/audit change and is independent of the dual-write call
sites.
asherfink added a commit that referenced this pull request May 28, 2026
…TE flag

Wires `register_resource` / `deregister_resource` into
`AgentTaskService.create_task` / `delete_task`, gated by a new
`FGAC_TASKS_DUAL_WRITE` env-var (off by default; resolved at
DI-resolve time so mid-process flips are intentionally invisible —
rollout assumes a redeploy cycles pods).

- `create_task`: after the Postgres row persists, register the
  task in the authorization graph with `parent=agent` so the
  tenant + owner + parent_agent tuples are written atomically.
- `delete_task`: pre-resolves the task id by name before the
  Postgres delete (lookup-after-delete would race), then
  deregisters once the row is gone. The name-lookup
  `ItemDoesNotExist` is swallowed so the subsequent `delete()`
  surfaces its own native error — flipping the flag must not
  change the error contract for missing tasks.
- Both call sites share `_dual_write_with_retry(op_name, do_call,
  task_id)`, which retries transient
  `AuthenticationServiceUnavailableError` /
  `AuthenticationGatewayError` with exponential backoff + jitter
  (3 retries → 4 attempts max). Mirrors
  `agents_acp_use_case.grant_with_retry`, but with no `fail_task`
  fallback: the Postgres row is the durable record and orphan auth
  tuples are preferable to losing the task. The AGX1-291 operator
  runbook covers backfill using the creator-audit columns added in
  the parent commit.
- Emits `task_fgac_dual_write.{attempt,success,retry,failure}`
  statsd counters (tagged `op:register|deregister` and
  `exception_class` on failure) — the rollout signal for the
  FGAC_TASKS_DUAL_WRITE flip dashboard.

The `Port` interface gains `register_resource` /
`deregister_resource`, and the agentex-auth proxy adapter calls
`POST /v1/authz/register` and `POST /v1/authz/deregister`. The
endpoints themselves already live on agentex-auth `main` via #354;
per-account routing across them is set by scaleapi/agentex#353.

Part of the AGX1-264 stack: scaleapi/scaleapi NEW2
(per-account FF endpoint) → scaleapi/agentex#353 (agentex-auth
routing + cancel) → this PR → #249 (per-RPC
route migration).
@asherfink asherfink force-pushed the asher.fink/agx1-275-task-route-migration branch from 7841696 to fcce135 Compare May 28, 2026 20:08
@asherfink asherfink marked this pull request as ready for review May 28, 2026 20:32
asherfink added a commit that referenced this pull request May 28, 2026
…n creation

Adds two nullable creator-audit columns to the `tasks` table —
`creator_user_id` and `creator_service_account_id` — populated from
the request principal in `AgentTaskService.create_task`. A CHECK
constraint `ck_tasks_at_most_one_creator` enforces that at most one
of the two is set; partial indexes back future "tasks created by X"
lookups.

Online migration: the CHECK is added `NOT VALID` then
`VALIDATE`d separately so the brief ACCESS EXCLUSIVE lock doesn't
have to wait on an existence scan. `tasks` is a high-write table;
a vanilla CHECK addition would queue behind in-flight transactions
and block readers until released. Indexes use
`CREATE INDEX CONCURRENTLY` inside `autocommit_block`.

Best-effort attribution: tasks created outside an HTTP request
context (Temporal activities, background workers, any path that
constructs `AgentTaskService` without `request.state.principal_context`)
leave both columns NULL. The CHECK constraint allows both-NULL,
and an integration test exercises the no-resolvable-creator path.

These columns are how the AGX1-291 operator runbook identifies
orphan rows for backfill when the dual-write call sites added in
the next commit fail under load.

Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
this PR -> #249 (per-RPC route migration).
Two commits land together in #246; this one is the schema/audit
change and is independent of the dual-write call sites in the
next commit.
asherfink added a commit that referenced this pull request May 28, 2026
…TE flag

Wires `register_resource` / `deregister_resource` into
`AgentTaskService.create_task` / `delete_task`, gated by a new
`FGAC_TASKS_DUAL_WRITE` env-var (off by default; resolved at
DI-resolve time so mid-process flips are intentionally invisible —
rollout assumes a redeploy cycles pods).

- `create_task`: after the Postgres row persists, register the
  task in the authorization graph with `parent=agent` so the
  tenant + owner + parent_agent tuples are written atomically.
- `delete_task`: pre-resolves the task id by name before the
  Postgres delete (lookup-after-delete would race), then
  deregisters once the row is gone. The name-lookup
  `ItemDoesNotExist` is swallowed so the subsequent `delete()`
  surfaces its own native error — flipping the flag must not
  change the error contract for missing tasks.
- Both call sites share `_dual_write_with_retry(op_name, do_call,
  task_id)`, which retries transient
  `AuthenticationServiceUnavailableError` /
  `AuthenticationGatewayError` with exponential backoff + jitter
  (3 retries -> 4 attempts max). Mirrors
  `agents_acp_use_case.grant_with_retry`, but with no `fail_task`
  fallback: the Postgres row is the durable record and orphan auth
  tuples are preferable to losing the task. The AGX1-291 operator
  runbook covers backfill using the creator-audit columns added in
  the parent commit.
- Emits `task_fgac_dual_write.{attempt,success,retry,failure}`
  statsd counters (tagged `op:register|deregister` and
  `exception_class` on failure) — the rollout signal for the
  FGAC_TASKS_DUAL_WRITE flip dashboard.

The `Port` interface gains `register_resource` /
`deregister_resource`, and the agentex-auth proxy adapter calls
`POST /v1/authz/register` and `POST /v1/authz/deregister`. The
endpoints themselves already live on agentex-auth `main` via #354;
per-account routing across them is set by scaleapi/agentex#353.

Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
this PR -> #249 (per-RPC route migration).
@asherfink asherfink force-pushed the asher.fink/agx1-275-task-route-migration branch from fcce135 to 3dfd5cb Compare May 28, 2026 22:26
dm36 added a commit that referenced this pull request May 29, 2026
…se and two-factor mutations

Mirrors AGX1-275 (PR #249) for agent_api_keys. Wires Spark AuthZ checks
into every api_key route, collapses denials to 404 (so name/id probes
can't distinguish "present in another tenant" from "absent"), and relies
on SpiceDB's transitive expansion of api_key.{update,delete} (= editor &
parent_agent->update & tenant_gate) for two-factor mutations rather than
issuing two explicit checks at the route layer.

- src/utils/agent_api_key_authorization.py (new):
  _check_api_key_or_collapse_to_404 — catches AuthorizationError, raises
  ItemDoesNotExist. Same shape as Asher's task helper.
- src/utils/authorization_shortcuts.py: DAuthorizedId routes
  AgentexResourceType.api_key through the wrap. (DAuthorizedName isn't
  used for api_keys; the name lookup is (agent_id, name, api_key_type),
  not a single globally-unique path param — the route handlers call the
  collapse helper inline instead.)
- src/api/routes/agent_api_keys.py:
  * POST: explicit agent.update on parent (no api_key resource yet).
  * GET list: DAuthorizedResourceIds + filter; None passes through.
  * GET /name/{name}: inline collapse helper.
  * GET /{id}: DAuthorizedId(api_key, read).
  * DELETE /{id}: DAuthorizedId(api_key, delete). Two-factor via SpiceDB
    schema (api_key.delete expands to parent_agent.update); no second
    route-layer check.
  * DELETE /name/{api_key_name}: inline collapse helper.
- tests/unit/api/test_agent_api_keys_authz.py (new): 12 tests, all pass.

Stacked on dhruv/agx1-272-agent-api-keys-dual-write (PR A). Does NOT
touch dual-write logic. Does NOT modify agentex-auth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
asherfink added a commit that referenced this pull request May 29, 2026
…TE flag

Wires `register_resource` / `deregister_resource` into
`AgentTaskService.create_task` / `delete_task`, gated by a new
`FGAC_TASKS_DUAL_WRITE` env-var (off by default; resolved at
DI-resolve time so mid-process flips are intentionally invisible —
rollout assumes a redeploy cycles pods).

- `create_task`: after the Postgres row persists, register the
  task in the authorization graph with `parent=agent` so the
  tenant + owner + parent_agent tuples are written atomically.
- `delete_task`: pre-resolves the task id by name before the
  Postgres delete (lookup-after-delete would race), then
  deregisters once the row is gone. The name-lookup
  `ItemDoesNotExist` is swallowed so the subsequent `delete()`
  surfaces its own native error — flipping the flag must not
  change the error contract for missing tasks.
- Both call sites share `_dual_write_with_retry(op_name, do_call,
  task_id)`, which retries transient
  `AuthenticationServiceUnavailableError` /
  `AuthenticationGatewayError` with exponential backoff + jitter
  (3 retries -> 4 attempts max). Mirrors
  `agents_acp_use_case.grant_with_retry`, but with no `fail_task`
  fallback: the Postgres row is the durable record and orphan auth
  tuples are preferable to losing the task. The AGX1-291 operator
  runbook covers backfill using the creator-audit columns added in
  the parent commit.
- Emits `task_fgac_dual_write.{attempt,success,retry,failure}`
  statsd counters (tagged `op:register|deregister` and
  `exception_class` on failure) — the rollout signal for the
  FGAC_TASKS_DUAL_WRITE flip dashboard.

The `Port` interface gains `register_resource` /
`deregister_resource`, and the agentex-auth proxy adapter calls
`POST /v1/authz/register` and `POST /v1/authz/deregister`. The
endpoints themselves already live on agentex-auth `main` via #354;
per-account routing across them is set by scaleapi/agentex#353.

Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
this PR -> #249 (per-RPC route migration).
dm36 added a commit that referenced this pull request May 29, 2026
Per Asher's review: the previous commit (5deda18) wrapped every direct-resource
DAuthorizedId/DAuthorizedQuery check in AuthorizationError -> ItemDoesNotExist,
which flipped 403 to 404 for every direct-resource route service-wide. That
diverges from #249 (tasks) and #255 (task-children) which keep direct-resource
denials as 403.

Restore the else branch to a bare authorization.check() in both helpers.
Only child-resource branches (TaskChildResourceType, AgentChildResourceType)
collapse to 404, which is the intended scope: the parent-resolution path
already loads the child to find its parent, so 404 is the natural shape.

Update the events list test to assert 403 for an unauthorized agent_id
(consistent with the direct-resource convention).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dm36 added a commit that referenced this pull request May 29, 2026
…ceType

Per Asher's review: the AgentChildResourceType branch in DAuthorizedId/
DAuthorizedQuery added an extra parent-resolution hop and a new resource-
type family compared to #249 (tasks) and #255 (task-children), both of
which only special-case task/task-children.

Mirror #249's shape exactly:

- Add agent_authorization.check_agent_or_collapse_to_404, parallel to
  task_authorization.check_task_or_collapse_to_404 from #249. Same
  docstring rationale (no tenant column on AgentORM, denial reason not
  discriminable, so collapse to 404 to prevent cross-tenant existence
  leak; TODO references AGX1-290).
- Delete AgentChildResourceType enum.
- Delete _get_parent_agent_id and the elif branch in both shortcuts.
- Drop DEventRepository injection from the shortcuts (was only used by
  the deleted branch).
- Rewrite get_event to do event_use_case.get(...) -> agent check inline,
  using the helper. Identical wire behavior (single /v1/authz/check on
  the parent agent, denial -> 404) but no new abstraction.

Net: shortcuts file is now structurally identical to main; events.py
follows the same pattern Harvey/Asher established for direct/child task
routes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dm36 added a commit that referenced this pull request May 29, 2026
Per Asher's review on the task-child branch: same point as the direct-
resource else — this PR shouldn't be modifying the task-child collapse
behavior either, since #249 owns that change (via
check_task_or_collapse_to_404).

Revert the task-child branch in DAuthorizedId/DAuthorizedQuery back to
main's shape: bare authorization.check(), no inline try/except.

Now this PR's diff against main is purely subtractive on
authorization_shortcuts.py: remove event from TaskChildResourceType (it
no longer fits — events delegate to the parent agent, not the parent
task) and the cascading cleanup of DEventRepository injection. No
behavioral changes to the task-child path. #249 will apply on top
cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dm36 added a commit that referenced this pull request May 29, 2026
Mirror Harvey's #255 inline pattern exactly in the task-child branch of
DAuthorizedId/DAuthorizedQuery — same imports, same comment, same
try/except shape. Direct-resource else branch stays as plain check (403),
matching both #249 and #255.

Net diff vs main is consistent with #255's shape: collapse for
task-children, plain check for direct resources. Only delta-from-#255
is this PR's own scope (removing event from TaskChildResourceType and
DEventRepository injection, since events delegate to the parent agent
instead).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread agentex/src/utils/task_authorization.py Outdated


async def check_task_or_collapse_to_404(
authorization,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make this typed

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, typed it as AuthorizationService

Rewires the operation literal sent to agentex-auth on task RPC
routes so each RPC checks the permission that actually matches its
side effect, instead of using `execute` everywhere:

- `MESSAGE_SEND` / `EVENT_SEND` -> `update`
- `TASK_CANCEL` -> `cancel`
- `TASK_CREATE` stays `create`
- Unknown `AgentRPCMethod` values now raise `NotImplementedError`
  rather than falling through authz-free (defense-in-depth: a new
  RPC must be explicitly wired before it can dispatch).

The same `execute -> update` swap is applied across `messages.py`,
`checkpoints.py`, and `states.py` so the editor role can perform
routine mutations without needing owner. The task SpiceDB schema
defines `permission update = (editor + owner) & internal_tenant_gate`,
so leaving these on `execute` (owner-only) would lock editors out
of normal flows.

Adds `check_task_or_collapse_to_404` in
`src/utils/task_authorization.py` and routes every task-resource
denial path through it: path id, query id, body id, and the name
surface in `authorization_shortcuts.py`. `tasks.name` is globally
unique, so a 403/404 split on the name route would let any
authenticated caller probe the whole system for task existence —
collapsing both denial cases into 404 closes that leak at the cost
of an in-tenant UX regression on permission-gap updates (tracked
under AGX1-290).

The `MESSAGE_SEND` task-name branch is restructured to
`try/else`: a denied update on an existing task must surface as 404
and NOT fall through to the create check, which would promote
"denied update" into create access.

Cross-repo wire dependency: the `update` and `cancel` literals
must resolve against the existing OWNER grant in SGP's task
permission schema before this deploys. `update` is already in
SGP's `AgentexOperation` enum; `cancel` is added by scaleapi/scaleapi
sgp-agentex-cancel-enum. Held behind that PR shipping everywhere.

Part of the AGX1-264 stack: scaleapi/scaleapi#145000
(FGAC_AGENTEX_AUTH_SPARK flag) -> scaleapi/scaleapi
sgp-agentex-cancel-enum (cancel enum on SGP backend) ->
scaleapi/agentex#353 (agentex-auth per-account routing + cancel) ->
#246 (task FGAC dual-write + audit columns)
-> this PR.
@asherfink asherfink force-pushed the asher.fink/agx1-275-task-route-migration branch from 3dfd5cb to 3009275 Compare June 1, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants