Skip to content

feat(AGX1-272): dual-write agent_api_keys to spark-authz behind FGAC flag#248

Open
dm36 wants to merge 8 commits into
mainfrom
dhruv/agx1-272-agent-api-keys-dual-write
Open

feat(AGX1-272): dual-write agent_api_keys to spark-authz behind FGAC flag#248
dm36 wants to merge 8 commits into
mainfrom
dhruv/agx1-272-agent-api-keys-dual-write

Conversation

@dm36
Copy link
Copy Markdown

@dm36 dm36 commented May 26, 2026

Summary

Wires register_resource / deregister_resource into AgentAPIKeysUseCase.create / delete so api_keys get a SpiceDB tuple (with parent_agent edge) on create and have that tuple removed on delete. Companion to the route-migration PR (#252).

scale-agentex is intentionally decoupled from egp-api-backend. The dual-write call is unconditional from this repo. agentex-auth's per-account routing (scaleapi/agentex#353, FGAC_AGENTEX_AUTH_SPARK) decides whether the call lands on Spark AuthZ or falls back to the legacy SGP backend for the caller's account.

Linear ticket: https://linear.app/scale-epd/issue/AGX1-272/fgac-agent-api-key-dual-write-registerderegister-on-createdelete

Cross-repo stack

# Repo PR Purpose State
1 scaleapi/scaleapi #144657 sgp-authz 0.6.0 → 0.7.0; parent_resource kwarg ✅ Merged
2 scaleapi/agentex #354 agentex-auth Port + Spark adapter + HTTP routes (/v1/authz/register, /v1/authz/deregister) + api_key mapping ✅ Merged
2b scaleapi/agentex #353 agentex-auth per-account routing via FGAC_AGENTEX_AUTH_SPARK flag Open (this PR is decoupled from it — works whether #353 lands first or not)
3 (this PR) scaleapi/scale-agentex this scale-agentex use case calling register_resource(parent=agent) Ready
4 scaleapi/scale-agentex #252 route-layer FGAC + 404 collapse + two-factor mutations Stacked on this

Changes

Port + adapter

  • src/adapters/authorization/port.py — abstract register_resource(resource, parent=None) and deregister_resource(resource) on AuthorizationGateway.
  • src/adapters/authorization/adapter_agentex_authz_proxy.py — POST /v1/authz/register and /v1/authz/deregister to agentex-auth (endpoints exposed by #354).

Service layer

  • src/domain/services/authorization_service.pyregister_resource / deregister_resource service methods mirror the grant/revoke pattern (principal_context override, _bypass support, parent identity in log line).

Use case

  • src/domain/use_cases/agent_api_keys_use_case.py:
    • create calls _register_api_key_in_auth unconditionally; parent=AgentexResource.agent(agent_id) is load-bearing for the SpiceDB cascade.
    • delete (and delete_by_* variants) call _deregister_api_key_from_auth after the Postgres row is gone.
    • Register fails closed (re-raises); deregister is best-effort (logs).
    • The "no creator resolvable" guard (internal-key path with no principal user/service identity) reads from principal_context at call time — no persisted creator columns.

Tests

  • tests/integration/services/test_agent_api_key_service_dual_write.py — 6 cases. Mocks the authorization service; asserts the call sequencing inside the use case.
  • tests/integration/fixtures/integration_client.py + tests/unit/use_cases/test_agents_api_keys_use_case.py updated for the new constructor signature.

Test plan

  • 6 / 6 dual-write integration tests pass locally.
  • Parent edge contract pinned: test_create_api_key_calls_register_resource_with_parent asserts parent=AgentexResource.agent(agent.id) is passed on every register call.
  • Fail-closed on register failure preserved.
  • Best-effort deregister preserved.
  • No-creator-resolvable skip path preserved (runtime check, no persistence).
  • 4 pre-existing docker-fixture errors in test_agents_api_keys_use_case.py predate this PR (verified via git stash).
  • CI: ruff, ruff-format, alembic migration lint.
  • End-to-end against a real SpiceDB once exercised in dev via the agentex-auth routing flip.

Diff scope

10 files changed, 55 insertions(+), 368 deletions(-) — net deletion. Earlier iterations had ~370 lines of OSS-coupling that Harvey's review pruned.

Out of scope / follow-ups

  • ZedToken plumbing. If a SpiceDB consistency token use case emerges, expose it under a vendor-neutral column name (not spark_authz_zedtoken).
  • Audit / backfill for legacy api_keys created before this PR. Tracked separately.
  • Other resource types' grant → register_resource swap (task, build, deployment, schedule). Each owns its own follow-up.

Linked: AGX1-272.

Greptile Summary

This PR wires register_resource / deregister_resource into AgentAPIKeysUseCase.create and all three delete variants, so each agent_api_key gets a SpiceDB tuple with a parent_agent edge on create and has that tuple removed on delete. The dual-write is unconditional from scale-agentex; per-account routing (Spark vs legacy SGP) is handled by agentex-auth's FGAC_AGENTEX_AUTH_SPARK flag in the companion PR.

  • Port + adapter: register_resource / deregister_resource added to AuthorizationGateway and implemented in AgentexAuthorizationProxy via POST to /v1/authz/register and /v1/authz/deregister, following the same payload pattern as the existing grant/revoke methods.
  • Use-case semantics: Create is fail-closed (register before DB write; re-raises on failure) and skips when no creator identity is resolvable on principal_context. Delete (all three variants) is best-effort (deregister after DB delete; logs but does not re-raise on failure).
  • Tests: Six new dual-write integration tests cover parent-edge contract, fail-closed register, best-effort deregister, no-creator skip, and delete_by_name; unit fixture updated to supply AsyncMock for both new methods.

Confidence Score: 5/5

Safe to merge; the dual-write is unconditional but correctly layered so that agentex-auth per-account routing controls whether Spark or the legacy backend is actually hit.

The create path is fail-closed (register before DB write, re-raises on failure), the delete paths are best-effort (swallow errors after the Postgres row is gone), and the no-creator-identity guard prevents writing ownerless SpiceDB tuples. Six dedicated integration tests exercise the key contracts and no logic errors were found.

No files require special attention.

Important Files Changed

Filename Overview
agentex/src/domain/use_cases/agent_api_keys_use_case.py Wires dual-write into create (fail-closed, before DB write, with parent-agent edge) and all three delete variants (best-effort, after DB delete); creator-identity guard correctly short-circuits when no user/service principal is resolvable.
agentex/src/domain/services/authorization_service.py Adds register_resource and deregister_resource service methods with _bypass() guard, principal_context override pattern, and structured logging — mirrors the existing grant/revoke service methods faithfully.
agentex/src/adapters/authorization/adapter_agentex_authz_proxy.py Implements register_resource and deregister_resource via POST to /v1/authz/register and /v1/authz/deregister; consistent with existing grant/revoke payload patterns including model_dump() for resource/parent.
agentex/src/api/schemas/authorization_types.py Adds api_key to AgentexResourceType enum and corresponding factory methods; follows the exact same pattern as the existing task and agent types.
agentex/tests/integration/services/test_agent_api_key_service_dual_write.py Six integration tests covering the core dual-write contracts: parent-edge on create, deregister on delete, fail-closed register, best-effort deregister, no-creator skip, and delete-by-name path.

Sequence Diagram

sequenceDiagram
    participant Route as API Route
    participant UC as AgentAPIKeysUseCase
    participant Auth as AuthorizationService
    participant GW as AgentexAuthorizationProxy
    participant DB as AgentAPIKeyRepository
    participant AX as agentex-auth

    Note over Route,AX: create path (fail-closed)
    Route->>UC: create(name, agent_id, api_key)
    UC->>Auth: check principal_context
    alt no creator identity
        UC-->>DB: skip auth, write row anyway
    else creator resolvable
        Auth->>GW: "register_resource(principal, api_key, parent=agent)"
        GW->>AX: POST /v1/authz/register
        alt register fails
            AX-->>UC: raises - DB row NOT written
        else register succeeds
            UC->>DB: create(AgentAPIKeyEntity)
        end
    end

    Note over Route,AX: delete path (best-effort)
    Route->>UC: delete(id)
    UC->>DB: get(id)
    alt ItemDoesNotExist
        UC-->>Route: no-op
    else row exists
        UC->>DB: delete(id)
        UC->>Auth: deregister_resource(api_key)
        Auth->>GW: deregister_resource(principal, api_key)
        GW->>AX: POST /v1/authz/deregister
        alt deregister fails
            UC->>UC: log warning, swallow
        end
    end
Loading

Comments Outside Diff (1)

  1. agentex/tests/integration/services/test_agent_api_key_service_dual_write.py, line 703-727 (link)

    P2 Missing test for grant-succeeds-but-DB-create-fails (orphan Spark tuple)

    The suite covers grant failure preventing the DB write, but not the reverse: grant succeeding followed by repo.create raising (e.g., a duplicate key or transient DB error). In that case a Spark tuple exists for an api_key_id that never lands in Postgres. Adding a test for this case would make the documented known-limitation explicit and guard against accidental regression if compensating logic is added later.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: agentex/tests/integration/services/test_agent_api_key_service_dual_write.py
    Line: 703-727
    
    Comment:
    **Missing test for grant-succeeds-but-DB-create-fails (orphan Spark tuple)**
    
    The suite covers `grant` failure preventing the DB write, but not the reverse: `grant` succeeding followed by `repo.create` raising (e.g., a duplicate key or transient DB error). In that case a Spark tuple exists for an `api_key_id` that never lands in Postgres. Adding a test for this case would make the documented known-limitation explicit and guard against accidental regression if compensating logic is added later.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Reviews (8): Last reviewed commit: "refactor(AGX1-272): early-return on dele..." | Re-trigger Greptile

@dm36 dm36 marked this pull request as ready for review May 26, 2026 21:31
@dm36 dm36 requested a review from a team as a code owner May 26, 2026 21:31
Comment thread agentex/src/domain/use_cases/agent_api_keys_use_case.py Outdated
Comment thread agentex/src/domain/use_cases/agent_api_keys_use_case.py Outdated
dm36 added a commit that referenced this pull request May 26, 2026
Closes the parent_agent cascade gap surfaced on scaleapi/agentex#354.
The api_key dual-write (AGX1-272, PR #248) currently calls grant() which
writes the owner edge in SpiceDB but NOT the parent_agent edge. The
agent_api_key schema requires `read = ... & parent_agent->read & ...`,
so every downstream read/update fails closed without that edge.

This PR adds register_resource/deregister_resource (Port + adapter + service)
and swaps the api_keys use case from grant→register_resource with
parent=AgentexResource.agent(agent_id). Now the owner edge and parent_agent
edge are written atomically.

Stack:
- scaleapi/scaleapi#144657 — sgp-authz 0.7.0 (parent_resource kwarg).
- scaleapi/agentex#355 — agentex-auth Port + adapter + HTTP routes.
- #248 — original AGX1-272 dual-write (this stacks on it).
- THIS PR — extends #248 to use the parent-aware path.

Changes:
- Port: abstract register_resource(resource, parent=None) and
  deregister_resource(resource).
- Adapter proxy: POST /v1/authz/register and /v1/authz/deregister.
- Service: mirror existing grant/revoke pattern (principal_context override,
  _bypass support, parent in log line for cascade debugging).
- Use case: swap grant→register_resource passing parent=agent;
  swap revoke→deregister_resource. except Exception wrappers preserved
  (fail-closed on register, best-effort on deregister).
- Tests: rename mocks to register_resource/deregister_resource; assert the
  parent edge is passed correctly.

Test plan:
- pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py
  → 8 / 8 pass.
- New test ``test_create_api_key_calls_grant_when_flag_on`` asserts
  parent.type == AgentexResourceType.agent and parent.selector == agent.id.

Other resource types' grant→register_resource swap is out of scope.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dm36 dm36 marked this pull request as draft May 27, 2026 17:27
dm36 and others added 2 commits May 28, 2026 10:28
…AGENT_API_KEYS_DUAL_WRITE flag

Mirrors the AGX1-274 task dual-write pattern (PR #246) for agent_api_keys.

- Adds creator_user_id / creator_service_account_id / spark_authz_zedtoken
  columns to agent_api_keys, with CHECK constraint and concurrent indexes.
- On create, when FGAC_AGENT_API_KEYS_DUAL_WRITE is enabled for the caller's
  account, calls authorization_service.grant(AgentexResource.api_key(id))
  BEFORE the Postgres write. Grant failure aborts the create.
- On delete, best-effort revoke after the Postgres delete. Failures are
  logged but do not block the delete.
- Adds AgentexResourceType.api_key and AgentexResource.api_key(...) factory.
- Creates src/utils/feature_flags.py with both FGAC_TASKS_DUAL_WRITE and
  FGAC_AGENT_API_KEYS_DUAL_WRITE (file does not exist on main yet; if PR #246
  lands first this becomes a rebase concern).

Structural divergence from tasks: agent_api_keys have no service layer, so
the dual-write logic lives in AgentAPIKeysUseCase rather than a separate
service. This keeps the call site simple and avoids inventing a new layer.

Route layer (read-side auth checks) is out of scope; that's PR B (AGX1-273).
agentex-auth spark_mapping.py update is a sibling-repo concern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the parent_agent cascade gap surfaced on scaleapi/agentex#354.
The api_key dual-write (AGX1-272, PR #248) currently calls grant() which
writes the owner edge in SpiceDB but NOT the parent_agent edge. The
agent_api_key schema requires `read = ... & parent_agent->read & ...`,
so every downstream read/update fails closed without that edge.

This PR adds register_resource/deregister_resource (Port + adapter + service)
and swaps the api_keys use case from grant→register_resource with
parent=AgentexResource.agent(agent_id). Now the owner edge and parent_agent
edge are written atomically.

Stack:
- scaleapi/scaleapi#144657 — sgp-authz 0.7.0 (parent_resource kwarg).
- scaleapi/agentex#355 — agentex-auth Port + adapter + HTTP routes.
- #248 — original AGX1-272 dual-write (this stacks on it).
- THIS PR — extends #248 to use the parent-aware path.

Changes:
- Port: abstract register_resource(resource, parent=None) and
  deregister_resource(resource).
- Adapter proxy: POST /v1/authz/register and /v1/authz/deregister.
- Service: mirror existing grant/revoke pattern (principal_context override,
  _bypass support, parent in log line for cascade debugging).
- Use case: swap grant→register_resource passing parent=agent;
  swap revoke→deregister_resource. except Exception wrappers preserved
  (fail-closed on register, best-effort on deregister).
- Tests: rename mocks to register_resource/deregister_resource; assert the
  parent edge is passed correctly.

Test plan:
- pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py
  → 8 / 8 pass.
- New test ``test_create_api_key_calls_grant_when_flag_on`` asserts
  parent.type == AgentexResourceType.agent and parent.selector == agent.id.

Other resource types' grant→register_resource swap is out of scope.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dm36 dm36 marked this pull request as ready for review May 28, 2026 17:29
@dm36 dm36 force-pushed the dhruv/agx1-272-agent-api-keys-dual-write branch from 9668f1a to e72df68 Compare May 28, 2026 17:29
…L_WRITE

Per team discussion: rather than maintain a parallel env-var flag system
in scale-agentex, route api_key dual-write flag checks through
egp-api-backend's existing flag service. One source of truth across services,
single flip surface for ops, fewer per-env env-var allowlists to keep in sync.

Changes:

- EnvVarKeys.EGP_API_BACKEND_URL — new env var for the egp-api-backend
  base URL. Used by the new HTTP-backed flag provider.

- FeatureFlagProvider rewritten as an HTTP client of egp-api-backend's
  GET /feature-flag/{id} endpoint:
    * Forwards x-api-key / x-user-id / x-service-account-id /
      x-selected-account-id from the caller's principal_context so the
      endpoint's REQUIRE_IDENTITY_AND_OPTIONAL_ACCOUNT policy admits the
      request.
    * Coerces the response's `value` field to bool.
    * Fails closed to False on any error (config missing, no identity,
      non-2xx, transport failure, JSON parse failure) — the legacy
      no-Spark code path is the safe default.
    * `is_enabled` is now async (HTTP call). Signature is
      `is_enabled(name, *, principal_context, account_id)`.

- AgentAPIKeysUseCase: both call sites now await is_enabled and pass
  principal_context. _deregister grabs principal_context from
  self.authorization_service.

- Test fixtures: mock FeatureFlagProvider directly (Mock with
  is_enabled = AsyncMock(return_value=flag_on)) so dual-write tests stay
  hermetic. The pre-existing FeatureFlagProvider() no-arg constructions
  in test_agents_api_keys_use_case.py and integration_client.py now pass
  egp_api_backend_url=None (provider returns False without it, matching
  the prior "flag never enabled in unit tests" behavior).

Out of scope:

- Migrating Asher's FGAC_TASKS_DUAL_WRITE flag check off env vars.
  That's task-team-owned and we leave their existing pattern alone per
  the team discussion (new-work-only).
- Caching the flag response. Each is_enabled is a fresh HTTP call.
  Egp-api-backend's flag endpoint is fast and the caller paths are
  already crossing the network for the actual register/deregister, so
  one extra round-trip is acceptable for now. Add caching later if
  load profiling shows it matters.

Test plan:

- uv run pytest agentex/tests/integration/services/test_agent_api_key_service_dual_write.py — 8/8 pass.
- Existing 4 unrelated test_agents_api_keys_use_case.py docker-fixture
  errors predate this commit (verified via `git stash`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread agentex/src/utils/feature_flags.py Outdated
Comment thread agentex/src/domain/entities/agent_api_keys.py Outdated
Comment thread agentex/src/domain/entities/agent_api_keys.py Outdated
Three rounds of Harvey's review feedback together pointed at the same
concern: scale-agentex (OSS) should not be coupled to egp-api-backend
(proprietary) feature-flag service or Spark-specific schema concepts.

Changes:

- Drop FeatureFlagProvider and the egp-api-backend HTTP query entirely.
  scale-agentex now calls register_resource/deregister_resource
  unconditionally. agentex-auth's per-account routing
  (FGAC_AGENTEX_AUTH_SPARK, scaleapi/agentex#353) decides whether the
  call lands on Spark AuthZ or falls back to legacy SGP for the caller's
  account.

- Drop the creator_user_id / creator_service_account_id columns from the
  agent_api_keys table. They were only used as a runtime guard inside
  the auth-registration helper. Moved that check inline.

- Drop the spark_authz_zedtoken column. The name leaked SpiceDB-specific
  terminology into the OSS schema, and the column was always None.

- Delete the Alembic migration entirely. migration_history.txt
  re-pointed to head=6c942325c828.

- Rename helper methods: _register_api_key_in_spark_authz ->
  _register_api_key_in_auth, same for deregister.

Test plan: 6/6 dual-write integration tests pass. Pre-existing
docker-fixture errors in test_agents_api_keys_use_case.py unrelated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread agentex/src/domain/use_cases/agent_api_keys_use_case.py
Per @jenniechung review: delete(id) used to call deregister unconditionally,
which wastes an auth round-trip when the id doesn't exist. The two
delete_by_* methods already use pre-fetch + `if existing is not None`;
delete() now matches that pattern via try/except ItemDoesNotExist.

Also tightened the unconditional-register comment from 5 lines to 2.

Tests:
- Fixture now defaults repo.get() to a stub entity so existing delete tests
  flow through the deregister path; new helper `_stub_api_key`.
- New test_delete_api_key_skips_deregister_when_row_does_not_exist asserts
  deregister.assert_not_called() when repo.get raises ItemDoesNotExist.
- 7/7 dual-write tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread agentex/tests/unit/use_cases/test_agents_api_keys_use_case.py
…t fixture

Per reviewer: the agent_api_keys_use_case fixture had `grant` and `revoke`
set up as AsyncMocks but not `register_resource` / `deregister_resource`.
The deregister path silently swallowed the resulting TypeError via its
`except Exception` handler, so delete tests passed without exercising the
auth-side call. Matches the integration_client.py fixture shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread agentex/src/api/routes/agent_api_keys.py Outdated
api_key_type=api_key_type,
api_key=api_key,
)
return await self.agent_api_key_repo.create(item=agent_api_key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow the egp-api-backend's pattern and add compensating deregister if repo.create fails?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

followed asher's approach here with tasks with not compensating deregister if repo.create fails for consistency- i can update tho

Comment thread agentex/src/domain/use_cases/agent_api_keys_use_case.py
dm36 and others added 2 commits June 1, 2026 11:36
agentex-auth derives account_id from the authenticated principal on its
side, so threading it through routes and use-case calls in OSS
scale-agentex is dead weight — and leaks the SGP-specific field name.

Drop:
- account_id extraction (getattr on principal_context) in all three
  api_keys route handlers; drop the now-unused DAuthorizationService
  injection from create_api_key/delete_agent_api_key.
- account_id parameter on create, delete, delete_by_agent_id_and_key_name,
  delete_by_agent_name_and_key_name, _register_api_key_in_auth, and
  _deregister_api_key_from_auth.
- account_id entries from log extras (it was only ever log context — no
  auth call ever consumed it).
- account_id kwargs at use-case call sites in the dual-write test.

Keep the _principal(account_id=...) test helper — it models the
principal shape and Harvey's feedback is about production OSS surface,
not test mocks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per Harvey's nit: when the pre-fetch raises ItemDoesNotExist, skip both
the repo.delete (was issuing a wasted DB round-trip for a no-op) and
the deregister. Same HTTP response from the route either way.

Test renamed and updated: now asserts neither repo.delete nor deregister
is called when the row doesn't exist (previously asserted repo.delete
was called).

Note: the delete_by_* variants have the symmetric pattern (unconditional
repo.delete_by_* even when the pre-fetch returns None), but they
predate this PR and Harvey's comment was scoped to delete. Leaving
those for now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants