Skip to content

feat: versioned deployments with promotion #181

Merged
smoreinis merged 11 commits intomainfrom
stas/deployments
Apr 3, 2026
Merged

feat: versioned deployments with promotion #181
smoreinis merged 11 commits intomainfrom
stas/deployments

Conversation

@smoreinis
Copy link
Copy Markdown
Collaborator

@smoreinis smoreinis commented Mar 30, 2026

Summary

Introduces a first-class Deployment entity with lifecycle states (PENDINGREADYFAILED), production promotion with atomic cutover, preview RPC routing, and soft-delete.

Key design decisions

  • Deploy ≠ Promote. Every deployment comes up as a non-production preview. Promotion is a separate, explicit step. First deployment per agent is auto-promoted.
  • Backward compatible. Agents not passing deployment_id in registration metadata continue to work via the legacy agent.acp_url path — no SDK changes required to keep existing agents running.
  • Preview-safe registration. When an agent pod registers with a deployment_id, only the deployment record is updated. The agent row (acp_url, registration_metadata) is untouched — it only changes via promotion.
  • Registration-driven lifecycle. The PENDING deployment record is optional. SGP can just generate a UUID, inject it as AGENTEX_DEPLOYMENT_ID env var, and the deployment record is created directly as READY when the agent pod registers.

New API endpoints

Method Path Description
POST /agents/{id}/deployments Create deployment (PENDING) — optional pre-creation
GET /agents/{id}/deployments List deployments (paginated, excludes soft-deleted)
GET /agents/{id}/deployments/{id} Get specific deployment
POST /agents/{id}/deployments/{id}/promote Promote to production (atomic cutover)
DELETE /agents/{id}/deployments/{id} Soft-delete (sets expires_at)
POST /agents/{id}/deployments/{id}/rpc Preview RPC to specific deployment

Schema

  • New deployments table: docker_image (typed column) + registration_metadata (JSONB for git/build metadata, matching Agent's pattern) + lifecycle columns (status, acp_url, is_production, sgp_deploy_id, helm_release_name, timestamps)
  • New production_deployment_id FK on agents
  • Partial unique index (agent_id) WHERE is_production = true enforces at most one production deployment per agent at the DB level
  • Data migration from deployment_history is intentionally deferred — existing agents use the legacy agent.acp_url fallback until they re-deploy through the new flow

SGP integration

SGP's only required change: generate a UUID and inject AGENTEX_DEPLOYMENT_ID env var into Helm values. No new API calls to Agentex are required — the deployment record is created on agent registration.

Test plan

  • All 214 unit tests pass
  • Verify migration applies cleanly (make apply-migrations)
  • Create deployment → register agent with deployment_id → verify READY + auto-promote
  • Second deployment registers → verify agent.acp_url unchanged (preview-safe)
  • Promote second deployment → verify atomic cutover
  • Preview RPC to non-production deployment
  • Soft-delete deployment → verify excluded from list and cannot be promoted
  • Legacy registration (no deployment_id) still works unchanged

Introduce a first-class Deployment entity with lifecycle states
(PENDING → READY → FAILED), production promotion with atomic cutover,
preview RPC routing, and instant rollback.

Key changes:
- New deployments table with Alembic migration + data migration from
  deployment_history
- DeploymentRepository with atomic promote() transaction
- DeploymentUseCase for create/list/promote/rollback/delete
- API endpoints under /agents/{id}/deployments (CRUD + promote +
  rollback + preview RPC)
- Deployment-aware agent registration (auto-promotes first deployment)
- RPC routing resolves acp_url through production deployment with
  legacy fallback
- Agent.production_deployment_id denormalized pointer for fast lookups
Instead of hard-deleting deployment rows, set expires_at timestamp
to mark them for cleanup. This preserves deployment history and
gives a future GC process a field to key off of.
The early return for existing agents (no agent_id, found by name)
skipped maybe_update_deployment entirely, so deployment records
would never get created/updated during re-registration.
When deployment_id is present in registration_metadata, treat it as
a deployment-scoped registration: only update the deployment record
and leave the agent row (acp_url, metadata) unchanged. The agent's
acp_url should only change via promotion, not by a preview pod
registering.
1. get_deployment now verifies deployment belongs to authorized agent,
   preventing cross-agent enumeration
2. Auto-promote uses single-transaction auto_promote_if_first() to
   eliminate TOCTOU race between concurrent pods
3. maybe_update_deployment no longer swallows exceptions — errors
   propagate to the caller
4. Preview RPC route raises ClientError instead of ValueError for
   proper 4xx responses
5. list_for_agent filters out soft-deleted deployments (expires_at
   IS NOT NULL)
6. DuplicateItemError path re-fetches agent by name to avoid using
   a stale local agent_id for deployment FK
@smoreinis smoreinis marked this pull request as ready for review April 1, 2026 01:24
@smoreinis smoreinis requested a review from a team as a code owner April 1, 2026 01:24
@smoreinis smoreinis requested a review from xsfa April 1, 2026 01:24
Comment on lines +120 to +121
deployment.expires_at = datetime.now(UTC)
return await self.deployment_repo.update(deployment)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the deletion be reflected in the frontend? I see list_for_agent filters out everything that has an expiry so would this be what is used?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly — list_for_agent filters out deployments where expires_at IS NOT NULL, so soft-deleted deployments won't appear in the list endpoint or the frontend. The record stays in the DB for audit/debugging, and a future GC process can clean them up based on expires_at.

deployment_id: str,
) -> DeploymentEntity:
logger.info(f"Rolling back to deployment {deployment_id} for agent {agent_id}")
return await self.deployment_repo.promote(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promote and rollback being functionally the same doesn't sit right with me but ig makes sense? Im maybe not super familiar with how other systems implement these semantics

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the same pattern as Kubernetes rollbacks — a rollback is just a promotion of an older revision. The operation is identical (atomic cutover of is_production + acp_url), only the intent differs. Keeping them as separate endpoints gives us distinct audit trail entries and makes the CLI/UI semantics clearer (agentex deploy promote vs agentex deploy rollback), even though the underlying mechanism is the same.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to getting rid of rollback and only having promotions though - @danielmillerp do you have any thoughts here one way or the other?

registration_metadata: dict[str, Any] | None = None,
agent_input_type: AgentInputType | None = None,
) -> AgentEntity:
deployment_id = (registration_metadata or {}).get("deployment_id")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this solely for backwards compatibility?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the branching point between the new deployment-aware flow and the legacy flow. When deployment_id is present in registration metadata, we only update the deployment record and leave the agent row untouched (acp_url changes only via promotion). When it's absent, we fall through to the existing behavior that updates agent.acp_url directly. This lets non-migrated agents (those not yet passing deployment_id) keep working unchanged.

acp_url: str,
registration_metadata: dict[str, Any] | None,
) -> None:
"""Handle deployment-aware registration when deployment_id is present."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this function is a bit unclear to me

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bridge between agent registration and the deployment system. When an agent pod boots and calls POST /agents/register with a deployment_id in its metadata, this function: (1) finds or creates the matching deployment record, (2) sets its acp_url and status to READY, and (3) auto-promotes it if it's the agent's first deployment. It's called from register_agent so that the existing registration endpoint gains deployment awareness without changing the SDK contract.

1. Add partial unique index (agent_id) WHERE is_production = true to
   enforce at most one production deployment per agent at the DB level,
   eliminating the concurrent-promote race condition
2. Fall back to agent.docker_image instead of "unknown" sentinel when
   registration metadata lacks docker_image
is_production = Column(Boolean, nullable=False, default=False)

# Infra references
sgp_deploy_id = Column(String, nullable=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this from cloud deploy?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Copy Markdown
Collaborator

@danielmillerp danielmillerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great Stas, super clear and simple!!

@danielmillerp
Copy link
Copy Markdown
Collaborator

One question I have is, do we have to do any work from cloud deploy side to enable this ? Because cloud deploy uses FluxCD right to actually deploy a new version, do we have to call promote now then from there too?

…SONB

Replace commit_hash, branch_name, author_name, author_email, and
build_timestamp typed columns with a single registration_metadata
JSONB column on the deployments table. docker_image remains a typed
column as it's the primary identifier for what's running.

This matches the existing pattern on the agents table and avoids
schema migrations when new metadata fields are added.
Remove the rollback route and use case method so rollback can be
implemented with its own semantics in a future PR rather than being
a thin alias for promote.
- Add expires_at IS NULL filter to promote() and auto_promote_if_first()
  so soft-deleted deployments cannot be promoted
- Rename maybe_update_deployment → complete_deployment_registration
  for clarity
@smoreinis
Copy link
Copy Markdown
Collaborator Author

One thing that will need to change is that AGENTEX_DEPLOYMENT_ID will need to get injected into the helm values in cloud deploy for this new code path to be exercised.

Once an agent pod boots and calls register with a deployment ID (this will also need a small SDK change to pass it up), this will trigger the new flow and the very first deployment will be auto promoted.

For subsequent deploys the promotion is decoupled from cloud deploy in case of agents that need to be decoupled from evals, but we could have a way to auto promote beyond the very first deployment as well if that's desired.

@smoreinis smoreinis changed the title feat: versioned deployments with promotion and rollback feat: versioned deployments with promotion Apr 3, 2026
@smoreinis smoreinis merged commit 461affa into main Apr 3, 2026
27 checks passed
@smoreinis smoreinis deleted the stas/deployments branch April 3, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants