feat: versioned deployments with promotion #181
Conversation
Introduce a first-class Deployment entity with lifecycle states
(PENDING → READY → FAILED), production promotion with atomic cutover,
preview RPC routing, and instant rollback.
Key changes:
- New deployments table with Alembic migration + data migration from
deployment_history
- DeploymentRepository with atomic promote() transaction
- DeploymentUseCase for create/list/promote/rollback/delete
- API endpoints under /agents/{id}/deployments (CRUD + promote +
rollback + preview RPC)
- Deployment-aware agent registration (auto-promotes first deployment)
- RPC routing resolves acp_url through production deployment with
legacy fallback
- Agent.production_deployment_id denormalized pointer for fast lookups
Instead of hard-deleting deployment rows, set expires_at timestamp to mark them for cleanup. This preserves deployment history and gives a future GC process a field to key off of.
The early return for existing agents (no agent_id, found by name) skipped maybe_update_deployment entirely, so deployment records would never get created/updated during re-registration.
When deployment_id is present in registration_metadata, treat it as a deployment-scoped registration: only update the deployment record and leave the agent row (acp_url, metadata) unchanged. The agent's acp_url should only change via promotion, not by a preview pod registering.
1. get_deployment now verifies deployment belongs to authorized agent, preventing cross-agent enumeration 2. Auto-promote uses single-transaction auto_promote_if_first() to eliminate TOCTOU race between concurrent pods 3. maybe_update_deployment no longer swallows exceptions — errors propagate to the caller 4. Preview RPC route raises ClientError instead of ValueError for proper 4xx responses 5. list_for_agent filters out soft-deleted deployments (expires_at IS NOT NULL) 6. DuplicateItemError path re-fetches agent by name to avoid using a stale local agent_id for deployment FK
agentex/database/migrations/alembic/versions/2026_03_30_1900_deployments_4a9b7787ccd7.py
Show resolved
Hide resolved
| deployment.expires_at = datetime.now(UTC) | ||
| return await self.deployment_repo.update(deployment) |
There was a problem hiding this comment.
How would the deletion be reflected in the frontend? I see list_for_agent filters out everything that has an expiry so would this be what is used?
There was a problem hiding this comment.
Yes exactly — list_for_agent filters out deployments where expires_at IS NOT NULL, so soft-deleted deployments won't appear in the list endpoint or the frontend. The record stays in the DB for audit/debugging, and a future GC process can clean them up based on expires_at.
| deployment_id: str, | ||
| ) -> DeploymentEntity: | ||
| logger.info(f"Rolling back to deployment {deployment_id} for agent {agent_id}") | ||
| return await self.deployment_repo.promote( |
There was a problem hiding this comment.
Promote and rollback being functionally the same doesn't sit right with me but ig makes sense? Im maybe not super familiar with how other systems implement these semantics
There was a problem hiding this comment.
This follows the same pattern as Kubernetes rollbacks — a rollback is just a promotion of an older revision. The operation is identical (atomic cutover of is_production + acp_url), only the intent differs. Keeping them as separate endpoints gives us distinct audit trail entries and makes the CLI/UI semantics clearer (agentex deploy promote vs agentex deploy rollback), even though the underlying mechanism is the same.
There was a problem hiding this comment.
open to getting rid of rollback and only having promotions though - @danielmillerp do you have any thoughts here one way or the other?
| registration_metadata: dict[str, Any] | None = None, | ||
| agent_input_type: AgentInputType | None = None, | ||
| ) -> AgentEntity: | ||
| deployment_id = (registration_metadata or {}).get("deployment_id") |
There was a problem hiding this comment.
Is this solely for backwards compatibility?
There was a problem hiding this comment.
It's the branching point between the new deployment-aware flow and the legacy flow. When deployment_id is present in registration metadata, we only update the deployment record and leave the agent row untouched (acp_url changes only via promotion). When it's absent, we fall through to the existing behavior that updates agent.acp_url directly. This lets non-migrated agents (those not yet passing deployment_id) keep working unchanged.
| acp_url: str, | ||
| registration_metadata: dict[str, Any] | None, | ||
| ) -> None: | ||
| """Handle deployment-aware registration when deployment_id is present.""" |
There was a problem hiding this comment.
The purpose of this function is a bit unclear to me
There was a problem hiding this comment.
This is the bridge between agent registration and the deployment system. When an agent pod boots and calls POST /agents/register with a deployment_id in its metadata, this function: (1) finds or creates the matching deployment record, (2) sets its acp_url and status to READY, and (3) auto-promotes it if it's the agent's first deployment. It's called from register_agent so that the existing registration endpoint gains deployment awareness without changing the SDK contract.
1. Add partial unique index (agent_id) WHERE is_production = true to enforce at most one production deployment per agent at the DB level, eliminating the concurrent-promote race condition 2. Fall back to agent.docker_image instead of "unknown" sentinel when registration metadata lacks docker_image
| is_production = Column(Boolean, nullable=False, default=False) | ||
|
|
||
| # Infra references | ||
| sgp_deploy_id = Column(String, nullable=True) |
There was a problem hiding this comment.
is this from cloud deploy?
danielmillerp
left a comment
There was a problem hiding this comment.
this is great Stas, super clear and simple!!
|
One question I have is, do we have to do any work from cloud deploy side to enable this ? Because cloud deploy uses FluxCD right to actually deploy a new version, do we have to call promote now then from there too? |
…SONB Replace commit_hash, branch_name, author_name, author_email, and build_timestamp typed columns with a single registration_metadata JSONB column on the deployments table. docker_image remains a typed column as it's the primary identifier for what's running. This matches the existing pattern on the agents table and avoids schema migrations when new metadata fields are added.
Remove the rollback route and use case method so rollback can be implemented with its own semantics in a future PR rather than being a thin alias for promote.
- Add expires_at IS NULL filter to promote() and auto_promote_if_first() so soft-deleted deployments cannot be promoted - Rename maybe_update_deployment → complete_deployment_registration for clarity
|
One thing that will need to change is that AGENTEX_DEPLOYMENT_ID will need to get injected into the helm values in cloud deploy for this new code path to be exercised. Once an agent pod boots and calls register with a deployment ID (this will also need a small SDK change to pass it up), this will trigger the new flow and the very first deployment will be auto promoted. For subsequent deploys the promotion is decoupled from cloud deploy in case of agents that need to be decoupled from evals, but we could have a way to auto promote beyond the very first deployment as well if that's desired. |
Summary
Introduces a first-class
Deploymententity with lifecycle states (PENDING→READY→FAILED), production promotion with atomic cutover, preview RPC routing, and soft-delete.Key design decisions
deployment_idin registration metadata continue to work via the legacyagent.acp_urlpath — no SDK changes required to keep existing agents running.deployment_id, only the deployment record is updated. The agent row (acp_url,registration_metadata) is untouched — it only changes via promotion.AGENTEX_DEPLOYMENT_IDenv var, and the deployment record is created directly as READY when the agent pod registers.New API endpoints
POST/agents/{id}/deploymentsGET/agents/{id}/deploymentsGET/agents/{id}/deployments/{id}POST/agents/{id}/deployments/{id}/promoteDELETE/agents/{id}/deployments/{id}expires_at)POST/agents/{id}/deployments/{id}/rpcSchema
deploymentstable:docker_image(typed column) +registration_metadata(JSONB for git/build metadata, matching Agent's pattern) + lifecycle columns (status,acp_url,is_production,sgp_deploy_id,helm_release_name, timestamps)production_deployment_idFK onagents(agent_id) WHERE is_production = trueenforces at most one production deployment per agent at the DB leveldeployment_historyis intentionally deferred — existing agents use the legacyagent.acp_urlfallback until they re-deploy through the new flowSGP integration
SGP's only required change: generate a UUID and inject
AGENTEX_DEPLOYMENT_IDenv var into Helm values. No new API calls to Agentex are required — the deployment record is created on agent registration.Test plan
make apply-migrations)deployment_id→ verify READY + auto-promotedeployment_id) still works unchanged