RFC: Flexible & Non-Ossifying Control Plane Persistence Layer
Authors: MushuEE / Agent Substrate Maintainers
Status: Draft / Proposed
What's the Goal?
As Agent Substrate matures, we want to make sure our Control Plane persistence layer (store.Interface) doesn't paint us into a corner.
Specifically, we want to achieve three things:
- Pluggable Databases: Allow users/deployers to bring their own database of choice (PostgreSQL, Spanner, Aerospike, DynamoDB) without rewriting the entire API server.
- Locality-Aware Scheduling (Hot Snapshot Cache): Enable the scheduler to query for workers that already have a target snapshot hot-cached, with a clean fallback to the general pool if none are warm.
- Non-Ossifying API Surface: Add these advanced query capabilities without bloating our Go interfaces or hardcoding specific filters into the persistence method signatures.
The Problem Today
Right now, our store.Interface listing methods are completely rigid and all-or-nothing:
ListWorkers(ctx context.Context) ([]*ateapipb.Worker, error)
ListActors(ctx context.Context) ([]*ateapipb.Actor, error)
This introduces severe issues as the project scales:
- No indexing/filtering in the DB: To find a free worker in a specific pool, the scheduler is forced to retrieve every single worker from the database and filter them in-memory in Go.
- Zero flexibility: If we want to query workers by Node name, size, or status, we either have to add new custom query methods to the interface (bloating the API surface) or continue doing extremely expensive in-memory scans in the control plane.
- Database lock-in: Alternative databases (like SQL or DynamoDB) cannot leverage their indexing engines (e.g. B-Trees) to optimize queries because the interface gives them zero filtering context.
The Proposal: Kubernetes-style ListOptions
Rather than hardcoding specific query parameters, we introduce a generic, extensible ListOptions block to list operations:
// ListOptions holds parameters for filtering List results.
type ListOptions struct {
// FieldSelector filters results by matching specific fields exactly (e.g., "worker_pool=cpu-pool").
FieldSelector map[string]string
}
This allows our listing methods to support generic query filtering:
ListWorkers(ctx context.Context, opts ListOptions) ([]*ateapipb.Worker, error)
ListActors(ctx context.Context, opts ListOptions) ([]*ateapipb.Actor, error)
Technical Arguments
1. Non-Ossifying API Surface
Using a map-based FieldSelector completely decouples the query signature from the resource schemas:
- If a new field (e.g.,
node_name, size, gpu_type) is added to the Worker protobuf tomorrow, the Go interface does not change.
- New clients can start passing
opts.FieldSelector["node_name"] = "k8s-node-1" immediately.
- The database implementation simply adds a new
case statement to its internal matcher to support it.
2. Enabling Production-Grade Pluggable DBs
This design allows pluggable database drivers to translate these selectors directly into native indexing queries instead of scanning the entire database:
- PostgreSQL / Spanner: Can map the
FieldSelector map directly to parameterized SQL WHERE clauses (leveraging B-Tree indexes):
SELECT * FROM workers WHERE worker_pool = $1 AND actor_id = $2;
- DynamoDB: Can map the selector directly to Key Condition Expressions or Filter Expressions.
- Redis (Default): Maintains simple, low-overhead client-side filtering on top of Master key scans for development/testing, but can easily be optimized to perform set lookups (e.g.,
SPOP or SRANDMEMBER on set:idle:cpu-pool) for specific common selector combinations.
3. High-Performance Locality & Snapshot Scheduling
A major performance booster for Agent Substrate is scheduling actors onto workers that already have the snapshot hot-cached in memory or locally on disk.
With ListOptions, the scheduler can execute a clean fallback loop:
// 1. Try to find a free worker with the snapshot hot-cached
warmWorkers, err := s.store.ListWorkers(ctx, store.ListOptions{
FieldSelector: map[string]string{
"worker_pool": "cpu-pool",
"actor_id": "", // Free workers only
"cached_snapshot": targetSnapshotURI,
},
})
if err == nil && len(warmWorkers) > 0 {
return warmWorkers[0], nil // Hot-cache hit!
}
// 2. Fallback: Find any idle worker in the pool
return s.store.ListWorkers(ctx, store.ListOptions{
FieldSelector: map[string]string{
"worker_pool": "cpu-pool",
"actor_id": "",
},
})
Under the hood, a custom SQL database or advanced Redis driver can intercept the first query and check a specialized snapshot-to-worker index table to return matches in $O(1)$ time.
Conclusion
Adding ListOptions strikes a balance between minimal invasiveness today and complete architectural readiness for high-performance custom databases tomorrow. It solves the scaling issues of in-memory filtering without forcing massive, premature changes to the codebase. As workers are very likely to need to support structural drift, such as scaling (size/cpu), and other worker/actor specifics will evolve on a deployment basis, this should enable some flexibility.
RFC: Flexible & Non-Ossifying Control Plane Persistence Layer
Authors: MushuEE / Agent Substrate Maintainers
Status: Draft / Proposed
What's the Goal?
As Agent Substrate matures, we want to make sure our Control Plane persistence layer (
store.Interface) doesn't paint us into a corner.Specifically, we want to achieve three things:
The Problem Today
Right now, our
store.Interfacelisting methods are completely rigid and all-or-nothing:This introduces severe issues as the project scales:
The Proposal: Kubernetes-style
ListOptionsRather than hardcoding specific query parameters, we introduce a generic, extensible
ListOptionsblock to list operations:This allows our listing methods to support generic query filtering:
Technical Arguments
1. Non-Ossifying API Surface
Using a map-based
FieldSelectorcompletely decouples the query signature from the resource schemas:node_name,size,gpu_type) is added to theWorkerprotobuf tomorrow, the Go interface does not change.opts.FieldSelector["node_name"] = "k8s-node-1"immediately.casestatement to its internal matcher to support it.2. Enabling Production-Grade Pluggable DBs
This design allows pluggable database drivers to translate these selectors directly into native indexing queries instead of scanning the entire database:
FieldSelectormap directly to parameterized SQLWHEREclauses (leveraging B-Tree indexes):SPOPorSRANDMEMBERonset:idle:cpu-pool) for specific common selector combinations.3. High-Performance Locality & Snapshot Scheduling
A major performance booster for Agent Substrate is scheduling actors onto workers that already have the snapshot hot-cached in memory or locally on disk.
With
ListOptions, the scheduler can execute a clean fallback loop:Under the hood, a custom SQL database or advanced Redis driver can intercept the first query and check a specialized snapshot-to-worker index table to return matches in$O(1)$ time.
Conclusion
Adding
ListOptionsstrikes a balance between minimal invasiveness today and complete architectural readiness for high-performance custom databases tomorrow. It solves the scaling issues of in-memory filtering without forcing massive, premature changes to the codebase. As workers are very likely to need to support structural drift, such as scaling (size/cpu), and other worker/actor specifics will evolve on a deployment basis, this should enable some flexibility.