Skip to content

RFC: Flexible & Non-Ossifying Control Plane Persistence Layer #135

@MushuEE

Description

RFC: Flexible & Non-Ossifying Control Plane Persistence Layer

Authors: MushuEE / Agent Substrate Maintainers
Status: Draft / Proposed


What's the Goal?

As Agent Substrate matures, we want to make sure our Control Plane persistence layer (store.Interface) doesn't paint us into a corner.

Specifically, we want to achieve three things:

  1. Pluggable Databases: Allow users/deployers to bring their own database of choice (PostgreSQL, Spanner, Aerospike, DynamoDB) without rewriting the entire API server.
  2. Locality-Aware Scheduling (Hot Snapshot Cache): Enable the scheduler to query for workers that already have a target snapshot hot-cached, with a clean fallback to the general pool if none are warm.
  3. Non-Ossifying API Surface: Add these advanced query capabilities without bloating our Go interfaces or hardcoding specific filters into the persistence method signatures.

The Problem Today

Right now, our store.Interface listing methods are completely rigid and all-or-nothing:

ListWorkers(ctx context.Context) ([]*ateapipb.Worker, error)
ListActors(ctx context.Context) ([]*ateapipb.Actor, error)

This introduces severe issues as the project scales:

  • No indexing/filtering in the DB: To find a free worker in a specific pool, the scheduler is forced to retrieve every single worker from the database and filter them in-memory in Go.
  • Zero flexibility: If we want to query workers by Node name, size, or status, we either have to add new custom query methods to the interface (bloating the API surface) or continue doing extremely expensive in-memory scans in the control plane.
  • Database lock-in: Alternative databases (like SQL or DynamoDB) cannot leverage their indexing engines (e.g. B-Trees) to optimize queries because the interface gives them zero filtering context.

The Proposal: Kubernetes-style ListOptions

Rather than hardcoding specific query parameters, we introduce a generic, extensible ListOptions block to list operations:

// ListOptions holds parameters for filtering List results.
type ListOptions struct {
	// FieldSelector filters results by matching specific fields exactly (e.g., "worker_pool=cpu-pool").
	FieldSelector map[string]string
}

This allows our listing methods to support generic query filtering:

ListWorkers(ctx context.Context, opts ListOptions) ([]*ateapipb.Worker, error)
ListActors(ctx context.Context, opts ListOptions) ([]*ateapipb.Actor, error)

Technical Arguments

1. Non-Ossifying API Surface

Using a map-based FieldSelector completely decouples the query signature from the resource schemas:

  • If a new field (e.g., node_name, size, gpu_type) is added to the Worker protobuf tomorrow, the Go interface does not change.
  • New clients can start passing opts.FieldSelector["node_name"] = "k8s-node-1" immediately.
  • The database implementation simply adds a new case statement to its internal matcher to support it.
2. Enabling Production-Grade Pluggable DBs

This design allows pluggable database drivers to translate these selectors directly into native indexing queries instead of scanning the entire database:

  • PostgreSQL / Spanner: Can map the FieldSelector map directly to parameterized SQL WHERE clauses (leveraging B-Tree indexes):
    SELECT * FROM workers WHERE worker_pool = $1 AND actor_id = $2;
  • DynamoDB: Can map the selector directly to Key Condition Expressions or Filter Expressions.
  • Redis (Default): Maintains simple, low-overhead client-side filtering on top of Master key scans for development/testing, but can easily be optimized to perform set lookups (e.g., SPOP or SRANDMEMBER on set:idle:cpu-pool) for specific common selector combinations.
3. High-Performance Locality & Snapshot Scheduling

A major performance booster for Agent Substrate is scheduling actors onto workers that already have the snapshot hot-cached in memory or locally on disk.

With ListOptions, the scheduler can execute a clean fallback loop:

// 1. Try to find a free worker with the snapshot hot-cached
warmWorkers, err := s.store.ListWorkers(ctx, store.ListOptions{
	FieldSelector: map[string]string{
		"worker_pool":     "cpu-pool",
		"actor_id":        "", // Free workers only
		"cached_snapshot": targetSnapshotURI,
	},
})
if err == nil && len(warmWorkers) > 0 {
	return warmWorkers[0], nil // Hot-cache hit!
}

// 2. Fallback: Find any idle worker in the pool
return s.store.ListWorkers(ctx, store.ListOptions{
	FieldSelector: map[string]string{
		"worker_pool": "cpu-pool",
		"actor_id":    "",
	},
})

Under the hood, a custom SQL database or advanced Redis driver can intercept the first query and check a specialized snapshot-to-worker index table to return matches in $O(1)$ time.

Conclusion

Adding ListOptions strikes a balance between minimal invasiveness today and complete architectural readiness for high-performance custom databases tomorrow. It solves the scaling issues of in-memory filtering without forcing massive, premature changes to the codebase. As workers are very likely to need to support structural drift, such as scaling (size/cpu), and other worker/actor specifics will evolve on a deployment basis, this should enable some flexibility.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions