RFC: Flexible & Non-Ossifying Control Plane Persistence Layer

# RFC: Flexible & Non-Ossifying Control Plane Persistence Layer

**Authors**: MushuEE / Agent Substrate Maintainers
**Status**: Draft / Proposed

---

## What's the Goal?

As Agent Substrate matures, we want to make sure our Control Plane persistence layer (`store.Interface`) doesn't paint us into a corner. 

Specifically, we want to achieve three things:
1. **Pluggable Databases**: Allow users/deployers to bring their own database of choice (PostgreSQL, Spanner, Aerospike, DynamoDB) without rewriting the entire API server.
2. **Locality-Aware Scheduling (Hot Snapshot Cache)**: Enable the scheduler to query for workers that already have a target snapshot hot-cached, with a clean fallback to the general pool if none are warm.
3. **Non-Ossifying API Surface**: Add these advanced query capabilities without bloating our Go interfaces or hardcoding specific filters into the persistence method signatures.

---

## The Problem Today

Right now, our `store.Interface` listing methods are completely rigid and all-or-nothing:
```go
ListWorkers(ctx context.Context) ([]*ateapipb.Worker, error)
ListActors(ctx context.Context) ([]*ateapipb.Actor, error)
```

This introduces severe issues as the project scales:
* **No indexing/filtering in the DB**: To find a free worker in a specific pool, the scheduler is forced to retrieve *every single worker* from the database and filter them in-memory in Go.
* **Zero flexibility**: If we want to query workers by Node name, size, or status, we either have to add new custom query methods to the interface (bloating the API surface) or continue doing extremely expensive in-memory scans in the control plane.
* **Database lock-in**: Alternative databases (like SQL or DynamoDB) cannot leverage their indexing engines (e.g. B-Trees) to optimize queries because the interface gives them zero filtering context.

---

## The Proposal: Kubernetes-style `ListOptions`

Rather than hardcoding specific query parameters, we introduce a generic, extensible `ListOptions` block to list operations:

```go
// ListOptions holds parameters for filtering List results.
type ListOptions struct {
	// FieldSelector filters results by matching specific fields exactly (e.g., "worker_pool=cpu-pool").
	FieldSelector map[string]string
}
```

This allows our listing methods to support generic query filtering:
```go
ListWorkers(ctx context.Context, opts ListOptions) ([]*ateapipb.Worker, error)
ListActors(ctx context.Context, opts ListOptions) ([]*ateapipb.Actor, error)
```

---

## Technical Arguments

<details>
 <summary>1. Non-Ossifying API Surface</summary>

Using a map-based `FieldSelector` completely decouples the query signature from the resource schemas:
* If a new field (e.g., `node_name`, `size`, `gpu_type`) is added to the `Worker` protobuf tomorrow, **the Go interface does not change.** 
* New clients can start passing `opts.FieldSelector["node_name"] = "k8s-node-1"` immediately.
* The database implementation simply adds a new `case` statement to its internal matcher to support it.

</details>

<details>
 <summary>2. Enabling Production-Grade Pluggable DBs</summary>

This design allows pluggable database drivers to translate these selectors directly into native indexing queries instead of scanning the entire database:

* **PostgreSQL / Spanner**: Can map the `FieldSelector` map directly to parameterized SQL `WHERE` clauses (leveraging B-Tree indexes):
 ```sql
 SELECT * FROM workers WHERE worker_pool = $1 AND actor_id = $2;
 ```
* **DynamoDB**: Can map the selector directly to Key Condition Expressions or Filter Expressions.
* **Redis (Default)**: Maintains simple, low-overhead client-side filtering on top of Master key scans for development/testing, but can easily be optimized to perform set lookups (e.g., `SPOP` or `SRANDMEMBER` on `set:idle:cpu-pool`) for specific common selector combinations.

</details>

<details>
 <summary>3. High-Performance Locality & Snapshot Scheduling</summary>

A major performance booster for Agent Substrate is scheduling actors onto workers that already have the snapshot hot-cached in memory or locally on disk. 

With `ListOptions`, the scheduler can execute a clean fallback loop:

```go
// 1. Try to find a free worker with the snapshot hot-cached
warmWorkers, err := s.store.ListWorkers(ctx, store.ListOptions{
	FieldSelector: map[string]string{
		"worker_pool": "cpu-pool",
		"actor_id": "", // Free workers only
		"cached_snapshot": targetSnapshotURI,
	},
})
if err == nil && len(warmWorkers) > 0 {
	return warmWorkers[0], nil // Hot-cache hit!
}

// 2. Fallback: Find any idle worker in the pool
return s.store.ListWorkers(ctx, store.ListOptions{
	FieldSelector: map[string]string{
		"worker_pool": "cpu-pool",
		"actor_id": "",
	},
})
```

Under the hood, a custom SQL database or advanced Redis driver can intercept the first query and check a specialized snapshot-to-worker index table to return matches in $O(1)$ time.

</details>

## Conclusion

Adding `ListOptions` strikes a balance between **minimal invasiveness today** and **complete architectural readiness for high-performance custom databases tomorrow**. It solves the scaling issues of in-memory filtering without forcing massive, premature changes to the codebase. As workers are very likely to need to support structural drift, such as scaling (size/cpu), and other worker/actor specifics will evolve on a deployment basis, this should enable some flexibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Flexible & Non-Ossifying Control Plane Persistence Layer #135

RFC: Flexible & Non-Ossifying Control Plane Persistence Layer

What's the Goal?

The Problem Today

The Proposal: Kubernetes-style `ListOptions`

Technical Arguments

Conclusion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: Flexible & Non-Ossifying Control Plane Persistence Layer #135

Description

RFC: Flexible & Non-Ossifying Control Plane Persistence Layer

What's the Goal?

The Problem Today

The Proposal: Kubernetes-style ListOptions

Technical Arguments

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The Proposal: Kubernetes-style `ListOptions`