Skip to content

Replace actor eth0 move with worker-owned veth networking #122

@EItanya

Description

Today ateom-gvisor gives an active actor network connectivity by moving the worker pod's real Kubernetes eth0 into the actor/gVisor network namespace. That works for basic actor execution, but it has an important side effect: while the actor is running, the worker pod no longer owns its pod network interface.

This makes the worker pod partially disappear from the normal Kubernetes networking model during actor activation. It also prevents us from cleanly adding pod-local networking components, such as an egress proxy or policy enforcement sidecar, because those components remain in the worker pod network namespace while actor traffic leaves through an interface that has been moved elsewhere.

We should replace the eth0 move approach with an explicit veth pair between the worker pod namespace and the actor/gVisor namespace.

Rationale

Keeping the pod's real eth0 in the worker namespace gives us a stable networking boundary:

  • The worker pod keeps normal Kubernetes network connectivity while an actor is running.
  • Sidecars or worker-local helper processes can remain reachable and can participate in actor networking.
  • Actor networking becomes explicit and owned by Substrate instead of depending on moving the CNI-provided interface.
  • This creates the foundation for transparent egress capture and policy enforcement.
  • It avoids coupling actor lifecycle to destructive mutation of the pod's primary network interface.

This is especially important for planned transparent egress policy work. We want actor traffic to be captured without requiring application opt-in via HTTP_PROXY or HTTPS_PROXY. To do that cleanly, actor traffic needs to cross a worker-owned interface where Substrate can install forwarding, NAT, and later proxy-capture rules.

Proposed approach

For each active actor, ateom-gvisor should create a point-to-point veth pair:

  • Worker pod namespace side: ateom0, for example 10.200.0.1/30.
  • Actor/gVisor namespace side: renamed to eth0, for example 10.200.0.2/30.
  • Actor default route points to the worker-side veth IP.
  • The Kubernetes-provided pod eth0 remains in the worker pod namespace.

For compatibility with the current routing model, the initial implementation can install temporary nftables rules:

  • Masquerade actor egress behind the worker pod IP.
  • DNAT inbound traffic from the worker pod IP and actor service port to the actor veth IP.
  • Enable forwarding between the actor veth and pod eth0.

These rules are intended as a compatibility bridge, not the final egress policy implementation. The later egress work should replace broad actor egress NAT with transparent capture into AgentGateway and default-deny policy rules.

Expected outcome

After this change:

  • Actors still start, checkpoint, restore, and receive inbound traffic as before.
  • The worker pod retains network connectivity during actor execution.
  • ateom-gvisor no longer needs to scrape, move, or restore the pod's real eth0.
  • Substrate has a cleaner networking foundation for transparent egress capture, policy enforcement, and future worker-local AgentGateway integration.

Validation

The implementation should be validated with:

  • Existing focused Go tests for ateom-gvisor and related networking/server boot packages.
  • The demo e2e flow that exercises actor startup, golden snapshot creation, restore, and inbound routing.
  • A runtime check that the worker pod still has its Kubernetes eth0 while an actor is active.
  • A runtime check that actor traffic routes through the veth pair and inbound traffic still reaches the actor.

Metadata

Metadata

Labels

area/networkarea/nodekind/featureAn enhancement / feature request or implementationprio/P0Highest priority / required for next milestone

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions