Skip to content

Replace actor eth0 move with veth networking#110

Open
Eitan Yarmush (EItanya) wants to merge 5 commits into
agent-substrate:mainfrom
kagent-dev:transparent-egress-veth-networking
Open

Replace actor eth0 move with veth networking#110
Eitan Yarmush (EItanya) wants to merge 5 commits into
agent-substrate:mainfrom
kagent-dev:transparent-egress-veth-networking

Conversation

@EItanya
Copy link
Copy Markdown

@EItanya Eitan Yarmush (EItanya) commented May 28, 2026

Fixes #122

Summary

This replaces the ateom-gvisor networking path that moved the worker pod's Kubernetes-provided eth0 into the actor/gVisor network namespace.

Instead, the worker pod keeps its real eth0, and ateom-gvisor creates a point-to-point veth pair between the worker pod namespace and the actor namespace. The actor-side peer is renamed to eth0, receives the actor-side address, and uses the worker-side veth as its default gateway.

The PR also adds temporary nftables compatibility rules so existing inbound and outbound behavior continues to work while preserving the worker pod's own network connectivity.

Why

Moving the pod's real eth0 makes the worker pod lose normal Kubernetes network connectivity while an actor is active. That blocks pod-local networking components, including the planned transparent egress capture and AgentGateway integration, because those components remain in the worker pod namespace while actor traffic leaves through an interface that was moved elsewhere.

Keeping eth0 in the worker pod namespace gives Substrate a stable worker-owned networking boundary for future transparent egress policy enforcement.

Validation

  • go test ./cmd/ateom-gvisor ./cmd/atelet ./internal/ateompath ./internal/controllers
  • go test ./cmd/ateom-gvisor ./internal/serverboot
  • NO_DEV_ENV=true BUCKET_NAME=ate-snapshots KO_DOCKER_REPO=localhost:5001 KUBECTL_CONTEXT=kind-kind ./hack/run-e2e.sh ./internal/e2e/suites/demo -run TestDemo3 -count=1

Checklist

  • Tests pass
  • Appropriate changes to documentation are included in the PR

@EItanya Eitan Yarmush (EItanya) force-pushed the transparent-egress-veth-networking branch from b66b5d1 to 2e8c8fd Compare May 29, 2026 13:59
Comment thread cmd/ateom-gvisor/main.go Outdated
Policy: &acceptPolicy,
})

c.AddRule(&nftables.Rule{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you put the programs for each chain right below the corresponding chain definition?

Comment thread cmd/ateom-gvisor/main.go Outdated
Type: nftables.ChainTypeFilter,
Hooknum: nftables.ChainHookForward,
Priority: nftables.ChainPriorityFilter,
Policy: &acceptPolicy,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You can use ptr.To: https://pkg.go.dev/k8s.io/utils/ptr#To

Comment thread cmd/ateom-gvisor/main.go
Chain: postrouting,
Exprs: append(ipSourceEqual(actorVethIP), &expr.Masq{}),
})
preroutingExprs := append(ipDestinationEqual(podIP.String()), tcpDestinationPortEqual(80)...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a few TODOs here:

  • We need to handle inbound UDP as well (actors may run QUIC servers).
  • We need to handle multiple, configurable inbound ports. (I think maybe that will require multiple rules on the prerouting chain? Or maybe a more complicated rule that tests against multiple ports).

Copy link
Copy Markdown
Collaborator

@BenTheElder Benjamin Elder (BenTheElder) May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could instead DNAT the IP without filtering on ports and avoid the port config, since we don't need network traffic except for the actor currently.

EDIT: Spoke with Taahir, let's not :-)

Comment thread cmd/ateom-gvisor/main.go
return fmt.Errorf("while moving actor veth peer into interior netns: %w", err)
}

if err := netNSDo(ctx, s.interiorNetNS, configureActorVeth); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really the only part that needs to be done on run / restore, I think, since gVisor wipes the routes from the links in the interior namespace. Everything else could be done on startup, at the same time we create the namespace.

(Except, eventually, I guess we will have per-actor egress rules we also program into nftables, so maybe wiping everything on each run makes more sense)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this also depends on the outcome of #123 and how the different ateom-* conversation plays out. Either way I think it's ok for now

Comment thread cmd/ateom-gvisor/main.go Outdated
actorVethCIDR = "10.200.0.2/30"
actorVethGateway = "10.200.0.1"
actorVethIP = "10.200.0.2"
defaultActorPort = "80"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

Comment thread cmd/ateom-gvisor/main.go Outdated
// current router assumptions: actor egress is masqueraded behind the worker
// pod IP, and inbound traffic to the worker pod's HTTP port is DNAT'd to the
// actor veth IP. Later transparent egress capture will replace the broad
// egress NAT with AgentGateway-bound capture rules.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Have we decided to use AgentGateway? cc Bowei Du (@bowei)

AFAICT this is the first reference in this project.

Copy link
Copy Markdown
Author

@EItanya Eitan Yarmush (EItanya) May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistaken comment, I will fix. My agent decided to hallucinate a bit based on previous stuff I was doing.

Comment thread internal/serverboot/serverboot.go
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a make update or the LICENSES script

thanks for working on this :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace actor eth0 move with worker-owned veth networking

3 participants