Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions tickets/devop-579-network-policy-rollout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# DEVOP-579 — NetworkPolicy egress rollout plan

**Status:** plan only. Execution is staged across 3 engineer-weeks. Do NOT deploy any NetworkPolicies based on this plan without the rollout owner signing off on the scope of each phase.

## Goal

Add `default-deny-egress` NetworkPolicies to every Kubernetes namespace across our 13 clusters, then layer explicit egress allowlists per workload. Closes the "compromised pod can call out to attacker-controlled C2" Shai-Hulud propagation path.

## Why this is hard (the 3-engineer-week estimate)

NetworkPolicies are stateless and additive — meaning a `default-deny` policy will silently break every workload that has a legitimate outbound dependency that isn't yet enumerated. Production-impacting blast radius if rushed. The bulk of the work is **discovery**, not deployment.

## Phase 0 — Pre-flight (week 1, days 1–2)

- [ ] Confirm CNI on every cluster supports NetworkPolicy (Calico, Cilium, Antrea — yes; flannel without --network-policy — no).
- [ ] Stand up `network-policy-engine` (Calico) or use Cilium's native NPL on any cluster that's still on flannel.
- [ ] Enable flow logs on at least one staging cluster: `cilium hubble enable` or `calicoctl flow logs enable`. We need ~7 days of baseline traffic to enumerate legitimate egress.

## Phase 1 — Discovery (week 1, days 3–5; week 2, days 1–2)

For each namespace, in priority order (highest-value first):
1. `allora-chain-validators`
2. `allora-chain-rpc`
3. `harbor`
4. `flux-system`
5. ingress-nginx / traefik
6. cert-manager
7. application namespaces (`robonet`, `eliza-allora`, etc.)
8. system namespaces last (`kube-system`, `gke-system`)

For each:
- [ ] Capture 7 days of egress flow logs from baseline.
- [ ] Enumerate destination CIDRs, DNS names, and ports.
Comment thread
cubic-dev-ai[bot] marked this conversation as resolved.
- [ ] Group by category: `internal` (other Allora namespaces), `infra` (cloud-provider metadata, DNS, NTP, GKE), `vendor-saas` (Datadog, Slack, etc.), `package-registries` (npm, pypi, docker.io, ghcr.io, quay.io — these become Harbor proxy-cache after DEVOP-589), `customer-traffic` (per-namespace).
- [ ] **Explicitly flag suspect egress destinations** for incident review before they get added to any allowlist. Treat the following as suspect by default:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost 100% certain we will not see domain/uris in the egress logging. As this is Layer 3/4 it would only be IP/Port+Protocol 🤔 we may need to map these against cluster DNS logs but I also don't think that we have verbose DNS logs turned on.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, in that case, should I just add "turn on verbose DNS logs"?

- Generic webhook receivers (`*.webhook.site`, `discord.com/api/webhooks/*`, `hooks.slack.com/services/*` that aren't ours, `*.pipedream.net`, `*.requestbin.com`, etc.).
- Pastebin-family services (`pastebin.com`, `paste.ee`, `hastebin.com`, `gist.githubusercontent.com` raw fetches from non-org accounts, `transfer.sh`, `0x0.st`).
- Tunnel / reverse-proxy services (`*.ngrok.io`, `*.ngrok-free.app`, `*.loca.lt`, `*.trycloudflare.com`, `*.serveo.net`).
- Cloud-instance metadata endpoints from inside a pod (`169.254.169.254`, `metadata.google.internal`, `100.100.100.200`) — these should be blocked outright unless a specific workload demonstrably needs them, and even then via an IRSA / Workload Identity allowlist, not raw IP.
- Anything resolving to a residential/dynamic-DNS provider (`*.duckdns.org`, `*.no-ip.com`, `*.dyndns.org`).
Each flagged destination needs an incident-response review: confirm a legitimate owner, document the use case, and either allowlist with a tight CIDR / FQDN or open a remediation ticket. Do NOT roll suspect destinations into the allowlist by default just because they appear in the 7-day baseline.
- [ ] Document in this repo as `network-policies/discovery/<namespace>.md` for future audit, including the suspect-destination review notes.

## Phase 2 — Allowlist authoring (week 2, days 3–5)

Per namespace, write two files:
- `network-policies/<cluster>/<namespace>/default-deny.yaml` — applies to all pods in the namespace, blocks all egress except DNS.
- `network-policies/<cluster>/<namespace>/allowlist.yaml` — explicit egress rules derived from Phase 1.

Patterns to standardize:
- DNS always allowed to kube-dns / coredns (53/udp, 53/tcp).
- NTP always allowed (123/udp).
- Cluster-internal pod-to-pod within same namespace: allow by default.
- Outbound to other Allora namespaces: explicit per-namespace allow (no blanket).
- Outbound to public internet: only through a designated egress proxy (or whitelist by CIDR).

## Phase 3 — Staged rollout (week 3)

DEVOP-579 specifies **48-hour soak windows** between rollout stages
(not 24h) so a full business-day cycle plus a quieter overnight cycle
both elapse before the next stage advances. This catches workloads
whose egress only fires on cron/batch schedules.

- [ ] Days 1–2: apply policies to **1 staging namespace** in **1 staging cluster**. Soak 48h.
- [ ] Days 3–4: apply to all staging namespaces in 1 cluster. Soak 48h.
- [ ] Days 5–6: apply to 1 production namespace (lowest-risk: docs site). Soak 48h.
- [ ] Days 7+: roll forward through remaining production namespaces in priority-inverse order (lowest-blast-radius first), keeping a 48h soak between each cluster cohort.

A stage may only advance if the prior soak completed with zero
NetworkPolicy-attributable incidents. If anything broke, hold the
window open until the root cause is fixed (or the policy is amended)
and restart the 48-hour clock for that stage.

**Rollback procedure** (must be documented before Day 1):
- `kubectl delete networkpolicy default-deny -n <ns>` — un-breaks egress instantly.
- Have this command ready as a runbook step in the on-call channel.

## Phase 4 — Steady state

- [ ] Add NetworkPolicy schemas to Kyverno (after DEVOP-588 lands) so any new namespace without a `default-deny` is auto-flagged.
- [ ] Monthly review of `discovery/<namespace>.md` for changes in legitimate egress (new vendor SaaS, etc.).
Comment thread
cubic-dev-ai[bot] marked this conversation as resolved.
- [ ] **Document the rollout and steady-state policies in `SECURITY-RUNBOOK.md`** (DEVOP-571): add a NetworkPolicy section covering (a) the default-deny model, (b) where the per-namespace allowlists live in this repo, (c) the rollback command (`kubectl delete networkpolicy default-deny -n <ns>`), and (d) the on-call escalation path when a workload reports egress failures. Without this hook into the runbook, on-call has no reference for diagnosing "my pod can't reach X" pages once default-deny is org-wide.

## Dependencies

- Harbor proxy-cache projects (DEVOP-589) must land **before** Phase 2, or the allowlists will be churn — they'd need to allow direct `ghcr.io` etc., then be rewritten to allow only `harbor.allora-network.io`.
- Kyverno on all clusters (DEVOP-588) is a soft dependency: Phase 4 needs it but Phases 0–3 can proceed.

## Ingress default-deny — same model, separate rollout cohort

DEVOP-579 requires default-deny for both egress **and** ingress. The
two share a rollout shape but have different blast-radius and
different discovery inputs, so they run as parallel cohorts rather
than as one combined sweep.

For ingress, mirror Phases 0–4 above with these substitutions:

- **Phase 1 (discovery)**: capture the *inbound* flow logs per
namespace for 7 days. Categorize sources by `internal` (other
Allora namespaces), `infra` (ingress controllers, load balancers,
health-check probes), `vendor-saas` (webhook callbacks, etc.), and
`public-traffic` (customer-facing routes). Apply the same suspect-
destination flagging in reverse: any inbound source that resolves
to a residential-DNS / tunnel service / cloud-metadata range is
reviewed before allowlisting.
- **Phase 2 (allowlist authoring)**: per namespace, write
`network-policies/<cluster>/<namespace>/default-deny-ingress.yaml`
plus `ingress-allowlist.yaml`. Pattern: deny all inbound by default,
allow from the ingress controller's pod selector, allow from
same-namespace pods, then explicit allow rules per legitimate
upstream.
- **Phase 3 (staged rollout)**: same 48-hour soak windows. Ingress
blast radius is generally *higher* than egress (a misconfigured
ingress policy can take a service offline for real users, not just
internal callouts), so the production cohort starts later and
proceeds slower than egress.
- **Phase 4 (steady state)**: Kyverno rule asserting both
`default-deny-egress` AND `default-deny-ingress` exist per namespace.
Runbook section covers both.

Run egress first (it's lower-risk because the failure mode is
"workload can't reach Datadog" rather than "customers can't reach our
API"). Start ingress discovery in parallel during Phase 0–1 of the
egress rollout so the two cohorts can converge on Phase 4 around the
same time.

## Out of scope for this ticket

- IDS/anomaly detection on egress flow logs — separate ticket (Falco rules, DEVOP-570).

## Who runs this

- Owner: cluster-admin / platform team.
- Reviewer: security team (sign-off on each phase before proceeding to the next).
- Estimated total engineer-time: ~3 engineer-weeks calendar, ~50% utilization (lots of waiting for flow-log baselines to accumulate).

## Links

- Linear: https://linear.app/alloralabs/issue/DEVOP-579
- Cilium NetworkPolicy reference: https://docs.cilium.io/en/stable/security/policy/
- Calico NetworkPolicy reference: https://docs.tigera.io/calico/latest/network-policy/