Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions docs/contributing/01-Proposals/MEP19/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
slug: /MEP-19-zone-awareness
title: MEP-19
sidebar_position: 19
---

# Zone Awareness

In metal-stack, the concepts of regions and zones are currently represented implicitly through partition names rather than as dedicated API entities. This design uses naming conventions to encode both region and zone information within a partition identifier. For example, the partition name `fra_eqx_01` translates to Frankfurt (region), Equinix (zone), and 01 (partition).

From a networking perspective, traffic between private node networks is not routed between partitions. To prevent misconfiguration, private networks are derived from partition-scoped `supernetworks`, preventing private node networks to be used across different partitions. Only external networks such as the Internet or Datacenter Interconnect (DCI) connections can be used to route traffic between partitions.

Additionally, all networks have disjunct IP prefixes. With the introduction of [MEP-4](../MEP4/README.md), this behavior will change: Network prefixes may overlap across partitions but must remain disjunct within a single project. This is possible since go-ipam release `v1.12.0`, which introduced the concept of network namespaces.

## Motivation

Already, with current metal-stack installations, it is possible to spread a single partition across data centers. This can be achieved through the rack spreading feature (introduced by [MEP-12](../MEP12/README.md)).

Limitations of this feature are: It can not be explicitly decided, in which racks nodes are placed. Moreover, this is performed with a best-effort strategy. If no machine is available in one rack, it might get placed in the one where already a machine is present.

Another issue with this approach is that the single partition is still one failure domain, e.g. a single BGP failure could bring down the whole partition. As known from major cloud providers, zonal distribution of workload enhances availability and fault tolerance.

## Requirements to Achieve this Goal

To support explicit region and zone concepts in metal-stack, several functional and architectural requirements must be met. The following considerations focus primarily on the Kubernetes integration and cluster topology aspects:

- Proper spreading of worker nodes and control plane components across [multiple zones](https://kubernetes.io/docs/setup/best-practices/multiple-zones/) and regions must be possible.
- Nodes that belong to the same Kubernetes cluster must have the capability to communicate directly with each other, even if they are located in different partitions, provided that network configurations allow this communication using their respective Node CIDRs.
- It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIRDs differently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required? In GCP this is not the case, node IPs are not in different CIDR ranges.

Suggested change
- It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIRDs differently.
- It must be possible for nodes within a single Kubernetes cluster to use different Node CIDR ranges, depending on their partition or zone assignment. Major cloud providers use node groups to configure Node CIDRs differently.

- Zones stay separate failure domains (e.g. a failure in the EVPN control-plane of one zone should not affect the other to avoid EVPN fate-sharing)

## Criteria

- Number of hops: for communication btw. worker nodes, to the internet and to the storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduction sentence is necessary. Which criteria do we talk about?


Storage resources must either be strictly located in a single partition or replicated across all partitions. This can be enforced using [`allowedTopologies`](https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies) within a `StorageClass`.

An open design question remains regarding Pod and Service CIDRs, which we usually configure for native routing (using FRR peering with CNI and with MetalLB for service exposal). In case of zonal routing, this would imply that traffic inside the FRR peering range also needs to be routable across zonal partitions. Should overlay networks be allowed or is it possible to depend on IPv6 in order to solve this issue? Further evaluation is needed to determine the optimal approach.

## Proposals

**Proposal 1: Disjunct VNIs Across Partitions**

![proposal 1](proposal-1.svg)

In this approach, each partition uses a distinct set of VNIs. An additional controller, most likely running on the exit switch, would be required to build and manage the corresponding route maps.

Each partition would maintain its own VRF. On the exit switch, routes from all VRFs associated with the same project would be imported to enable project-wide routing between partitions while maintaining isolation from other projects.

The firewall would need to participate in all VRFs of the cluster, ensuring consistent traffic filtering and policy enforcement across partitions. Additionally, a default route must be present within each VRF.

**Proposal 2: Multi-Site DCI**

![proposal 2](proposal-2.svg)

In the second approach, the same VNIs are used across multiple partitions. This capability can be realized by leveraging features provided by the Enterprise Switch OS.

From a metal-stack perspective, each partition would still define separate node networks, but the same VRFs would be available in each partition.

To support this, the `metal-api` would need to be extended to allow identical VNIs across different networks and partitions, as long as they belong to the same project.

**Storage**

Storage aspects will likely be addressed in a dedicated MEP. However, some initial considerations are outlined here.

![current storage situation](storage-current.svg)

In the current architecture as illustrated above, a node accesses storage through the firewall.

![storage proposal](storage-proposal.svg)

One possible improvement would be to remove the dependency on the firewall for storage access. This could be achieved by configuring a route map on the leaf switch to establish a direct mapping between the tenant VRF and the storage VRF on a per-project basis.

**Proposal 3: Project-Wide Route-Leaking and Open DCI**

This is a mixture of proposal 1 and 2 with disjunct VNIs across partitions.

In this approach, each partition uses a distinct set of VNIs. The `metal-core`, running on the leaf switches, would be required to build and manage route leaks:

- from certain private networks (e.g. all project networks, storage network) to the local VRF (only locally held at the leaf switches)
- from the local VRF to a DCI VRF (only propagated zone-wide)

The open DCI is a ring of exit switches speaking plain BGP (no EVPN routes, no VXLAN) for exchanging the private supernetworks of zones (note: prefix length is longer). They operate as VTEP for the DCI VRF and is not dependent on the Multi-Site DCI feature of Enterprise SONiC.

Notes:

- cross-zone traffic is very efficiently transported, as the firewall is not in the path (fewer hops)
- this can also be used to provide worker nodes with an more efficient way to access storage systems (also not going through the firewall)

## Operational Recommendations and Documentation Notes

Include a recommendation on the maximum practical distance between partitions within a single zone, particularly with regard to latency-sensitive components such as `etcd`.

## Roadmap

The following tasks can be considered as next steps:

- Verify proposals in containerlab
- Research: Can FRR do the Multi-Site DCI Feature out-of-the-box?
- Create sample for a Gardener shoot spec and the Cluster API manifests
51 changes: 51 additions & 0 deletions docs/contributing/01-Proposals/MEP19/proposal-1.drawio
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
<mxfile host="65bd71144e">
<diagram id="8gMl2hTIlcoxMkYUvRWJ" name="Page-1">
<mxGraphModel dx="621" dy="454" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
<root>
<mxCell id="0"/>
<mxCell id="1" parent="0"/>
<mxCell id="6" value="Partition 1" style="swimlane;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="120" y="40" width="240" height="240" as="geometry"/>
</mxCell>
<mxCell id="2" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;VRF1&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/switch/Switch_48_port_L3.svg;" parent="6" vertex="1">
<mxGeometry x="81" y="48" width="78" height="52.8" as="geometry"/>
</mxCell>
<mxCell id="4" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;10.0.0.1/32&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/computer_and_terminals/Server_Desktop.svg;" parent="6" vertex="1">
<mxGeometry x="98.69999999999999" y="160" width="42.599999999999994" height="54" as="geometry"/>
</mxCell>
<mxCell id="10" style="edgeStyle=none;html=1;endArrow=none;endFill=0;" parent="6" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="120" y="120" as="sourcePoint"/>
<mxPoint x="120" y="164" as="targetPoint"/>
</mxGeometry>
</mxCell>
<mxCell id="7" value="Partition 2" style="swimlane;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="480" y="40" width="240" height="240" as="geometry"/>
</mxCell>
<mxCell id="11" style="edgeStyle=none;html=1;endArrow=none;endFill=0;" parent="7" target="5" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="130" y="120" as="sourcePoint"/>
</mxGeometry>
</mxCell>
<mxCell id="3" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;VRF2&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/switch/Switch_48_port_L3.svg;" parent="7" vertex="1">
<mxGeometry x="90" y="48" width="78" height="52.8" as="geometry"/>
</mxCell>
<mxCell id="5" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;10.0.1.1/32&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/computer_and_terminals/Server_Desktop.svg;" parent="7" vertex="1">
<mxGeometry x="107.70000000000005" y="160" width="42.599999999999994" height="54" as="geometry"/>
</mxCell>
<mxCell id="9" style="edgeStyle=none;html=1;endArrow=none;endFill=0;rounded=0;curved=1;" parent="1" source="2" target="3" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="420" y="60"/>
</Array>
</mxGeometry>
</mxCell>
<mxCell id="13" value="Route Maps&lt;div&gt;without NAT&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" parent="9" vertex="1" connectable="0">
<mxGeometry x="-0.0681" y="-19" relative="1" as="geometry">
<mxPoint as="offset"/>
</mxGeometry>
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>
1 change: 1 addition & 0 deletions docs/contributing/01-Proposals/MEP19/proposal-1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
47 changes: 47 additions & 0 deletions docs/contributing/01-Proposals/MEP19/proposal-2.drawio
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<mxfile host="65bd71144e">
<diagram id="8gMl2hTIlcoxMkYUvRWJ" name="Page-1">
<mxGraphModel dx="434" dy="318" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
<root>
<mxCell id="0"/>
<mxCell id="1" parent="0"/>
<mxCell id="6" value="Partition 1" style="swimlane;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="120" y="40" width="240" height="240" as="geometry"/>
</mxCell>
<mxCell id="2" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;VRF1&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/switch/Switch_48_port_L3.svg;" parent="6" vertex="1">
<mxGeometry x="81" y="48" width="78" height="52.8" as="geometry"/>
</mxCell>
<mxCell id="4" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;10.0.0.1/32&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/computer_and_terminals/Server_Desktop.svg;" parent="6" vertex="1">
<mxGeometry x="98.69999999999999" y="160" width="42.599999999999994" height="54" as="geometry"/>
</mxCell>
<mxCell id="10" style="edgeStyle=none;html=1;endArrow=none;endFill=0;" parent="6" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="120" y="120" as="sourcePoint"/>
<mxPoint x="120" y="164" as="targetPoint"/>
</mxGeometry>
</mxCell>
<mxCell id="7" value="Partition 2" style="swimlane;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="480" y="40" width="240" height="240" as="geometry"/>
</mxCell>
<mxCell id="11" style="edgeStyle=none;html=1;endArrow=none;endFill=0;" parent="7" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="131" y="123" as="sourcePoint"/>
<mxPoint x="130.40298507462694" y="163" as="targetPoint"/>
</mxGeometry>
</mxCell>
<mxCell id="3" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;VRF1&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/switch/Switch_48_port_L3.svg;" parent="7" vertex="1">
<mxGeometry x="90" y="48" width="78" height="52.8" as="geometry"/>
</mxCell>
<mxCell id="5" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;10.0.1.1/32&lt;/font&gt;" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/computer_and_terminals/Server_Desktop.svg;" parent="7" vertex="1">
<mxGeometry x="107.70000000000005" y="160" width="42.599999999999994" height="54" as="geometry"/>
</mxCell>
<mxCell id="9" style="edgeStyle=none;html=1;endArrow=none;endFill=0;rounded=0;curved=1;" parent="1" source="2" target="3" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="420" y="60"/>
</Array>
</mxGeometry>
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>
1 change: 1 addition & 0 deletions docs/contributing/01-Proposals/MEP19/proposal-2.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 61 additions & 0 deletions docs/contributing/01-Proposals/MEP19/storage-current.drawio
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
<mxfile host="65bd71144e">
<diagram id="bnkaKnrv1tXkZOrYpwZu" name="Page-1">
<mxGraphModel dx="1086" dy="795" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
<root>
<mxCell id="0"/>
<mxCell id="1" parent="0"/>
<mxCell id="10" style="edgeStyle=none;html=1;endArrow=none;endFill=0;" parent="1" source="2" target="5" edge="1">
<mxGeometry relative="1" as="geometry"/>
</mxCell>
<mxCell id="2" value="" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/switch/Switch_48_port_L3.svg;" parent="1" vertex="1">
<mxGeometry x="200" y="80" width="78" height="52.8" as="geometry"/>
</mxCell>
<mxCell id="8" style="edgeStyle=none;html=1;endArrow=none;endFill=0;" parent="1" source="3" target="4" edge="1">
<mxGeometry relative="1" as="geometry"/>
</mxCell>
<mxCell id="9" value="Tenant VRF" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];fontSize=10;" parent="8" vertex="1" connectable="0">
<mxGeometry x="-0.4018" y="-2" relative="1" as="geometry">
<mxPoint x="2" y="9" as="offset"/>
</mxGeometry>
</mxCell>
<mxCell id="3" value="" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/computer_and_terminals/Server_Desktop.svg;" parent="1" vertex="1">
<mxGeometry x="360" y="80" width="42.599999999999994" height="54" as="geometry"/>
</mxCell>
<mxCell id="4" value="" style="image;points=[];aspect=fixed;html=1;align=center;shadow=0;dashed=0;image=img/lib/allied_telesis/computer_and_terminals/Server_Desktop.svg;" parent="1" vertex="1">
<mxGeometry x="360" y="190" width="42.599999999999994" height="54" as="geometry"/>
</mxCell>
<mxCell id="5" value="" style="strokeWidth=2;html=1;shape=mxgraph.flowchart.database;whiteSpace=wrap;" parent="1" vertex="1">
<mxGeometry x="224" y="160" width="30" height="40" as="geometry"/>
</mxCell>
<mxCell id="6" style="edgeStyle=none;html=1;entryX=0.012;entryY=0.515;entryDx=0;entryDy=0;entryPerimeter=0;endArrow=none;endFill=0;" parent="1" source="2" target="3" edge="1">
<mxGeometry relative="1" as="geometry"/>
</mxCell>
<mxCell id="7" value="Storage VRF" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];fontSize=10;" parent="6" vertex="1" connectable="0">
<mxGeometry x="0.2721" y="2" relative="1" as="geometry">
<mxPoint x="-11" y="1" as="offset"/>
</mxGeometry>
</mxCell>
<mxCell id="20" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;Firewall&lt;/font&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="395" y="90" width="60" height="30" as="geometry"/>
</mxCell>
<mxCell id="21" value="&lt;font style=&quot;font-size: 10px;&quot;&gt;Worker&lt;/font&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="395" y="200" width="60" height="30" as="geometry"/>
</mxCell>
<mxCell id="24" value="" style="endArrow=classic;html=1;strokeColor=light-dark(#FF9933,#EDEDED);" parent="1" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="462.6" y="215" as="sourcePoint"/>
<mxPoint x="280" y="60" as="targetPoint"/>
<Array as="points">
<mxPoint x="463" y="60"/>
</Array>
</mxGeometry>
</mxCell>
<mxCell id="31" value="Storage access" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];fontSize=8;" parent="24" vertex="1" connectable="0">
<mxGeometry x="0.6001" y="-1" relative="1" as="geometry">
<mxPoint x="13" as="offset"/>
</mxGeometry>
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>
Loading