Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ openshift-install*
node_modules
.envrc
.ansible/
__pycache__/
__pycache__/
LAB.md
40 changes: 26 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,29 +16,28 @@ The pattern provides three deployment topologies:

3. **Bare metal** (`baremetal` clusterGroup) — deploys all components on bare metal hardware with Intel TDX or AMD SEV-SNP support. NFD (Node Feature Discovery) auto-detects the CPU architecture and configures the appropriate runtime. Supports SNO (Single Node OpenShift) and multi-node clusters.

4. **Bare metal with GPU** (`baremetal-gpu` clusterGroup) — extends the bare metal topology with NVIDIA H100 confidential GPU support. Adds the NVIDIA GPU Operator, IOMMU kernel configuration, and a sample CUDA workload for CC GPU verification. Requires NVIDIA H100 GPUs with confidential computing firmware.

The topology is controlled by the `main.clusterGroupName` field in `values-global.yaml`.

Azure deployments use peer-pods, which provision confidential VMs (`Standard_DCas_v5` family) directly on the Azure hypervisor. Bare metal deployments use layered images and hardware TEE features directly.

## Current version (4.*)
## Current version (5.*)

Breaking change from v3. This is the first version using GA (Generally Available) releases of the CoCo stack:
Breaking change from v4. Uses GA releases of the CoCo stack with Kyverno-based initdata injection.

- **OpenShift Sandboxed Containers 1.12+** (requires OCP 4.19.28+)
- **Red Hat Build of Trustee 1.1** (GA release; all versions prior to 1.0 were Technology Preview)
- External chart repositories for [Trustee](https://github.com/validatedpatterns/trustee-chart), [sandboxed-containers](https://github.com/validatedpatterns/sandboxed-containers-chart), and [sandboxed-policies](https://github.com/validatedpatterns/sandboxed-policies-chart)
- Self-signed certificates via cert-manager (Let's Encrypt no longer required)
- Multi-cluster support via ACM
- **5.0** — Kyverno-based `cc_init_data` injection (replaces MutatingAdmissionPolicy), OSC 1.12 / Trustee 1.1 GA, external chart repositories, self-signed certificates via cert-manager, multi-cluster support via ACM. Requires OCP 4.19.28+.
- **5.1** — Bare metal support for Intel TDX and AMD SEV-SNP via NFD auto-detection. Currently tested on SNO (Single Node OpenShift) configurations only.
- **5.2** — NVIDIA H100 confidential GPU support for bare metal (`baremetal-gpu` clusterGroup). Adds GPU Operator, IOMMU configuration, CC Manager, and sample CUDA workload.

### Previous versions

All previous versions used pre-GA (Technology Preview) releases of Trustee:

| Version | Trustee | OSC | Min OCP |
|---------|---------|-----|---------|
| **3.*** | 0.4.* (Tech Preview) | 1.10.* | 4.16+ |
| **2.*** | 0.3.* (Tech Preview) | 1.9.* | 4.16+ |
| **1.0.0** | 0.2.0 (Tech Preview) | 1.8.1 | 4.16+ |
| Version | Trustee | OSC | Min OCP | Notes |
|---------|---------|-----|---------|-------|
| **4.*** | 1.1 (GA) | 1.12 | 4.19.28+ | First GA release; MutatingAdmissionPolicy-based initdata |
| **3.*** | 0.4.* (Tech Preview) | 1.10.* | 4.16+ | |
| **2.*** | 0.3.* (Tech Preview) | 1.9.* | 4.16+ | |
| **1.0.0** | 0.2.0 (Tech Preview) | 1.8.1 | 4.16+ | |

## Setup

Expand Down Expand Up @@ -98,6 +97,8 @@ These scripts generate the cryptographic material and attestation measurements n
4. `./pattern.sh make install`
5. Wait for the cluster to reboot nodes (MachineConfig updates for TDX kernel parameters and vsock)

> **Note:** Bare metal support is currently tested on SNO (Single Node OpenShift) configurations. Multi-node bare metal clusters are expected to work but have not been validated yet.

The system auto-detects your hardware:

- **NFD** discovers Intel TDX or AMD SEV-SNP capabilities and labels nodes
Expand All @@ -109,6 +110,17 @@ The system auto-detects your hardware:

Optional: pin PCCS to a specific node with `bash scripts/get-pccs-node.sh` and set `baremetal.pccs.nodeSelector` in the baremetal chart values.

### Bare metal GPU deployment

1. Set `main.clusterGroupName: baremetal-gpu` in `values-global.yaml`
2. Run `bash scripts/gen-secrets.sh` to generate KBS keys and PCCS secrets
3. For Intel TDX: uncomment the PCCS secrets in `~/values-secret-coco-pattern.yaml` and provide your Intel PCS API key
4. `./pattern.sh make install`
5. Wait for the cluster to reboot nodes (MachineConfig updates for TDX/SEV-SNP kernel parameters, vsock, and IOMMU)
6. Approve the GPU Operator install plan when it appears (uses `installPlanApproval: Manual`)

> **Note:** The `baremetal-gpu` topology deploys IOMMU MachineConfig on all nodes and will trigger reboots. For clusters without GPUs, use the `baremetal` topology instead. The GPU workload deployment will remain Pending on non-GPU systems but is otherwise harmless.

## Sample applications

Two sample applications are deployed on the cluster running confidential workloads (the single cluster in `simple` mode, or the spoke in multi-cluster mode):
Expand Down
42 changes: 42 additions & 0 deletions ansible/reconcile-kataconfig-gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
- name: Reconcile KataConfig for GPU RuntimeClass
hosts: localhost
connection: local
become: false
gather_facts: true
tasks:
- name: Check for nodes with NVIDIA GPU labels
kubernetes.core.k8s_info:
api_version: v1
kind: Node
label_selectors:
- "nvidia.com/gpu.present=true"
register: gpu_nodes

- name: Check if kata-cc-nvidia-gpu RuntimeClass exists
kubernetes.core.k8s_info:
api_version: node.k8s.io/v1
kind: RuntimeClass
name: kata-cc-nvidia-gpu
register: gpu_runtimeclass

- name: Trigger KataConfig re-reconciliation
kubernetes.core.k8s:
state: patched
api_version: kataconfiguration.openshift.io/v1
kind: KataConfig
name: default-kata-config
definition:
metadata:
annotations:
kata-reconcile: "{{ ansible_date_time.epoch }}"
when:
- gpu_nodes.resources | length > 0
- gpu_runtimeclass.resources | length == 0

- name: Report status
ansible.builtin.debug:
msg: >-
GPU nodes: {{ gpu_nodes.resources | length }},
RuntimeClass exists: {{ gpu_runtimeclass.resources | length > 0 }},
Action: {{ 'triggered re-reconciliation' if (gpu_nodes.resources | length > 0 and gpu_runtimeclass.resources | length == 0) else 'no action needed' }}
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ spec:
all:
- key: "{{ "{{" }}request.object.spec.runtimeClassName || '' {{ "}}" }}"
operator: AnyIn
value: ["kata", "kata-cc", "kata-remote"]
value: ["kata", "kata-cc", "kata-remote", "kata-cc-nvidia-gpu"]
- key: "{{ "{{" }}request.object.metadata.annotations.\"coco.io/initdata-configmap\" || '' {{ "}}" }}"
operator: NotEquals
value: ""
Expand Down
1 change: 1 addition & 0 deletions charts/all/coco-kyverno-policies/values.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
workloadNamespaces:
- hello-openshift
- kbs-access
- gpu-workload

initdataSourceNamespace: imperative
9 changes: 9 additions & 0 deletions charts/all/nvidia-gpu/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
apiVersion: v2
description: NVIDIA GPU Operator configuration for confidential containers (ClusterPolicy, IOMMU MachineConfig).
keywords:
- pattern
- nvidia
- gpu
- confidential
name: nvidia-gpu
version: 0.0.1
113 changes: 113 additions & 0 deletions charts/all/nvidia-gpu/templates/cluster-policy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
{{- if .Values.enabled }}
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
annotations:
argocd.argoproj.io/sync-wave: "110"
spec:
ccManager:
defaultMode: {{ .Values.ccManager.defaultMode | quote }}
enabled: {{ .Values.ccManager.enabled }}
cdi:
default: false
enabled: true
nriPluginEnabled: false
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
dcgm:
enabled: false
dcgmExporter:
config:
name: ''
enabled: false
serviceMonitor:
enabled: true
devicePlugin:
config:
default: ''
name: ''
enabled: false
mps:
root: /run/nvidia/mps
driver:
certConfig:
name: ''
enabled: false
kernelModuleConfig:
name: ''
kernelModuleType: auto
licensingConfig:
configMapName: ''
nlsEnabled: true
repoConfig:
configMapName: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
useNvidiaDriverCRD: false
useOpenKernelModules: false
virtualTopology:
config: ''
gdrcopy:
enabled: false
gds:
enabled: false
gfd:
enabled: true
kataManager:
enabled: false
mig:
strategy: single
migManager:
enabled: false
nodeStatusExporter:
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
kataSandboxDevicePlugin:
enabled: {{ .Values.kataSandboxDevicePlugin.enabled }}
env:
- name: P_GPU_ALIAS
value: pgpu
- name: NVSWITCH_ALIAS
value: nvswitch
sandboxWorkloads:
defaultWorkload: vm-passthrough
enabled: true
mode: kata
toolkit:
enabled: false
installDir: /usr/local/nvidia
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
vfioManager:
enabled: true
env:
- name: BIND_NVSWITCHES
value: 'true'
vgpuDeviceManager:
enabled: false
vgpuManager:
enabled: false
{{- end }}
15 changes: 15 additions & 0 deletions charts/all/nvidia-gpu/templates/iommu-mco.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{{- if .Values.iommu.enabled }}
{{- range list "master" "worker" }}
---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: {{ . }}
name: 100-iommu-{{ . }}
spec:
kernelArguments:
- amd_iommu=on
- intel_iommu=on
{{- end }}
{{- end }}
11 changes: 11 additions & 0 deletions charts/all/nvidia-gpu/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
enabled: true

ccManager:
enabled: true
defaultMode: "on"

kataSandboxDevicePlugin:
enabled: true

iommu:
enabled: true
10 changes: 10 additions & 0 deletions charts/coco-supported/gpu-workload/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: v2
description: Sample CUDA workload for NVIDIA confidential GPU verification.
keywords:
- pattern
- nvidia
- gpu
- workload
- confidential
name: gpu-workload
version: 0.0.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-vectoradd
labels:
app: gpu-vectoradd
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: gpu-vectoradd
template:
metadata:
labels:
app: gpu-vectoradd
annotations:
coco.io/initdata-configmap: debug-initdata
{{- if .Values.defaultMemory }}
io.katacontainers.config.hypervisor.default_memory: {{ .Values.defaultMemory | quote }}
{{- end }}
spec:
runtimeClassName: {{ .Values.runtimeClassName }}
containers:
- name: gpu-cc-verifier
image: quay.io/openshift_sandboxed_containers/gpu-verifier:ubi9
imagePullPolicy: Always
command: ["/bin/bash"]
args:
- -c
- |
/opt/cuda-samples/Samples/0_Introduction/vectorAdd/build/vectorAdd
sleep 36000
resources:
limits:
nvidia.com/pgpu: 1
6 changes: 6 additions & 0 deletions charts/coco-supported/gpu-workload/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
runtimeClassName: "kata-cc-nvidia-gpu"

defaultMemory: "32768"

global:
clusterPlatform: ""
4 changes: 2 additions & 2 deletions scripts/gen-secrets.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,13 @@ if [ ! -f "${PCCS_USER_TOKEN_FILE}" ]; then
echo "Creating PCCS user token"
echo "usertoken" > "${PCCS_USER_TOKEN_FILE}"
fi
echo -n "usertoken" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_user_token_hash"
tr -d '\n' < "${PCCS_USER_TOKEN_FILE}" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_user_token_hash"

if [ ! -f "${PCCS_ADMIN_TOKEN_FILE}" ]; then
echo "Creating PCCS admin token"
echo "admintoken" > "${PCCS_ADMIN_TOKEN_FILE}"
fi
echo -n "admintoken" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_admin_token_hash"
tr -d '\n' < "${PCCS_ADMIN_TOKEN_FILE}" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_admin_token_hash"

## Copy a sample values file if this stuff doesn't exist

Expand Down
Loading
Loading