validatedpatterns · butler54 · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/.gitignore b/.gitignore
@@ -20,4 +20,5 @@ openshift-install*
 node_modules
 .envrc
 .ansible/
-__pycache__/
+__pycache__/
+LAB.md
diff --git a/README.md b/README.md
@@ -16,29 +16,28 @@ The pattern provides three deployment topologies:
 
 3. **Bare metal** (`baremetal` clusterGroup) — deploys all components on bare metal hardware with Intel TDX or AMD SEV-SNP support. NFD (Node Feature Discovery) auto-detects the CPU architecture and configures the appropriate runtime. Supports SNO (Single Node OpenShift) and multi-node clusters.
 
+4. **Bare metal with GPU** (`baremetal-gpu` clusterGroup) — extends the bare metal topology with NVIDIA H100 confidential GPU support. Adds the NVIDIA GPU Operator, IOMMU kernel configuration, and a sample CUDA workload for CC GPU verification. Requires NVIDIA H100 GPUs with confidential computing firmware.
+
 The topology is controlled by the `main.clusterGroupName` field in `values-global.yaml`.
 
 Azure deployments use peer-pods, which provision confidential VMs (`Standard_DCas_v5` family) directly on the Azure hypervisor. Bare metal deployments use layered images and hardware TEE features directly.
 
-## Current version (4.*)
+## Current version (5.*)
 
-Breaking change from v3. This is the first version using GA (Generally Available) releases of the CoCo stack:
+Breaking change from v4. Uses GA releases of the CoCo stack with Kyverno-based initdata injection.
 
-- **OpenShift Sandboxed Containers 1.12+** (requires OCP 4.19.28+)
-- **Red Hat Build of Trustee 1.1** (GA release; all versions prior to 1.0 were Technology Preview)
-- External chart repositories for [Trustee](https://github.com/validatedpatterns/trustee-chart), [sandboxed-containers](https://github.com/validatedpatterns/sandboxed-containers-chart), and [sandboxed-policies](https://github.com/validatedpatterns/sandboxed-policies-chart)
-- Self-signed certificates via cert-manager (Let's Encrypt no longer required)
-- Multi-cluster support via ACM
+- **5.0** — Kyverno-based `cc_init_data` injection (replaces MutatingAdmissionPolicy), OSC 1.12 / Trustee 1.1 GA, external chart repositories, self-signed certificates via cert-manager, multi-cluster support via ACM. Requires OCP 4.19.28+.
+- **5.1** — Bare metal support for Intel TDX and AMD SEV-SNP via NFD auto-detection. Currently tested on SNO (Single Node OpenShift) configurations only.
+- **5.2** — NVIDIA H100 confidential GPU support for bare metal (`baremetal-gpu` clusterGroup). Adds GPU Operator, IOMMU configuration, CC Manager, and sample CUDA workload.
 
 ### Previous versions
 
-All previous versions used pre-GA (Technology Preview) releases of Trustee:
-
-| Version | Trustee | OSC | Min OCP |
-|---------|---------|-----|---------|
-| **3.*** | 0.4.* (Tech Preview) | 1.10.* | 4.16+ |
-| **2.*** | 0.3.* (Tech Preview) | 1.9.* | 4.16+ |
-| **1.0.0** | 0.2.0 (Tech Preview) | 1.8.1 | 4.16+ |
+| Version | Trustee | OSC | Min OCP | Notes |
+|---------|---------|-----|---------|-------|
+| **4.*** | 1.1 (GA) | 1.12 | 4.19.28+ | First GA release; MutatingAdmissionPolicy-based initdata |
+| **3.*** | 0.4.* (Tech Preview) | 1.10.* | 4.16+ | |
+| **2.*** | 0.3.* (Tech Preview) | 1.9.* | 4.16+ | |
+| **1.0.0** | 0.2.0 (Tech Preview) | 1.8.1 | 4.16+ | |
 
 ## Setup
 
@@ -98,6 +97,8 @@ These scripts generate the cryptographic material and attestation measurements n
 4. `./pattern.sh make install`
 5. Wait for the cluster to reboot nodes (MachineConfig updates for TDX kernel parameters and vsock)
 
+> **Note:** Bare metal support is currently tested on SNO (Single Node OpenShift) configurations. Multi-node bare metal clusters are expected to work but have not been validated yet.
+
 The system auto-detects your hardware:
 
 - **NFD** discovers Intel TDX or AMD SEV-SNP capabilities and labels nodes
@@ -109,6 +110,17 @@ The system auto-detects your hardware:
 
 Optional: pin PCCS to a specific node with `bash scripts/get-pccs-node.sh` and set `baremetal.pccs.nodeSelector` in the baremetal chart values.
 
+### Bare metal GPU deployment
+
+1. Set `main.clusterGroupName: baremetal-gpu` in `values-global.yaml`
+2. Run `bash scripts/gen-secrets.sh` to generate KBS keys and PCCS secrets
+3. For Intel TDX: uncomment the PCCS secrets in `~/values-secret-coco-pattern.yaml` and provide your Intel PCS API key
+4. `./pattern.sh make install`
+5. Wait for the cluster to reboot nodes (MachineConfig updates for TDX/SEV-SNP kernel parameters, vsock, and IOMMU)
+6. Approve the GPU Operator install plan when it appears (uses `installPlanApproval: Manual`)
+
+> **Note:** The `baremetal-gpu` topology deploys IOMMU MachineConfig on all nodes and will trigger reboots. For clusters without GPUs, use the `baremetal` topology instead. The GPU workload deployment will remain Pending on non-GPU systems but is otherwise harmless.
+
 ## Sample applications
 
 Two sample applications are deployed on the cluster running confidential workloads (the single cluster in `simple` mode, or the spoke in multi-cluster mode):

diff --git a/ansible/reconcile-kataconfig-gpu.yaml b/ansible/reconcile-kataconfig-gpu.yaml
@@ -0,0 +1,42 @@
+---
+- name: Reconcile KataConfig for GPU RuntimeClass
+  hosts: localhost
+  connection: local
+  become: false
+  gather_facts: true
+  tasks:
+    - name: Check for nodes with NVIDIA GPU labels
+      kubernetes.core.k8s_info:
+        api_version: v1
+        kind: Node
+        label_selectors:
+          - "nvidia.com/gpu.present=true"
+      register: gpu_nodes
+
+    - name: Check if kata-cc-nvidia-gpu RuntimeClass exists
+      kubernetes.core.k8s_info:
+        api_version: node.k8s.io/v1
+        kind: RuntimeClass
+        name: kata-cc-nvidia-gpu
+      register: gpu_runtimeclass
+
+    - name: Trigger KataConfig re-reconciliation
+      kubernetes.core.k8s:
+        state: patched
+        api_version: kataconfiguration.openshift.io/v1
+        kind: KataConfig
+        name: default-kata-config
+        definition:
+          metadata:
+            annotations:
+              kata-reconcile: "{{ ansible_date_time.epoch }}"
+      when:
+        - gpu_nodes.resources | length > 0
+        - gpu_runtimeclass.resources | length == 0
+
+    - name: Report status
+      ansible.builtin.debug:
+        msg: >-
+          GPU nodes: {{ gpu_nodes.resources | length }},
+          RuntimeClass exists: {{ gpu_runtimeclass.resources | length > 0 }},
+          Action: {{ 'triggered re-reconciliation' if (gpu_nodes.resources | length > 0 and gpu_runtimeclass.resources | length == 0) else 'no action needed' }}
diff --git a/charts/all/coco-kyverno-policies/templates/inject-coco-initdata.yaml b/charts/all/coco-kyverno-policies/templates/inject-coco-initdata.yaml
@@ -28,7 +28,7 @@ spec:
         all:
           - key: "{{ "{{" }}request.object.spec.runtimeClassName || '' {{ "}}" }}"
             operator: AnyIn
-            value: ["kata", "kata-cc", "kata-remote"]
+            value: ["kata", "kata-cc", "kata-remote", "kata-cc-nvidia-gpu"]
           - key: "{{ "{{" }}request.object.metadata.annotations.\"coco.io/initdata-configmap\" || '' {{ "}}" }}"
             operator: NotEquals
             value: ""

diff --git a/charts/all/coco-kyverno-policies/values.yaml b/charts/all/coco-kyverno-policies/values.yaml
@@ -1,5 +1,6 @@
 workloadNamespaces:
   - hello-openshift
   - kbs-access
+  - gpu-workload
 
 initdataSourceNamespace: imperative
diff --git a/charts/all/nvidia-gpu/Chart.yaml b/charts/all/nvidia-gpu/Chart.yaml
@@ -0,0 +1,9 @@
+apiVersion: v2
+description: NVIDIA GPU Operator configuration for confidential containers (ClusterPolicy, IOMMU MachineConfig).
+keywords:
+- pattern
+- nvidia
+- gpu
+- confidential
+name: nvidia-gpu
+version: 0.0.1
diff --git a/charts/all/nvidia-gpu/templates/cluster-policy.yaml b/charts/all/nvidia-gpu/templates/cluster-policy.yaml
@@ -0,0 +1,113 @@
+{{- if .Values.enabled }}
+apiVersion: nvidia.com/v1
+kind: ClusterPolicy
+metadata:
+  name: gpu-cluster-policy
+  annotations:
+    argocd.argoproj.io/sync-wave: "110"
+spec:
+  ccManager:
+    defaultMode: {{ .Values.ccManager.defaultMode | quote }}
+    enabled: {{ .Values.ccManager.enabled }}
+  cdi:
+    default: false
+    enabled: true
+    nriPluginEnabled: false
+  daemonsets:
+    rollingUpdate:
+      maxUnavailable: '1'
+    updateStrategy: RollingUpdate
+  dcgm:
+    enabled: false
+  dcgmExporter:
+    config:
+      name: ''
+    enabled: false
+    serviceMonitor:
+      enabled: true
+  devicePlugin:
+    config:
+      default: ''
+      name: ''
+    enabled: false
+    mps:
+      root: /run/nvidia/mps
+  driver:
+    certConfig:
+      name: ''
+    enabled: false
+    kernelModuleConfig:
+      name: ''
+    kernelModuleType: auto
+    licensingConfig:
+      configMapName: ''
+      nlsEnabled: true
+    repoConfig:
+      configMapName: ''
+    upgradePolicy:
+      autoUpgrade: true
+      drain:
+        deleteEmptyDir: false
+        enable: false
+        force: false
+        timeoutSeconds: 300
+      maxParallelUpgrades: 1
+      maxUnavailable: 25%
+      podDeletion:
+        deleteEmptyDir: false
+        force: false
+        timeoutSeconds: 300
+      waitForCompletion:
+        timeoutSeconds: 0
+    useNvidiaDriverCRD: false
+    useOpenKernelModules: false
+    virtualTopology:
+      config: ''
+  gdrcopy:
+    enabled: false
+  gds:
+    enabled: false
+  gfd:
+    enabled: true
+  kataManager:
+    enabled: false
+  mig:
+    strategy: single
+  migManager:
+    enabled: false
+  nodeStatusExporter:
+    enabled: true
+  operator:
+    defaultRuntime: crio
+    initContainer: {}
+    runtimeClass: nvidia
+    use_ocp_driver_toolkit: true
+  kataSandboxDevicePlugin:
+    enabled: {{ .Values.kataSandboxDevicePlugin.enabled }}
+    env:
+      - name: P_GPU_ALIAS
+        value: pgpu
+      - name: NVSWITCH_ALIAS
+        value: nvswitch
+  sandboxWorkloads:
+    defaultWorkload: vm-passthrough
+    enabled: true
+    mode: kata
+  toolkit:
+    enabled: false
+    installDir: /usr/local/nvidia
+  validator:
+    plugin:
+      env:
+        - name: WITH_WORKLOAD
+          value: 'false'
+  vfioManager:
+    enabled: true
+    env:
+      - name: BIND_NVSWITCHES
+        value: 'true'
+  vgpuDeviceManager:
+    enabled: false
+  vgpuManager:
+    enabled: false
+{{- end }}
diff --git a/charts/all/nvidia-gpu/templates/iommu-mco.yaml b/charts/all/nvidia-gpu/templates/iommu-mco.yaml
@@ -0,0 +1,15 @@
+{{- if .Values.iommu.enabled }}
+{{- range list "master" "worker" }}
+---
+apiVersion: machineconfiguration.openshift.io/v1
+kind: MachineConfig
+metadata:
+  labels:
+    machineconfiguration.openshift.io/role: {{ . }}
+  name: 100-iommu-{{ . }}
+spec:
+  kernelArguments:
+  - amd_iommu=on
+  - intel_iommu=on
+{{- end }}
+{{- end }}
diff --git a/charts/all/nvidia-gpu/values.yaml b/charts/all/nvidia-gpu/values.yaml
@@ -0,0 +1,11 @@
+enabled: true
+
+ccManager:
+  enabled: true
+  defaultMode: "on"
+
+kataSandboxDevicePlugin:
+  enabled: true
+
+iommu:
+  enabled: true
diff --git a/charts/coco-supported/gpu-workload/Chart.yaml b/charts/coco-supported/gpu-workload/Chart.yaml
@@ -0,0 +1,10 @@
+apiVersion: v2
+description: Sample CUDA workload for NVIDIA confidential GPU verification.
+keywords:
+- pattern
+- nvidia
+- gpu
+- workload
+- confidential
+name: gpu-workload
+version: 0.0.1
diff --git a/charts/coco-supported/gpu-workload/templates/gpu-vectoradd-deployment.yaml b/charts/coco-supported/gpu-workload/templates/gpu-vectoradd-deployment.yaml
@@ -0,0 +1,37 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: gpu-vectoradd
+  labels:
+    app: gpu-vectoradd
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      app: gpu-vectoradd
+  template:
+    metadata:
+      labels:
+        app: gpu-vectoradd
+      annotations:
+        coco.io/initdata-configmap: debug-initdata
+        {{- if .Values.defaultMemory }}
+        io.katacontainers.config.hypervisor.default_memory: {{ .Values.defaultMemory | quote }}
+        {{- end }}
+    spec:
+      runtimeClassName: {{ .Values.runtimeClassName }}
+      containers:
+        - name: gpu-cc-verifier
+          image: quay.io/openshift_sandboxed_containers/gpu-verifier:ubi9
+          imagePullPolicy: Always
+          command: ["/bin/bash"]
+          args:
+            - -c
+            - |
+              /opt/cuda-samples/Samples/0_Introduction/vectorAdd/build/vectorAdd
+              sleep 36000
+          resources:
+            limits:
+              nvidia.com/pgpu: 1
diff --git a/charts/coco-supported/gpu-workload/values.yaml b/charts/coco-supported/gpu-workload/values.yaml
@@ -0,0 +1,6 @@
+runtimeClassName: "kata-cc-nvidia-gpu"
+
+defaultMemory: "32768"
+
+global:
+  clusterPlatform: ""
diff --git a/scripts/gen-secrets.sh b/scripts/gen-secrets.sh
@@ -46,13 +46,13 @@ if [ ! -f "${PCCS_USER_TOKEN_FILE}" ]; then
 	echo "Creating PCCS user token"
 	echo "usertoken" > "${PCCS_USER_TOKEN_FILE}"
 fi
-echo -n "usertoken" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_user_token_hash"
+tr -d '\n' < "${PCCS_USER_TOKEN_FILE}" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_user_token_hash"
 
 if [ ! -f "${PCCS_ADMIN_TOKEN_FILE}" ]; then
 	echo "Creating PCCS admin token"
 	echo "admintoken" > "${PCCS_ADMIN_TOKEN_FILE}"
 fi
-echo -n "admintoken" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_admin_token_hash"
+tr -d '\n' < "${PCCS_ADMIN_TOKEN_FILE}" | sha512sum | tr -d '[:space:]-' > "${COCO_SECRETS_DIR}/pccs_admin_token_hash"
 
 ## Copy a sample values file if this stuff doesn't exist