Switch MI355X AMD workflow to ARC runner scale set

Mark Saroufim · Mark Saroufim · commit c43268d20feb · 2026-03-04T19:04:41.000-08:00
- Default runner changed from mia1-p02-g29 to arc-runner-set
- Remove container: block from amd_workflow.yml since the ARC runner
  pod already uses ghcr.io/gpu-mode/amd-runner:mi355
- Update github launcher to dispatch MI355X jobs to arc-runner-set
- Add ARC GPU runner setup skill documenting the full setup process
diff --git a/.claude/skills/arc-gpu-runners.md b/.claude/skills/arc-gpu-runners.md
@@ -0,0 +1,164 @@
+# ARC GPU Runner Setup
+
+How to set up Actions Runner Controller (ARC) on a bare-metal k3s node with AMD GPUs so that multiple GitHub Actions jobs run concurrently, each isolated to its own GPU, CPU, and RAM slice.
+
+## Prerequisites
+
+- k3s cluster running (check with `sudo k3s kubectl get nodes`)
+- AMD GPU device plugin daemonset deployed (`amdgpu-device-plugin-daemonset` in `kube-system`)
+- Docker installed on the node
+- A GitHub PAT with `repo` scope (classic) or "Administration" read/write (fine-grained) for the target repo
+
+## Setup Steps
+
+### 1. Install Helm
+
+```bash
+curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | sudo bash
+```
+
+### 2. Add the ARC Helm repo
+
+```bash
+sudo helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
+sudo helm repo update
+```
+
+### 3. Install the ARC controller
+
+```bash
+sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install arc \
+  --namespace arc-systems \
+  --create-namespace \
+  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
+  --version 0.10.1
+```
+
+Verify the controller is running:
+
+```bash
+sudo k3s kubectl get pods -n arc-systems
+```
+
+### 4. Deploy the runner scale set
+
+Create a values file (`arc-runner-values.yaml`):
+
+```yaml
+githubConfigUrl: "https://github.com/gpu-mode/kernelbot"
+githubConfigSecret:
+  github_token: "<YOUR_GITHUB_PAT>"
+
+maxRunners: 8
+minRunners: 0
+
+template:
+  spec:
+    containers:
+      - name: runner
+        image: ghcr.io/gpu-mode/amd-runner:mi355
+        command: ["/home/runner/run.sh"]
+        resources:
+          requests:
+            cpu: "14"
+            memory: "340Gi"
+            amd.com/gpu: "1"
+          limits:
+            cpu: "14"
+            memory: "340Gi"
+            amd.com/gpu: "1"
+        volumeMounts:
+          - name: kfd
+            mountPath: /dev/kfd
+    volumes:
+      - name: kfd
+        hostPath:
+          path: /dev/kfd
+          type: CharDevice
+    nodeSelector:
+      kubernetes.io/os: linux
+```
+
+Install:
+
+```bash
+sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install arc-runner-set \
+  --namespace arc-runners \
+  --create-namespace \
+  -f arc-runner-values.yaml \
+  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
+  --version 0.10.1
+```
+
+### 5. Verify
+
+```bash
+# Controller + listener running
+sudo k3s kubectl get pods -n arc-systems
+
+# Scale set registered
+sudo k3s kubectl get autoscalingrunnerset -n arc-runners
+
+# Listener connected to GitHub
+sudo k3s kubectl logs -n arc-systems -l actions.github.com/scale-set-name=arc-runner-set --tail=10
+```
+
+## How It Works
+
+- **GPU isolation**: The AMD device plugin exposes `amd.com/gpu` as a k8s resource. Each runner pod requests exactly 1 GPU. Kubernetes guarantees no two pods share a GPU — each gets a unique `/dev/dri/renderD*` device.
+- **CPU isolation**: Each pod gets 14 dedicated cores via cgroup limits (`nproc` reports 14 inside the container).
+- **RAM isolation**: Each pod gets a 340Gi memory limit enforced by cgroups. Exceeding it triggers OOM kill.
+- **Autoscaling**: With `minRunners: 0` and `maxRunners: 8`, runners spin up on demand when GitHub queues jobs and are destroyed after completion (ephemeral runners).
+
+## Resource Budget (per MI355X node)
+
+The MI355X node has 126 allocatable CPUs, ~3TB RAM, and 8 GPUs.
+
+| Per runner | Value |
+|------------|-------|
+| CPU | 14 cores |
+| RAM | 340 Gi |
+| GPU | 1x MI355X |
+
+At max capacity (8 runners): 112 cores, 2720 Gi, 8 GPUs. Remaining resources go to system pods.
+
+## Using in Workflows
+
+Workflows target ARC runners with `runs-on: arc-runner-set`. Since the runner pod already uses the `ghcr.io/gpu-mode/amd-runner:mi355` image (with ROCm, Python, etc.), there is no need for a separate `container:` block.
+
+```yaml
+jobs:
+  my-job:
+    runs-on: arc-runner-set
+    steps:
+      - uses: actions/checkout@v4
+      - run: rocm-smi  # GPU is available
+```
+
+## Updating the Configuration
+
+To change resource limits, max runners, or the runner image:
+
+```bash
+# Edit values, then:
+sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade arc-runner-set \
+  --namespace arc-runners \
+  -f arc-runner-values.yaml \
+  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
+  --version 0.10.1
+```
+
+## Troubleshooting
+
+- **403 on token**: PAT needs `repo` scope or fine-grained "Administration" read/write permission
+- **Pods stuck Pending**: Check GPU availability with `kubectl describe node <name> | grep amd.com/gpu`
+- **Listener not starting**: Check controller logs: `kubectl logs -n arc-systems -l app.kubernetes.io/name=gha-rs-controller`
+- **Runner image issues**: The image must have `/home/runner/run.sh` (GitHub Actions runner binary)
+
+## Current Cluster Info
+
+- **Node**: mia1-p02-g29 (+ 4 more nodes in the k3s cluster)
+- **GPUs**: 8x AMD Instinct MI355X per node
+- **CPU**: AMD EPYC 9575F 64-Core (128 threads, 2 sockets)
+- **RAM**: ~3 TB per node
+- **SSH**: `ssh -J marksaroufim@meta.com@64.139.223.122 marksaroufim@meta.com@mia1-p02-g29`
diff --git a/.github/workflows/amd_workflow.yml b/.github/workflows/amd_workflow.yml
@@ -13,7 +13,7 @@ on:
       runner:
         description: 'AMD runner to run workflow on'
         required: true
-        default: "mia1-p02-g29"
+        default: "arc-runner-set"
         type: string
       requirements:
         description: 'Contents for a requirements.txt file'
@@ -25,9 +25,6 @@ run-name: 'AMD Job - ${{ github.event.inputs.run_id }}'
 jobs:
   run:
     runs-on: ${{ github.event.inputs.runner }}
-    container:
-      image: ghcr.io/gpu-mode/amd-runner:mi355
-      options: --user root --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 64G
     strategy:
       fail-fast: false
     timeout-minutes: 20
diff --git a/src/libkernelbot/launchers/github.py b/src/libkernelbot/launchers/github.py
@@ -100,7 +100,7 @@ async def run_submission(  # noqa: C901
                 "MI300": "amdgpu-mi300-x86-64",
                 "MI250": "amdgpu-mi250-x86-64",
                 "MI300x8": "amdgpu-mi300-8-x86-64",
-                "MI355X": "mia1-p02-g29",
+                "MI355X": "arc-runner-set",
             }[gpu_type.value]
             gpu_vendor = "AMD"
             requirements = AMD_REQUIREMENTS