|
| 1 | +# ARC GPU Runner Setup |
| 2 | + |
| 3 | +How to set up Actions Runner Controller (ARC) on a bare-metal k3s node with AMD GPUs so that multiple GitHub Actions jobs run concurrently, each isolated to its own GPU, CPU, and RAM slice. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +- k3s cluster running (check with `sudo k3s kubectl get nodes`) |
| 8 | +- AMD GPU device plugin daemonset deployed (`amdgpu-device-plugin-daemonset` in `kube-system`) |
| 9 | +- Docker installed on the node |
| 10 | +- A GitHub PAT with `repo` scope (classic) or "Administration" read/write (fine-grained) for the target repo |
| 11 | + |
| 12 | +## Setup Steps |
| 13 | + |
| 14 | +### 1. Install Helm |
| 15 | + |
| 16 | +```bash |
| 17 | +curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | sudo bash |
| 18 | +``` |
| 19 | + |
| 20 | +### 2. Add the ARC Helm repo |
| 21 | + |
| 22 | +```bash |
| 23 | +sudo helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller |
| 24 | +sudo helm repo update |
| 25 | +``` |
| 26 | + |
| 27 | +### 3. Install the ARC controller |
| 28 | + |
| 29 | +```bash |
| 30 | +sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install arc \ |
| 31 | + --namespace arc-systems \ |
| 32 | + --create-namespace \ |
| 33 | + oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \ |
| 34 | + --version 0.10.1 |
| 35 | +``` |
| 36 | + |
| 37 | +Verify the controller is running: |
| 38 | + |
| 39 | +```bash |
| 40 | +sudo k3s kubectl get pods -n arc-systems |
| 41 | +``` |
| 42 | + |
| 43 | +### 4. Deploy the runner scale set |
| 44 | + |
| 45 | +Create a values file (`arc-runner-values.yaml`): |
| 46 | + |
| 47 | +```yaml |
| 48 | +githubConfigUrl: "https://github.com/gpu-mode/kernelbot" |
| 49 | +githubConfigSecret: |
| 50 | + github_token: "<YOUR_GITHUB_PAT>" |
| 51 | + |
| 52 | +maxRunners: 8 |
| 53 | +minRunners: 0 |
| 54 | + |
| 55 | +template: |
| 56 | + spec: |
| 57 | + containers: |
| 58 | + - name: runner |
| 59 | + image: ghcr.io/gpu-mode/amd-runner:mi355 |
| 60 | + command: ["/home/runner/run.sh"] |
| 61 | + resources: |
| 62 | + requests: |
| 63 | + cpu: "14" |
| 64 | + memory: "340Gi" |
| 65 | + amd.com/gpu: "1" |
| 66 | + limits: |
| 67 | + cpu: "14" |
| 68 | + memory: "340Gi" |
| 69 | + amd.com/gpu: "1" |
| 70 | + volumeMounts: |
| 71 | + - name: kfd |
| 72 | + mountPath: /dev/kfd |
| 73 | + volumes: |
| 74 | + - name: kfd |
| 75 | + hostPath: |
| 76 | + path: /dev/kfd |
| 77 | + type: CharDevice |
| 78 | + nodeSelector: |
| 79 | + kubernetes.io/os: linux |
| 80 | +``` |
| 81 | +
|
| 82 | +Install: |
| 83 | +
|
| 84 | +```bash |
| 85 | +sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install arc-runner-set \ |
| 86 | + --namespace arc-runners \ |
| 87 | + --create-namespace \ |
| 88 | + -f arc-runner-values.yaml \ |
| 89 | + oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \ |
| 90 | + --version 0.10.1 |
| 91 | +``` |
| 92 | + |
| 93 | +### 5. Verify |
| 94 | + |
| 95 | +```bash |
| 96 | +# Controller + listener running |
| 97 | +sudo k3s kubectl get pods -n arc-systems |
| 98 | + |
| 99 | +# Scale set registered |
| 100 | +sudo k3s kubectl get autoscalingrunnerset -n arc-runners |
| 101 | + |
| 102 | +# Listener connected to GitHub |
| 103 | +sudo k3s kubectl logs -n arc-systems -l actions.github.com/scale-set-name=arc-runner-set --tail=10 |
| 104 | +``` |
| 105 | + |
| 106 | +## How It Works |
| 107 | + |
| 108 | +- **GPU isolation**: The AMD device plugin exposes `amd.com/gpu` as a k8s resource. Each runner pod requests exactly 1 GPU. Kubernetes guarantees no two pods share a GPU — each gets a unique `/dev/dri/renderD*` device. |
| 109 | +- **CPU isolation**: Each pod gets 14 dedicated cores via cgroup limits (`nproc` reports 14 inside the container). |
| 110 | +- **RAM isolation**: Each pod gets a 340Gi memory limit enforced by cgroups. Exceeding it triggers OOM kill. |
| 111 | +- **Autoscaling**: With `minRunners: 0` and `maxRunners: 8`, runners spin up on demand when GitHub queues jobs and are destroyed after completion (ephemeral runners). |
| 112 | + |
| 113 | +## Resource Budget (per MI355X node) |
| 114 | + |
| 115 | +The MI355X node has 126 allocatable CPUs, ~3TB RAM, and 8 GPUs. |
| 116 | + |
| 117 | +| Per runner | Value | |
| 118 | +|------------|-------| |
| 119 | +| CPU | 14 cores | |
| 120 | +| RAM | 340 Gi | |
| 121 | +| GPU | 1x MI355X | |
| 122 | + |
| 123 | +At max capacity (8 runners): 112 cores, 2720 Gi, 8 GPUs. Remaining resources go to system pods. |
| 124 | + |
| 125 | +## Using in Workflows |
| 126 | + |
| 127 | +Workflows target ARC runners with `runs-on: arc-runner-set`. Since the runner pod already uses the `ghcr.io/gpu-mode/amd-runner:mi355` image (with ROCm, Python, etc.), there is no need for a separate `container:` block. |
| 128 | + |
| 129 | +```yaml |
| 130 | +jobs: |
| 131 | + my-job: |
| 132 | + runs-on: arc-runner-set |
| 133 | + steps: |
| 134 | + - uses: actions/checkout@v4 |
| 135 | + - run: rocm-smi # GPU is available |
| 136 | +``` |
| 137 | +
|
| 138 | +## Updating the Configuration |
| 139 | +
|
| 140 | +To change resource limits, max runners, or the runner image: |
| 141 | +
|
| 142 | +```bash |
| 143 | +# Edit values, then: |
| 144 | +sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade arc-runner-set \ |
| 145 | + --namespace arc-runners \ |
| 146 | + -f arc-runner-values.yaml \ |
| 147 | + oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \ |
| 148 | + --version 0.10.1 |
| 149 | +``` |
| 150 | + |
| 151 | +## Troubleshooting |
| 152 | + |
| 153 | +- **403 on token**: PAT needs `repo` scope or fine-grained "Administration" read/write permission |
| 154 | +- **Pods stuck Pending**: Check GPU availability with `kubectl describe node <name> | grep amd.com/gpu` |
| 155 | +- **Listener not starting**: Check controller logs: `kubectl logs -n arc-systems -l app.kubernetes.io/name=gha-rs-controller` |
| 156 | +- **Runner image issues**: The image must have `/home/runner/run.sh` (GitHub Actions runner binary) |
| 157 | + |
| 158 | +## Current Cluster Info |
| 159 | + |
| 160 | +- **Node**: mia1-p02-g29 (+ 4 more nodes in the k3s cluster) |
| 161 | +- **GPUs**: 8x AMD Instinct MI355X per node |
| 162 | +- **CPU**: AMD EPYC 9575F 64-Core (128 threads, 2 sockets) |
| 163 | +- **RAM**: ~3 TB per node |
| 164 | +- **SSH**: `ssh -J marksaroufim@meta.com@64.139.223.122 marksaroufim@meta.com@mia1-p02-g29` |
0 commit comments