Skip to content

Commit c43268d

Browse files
author
Mark Saroufim
committed
Switch MI355X AMD workflow to ARC runner scale set
- Default runner changed from mia1-p02-g29 to arc-runner-set - Remove container: block from amd_workflow.yml since the ARC runner pod already uses ghcr.io/gpu-mode/amd-runner:mi355 - Update github launcher to dispatch MI355X jobs to arc-runner-set - Add ARC GPU runner setup skill documenting the full setup process
1 parent 17a3073 commit c43268d

3 files changed

Lines changed: 166 additions & 5 deletions

File tree

.claude/skills/arc-gpu-runners.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# ARC GPU Runner Setup
2+
3+
How to set up Actions Runner Controller (ARC) on a bare-metal k3s node with AMD GPUs so that multiple GitHub Actions jobs run concurrently, each isolated to its own GPU, CPU, and RAM slice.
4+
5+
## Prerequisites
6+
7+
- k3s cluster running (check with `sudo k3s kubectl get nodes`)
8+
- AMD GPU device plugin daemonset deployed (`amdgpu-device-plugin-daemonset` in `kube-system`)
9+
- Docker installed on the node
10+
- A GitHub PAT with `repo` scope (classic) or "Administration" read/write (fine-grained) for the target repo
11+
12+
## Setup Steps
13+
14+
### 1. Install Helm
15+
16+
```bash
17+
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | sudo bash
18+
```
19+
20+
### 2. Add the ARC Helm repo
21+
22+
```bash
23+
sudo helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
24+
sudo helm repo update
25+
```
26+
27+
### 3. Install the ARC controller
28+
29+
```bash
30+
sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install arc \
31+
--namespace arc-systems \
32+
--create-namespace \
33+
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
34+
--version 0.10.1
35+
```
36+
37+
Verify the controller is running:
38+
39+
```bash
40+
sudo k3s kubectl get pods -n arc-systems
41+
```
42+
43+
### 4. Deploy the runner scale set
44+
45+
Create a values file (`arc-runner-values.yaml`):
46+
47+
```yaml
48+
githubConfigUrl: "https://github.com/gpu-mode/kernelbot"
49+
githubConfigSecret:
50+
github_token: "<YOUR_GITHUB_PAT>"
51+
52+
maxRunners: 8
53+
minRunners: 0
54+
55+
template:
56+
spec:
57+
containers:
58+
- name: runner
59+
image: ghcr.io/gpu-mode/amd-runner:mi355
60+
command: ["/home/runner/run.sh"]
61+
resources:
62+
requests:
63+
cpu: "14"
64+
memory: "340Gi"
65+
amd.com/gpu: "1"
66+
limits:
67+
cpu: "14"
68+
memory: "340Gi"
69+
amd.com/gpu: "1"
70+
volumeMounts:
71+
- name: kfd
72+
mountPath: /dev/kfd
73+
volumes:
74+
- name: kfd
75+
hostPath:
76+
path: /dev/kfd
77+
type: CharDevice
78+
nodeSelector:
79+
kubernetes.io/os: linux
80+
```
81+
82+
Install:
83+
84+
```bash
85+
sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install arc-runner-set \
86+
--namespace arc-runners \
87+
--create-namespace \
88+
-f arc-runner-values.yaml \
89+
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
90+
--version 0.10.1
91+
```
92+
93+
### 5. Verify
94+
95+
```bash
96+
# Controller + listener running
97+
sudo k3s kubectl get pods -n arc-systems
98+
99+
# Scale set registered
100+
sudo k3s kubectl get autoscalingrunnerset -n arc-runners
101+
102+
# Listener connected to GitHub
103+
sudo k3s kubectl logs -n arc-systems -l actions.github.com/scale-set-name=arc-runner-set --tail=10
104+
```
105+
106+
## How It Works
107+
108+
- **GPU isolation**: The AMD device plugin exposes `amd.com/gpu` as a k8s resource. Each runner pod requests exactly 1 GPU. Kubernetes guarantees no two pods share a GPU — each gets a unique `/dev/dri/renderD*` device.
109+
- **CPU isolation**: Each pod gets 14 dedicated cores via cgroup limits (`nproc` reports 14 inside the container).
110+
- **RAM isolation**: Each pod gets a 340Gi memory limit enforced by cgroups. Exceeding it triggers OOM kill.
111+
- **Autoscaling**: With `minRunners: 0` and `maxRunners: 8`, runners spin up on demand when GitHub queues jobs and are destroyed after completion (ephemeral runners).
112+
113+
## Resource Budget (per MI355X node)
114+
115+
The MI355X node has 126 allocatable CPUs, ~3TB RAM, and 8 GPUs.
116+
117+
| Per runner | Value |
118+
|------------|-------|
119+
| CPU | 14 cores |
120+
| RAM | 340 Gi |
121+
| GPU | 1x MI355X |
122+
123+
At max capacity (8 runners): 112 cores, 2720 Gi, 8 GPUs. Remaining resources go to system pods.
124+
125+
## Using in Workflows
126+
127+
Workflows target ARC runners with `runs-on: arc-runner-set`. Since the runner pod already uses the `ghcr.io/gpu-mode/amd-runner:mi355` image (with ROCm, Python, etc.), there is no need for a separate `container:` block.
128+
129+
```yaml
130+
jobs:
131+
my-job:
132+
runs-on: arc-runner-set
133+
steps:
134+
- uses: actions/checkout@v4
135+
- run: rocm-smi # GPU is available
136+
```
137+
138+
## Updating the Configuration
139+
140+
To change resource limits, max runners, or the runner image:
141+
142+
```bash
143+
# Edit values, then:
144+
sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade arc-runner-set \
145+
--namespace arc-runners \
146+
-f arc-runner-values.yaml \
147+
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
148+
--version 0.10.1
149+
```
150+
151+
## Troubleshooting
152+
153+
- **403 on token**: PAT needs `repo` scope or fine-grained "Administration" read/write permission
154+
- **Pods stuck Pending**: Check GPU availability with `kubectl describe node <name> | grep amd.com/gpu`
155+
- **Listener not starting**: Check controller logs: `kubectl logs -n arc-systems -l app.kubernetes.io/name=gha-rs-controller`
156+
- **Runner image issues**: The image must have `/home/runner/run.sh` (GitHub Actions runner binary)
157+
158+
## Current Cluster Info
159+
160+
- **Node**: mia1-p02-g29 (+ 4 more nodes in the k3s cluster)
161+
- **GPUs**: 8x AMD Instinct MI355X per node
162+
- **CPU**: AMD EPYC 9575F 64-Core (128 threads, 2 sockets)
163+
- **RAM**: ~3 TB per node
164+
- **SSH**: `ssh -J marksaroufim@meta.com@64.139.223.122 marksaroufim@meta.com@mia1-p02-g29`

.github/workflows/amd_workflow.yml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ on:
1313
runner:
1414
description: 'AMD runner to run workflow on'
1515
required: true
16-
default: "mia1-p02-g29"
16+
default: "arc-runner-set"
1717
type: string
1818
requirements:
1919
description: 'Contents for a requirements.txt file'
@@ -25,9 +25,6 @@ run-name: 'AMD Job - ${{ github.event.inputs.run_id }}'
2525
jobs:
2626
run:
2727
runs-on: ${{ github.event.inputs.runner }}
28-
container:
29-
image: ghcr.io/gpu-mode/amd-runner:mi355
30-
options: --user root --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 64G
3128
strategy:
3229
fail-fast: false
3330
timeout-minutes: 20

src/libkernelbot/launchers/github.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ async def run_submission( # noqa: C901
100100
"MI300": "amdgpu-mi300-x86-64",
101101
"MI250": "amdgpu-mi250-x86-64",
102102
"MI300x8": "amdgpu-mi300-8-x86-64",
103-
"MI355X": "mia1-p02-g29",
103+
"MI355X": "arc-runner-set",
104104
}[gpu_type.value]
105105
gpu_vendor = "AMD"
106106
requirements = AMD_REQUIREMENTS

0 commit comments

Comments
 (0)