Skip to content

Commit 95ceea8

Browse files
author
Mark Saroufim
authored
Update ARC skill doc: 5-node cluster, 40 max runners (#457)
- Document all 5 nodes in the cluster - Update maxRunners from 8 to 40 (5 nodes × 8 GPUs) - Add "Adding New Nodes" section
1 parent c43268d commit 95ceea8

1 file changed

Lines changed: 29 additions & 3 deletions

File tree

.claude/skills/arc-gpu-runners.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ sudo k3s kubectl logs -n arc-systems -l actions.github.com/scale-set-name=arc-ru
108108
- **GPU isolation**: The AMD device plugin exposes `amd.com/gpu` as a k8s resource. Each runner pod requests exactly 1 GPU. Kubernetes guarantees no two pods share a GPU — each gets a unique `/dev/dri/renderD*` device.
109109
- **CPU isolation**: Each pod gets 14 dedicated cores via cgroup limits (`nproc` reports 14 inside the container).
110110
- **RAM isolation**: Each pod gets a 340Gi memory limit enforced by cgroups. Exceeding it triggers OOM kill.
111-
- **Autoscaling**: With `minRunners: 0` and `maxRunners: 8`, runners spin up on demand when GitHub queues jobs and are destroyed after completion (ephemeral runners).
111+
- **Autoscaling**: With `minRunners: 0` and `maxRunners: 40`, runners spin up on demand when GitHub queues jobs and are destroyed after completion (ephemeral runners). The scheduler spreads pods across all 5 nodes.
112112

113113
## Resource Budget (per MI355X node)
114114

@@ -120,7 +120,7 @@ The MI355X node has 126 allocatable CPUs, ~3TB RAM, and 8 GPUs.
120120
| RAM | 340 Gi |
121121
| GPU | 1x MI355X |
122122

123-
At max capacity (8 runners): 112 cores, 2720 Gi, 8 GPUs. Remaining resources go to system pods.
123+
At max capacity (40 runners across 5 nodes): 8 runners per node, each using 14 cores / 340 Gi / 1 GPU.
124124

125125
## Using in Workflows
126126

@@ -148,6 +148,32 @@ sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade arc-runner-set \
148148
--version 0.10.1
149149
```
150150

151+
## Adding New Nodes
152+
153+
ARC is cluster-wide — no per-node setup is needed. When a new node joins the k3s cluster:
154+
155+
1. The AMD GPU device plugin (DaemonSet) auto-deploys to the new node
156+
2. The k8s scheduler can immediately place runner pods on it
157+
3. No changes needed to workflows or the GitHub launcher
158+
159+
The only thing to update is `maxRunners` to reflect the new total GPU count:
160+
161+
```bash
162+
# Example: 3 nodes × 8 GPUs = 24 max runners
163+
sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade arc-runner-set \
164+
--namespace arc-runners \
165+
--set maxRunners=24 \
166+
--reuse-values \
167+
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
168+
--version 0.10.1
169+
```
170+
171+
Verify the new node has GPUs registered:
172+
173+
```bash
174+
sudo k3s kubectl describe node <new-node-name> | grep amd.com/gpu
175+
```
176+
151177
## Troubleshooting
152178

153179
- **403 on token**: PAT needs `repo` scope or fine-grained "Administration" read/write permission
@@ -157,7 +183,7 @@ sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm upgrade arc-runner-set \
157183

158184
## Current Cluster Info
159185

160-
- **Node**: mia1-p02-g29 (+ 4 more nodes in the k3s cluster)
186+
- **Nodes**: mia1-p02-g29, mia1-p02-g52, mia1-p02-g53, mia1-p02-g55, mia1-p02-g56 (5-node k3s cluster)
161187
- **GPUs**: 8x AMD Instinct MI355X per node
162188
- **CPU**: AMD EPYC 9575F 64-Core (128 threads, 2 sockets)
163189
- **RAM**: ~3 TB per node

0 commit comments

Comments
 (0)