You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**GPU isolation**: The AMD device plugin exposes `amd.com/gpu` as a k8s resource. Each runner pod requests exactly 1 GPU. Kubernetes guarantees no two pods share a GPU — each gets a unique `/dev/dri/renderD*` device.
112
+
-**GPU device permissions**: On this cluster, the host GPU device nodes are group-owned by GID `110`. The runner pod must include `spec.securityContext.supplementalGroups: [110]` or ROCm inside the container will fail with `Unable to open /dev/kfd read-write: Permission denied` / `No HIP GPUs are available`.
109
113
-**CPU isolation**: Each pod gets 14 dedicated cores via cgroup limits (`nproc` reports 14 inside the container).
110
114
-**RAM isolation**: Each pod gets a 340Gi memory limit enforced by cgroups. Exceeding it triggers OOM kill.
111
115
-**Autoscaling**: With `minRunners: 40` and `maxRunners: 40`, all 40 runners stay online and idle on the GitHub runners tab, ready to pick up jobs instantly. The scheduler spreads pods across all 5 nodes (8 per node). Note: `minRunners: 0` means runners only exist when there are queued jobs and won't appear on the GitHub runners tab when idle.
If ARC runners start but benchmarks fail inside the container with `RuntimeError: No HIP GPUs are available`, verify the issue from inside a runner pod:
219
+
220
+
```bash
221
+
rocminfo
222
+
python3 - <<'PY'
223
+
import torch
224
+
print(torch.cuda.is_available())
225
+
print(torch.cuda.device_count())
226
+
PY
227
+
```
228
+
229
+
**Symptoms of the broken state:**
230
+
-`rocminfo` prints `Unable to open /dev/kfd read-write: Permission denied`
231
+
-`torch.cuda.device_count()` may show `1`, but `torch.cuda.is_available()` is `False`
232
+
233
+
**Cause:**
234
+
- The pod has `/dev/kfd` and `/dev/dri/renderD*`, but the container user is missing the host GPU device group (GID `110`) unless `supplementalGroups: [110]` is set on the pod.
235
+
236
+
**Fix:**
237
+
- Add `spec.template.spec.securityContext.supplementalGroups: [110]` to the ARC runner scale set
238
+
- Recreate the runner pods so the new group takes effect
239
+
240
+
### Full 40-runner capacity
241
+
242
+
`maxRunners: 40` only works if the cluster is dedicated to ARC. Any non-ARC workload that consumes `amd.com/gpu` or large pinned CPU reservations on these nodes will reduce the actual runner capacity.
243
+
184
244
### Jobs queuing forever / failed ephemeral runners
185
245
186
246
ARC does **not** garbage-collect failed ephemeral runners. If pods fail to start (transient image pull errors, node issues, resource contention), ARC retries 5 times then marks the ephemeral runner as `Failed` with `TooManyPodFailures`. These zombie runners still count against `maxRunners`, so the autoscaler thinks the cluster is full even though the GPUs are idle.
0 commit comments