Skip to content

Commit 8a9e329

Browse files
committed
more debug info for arc runners
1 parent 8d8e426 commit 8a9e329

1 file changed

Lines changed: 60 additions & 0 deletions

File tree

.claude/skills/arc-gpu-runners.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,9 @@ minRunners: 40
5454

5555
template:
5656
spec:
57+
securityContext:
58+
supplementalGroups:
59+
- 110
5760
containers:
5861
- name: runner
5962
image: ghcr.io/gpu-mode/amd-runner:mi355
@@ -106,6 +109,7 @@ sudo k3s kubectl logs -n arc-systems -l actions.github.com/scale-set-name=arc-ru
106109
## How It Works
107110

108111
- **GPU isolation**: The AMD device plugin exposes `amd.com/gpu` as a k8s resource. Each runner pod requests exactly 1 GPU. Kubernetes guarantees no two pods share a GPU — each gets a unique `/dev/dri/renderD*` device.
112+
- **GPU device permissions**: On this cluster, the host GPU device nodes are group-owned by GID `110`. The runner pod must include `spec.securityContext.supplementalGroups: [110]` or ROCm inside the container will fail with `Unable to open /dev/kfd read-write: Permission denied` / `No HIP GPUs are available`.
109113
- **CPU isolation**: Each pod gets 14 dedicated cores via cgroup limits (`nproc` reports 14 inside the container).
110114
- **RAM isolation**: Each pod gets a 340Gi memory limit enforced by cgroups. Exceeding it triggers OOM kill.
111115
- **Autoscaling**: With `minRunners: 40` and `maxRunners: 40`, all 40 runners stay online and idle on the GitHub runners tab, ready to pick up jobs instantly. The scheduler spreads pods across all 5 nodes (8 per node). Note: `minRunners: 0` means runners only exist when there are queued jobs and won't appear on the GitHub runners tab when idle.
@@ -181,6 +185,62 @@ sudo k3s kubectl describe node <new-node-name> | grep amd.com/gpu
181185
- **Listener not starting**: Check controller logs: `kubectl logs -n arc-systems -l app.kubernetes.io/name=gha-rs-controller`
182186
- **Runner image issues**: The image must have `/home/runner/run.sh` (GitHub Actions runner binary)
183187

188+
### Host rebooted, k3s won't come back cleanly
189+
190+
If a node reboots, kubelet's static memory manager state can become invalid and `k3s` will fail with:
191+
192+
```text
193+
Invalid state, please drain node and remove policy state file
194+
start memory manager error: [memorymanager] the expected machine state is different from the real one
195+
```
196+
197+
**Fix immediately:**
198+
199+
```bash
200+
sudo systemctl stop k3s
201+
sudo rm -f /var/lib/kubelet/memory_manager_state
202+
sudo systemctl start k3s
203+
```
204+
205+
**Make it persistent on every node:**
206+
207+
```bash
208+
sudo mkdir -p /etc/systemd/system/k3s.service.d
209+
cat <<'EOF' | sudo tee /etc/systemd/system/k3s.service.d/fix-memory-manager.conf
210+
[Service]
211+
ExecStartPre=/bin/sh -c 'rm -f /var/lib/kubelet/memory_manager_state'
212+
EOF
213+
sudo systemctl daemon-reload
214+
```
215+
216+
### Jobs fail with `No HIP GPUs are available`
217+
218+
If ARC runners start but benchmarks fail inside the container with `RuntimeError: No HIP GPUs are available`, verify the issue from inside a runner pod:
219+
220+
```bash
221+
rocminfo
222+
python3 - <<'PY'
223+
import torch
224+
print(torch.cuda.is_available())
225+
print(torch.cuda.device_count())
226+
PY
227+
```
228+
229+
**Symptoms of the broken state:**
230+
- `rocminfo` prints `Unable to open /dev/kfd read-write: Permission denied`
231+
- `torch.cuda.device_count()` may show `1`, but `torch.cuda.is_available()` is `False`
232+
233+
**Cause:**
234+
- The pod has `/dev/kfd` and `/dev/dri/renderD*`, but the container user is missing the host GPU device group (GID `110`) unless `supplementalGroups: [110]` is set on the pod.
235+
236+
**Fix:**
237+
- Add `spec.template.spec.securityContext.supplementalGroups: [110]` to the ARC runner scale set
238+
- Recreate the runner pods so the new group takes effect
239+
240+
### Full 40-runner capacity
241+
242+
`maxRunners: 40` only works if the cluster is dedicated to ARC. Any non-ARC workload that consumes `amd.com/gpu` or large pinned CPU reservations on these nodes will reduce the actual runner capacity.
243+
184244
### Jobs queuing forever / failed ephemeral runners
185245

186246
ARC does **not** garbage-collect failed ephemeral runners. If pods fail to start (transient image pull errors, node issues, resource contention), ARC retries 5 times then marks the ephemeral runner as `Failed` with `TooManyPodFailures`. These zombie runners still count against `maxRunners`, so the autoscaler thinks the cluster is full even though the GPUs are idle.

0 commit comments

Comments
 (0)