Skip to content

feat(vllm): add Strix Halo vLLM image with gfx1150/1151 CI#101

Open
KerwinTsaiii wants to merge 6 commits into
developfrom
feat/add-vllm-base-image
Open

feat(vllm): add Strix Halo vLLM image with gfx1150/1151 CI#101
KerwinTsaiii wants to merge 6 commits into
developfrom
feat/add-vllm-base-image

Conversation

@KerwinTsaiii
Copy link
Copy Markdown
Collaborator

Build vLLM + AITER + ROCm flash-attention from source on top of auplc-base, targeted at gfx1151 (Strix Halo) and gfx1150.

  • dockerfiles/VLLM/: Dockerfile and helper scripts (build / server / bench / chat) plus the Welcome-vLLM-on-Strix-Halo notebook.
  • patch_aiter_headers.py + optCompilerConfig.gfx1151.json: source-level RDNA 3.5 fallbacks for AITER's CDNA-only ISA paths (vec_convert.h packed fp8/bf8, hip_reduce.h DPP row_bcast -> ds_swizzle).
  • patch_flash_attn_setup.py + patch_strix.py: build-system fixups for the AMD flash-attention fork on gfx1151.
  • dockerfiles/Makefile: 'make vllm' target wiring GPU_TARGET / VLLM_REF / MAX_JOBS / FLASH_ATTN_REF through to VLLM/build.sh.
  • .github/workflows/docker-build-vllm.yml: matrix CI for gfx1150 + gfx1151 publishing ghcr.io/amdresearch/auplc-vllm:{tag}-gfx115x (and unsuffixed aliases for the default gfx1151 target). Runs sequentially (max-parallel: 1) to fit ubuntu-latest's 7 GB / 14 GB envelope; cache scoped per GPU.

Summary

Changes

Testing

Files Changed

Checklist

  • Code follows project style guidelines
  • Changes are backward compatible
  • Tested on local Kubernetes cluster
  • Documentation links updated

KerwinTsaiii and others added 6 commits May 12, 2026 22:29
Build vLLM + AITER + ROCm flash-attention from source on top of
auplc-base, targeted at gfx1151 (Strix Halo) and gfx1150.

* dockerfiles/VLLM/: Dockerfile and helper scripts (build / server /
  bench / chat) plus the Welcome-vLLM-on-Strix-Halo notebook.
* patch_aiter_headers.py + optCompilerConfig.gfx1151.json: source-level
  RDNA 3.5 fallbacks for AITER's CDNA-only ISA paths (vec_convert.h
  packed fp8/bf8, hip_reduce.h DPP row_bcast -> ds_swizzle).
* patch_flash_attn_setup.py + patch_strix.py: build-system fixups for
  the AMD flash-attention fork on gfx1151.
* dockerfiles/Makefile: 'make vllm' target wiring GPU_TARGET /
  VLLM_REF / MAX_JOBS / FLASH_ATTN_REF through to VLLM/build.sh.
* .github/workflows/docker-build-vllm.yml: matrix CI for
  gfx1150 + gfx1151 publishing
  ghcr.io/amdresearch/auplc-vllm:{tag}-gfx115x (and unsuffixed aliases
  for the default gfx1151 target). Runs sequentially (max-parallel: 1)
  to fit ubuntu-latest's 7 GB / 14 GB envelope; cache scoped per GPU.

Co-authored-by: Cursor <cursoragent@cursor.com>
GHA evaluates `matrix:` before other contexts, so filtering matrix
entries from a job-level `if: ... matrix.gpu_target` raised
"Unrecognized named-value: 'matrix'" and the workflow shipped as
invalid (Actions UI fell back to the filename).

Replace the post-hoc filter with a tiny `resolve-matrix` job that
emits a JSON array based on workflow_dispatch input, and feed it
back to `build-vllm` via `fromJSON(needs.resolve-matrix.outputs.gpu_targets)`.
Push / pull_request keep building both gfx1150 + gfx1151; manual
runs with `gpu_target=gfx1151` (or 1150) build only that one.

Co-authored-by: Cursor <cursoragent@cursor.com>
Round out the vLLM image with the surrounding project glue that lets
users actually launch and measure it.

* runtime/values.yaml: register auplc-vllm as a hub-spawnable profile
  (vllm image + GPU resources + "vLLM Inference Server" card) and add
  it to the official / native-users / github-users access lists so it
  shows up next to the Course images in JupyterHub.
* pyproject.toml: exclude dockerfiles/VLLM/patch_aiter_headers.py from
  ruff. The file is ~95 % C++ source held inside a Python string —
  ruff would chase indentation / trailing-space inside the embedded
  C++ forever; the wrapper Python is trivial enough to skip linting.
* benchmarks/run_qwen3_4b_throughput.sh: host-side wrapper that boots
  the auplc-vllm container, waits for /v1/models to settle, then
  docker-execs the in-image bench against loopback so client and
  server share the exact same vLLM build.
* benchmarks/.gitignore: keep run logs (server / bench JSON + .log)
  out of VCS; results live outside the repo.

Co-authored-by: Cursor <cursoragent@cursor.com>
* cell 3 (sanity-check): hoist `import torch` above the `print(...)`
  block so all imports sit at the top, then let ruff's isort group
  it under the third-party block. (E402 + I001)
* cell 14 (cleanup): drop the duplicate `import os` / `import signal`;
  cell 6 already pulled them into the notebook's global namespace,
  and the cleanup cell can't run standalone anyway (it depends on
  `server` from cell 6). (F811)

Co-authored-by: Cursor <cursoragent@cursor.com>
Mostly mechanical: collapse a one-line raise, expand the long
chat-completion user message into a multi-line dict, and a few
trivial whitespace touches across cells. No semantic changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant