Skip to content

fix: AE-2197: stable gpu sorting#285

Merged
jhcipar merged 2 commits intomainfrom
jhcipar/AE-2197/gpuids-order-normalization
Mar 26, 2026
Merged

fix: AE-2197: stable gpu sorting#285
jhcipar merged 2 commits intomainfrom
jhcipar/AE-2197/gpuids-order-normalization

Conversation

@jhcipar
Copy link
Contributor

@jhcipar jhcipar commented Mar 25, 2026

Summary

  • Canonicalizes gpuIds strings so comparisons and hashing are stable even when input order/spacing differs
  • Adds GpuGroup.normalize_gpu_ids_str(...) and uses it in both GPU ID generation (to_gpu_ids_str) and ServerlessResource input sync, so raw gpuIds inputs are normalized consistently.

Why

  • Previously, semantically equivalent gpuIds values could produce different strings depending on token order, which could cause avoidable config diffs and unstable hash/equality behavior, and trigger lots of releases
  • This change makes gpuIds representation deterministic across code paths

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #285 — fix: AE-2197: stable gpu sorting

Repo: runpod/flash
Severity: LOW
Key findings: No test covers the actual stated goal — that the same GPU set in different orders produces the same config hash and avoids a spurious redeploy.


1. Missing test for the user scenario this fix is meant to prevent

The purpose of this fix is: a user who deploys with gpuIds="AMPERE_24,ADA_24" and then redeploys the same config with gpuIds="ADA_24,AMPERE_24" should not trigger an unnecessary redeploy.

No test validates this end-to-end. The new tests cover the sort function in isolation, but nothing constructs two resource configs with the same GPU set in different orders and asserts they produce the same config_hash. If the normalization were ever disconnected from the hash path, this scenario would silently regress — the sort tests would still pass.

Nits

  • from_gpu_ids_str parses self.gpuIds but does not write back the normalized form. No bug today, but it creates an implicit ordering dependency that could cause confusion for anyone reading that code path later.

Verdict

PASS WITH NITS — the fix is correct and the spurious-redeploy scenario is resolved. Add a hash-level round-trip test (same GPU set, different input order, same config_hash) to lock in the stated goal.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@jhcipar jhcipar merged commit 759e0f6 into main Mar 26, 2026
4 checks passed
@jhcipar jhcipar deleted the jhcipar/AE-2197/gpuids-order-normalization branch March 26, 2026 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants