Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge#399
Open
Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge#399
Conversation
The p6-b200.48xlarge key was missing the ml. prefix in both INSTANCE_TYPE_MIG_PROFILES (training) and INSTANCE_MIG_PROFILES (inference), causing MIG validation to always reject B200 instances. The instance type flowing through the system from the Kubernetes node label (node.kubernetes.io/instance-type) is always ml.p6-b200.48xlarge, so the dict lookup never matched. Additionally, the inference constant had the wrong MIG profiles for B200 — it used GB200 values (47gb, 93gb, 186gb) instead of the correct B200 values (45gb, 90gb, 180gb), likely a copy-paste from the ml.p6e-gb200.36xlarge entry. Fixes: - training/constants.py: 'p6-b200.48xlarge' -> 'ml.p6-b200.48xlarge' - inference/constant.py: key prefix + correct B200 profiles - test: update to use ml. prefixed instance type
KeitaW
added a commit
to KeitaW/sagemaker-hyperpod-cli
that referenced
this pull request
Mar 28, 2026
Add B200 (Blackwell) test coverage alongside B300: - 2 validation cases: valid profile accepted, cross-arch rejected - 6 defaults cases with exact CPU/memory values B200 validation tests will fail until aws#399 merges (fixes the p6-b200.48xlarge → ml.p6-b200.48xlarge key). B200 defaults tests pass immediately since INSTANCE_RESOURCES already uses the ml. key.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix three bugs in the
p6-b200.48xlargeMIG profile configuration that cause MIG validation to always reject B200 instances for both training and inference workloads.Bugs Fixed
1. Training: missing
ml.prefix inINSTANCE_TYPE_MIG_PROFILESFile:
src/sagemaker/hyperpod/training/constants.py:134The dict key is
'p6-b200.48xlarge'but the instance type flowing through the system fromnode.kubernetes.io/instance-typeis alwaysml.p6-b200.48xlarge. The lookup ataccelerator_partition_util.py:26fails, returning:This blocks all B200 MIG usage:
HyperPodPyTorchJobsubmissions withacceleratorPartitionTypeandhyp list-accelerator-partition-type --instance-type ml.p6-b200.48xlarge.Every other entry in the dict uses the
ml.prefix. The enum inhyperpod_instance_types.py:110also definesML_P6_B200_48XLARGE = "ml.p6-b200.48xlarge".2. Inference: missing
ml.prefix inINSTANCE_MIG_PROFILESFile:
src/sagemaker/hyperpod/inference/constant.py:42Same key mismatch as above.
hp_jumpstart_endpoint.py:287checksinstance_type not in INSTANCE_MIG_PROFILES, which always fails for B200 when the caller passes the correctml.p6-b200.48xlarge.3. Inference: wrong MIG profiles for B200 (copy-pasted from GB200)
File:
src/sagemaker/hyperpod/inference/constant.py:43-48The B200 entry had GB200 profiles instead of B200 profiles:
mig-1g.47gbmig-1g.45gbmig-2g.47gbmig-2g.45gbmig-3g.93gbmig-3g.90gbmig-4g.93gbmig-4g.90gbmig-7g.186gbmig-7g.180gbThe correct B200 profiles are confirmed by the NVIDIA MIG User Guide (r580) and the NVIDIA GPU Operator upstream ConfigMap (device-filter
0x290110DE).Changes
src/sagemaker/hyperpod/training/constants.py'p6-b200.48xlarge'→'ml.p6-b200.48xlarge'src/sagemaker/hyperpod/inference/constant.pytest/unit_tests/inference/test_hp_jumpstart_endpoint.pyml.p6-b200.48xlargeTest plan
INSTANCE_TYPE_MIG_PROFILES['ml.p6-b200.48xlarge']returns the correct 6 B200 profilesINSTANCE_MIG_PROFILES['ml.p6-b200.48xlarge']returns the correct 7 B200 profiles_validate_accelerator_partition("mig-1g.23gb", ..., "ml.p6-b200.48xlarge")passes validationvalidate_mig_profile("mig-1g.45gb", "ml.p6-b200.48xlarge")passes in inferencetest_hp_jumpstart_endpoint.pypasses with updated instance type