Skip to content

Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge#399

Open
KeitaW wants to merge 1 commit intoaws:mainfrom
KeitaW:fix/b200-mig-profile-key-prefix
Open

Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge#399
KeitaW wants to merge 1 commit intoaws:mainfrom
KeitaW:fix/b200-mig-profile-key-prefix

Conversation

@KeitaW
Copy link
Copy Markdown
Contributor

@KeitaW KeitaW commented Mar 27, 2026

Summary

Fix three bugs in the p6-b200.48xlarge MIG profile configuration that cause MIG validation to always reject B200 instances for both training and inference workloads.

Bugs Fixed

1. Training: missing ml. prefix in INSTANCE_TYPE_MIG_PROFILES

File: src/sagemaker/hyperpod/training/constants.py:134

The dict key is 'p6-b200.48xlarge' but the instance type flowing through the system from node.kubernetes.io/instance-type is always ml.p6-b200.48xlarge. The lookup at accelerator_partition_util.py:26 fails, returning:

"Instance type 'ml.p6-b200.48xlarge' does not support accelerator partitions."

This blocks all B200 MIG usage: HyperPodPyTorchJob submissions with acceleratorPartitionType and hyp list-accelerator-partition-type --instance-type ml.p6-b200.48xlarge.

Every other entry in the dict uses the ml. prefix. The enum in hyperpod_instance_types.py:110 also defines ML_P6_B200_48XLARGE = "ml.p6-b200.48xlarge".

2. Inference: missing ml. prefix in INSTANCE_MIG_PROFILES

File: src/sagemaker/hyperpod/inference/constant.py:42

Same key mismatch as above. hp_jumpstart_endpoint.py:287 checks instance_type not in INSTANCE_MIG_PROFILES, which always fails for B200 when the caller passes the correct ml.p6-b200.48xlarge.

3. Inference: wrong MIG profiles for B200 (copy-pasted from GB200)

File: src/sagemaker/hyperpod/inference/constant.py:43-48

The B200 entry had GB200 profiles instead of B200 profiles:

Profile slot Was (GB200 values) Now (correct B200 values)
1g double-mem mig-1g.47gb mig-1g.45gb
2g mig-2g.47gb mig-2g.45gb
3g mig-3g.93gb mig-3g.90gb
4g mig-4g.93gb mig-4g.90gb
7g mig-7g.186gb mig-7g.180gb

The correct B200 profiles are confirmed by the NVIDIA MIG User Guide (r580) and the NVIDIA GPU Operator upstream ConfigMap (device-filter 0x290110DE).

Changes

File Change
src/sagemaker/hyperpod/training/constants.py 'p6-b200.48xlarge''ml.p6-b200.48xlarge'
src/sagemaker/hyperpod/inference/constant.py Key prefix fix + correct B200 profiles
test/unit_tests/inference/test_hp_jumpstart_endpoint.py Update test to use ml.p6-b200.48xlarge

Test plan

  • Verify INSTANCE_TYPE_MIG_PROFILES['ml.p6-b200.48xlarge'] returns the correct 6 B200 profiles
  • Verify INSTANCE_MIG_PROFILES['ml.p6-b200.48xlarge'] returns the correct 7 B200 profiles
  • Verify _validate_accelerator_partition("mig-1g.23gb", ..., "ml.p6-b200.48xlarge") passes validation
  • Verify validate_mig_profile("mig-1g.45gb", "ml.p6-b200.48xlarge") passes in inference
  • Unit test: test_hp_jumpstart_endpoint.py passes with updated instance type

The p6-b200.48xlarge key was missing the ml. prefix in both
INSTANCE_TYPE_MIG_PROFILES (training) and INSTANCE_MIG_PROFILES
(inference), causing MIG validation to always reject B200 instances.
The instance type flowing through the system from the Kubernetes
node label (node.kubernetes.io/instance-type) is always
ml.p6-b200.48xlarge, so the dict lookup never matched.

Additionally, the inference constant had the wrong MIG profiles
for B200 — it used GB200 values (47gb, 93gb, 186gb) instead of
the correct B200 values (45gb, 90gb, 180gb), likely a copy-paste
from the ml.p6e-gb200.36xlarge entry.

Fixes:
- training/constants.py: 'p6-b200.48xlarge' -> 'ml.p6-b200.48xlarge'
- inference/constant.py: key prefix + correct B200 profiles
- test: update to use ml. prefixed instance type
@KeitaW KeitaW requested a review from a team as a code owner March 27, 2026 23:17
KeitaW added a commit to KeitaW/sagemaker-hyperpod-cli that referenced this pull request Mar 28, 2026
Add B200 (Blackwell) test coverage alongside B300:
- 2 validation cases: valid profile accepted, cross-arch rejected
- 6 defaults cases with exact CPU/memory values

B200 validation tests will fail until aws#399 merges (fixes the
p6-b200.48xlarge → ml.p6-b200.48xlarge key). B200 defaults tests
pass immediately since INSTANCE_RESOURCES already uses the ml. key.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant