Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge by KeitaW · Pull Request #399 · aws/sagemaker-hyperpod-cli

KeitaW · 2026-03-27T23:17:49Z

Summary

Fix three bugs in the p6-b200.48xlarge MIG profile configuration that cause MIG validation to always reject B200 instances for both training and inference workloads.

Bugs Fixed

1. Training: missing `ml.` prefix in `INSTANCE_TYPE_MIG_PROFILES`

File: src/sagemaker/hyperpod/training/constants.py:134

The dict key is 'p6-b200.48xlarge' but the instance type flowing through the system from node.kubernetes.io/instance-type is always ml.p6-b200.48xlarge. The lookup at accelerator_partition_util.py:26 fails, returning:

"Instance type 'ml.p6-b200.48xlarge' does not support accelerator partitions."

This blocks all B200 MIG usage: HyperPodPyTorchJob submissions with acceleratorPartitionType and hyp list-accelerator-partition-type --instance-type ml.p6-b200.48xlarge.

Every other entry in the dict uses the ml. prefix. The enum in hyperpod_instance_types.py:110 also defines ML_P6_B200_48XLARGE = "ml.p6-b200.48xlarge".

2. Inference: missing `ml.` prefix in `INSTANCE_MIG_PROFILES`

File: src/sagemaker/hyperpod/inference/constant.py:42

Same key mismatch as above. hp_jumpstart_endpoint.py:287 checks instance_type not in INSTANCE_MIG_PROFILES, which always fails for B200 when the caller passes the correct ml.p6-b200.48xlarge.

3. Inference: wrong MIG profiles for B200 (copy-pasted from GB200)

File: src/sagemaker/hyperpod/inference/constant.py:43-48

The B200 entry had GB200 profiles instead of B200 profiles:

Profile slot	Was (GB200 values)	Now (correct B200 values)
1g double-mem	`mig-1g.47gb`	`mig-1g.45gb`
2g	`mig-2g.47gb`	`mig-2g.45gb`
3g	`mig-3g.93gb`	`mig-3g.90gb`
4g	`mig-4g.93gb`	`mig-4g.90gb`
7g	`mig-7g.186gb`	`mig-7g.180gb`

The correct B200 profiles are confirmed by the NVIDIA MIG User Guide (r580) and the NVIDIA GPU Operator upstream ConfigMap (device-filter 0x290110DE).

Changes

File	Change
`src/sagemaker/hyperpod/training/constants.py`	`'p6-b200.48xlarge'` → `'ml.p6-b200.48xlarge'`
`src/sagemaker/hyperpod/inference/constant.py`	Key prefix fix + correct B200 profiles
`test/unit_tests/inference/test_hp_jumpstart_endpoint.py`	Update test to use `ml.p6-b200.48xlarge`

Test plan

Verify INSTANCE_TYPE_MIG_PROFILES['ml.p6-b200.48xlarge'] returns the correct 6 B200 profiles
Verify INSTANCE_MIG_PROFILES['ml.p6-b200.48xlarge'] returns the correct 7 B200 profiles
Verify _validate_accelerator_partition("mig-1g.23gb", ..., "ml.p6-b200.48xlarge") passes validation
Verify validate_mig_profile("mig-1g.45gb", "ml.p6-b200.48xlarge") passes in inference
Unit test: test_hp_jumpstart_endpoint.py passes with updated instance type

The p6-b200.48xlarge key was missing the ml. prefix in both INSTANCE_TYPE_MIG_PROFILES (training) and INSTANCE_MIG_PROFILES (inference), causing MIG validation to always reject B200 instances. The instance type flowing through the system from the Kubernetes node label (node.kubernetes.io/instance-type) is always ml.p6-b200.48xlarge, so the dict lookup never matched. Additionally, the inference constant had the wrong MIG profiles for B200 — it used GB200 values (47gb, 93gb, 186gb) instead of the correct B200 values (45gb, 90gb, 180gb), likely a copy-paste from the ml.p6e-gb200.36xlarge entry. Fixes: - training/constants.py: 'p6-b200.48xlarge' -> 'ml.p6-b200.48xlarge' - inference/constant.py: key prefix + correct B200 profiles - test: update to use ml. prefixed instance type

Add B200 (Blackwell) test coverage alongside B300: - 2 validation cases: valid profile accepted, cross-arch rejected - 6 defaults cases with exact CPU/memory values B200 validation tests will fail until aws#399 merges (fixes the p6-b200.48xlarge → ml.p6-b200.48xlarge key). B200 defaults tests pass immediately since INSTANCE_RESOURCES already uses the ml. key.

KeitaW requested a review from a team as a code owner March 27, 2026 23:17

KeitaW requested a deployment to manual-approval March 27, 2026 23:18 — with GitHub Actions Waiting

KeitaW mentioned this pull request Mar 28, 2026

Add MIG partition validation and defaults tests for all instance types #400

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge#399

Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge#399
KeitaW wants to merge 1 commit intoaws:mainfrom
KeitaW:fix/b200-mig-profile-key-prefix

KeitaW commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KeitaW commented Mar 27, 2026

Summary

Bugs Fixed

1. Training: missing ml. prefix in INSTANCE_TYPE_MIG_PROFILES

2. Inference: missing ml. prefix in INSTANCE_MIG_PROFILES

3. Inference: wrong MIG profiles for B200 (copy-pasted from GB200)

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Training: missing `ml.` prefix in `INSTANCE_TYPE_MIG_PROFILES`

2. Inference: missing `ml.` prefix in `INSTANCE_MIG_PROFILES`