Skip to content

Add MIG partition validation and defaults tests for all instance types#400

Open
KeitaW wants to merge 6 commits intoaws:mainfrom
KeitaW:test/b300-mig-profile-tests
Open

Add MIG partition validation and defaults tests for all instance types#400
KeitaW wants to merge 6 commits intoaws:mainfrom
KeitaW:test/b300-mig-profile-tests

Conversation

@KeitaW
Copy link
Copy Markdown
Contributor

@KeitaW KeitaW commented Mar 28, 2026

Summary

Adds regression tests for MIG accelerator partition validation and CPU/memory default calculation across all MIG-capable instance types. All tests extend the existing TestAcceleratorPartitionUtil parametrized cases.

Motivation

Without these tests, the following regressions would only surface as customer-reported runtime failures:

  • Missing or misnamed dict key in INSTANCE_TYPE_MIG_PROFILES — the CLI rejects valid MIG requests with "Instance type does not support accelerator partitions". The B200 validation tests in this PR demonstrate this: they fail on the current main branch because the ml. prefix is missing (fixed by Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge #399).
  • Wrong INSTANCE_RESOURCES values — CPU/memory auto-calculation depends on correct instance specs (cpu, gpu, memory). A typo would silently mis-provision pod resources. The defaults test covers all 9 MIG-capable instance types with exact expected values.

Depends on

Merge order: #399#398 → this PR

Test coverage

Test method What it verifies Cases
test_validate_accelerator_partition_fields B200/B300: valid profiles accepted, cross-architecture profiles rejected +4 rows (2 per instance type)
test_accelerator_partition_defaults CPU/memory defaults correct for every MIG-capable instance type 9 rows: P4d, P4de, P5, P5e, P5en, B200, B300, GB200, G7e
test_instance_type_profiles_not_empty Every key in INSTANCE_TYPE_MIG_PROFILES has ≥1 profile Data-driven over all keys

Test plan

KeitaW added 2 commits March 28, 2026 00:00
Add ml.p6-b300.48xlarge to INSTANCE_TYPE_MIG_PROFILES in constants.py
with the correct B300 MIG profiles derived from the NVIDIA GPU Operator
v25.3.0 upstream ConfigMap (device-filter 0x318210DE):

- mig-1g.34gb, mig-1g.67gb, mig-2g.67gb
- mig-3g.135gb, mig-4g.135gb, mig-7g.269gb

Also add the corresponding uniform and mixed MIG partition profiles
to the Helm chart default-mig-config.yaml ConfigMap, following the
same pattern used for existing GPU types (H100, H200, B200).

The B300 GPU (288GB HBM3e, ~269GB usable) was already registered in
INSTANCE_RESOURCES but had no MIG profile mapping, causing HyperPod
MIG validation to reject accelerator partition requests on this
instance type.
Covers ml.p6-b300.48xlarge MIG profile support added in PR aws#398:
- Profile presence in INSTANCE_TYPE_MIG_PROFILES
- Complete profile list verification (6 profiles)
- All profiles in ALLOWED_ACCELERATOR_PARTITION_TYPES
- GPU slice extraction for all B300 profiles (1g→1, 2g→2, ..., 7g→7)
- CPU/memory default calculation for each profile at max instances
- Validation acceptance for valid B300 profiles
- Validation rejection for invalid profiles on B300 instance type
@KeitaW KeitaW requested a review from a team as a code owner March 28, 2026 00:07
- Delete test_b300_in_instance_type_mig_profiles (subsumed by
  test_b300_profiles_complete which KeyErrors on missing key)
- Delete test_b300_profiles_in_allowed_set (tautological: the
  allowed set is computed as union of all profile values)
- Delete test_extract_gpu_slices_b300 (instance-type-agnostic
  regex already covered by existing parametrized tests)
- Replace > 0 assertions with exact expected values in
  test_accelerator_partition_defaults_b300
- Fix misleading mock in test_validate_b300_partition: use empty
  allocatable for the invalid-profile case since validation fails
  at static parameter check before cluster check
- Remove unused ALLOWED_ACCELERATOR_PARTITION_TYPES import
Eliminate the separate TestB300MigProfiles class. B300 tests now
extend the existing parametrized cases in TestAcceleratorPartitionUtil:

- B300 valid/invalid profile cases added to
  test_validate_accelerator_partition_fields
- B300 defaults with exact values added to
  test_accelerator_partition_defaults (instance-type-parametrized)
- test_instance_type_profiles_not_empty iterates all instance types
  in INSTANCE_TYPE_MIG_PROFILES as a data-driven guard

This pattern scales to future instance types without adding new
test classes.
Add B200 (Blackwell) test coverage alongside B300:
- 2 validation cases: valid profile accepted, cross-arch rejected
- 6 defaults cases with exact CPU/memory values

B200 validation tests will fail until aws#399 merges (fixes the
p6-b200.48xlarge → ml.p6-b200.48xlarge key). B200 defaults tests
pass immediately since INSTANCE_RESOURCES already uses the ml. key.
@KeitaW KeitaW changed the title Add unit tests for B300 MIG profile validation Add unit tests for B200 and B300 MIG profile validation Mar 28, 2026
Replace 12 B200/B300-only rows with 1 representative row per
MIG-capable instance type (P4d, P4de, P5, P5e, P5en, B200, B300,
GB200, G7e). Each row uses the smallest profile at max instance
count, verifying that INSTANCE_RESOURCES has correct cpu/gpu/memory
values for the ratio calculation.
@KeitaW KeitaW changed the title Add unit tests for B200 and B300 MIG profile validation Add MIG partition validation and defaults tests for all instance types Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant