Require --instance-type when specifying accelerator resources (#317) by FarhanTejani · Pull Request #393 · aws/sagemaker-hyperpod-cli

FarhanTejani · 2026-03-19T00:20:24Z

What's changing and why?

Closes #317 (the --node-count mutual exclusivity part was fixed in #383)

When --accelerators or --accelerators-limit is specified without --instance-type, the CLI silently produces nvidia.com/gpu: 0 in the k8s resource spec instead of the user's requested value. This happens because _set_default_accelerators_val needs the instance type to determine the accelerator key (nvidia.com/gpu vs aws.amazon.com/neuroncore), and when it can't, it silently returns None for both values.

This change adds an early validation that raises a clear error when accelerators are specified without --instance-type, and removes the dead code path in _get_limits that hardcoded nvidia.com/gpu: 0.

Before/After UX

Before:

$ hyp create hyp-pytorch-job --accelerators 4
# Silently generates: nvidia.com/gpu: '0'

After:

$ hyp create hyp-pytorch-job --accelerators 4
# Error: --instance-type is required when specifying accelerator resources

How was this change tested?

Unit tests pass

Are unit tests added?

Updated 3 tests that asserted the old behavior (nvidia.com/gpu: 0 on CPU-only / invalid instances)

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

All automated PR checks pass
Failed tests include local run results/screenshots proving they work
Changes are documentation-only

src/sagemaker/hyperpod/training/hyperpod_pytorch_job.py

mollyheamazon · 2026-03-19T19:05:43Z

Can you reenable these unit tests as part of your fix:

sagemaker-hyperpod-cli/test/unit_tests/training/test_pytorch_job_template_model.py

Lines 6 to 47 in 4406d08

    
           class TestPyTorchJobConfigEFA(unittest.TestCase): 
        
               """Test EFA resource allocation in PyTorchJobConfig""" 
        
               # def test_single_node_no_efa(self): 
        
               #     """Test that single-node jobs don't get EFA resources""" 
        
               #     config = PyTorchJobConfig( 
        
               #         job_name="test-single-node", 
        
               #         image="pytorch:latest", 
        
               #         node_count=1, 
        
               #         accelerators=2, 
        
               #         instance_type="ml.p4d.24xlarge" 
        
               #     ) 
        
               #     job = config.to_domain() 
        
               #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
               #     # Should not have EFA resources 
        
               #     self.assertNotIn("vpc.amazonaws.com/efa", container.resources.requests) 
        
               #     self.assertNotIn("vpc.amazonaws.com/efa", container.resources.limits) 
        
               #     # Should have GPU resources 
        
               #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "2") 
        
               # def test_multi_node_with_efa(self): 
        
               #     """Test that multi-node jobs automatically get EFA resources""" 
        
               #     config = PyTorchJobConfig( 
        
               #         job_name="test-multi-node", 
        
               #         image="pytorch:latest", 
        
               #         node_count=4, 
        
               #         accelerators=8, 
        
               #         instance_type="ml.p4d.24xlarge" 
        
               #     ) 
        
               #     job = config.to_domain() 
        
               #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
               #     # Should have EFA resources 
        
               #     self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1") 
        
               #     self.assertEqual(container.resources.limits["vpc.amazonaws.com/efa"], "1") 
        
               #     # Should also have GPU resources 
        
               #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "8")

sagemaker-hyperpod-cli/test/unit_tests/training/test_pytorch_job_template_model.py

Lines 114 to 150 in 4406d08

    
           # def test_multi_node_with_memory_and_cpu(self): 
        
           #     """Test EFA with other resource types""" 
        
           #     config = PyTorchJobConfig( 
        
           #         job_name="test-multi-resources", 
        
           #         image="pytorch:latest", 
        
           #         node_count=2, 
        
           #         accelerators=4, 
        
           #         vcpu=16.0, 
        
           #         memory=64.0, 
        
           #         instance_type="ml.p4d.24xlarge" 
        
           #     ) 
        
           #     job = config.to_domain() 
        
           #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
           #     # Should have all resources including EFA 
        
           #     self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1") 
        
           #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4") 
        
           #     self.assertEqual(container.resources.requests["cpu"], "16.0") 
        
           #     self.assertEqual(container.resources.requests["memory"], "64.0Gi") 
        
           # def test_accelerators_without_instance_type(self): 
        
           #     """Test that accelerators work without instance_type (fixes the main issue)""" 
        
           #     config = PyTorchJobConfig( 
        
           #         job_name="test-no-instance-type", 
        
           #         image="pytorch:latest", 
        
           #         accelerators=4 
        
           #         # No instance_type specified 
        
           #     ) 
        
           #     job = config.to_domain() 
        
           #     container = job.replicaSpecs[0].template.spec.containers[0] 
        
           #     # Should respect accelerators value even without instance_type 
        
           #     self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4") 
        
           #     # Limits should default to "0" since accelerators_limit not specified 
        
           #     self.assertEqual(container.resources.limits["nvidia.com/gpu"], "0")

They were commented out previously for a merge conflict resolution.

…rce allocation tests

FarhanTejani · 2026-03-24T01:49:32Z

Re-enabled all 4 tests with corrected assertions and refactored to use the full pipeline. Also moved the validation before the early return per the inline comment.

FarhanTejani requested a review from a team as a code owner March 19, 2026 00:20

FarhanTejani had a problem deploying to manual-approval March 19, 2026 00:20 — with GitHub Actions Error

mollyheamazon reviewed Mar 19, 2026

View reviewed changes

src/sagemaker/hyperpod/training/hyperpod_pytorch_job.py Outdated Show resolved Hide resolved

mollyheamazon mentioned this pull request Mar 19, 2026

Fix EFA field naming #392

Merged

3 tasks

FarhanTejani added 2 commits March 23, 2026 18:42

Require --instance-type when specifying accelerator resources (aws#317)

0ef6c48

fix: move validation before early return, re-enable and improve resou…

d03913f

…rce allocation tests

FarhanTejani force-pushed the fix/accelerators-require-instance-type branch from 21b5e32 to d03913f Compare March 24, 2026 01:42

FarhanTejani temporarily deployed to manual-approval March 24, 2026 01:42 — with GitHub Actions Inactive

mollyheamazon approved these changes Mar 25, 2026

View reviewed changes

mollyheamazon merged commit 539ef69 into aws:main Mar 25, 2026
6 checks passed

FarhanTejani deleted the fix/accelerators-require-instance-type branch March 26, 2026 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require --instance-type when specifying accelerator resources (#317)#393

Require --instance-type when specifying accelerator resources (#317)#393
mollyheamazon merged 2 commits intoaws:mainfrom
FarhanTejani:fix/accelerators-require-instance-type

FarhanTejani commented Mar 19, 2026

Uh oh!

Uh oh!

mollyheamazon commented Mar 19, 2026

Uh oh!

FarhanTejani commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FarhanTejani commented Mar 19, 2026

What's changing and why?

Before/After UX

How was this change tested?

Are unit tests added?

Are integration tests added?

Reviewer Guidelines

Uh oh!

Uh oh!

mollyheamazon commented Mar 19, 2026

Uh oh!

FarhanTejani commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants