Skip to content

Require --instance-type when specifying accelerator resources (#317)#393

Merged
mollyheamazon merged 2 commits intoaws:mainfrom
FarhanTejani:fix/accelerators-require-instance-type
Mar 25, 2026
Merged

Require --instance-type when specifying accelerator resources (#317)#393
mollyheamazon merged 2 commits intoaws:mainfrom
FarhanTejani:fix/accelerators-require-instance-type

Conversation

@FarhanTejani
Copy link
Copy Markdown
Member

What's changing and why?

Closes #317 (the --node-count mutual exclusivity part was fixed in #383)

When --accelerators or --accelerators-limit is specified without --instance-type, the CLI silently produces nvidia.com/gpu: 0 in the k8s resource spec instead of the user's requested value. This happens because _set_default_accelerators_val needs the instance type to determine the accelerator key (nvidia.com/gpu vs aws.amazon.com/neuroncore), and when it can't, it silently returns None for both values.

This change adds an early validation that raises a clear error when accelerators are specified without --instance-type, and removes the dead code path in _get_limits that hardcoded nvidia.com/gpu: 0.

Before/After UX

Before:

$ hyp create hyp-pytorch-job --accelerators 4
# Silently generates: nvidia.com/gpu: '0'

After:

$ hyp create hyp-pytorch-job --accelerators 4
# Error: --instance-type is required when specifying accelerator resources

How was this change tested?

  • Unit tests pass

Are unit tests added?

Updated 3 tests that asserted the old behavior (nvidia.com/gpu: 0 on CPU-only / invalid instances)

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

@FarhanTejani FarhanTejani requested a review from a team as a code owner March 19, 2026 00:20
@mollyheamazon mollyheamazon mentioned this pull request Mar 19, 2026
3 tasks
@mollyheamazon
Copy link
Copy Markdown
Collaborator

Can you reenable these unit tests as part of your fix:

class TestPyTorchJobConfigEFA(unittest.TestCase):
"""Test EFA resource allocation in PyTorchJobConfig"""
# def test_single_node_no_efa(self):
# """Test that single-node jobs don't get EFA resources"""
# config = PyTorchJobConfig(
# job_name="test-single-node",
# image="pytorch:latest",
# node_count=1,
# accelerators=2,
# instance_type="ml.p4d.24xlarge"
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should not have EFA resources
# self.assertNotIn("vpc.amazonaws.com/efa", container.resources.requests)
# self.assertNotIn("vpc.amazonaws.com/efa", container.resources.limits)
# # Should have GPU resources
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "2")
# def test_multi_node_with_efa(self):
# """Test that multi-node jobs automatically get EFA resources"""
# config = PyTorchJobConfig(
# job_name="test-multi-node",
# image="pytorch:latest",
# node_count=4,
# accelerators=8,
# instance_type="ml.p4d.24xlarge"
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should have EFA resources
# self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1")
# self.assertEqual(container.resources.limits["vpc.amazonaws.com/efa"], "1")
# # Should also have GPU resources
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "8")

# def test_multi_node_with_memory_and_cpu(self):
# """Test EFA with other resource types"""
# config = PyTorchJobConfig(
# job_name="test-multi-resources",
# image="pytorch:latest",
# node_count=2,
# accelerators=4,
# vcpu=16.0,
# memory=64.0,
# instance_type="ml.p4d.24xlarge"
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should have all resources including EFA
# self.assertEqual(container.resources.requests["vpc.amazonaws.com/efa"], "1")
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4")
# self.assertEqual(container.resources.requests["cpu"], "16.0")
# self.assertEqual(container.resources.requests["memory"], "64.0Gi")
# def test_accelerators_without_instance_type(self):
# """Test that accelerators work without instance_type (fixes the main issue)"""
# config = PyTorchJobConfig(
# job_name="test-no-instance-type",
# image="pytorch:latest",
# accelerators=4
# # No instance_type specified
# )
# job = config.to_domain()
# container = job.replicaSpecs[0].template.spec.containers[0]
# # Should respect accelerators value even without instance_type
# self.assertEqual(container.resources.requests["nvidia.com/gpu"], "4")
# # Limits should default to "0" since accelerators_limit not specified
# self.assertEqual(container.resources.limits["nvidia.com/gpu"], "0")

They were commented out previously for a merge conflict resolution.

@FarhanTejani FarhanTejani force-pushed the fix/accelerators-require-instance-type branch from 21b5e32 to d03913f Compare March 24, 2026 01:42
@FarhanTejani
Copy link
Copy Markdown
Member Author

Re-enabled all 4 tests with corrected assertions and refactored to use the full pipeline. Also moved the validation before the early return per the inline comment.

@mollyheamazon mollyheamazon merged commit 539ef69 into aws:main Mar 25, 2026
6 checks passed
@FarhanTejani FarhanTejani deleted the fix/accelerators-require-instance-type branch March 26, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hyp-pytorch-job does not correctly use --accelerators/--accelerators-limits to specify resource requests and limits

2 participants