Skip to content

fix: ModelTrainer and HyperparameterTuner missing environment variables (5613)#5695

Draft
aviruthen wants to merge 3 commits intoaws:masterfrom
aviruthen:fix/modeltrainer-and-hyperparametertuner-missing-5613
Draft

fix: ModelTrainer and HyperparameterTuner missing environment variables (5613)#5695
aviruthen wants to merge 3 commits intoaws:masterfrom
aviruthen:fix/modeltrainer-and-hyperparametertuner-missing-5613

Conversation

@aviruthen
Copy link
Copy Markdown
Collaborator

Description

The HyperparameterTuner in V3 (sagemaker-train) does not propagate the ModelTrainer's environment attribute when building the TrainingJobDefinition for the hyperparameter tuning job. Searching the entire visible portion of tuner.py (first 1057 lines) reveals zero occurrences of the word 'environment'. The _build_training_job_definition method (in the truncated portion of tuner.py, lines 1058+) constructs the training job definition dict from model_trainer attributes like training_image, hyperparameters, compute, stopping_condition, etc. but is missing the environment field. The SageMaker CreateHyperParameterTuningJob API supports an Environment field inside TrainingJobDefinition and TrainingJobDefinitions[*], so the fix requires reading model_trainer.environment and including it in the training job definition(s) built by the tuner. Both the single-trainer path (_build_training_job_definition) and the multi-trainer path (_build_training_job_definitions) need this fix.

Related Issue

Related issue: 5613

Changes Made

  • sagemaker-train/src/sagemaker/train/tuner.py
  • sagemaker-train/tests/unit/train/test_tuner.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Member

@mufaddal-rohawala mufaddal-rohawala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR adds a helper method _get_model_trainer_environment to extract environment variables from a ModelTrainer, but critically never actually calls it in _build_training_job_definition or _build_training_job_definitions. The helper is defined but the environment is never propagated into the training job definition, meaning the bug described in issue #5613 is not actually fixed. The tests also have issues — they test the helper in isolation (which works) but the integration test for _build_training_job_definition will likely fail or pass vacuously since the method was never modified to use the helper.

# The definition should contain the environment variables
assert hasattr(definition, "environment") or hasattr(definition, "Environment"), \
"Training job definition should have environment attribute"
definition_env = getattr(definition, "environment", None) or getattr(definition, "Environment", None)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line exceeds 100 character limit. Same issue here — break this across multiple lines to stay within the 100-char limit.

hyperparameter_ranges=_create_single_hp_range(),
)

# Should not raise an error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test doesn't verify the environment is absent from the definition. For the empty environment case, you should assert that the environment field is either None or not set on the definition, rather than just checking definition is not None. This would ensure empty environments don't get propagated unnecessarily.

Copy link
Copy Markdown
Member

@mufaddal-rohawala mufaddal-rohawala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes environment variable propagation in HyperparameterTuner for single-trainer tuning jobs. The fix is clean and well-tested for the single-trainer path, but the PR description claims both single and multi-trainer paths need fixing while the multi-trainer path (_build_training_job_definitions) is not actually modified. The tests for multi-trainer only verify that the environment attribute is preserved on mock objects, not that it's included in the actual training job definitions.

Copy link
Copy Markdown
Member

@mufaddal-rohawala mufaddal-rohawala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes environment variable propagation in HyperparameterTuner for single-trainer tuning jobs. The code change is clean and well-tested for the single-trainer path. However, the PR description claims both single and multi-trainer paths need fixing, but the multi-trainer path (_build_training_job_definitions) is not modified, and the multi-trainer tests only verify environment is preserved on mock objects rather than actually testing propagation into training job definitions.

return None

@classmethod
def _prepare_model_trainer_for_tuning(cls, model_trainer, inputs=None, job_name=None, **kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states: "Both the single-trainer path (_build_training_job_definition) and the multi-trainer path (_build_training_job_definitions) need this fix." However, only the single-trainer path (_build_training_job_definition) is modified here. Is the multi-trainer path (_build_training_job_definitions) already handling environment variables correctly, or is this fix incomplete? If the multi-trainer path also needs the fix, please add the corresponding change and a test that calls _build_training_job_definitions and verifies definition.environment is set on each resulting definition.


return new_static_hyperparameters, auto_parameters

@staticmethod
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation uses a forward reference string "ModelTrainer" but ModelTrainer is never imported (even conditionally under TYPE_CHECKING). For proper type checking support, consider adding:

from typing import TYPE_CHECKING
if TYPE_CHECKING:
    from sagemaker.train.model_trainer import ModelTrainer

This avoids a runtime import cycle while enabling static analysis tools to resolve the type.

@@ -442,6 +442,27 @@ def _prepare_auto_parameters(self, static_hyperparameters, hyperparameters_to_ke

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor design question: This static method is essentially a 4-line null-safe copy. It's only called in one place (_build_training_job_definition). Is extracting it as a separate static method warranted? It adds indirection without much reuse benefit. If the intent is to also use it in _build_training_job_definitions (the multi-trainer path), that usage is currently missing. If it's only used once, consider inlining the logic.

},
hyperparameter_ranges_dict={
"trainer1": _create_single_hp_range(),
"trainer2": _create_single_hp_range(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These multi-trainer tests (TestMultiTrainerEnvironmentPropagation) only verify that the mock's .environment attribute is preserved on the mock objects stored in tuner.model_trainer_dict. Since these are MagicMock objects, this is essentially testing that mock attribute assignment works — it doesn't test that _build_training_job_definitions (the multi-trainer build path) actually propagates environment into the resulting training job definitions. To properly test the multi-trainer fix, you'd need a test that calls _build_training_job_definitions and asserts definition.environment is set on each output definition.

@@ -574,3 +576,200 @@ def test_build_training_job_definition_includes_internal_channels(self):
assert "train" in channel_names, "User 'train' channel should be included"
assert "validation" in channel_names, "User 'validation' channel should be included"
assert len(channel_names) == 4, "Should have exactly 4 channels"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test — this directly validates the fix for issue #5613 by calling _build_training_job_definition and checking definition.environment. The defensive copy assertion is a nice touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants