Skip to content

HyperParameterTuner not fetching channels from ModelTrainer #5508

@CoolFish88

Description

@CoolFish88

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When submitting training job using ModelTrainer, the jobs terminate successfully. These jobs ingest custom training code supplied though a SourceCode object as well as training, validation and config data supplied through designated channels (using InputData objects). However, when the ModelTrainer is wrapped using a HyperParameterTuner object, the tuning jobs fail.

Looking at the logs, I see:

  • Training jobs: SM_CHANNELS=['code', 'config', 'sm_drivers', 'train', 'validation']
  • Tuning jobs: SM_CHANNELS=["config","train","validation"]

The tuning job is using the default framework container behavior instead of the ModelTrainer's custom entrypoint that runs sm_train.sh. This is evidenced by the following run commands in the logs:

  • Training job: torchrun --nnodes=1 --nproc_per_node=4 train.py ....
  • Tuning job: /usr/local/bin/python train.py --learning_rate ....

The HyperparameterTuner._build_training_job_definition() method is not properly including the source code channels and container configuration from the ModelTrainer.

To reproduce
Below is a minimal example:

from sagemaker.core.training.configs import SourceCode
from sagemaker.core.training.configs import Compute
from sagemaker.train.distributed import Torchrun
from sagemaker.core.training.configs import InputData, OutputDataConfig
from sagemaker.core.shapes import RetryStrategy, MetricDefinition, StoppingCondition, CheckpointConfig
from sagemaker.train.model_trainer import ModelTrainer, Mode

root = str(Path.cwd().parent)
source_dir = os.path.join(root, "sagemaker")
requirements = 'requirements.txt'
entry_script = "train.py"
source_code = SourceCode(source_dir=source_dir,
                         requirements=requirements,
                         entry_script=entry_script)

instance_type = "ml.g6e.12xlarge"
instance_count = 1
volume_size_in_gb = 200
compute = Compute(instance_type=instance_type,
                  instance_count=instance_count,
                  volume_size_in_gb=volume_size_in_gb)

distributed_strategy = Torchrun()

s3_input_path = "s3 path to training data"
training_data = InputData(channel_name='train',
                          data_source=s3_input_path)

s3_input_path = "s3 path to validation data"
validation_data = InputData(channel_name='validation',
                            data_source=s3_input_path)

# Path to S3 yaml file containing training hyperparameters
config_path = "s3 path to yaml config file"
config_data = InputData(channel_name='config', data_source=config_path)

s3_output_path = "s3 output path"
output = OutputDataConfig(s3_output_path=s3_output_path,
                          compression_type='NONE')

# Defined checkpoint config
checkpoint_config = CheckpointConfig(s3_uri=s3_output_path, local_path="/opt/ml/checkpoints")

# Define retry strategy
retry_strategy = RetryStrategy(maximum_retry_attempts=3)

# Define tracking metrics
metric_definitions = [
    MetricDefinition(
        name="eval_acro_f1",
        regex="eval_macro_f1: (.*?)",
    )]

# Define stopping condition
num_hours = 3
stopping = StoppingCondition(max_runtime_in_seconds=3600 * num_hours)

job_name = "my_training_job"
training_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0"
training_mode = Mode.SAGEMAKER_TRAINING_JOB

model_trainer = ModelTrainer(
    training_mode=training_mode,
    sagemaker_session=sagemaker_session,
    role=role,
    training_image=training_image,
    base_job_name=job_name,
    source_code=source_code,
    compute=compute,
    distributed=distributed_strategy,
    output_data_config=output,
    checkpoint_config=checkpoint_config,
    stopping_condition=stopping,
    environment={"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"},
   hyperparameters={"learning_rate": 1e-5}
)
model_trainer.train(wait=False, logs=True, input_data_config=[training_data, validation_data, config_data])

from sagemaker.train.tuner import HyperparameterTuner
from sagemaker.core.parameter import ContinuousParameter

metric_definitions = [{
    "Name": "eval_loss",
    "Regex": "eval_loss: (.*?)"}]

learning_rate = ContinuousParameter(
    min_value=1e-5,
    max_value=5e-4,
    scaling_type='Logarithmic')

hyperparameter_ranges = {"learning_rate": learning_rate}

tuner = HyperparameterTuner(model_trainer=model_trainer,
                            objective_metric_name="eval_loss",
                            metric_definitions=metric_definitions,
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=3,
                            max_parallel_jobs=3)
tuner.tune(wait=False, inputs=[training_data, validation_data, config_data])

Expected behavior
Tuning jobs completing without errors.

Screenshots or logs

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 3.3.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version:
  • Python version: 3.13
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions