Add aws batch #5409

aviruthen · 2025-12-11T21:12:59Z

Issue #, if available:

Description of changes:

Introducing aws_batch into PySDK V3, placed in sagemaker.train.
Supports queueing in front of training jobs with interfaces TrainingQueue and TrainingQueuedJob.
Includes unit tests, integration tests, and an example notebook

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…r-python-sdk into add-aws-batch

aviruthen · 2025-12-12T19:17:19Z

Unit and integ tests are passing, have to rerun for recent merge

davlind-amzn · 2025-12-16T02:30:36Z

...ples/training-examples/aws_batch/sm-training-queues_getting_started_with_model_trainer.ipynb

+    "## Create Sample Resources\n",
+    "The diagram belows shows the Batch resources we'll create for this example.\n",
+    "\n",
+    "![The Resources to Create](batch_getting_started_resources.png \"Example Job Queue and Service Environment Resources\")\n",


I think we're missing the png here from this PR

Good catch, added the png back

davlind-amzn · 2025-12-16T02:42:34Z

...ples/training-examples/aws_batch/sm-training-queues_getting_started_with_model_trainer.ipynb

+    "        time.sleep(5)\n",
+    "\n",
+    "    # Print training job logs\n",
+    "    # job.get_estimator().logs()\n",


We can remove this Estimator comment reference

Oops good catch!

davlind-amzn · 2025-12-16T02:48:56Z

sagemaker-train/src/sagemaker/train/aws_batch/training_queued_job.py

+    # Step 3: Create ModelTrainer
+    model_trainer = ModelTrainer(**init_params)
+
+    # Step 4: Set _latest_training_job (key insight!)


nit: key insight?

Sorry, this was a note for myself (the ModelTrainer parameter that I could attach the training job to)!

…r-python-sdk into add-aws-batch

davlind-amzn

Ran some testing of these changes on my side, everything looks good to me!

mufaddal-rohawala · 2025-12-17T20:46:57Z

sagemaker-core/src/sagemaker/core/helper/session_helper.py

        return self.boto_session.resource("iam").Role(role).arn
+
+
+    def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):


Wanted to check on what this utility is used for?

This is to maintain parity with the Estimator::logs method, which tails the logs being emitted from an active training job. Example of usage here, down under the "Monitor Job Status" section: https://github.com/aws/amazon-sagemaker-examples/blob/default/%20%20%20%20%20%20build_and_train_models/sm-training-queues/sm-training-queues_getting_started_with_estimator.ipynb

Check out this line in the example notebook in this PR: model_trainer.sagemaker_session.logs_for_job(model_trainer._latest_training_job.training_job_name, wait=True)

It seem like we are replacing:
v2 logic: job.get_estimator().logs()

v3 logic: model_trainer.sagemaker_session.logs_for_job(model_trainer._latest_training_job.training_job_name, wait=True)

The V3 experience looks pretty bad to me and since this is not an existing v2 parity issue - Can we think about tie-ing the get logs functionality to training queue job or model trainer directly? We can also think through on this and not make a decision on this for 1st release. Lets use utility method within example notebooks (replicate _logs_for_job funtionality within notebook as a standalone utility)? This would not break customers using the notebook and we can make a right offering for logs exposure and update the notebook.

Synced with Mufi. Resolution: let's make logs_for_job a notebook method so that we are not introducing a new external method (and we can take more time on this for the future). We are okay with allowing the user to get the training job name from _lastest_training_job since this is an internal parameter (not an internal class or method). Will be implementing this

mufaddal-rohawala · 2025-12-17T20:48:54Z

sagemaker-train/src/sagemaker/train/aws_batch/batch_api_helper.py

+from sagemaker.train.aws_batch.boto_client import get_batch_boto_client
+
+
+def submit_service_job(


if these are helper functions can this be internal? this helps in maing breaking changes if not user facing

Regarding comments about moving methods to internal: I can definitely see your point here! My take is that these methods are not marked as internal in V2 and may already be in use by customers. Changing the method signature makes migration from V2 to V3 more difficult. Also, the benefit of marking these functions as internal is minimal, as python doesn't prevent these methods from being directly called. So overall I would say it's more worthwhile to leave these signatures as-is.

My take is that these methods are not marked as internal in V2 and may already be in use by customers. Changing the method signature makes migration from V2 to V3 more difficult.

We are also locking in the experience for a new major version and hampering our ability to make breaking changes to these offerings, this is our only opportunity to make these internal. Have we marketed these utilities within notebooks? If customer encounters errors while migrating from v2 -> v3, the migration shift would be minimal.
I would leave this decision to you David, if you feel these signatures will not be needing any breaking changes, since these are AWS Batch specific.

Alright I'm good if we want to make these utility methods internal. Effort to migrate is minimal, and it does seem unlikely that customers are currently using these. Thanks for your input Mufi!

Made the batch api helper methods internal

mufaddal-rohawala · 2025-12-17T20:49:43Z

sagemaker-train/src/sagemaker/train/aws_batch/boto_client.py

+import boto3
+
+
+def get_batch_boto_client(


make it internal?

This is used in the notebook, should stay external

mufaddal-rohawala · 2025-12-17T20:52:40Z

sagemaker-train/src/sagemaker/train/utils.py

    ]
+
+
+def get_training_job_name_from_training_job_arn(training_job_arn: str) -> str:


Made internal!

mufaddal-rohawala · 2025-12-17T20:54:14Z

sagemaker-train/pyproject.toml

    "jinja2>=3.0,<4.0",
    "sagemaker-mlflow>=0.0.1,<1.0.0",
    "mlflow>=3.0.0,<4.0.0",
+    "nest_asyncio>=1.5.0",


is this a required dependency? how much of additional size implications are added to the sagemaker-train package?

This is used in the result() method for TrainingQueuedJob. It's a really small package; however, I'm aligned with removing it from pyproject.toml and having it be a dependency users can pip install (we have something similar where there are many different ML frameworks for inference but we don't require all of them in the pyproject.toml file since users can pick and choose which ML frameworks they want to use)

Removed nest_asyncio as a dependency in pyproject.toml

…r-python-sdk into add-aws-batch

aviruthen added 3 commits December 11, 2025 11:48

Add aws batch implementation (works with example notebook)

1d6c559

fixing unit tests and adding integration test

2cdb2d4

Merge branch 'master' into add-aws-batch

3438a87

aviruthen temporarily deployed to auto-approve December 11, 2025 21:13 — with GitHub Actions Inactive

aviruthen added 2 commits December 11, 2025 13:30

add example notebook

a60459c

Merge branch 'add-aws-batch' of https://github.com/aviruthen/sagemake…

149f02c

…r-python-sdk into add-aws-batch

aviruthen temporarily deployed to auto-approve December 11, 2025 21:32 — with GitHub Actions Inactive

Adding missing dependencies for aws_batch

41f71d6

aviruthen temporarily deployed to auto-approve December 11, 2025 22:11 — with GitHub Actions Inactive

Fixing indentation bug in source code

9326821

aviruthen temporarily deployed to auto-approve December 12, 2025 17:43 — with GitHub Actions Inactive

comment out delete resources in example notebook

711b5a3

aviruthen temporarily deployed to auto-approve December 12, 2025 17:49 — with GitHub Actions Inactive

Merge branch 'master' into add-aws-batch

1e5a534

aviruthen temporarily deployed to auto-approve December 12, 2025 19:17 — with GitHub Actions Inactive

davlind-amzn reviewed Dec 16, 2025

View reviewed changes

aviruthen added 3 commits December 16, 2025 12:44

Add notebook png and remove extraneous comments

3cd0846

Merge branch 'add-aws-batch' of https://github.com/aviruthen/sagemake…

d41bd5a

…r-python-sdk into add-aws-batch

Merge branch 'master' into add-aws-batch

486377b

aviruthen temporarily deployed to auto-approve December 16, 2025 20:45 — with GitHub Actions Inactive

aviruthen added 2 commits December 16, 2025 12:46

Add in png correctly

5e5d0cc

Merge branch 'add-aws-batch' of https://github.com/aviruthen/sagemake…

5030010

…r-python-sdk into add-aws-batch

aviruthen temporarily deployed to auto-approve December 16, 2025 20:47 — with GitHub Actions Inactive

davlind-amzn approved these changes Dec 17, 2025

View reviewed changes

mufaddal-rohawala requested changes Dec 17, 2025

View reviewed changes

aviruthen added 2 commits December 18, 2025 08:54

Removing logs_from_job from session_helper

8194f1b

Merge branch 'master' into add-aws-batch

f96e58a

aviruthen temporarily deployed to auto-approve December 18, 2025 16:54 — with GitHub Actions Inactive

aviruthen added 2 commits December 18, 2025 08:56

Adding helpers for logging

8121734

Merge branch 'add-aws-batch' of https://github.com/aviruthen/sagemake…

3fc1eff

…r-python-sdk into add-aws-batch

aviruthen temporarily deployed to auto-approve December 18, 2025 16:57 — with GitHub Actions Inactive

Make helper methods internal

9fd23a9

aviruthen temporarily deployed to auto-approve December 18, 2025 19:56 — with GitHub Actions Inactive

Merge branch 'master' into add-aws-batch

1dffe75

aviruthen temporarily deployed to auto-approve December 18, 2025 19:57 — with GitHub Actions Inactive

aviruthen added 2 commits December 18, 2025 13:47

Adding back nest asyncio dependency

b7f4cfb

Merge branch 'add-aws-batch' of https://github.com/aviruthen/sagemake…

848dff9

…r-python-sdk into add-aws-batch

aviruthen temporarily deployed to auto-approve December 18, 2025 21:47 — with GitHub Actions Inactive

davlind-amzn approved these changes Dec 18, 2025

View reviewed changes

Merge branch 'master' into add-aws-batch

c3dab73

aviruthen temporarily deployed to auto-approve December 18, 2025 22:52 — with GitHub Actions Inactive

		return self.boto_session.resource("iam").Role(role).arn


		def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):

		from sagemaker.train.aws_batch.boto_client import get_batch_boto_client


		def submit_service_job(

		]


		def get_training_job_name_from_training_job_arn(training_job_arn: str) -> str:

Add aws batch #5409

Are you sure you want to change the base?

Add aws batch #5409

Conversation

aviruthen commented Dec 11, 2025

Uh oh!

aviruthen commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davlind-amzn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aviruthen Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aviruthen Dec 17, 2025 •

edited

Loading