-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add aws batch #5409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add aws batch #5409
Conversation
| "## Create Sample Resources\n", | ||
| "The diagram belows shows the Batch resources we'll create for this example.\n", | ||
| "\n", | ||
| "\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're missing the png here from this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, added the png back
| " time.sleep(5)\n", | ||
| "\n", | ||
| " # Print training job logs\n", | ||
| " # job.get_estimator().logs()\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this Estimator comment reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops good catch!
| # Step 3: Create ModelTrainer | ||
| model_trainer = ModelTrainer(**init_params) | ||
|
|
||
| # Step 4: Set _latest_training_job (key insight!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: key insight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was a note for myself (the ModelTrainer parameter that I could attach the training job to)!
davlind-amzn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran some testing of these changes on my side, everything looks good to me!
| return self.boto_session.resource("iam").Role(role).arn | ||
|
|
||
|
|
||
| def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to check on what this utility is used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to maintain parity with the Estimator::logs method, which tails the logs being emitted from an active training job. Example of usage here, down under the "Monitor Job Status" section: https://github.com/aws/amazon-sagemaker-examples/blob/default/%20%20%20%20%20%20build_and_train_models/sm-training-queues/sm-training-queues_getting_started_with_estimator.ipynb
Check out this line in the example notebook in this PR: model_trainer.sagemaker_session.logs_for_job(model_trainer._latest_training_job.training_job_name, wait=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem like we are replacing:
v2 logic: job.get_estimator().logs()
v3 logic: model_trainer.sagemaker_session.logs_for_job(model_trainer._latest_training_job.training_job_name, wait=True)
The V3 experience looks pretty bad to me and since this is not an existing v2 parity issue - Can we think about tie-ing the get logs functionality to training queue job or model trainer directly? We can also think through on this and not make a decision on this for 1st release. Lets use utility method within example notebooks (replicate _logs_for_job funtionality within notebook as a standalone utility)? This would not break customers using the notebook and we can make a right offering for logs exposure and update the notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Mufi. Resolution: let's make logs_for_job a notebook method so that we are not introducing a new external method (and we can take more time on this for the future). We are okay with allowing the user to get the training job name from _lastest_training_job since this is an internal parameter (not an internal class or method). Will be implementing this
| from sagemaker.train.aws_batch.boto_client import get_batch_boto_client | ||
|
|
||
|
|
||
| def submit_service_job( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these are helper functions can this be internal? this helps in maing breaking changes if not user facing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding comments about moving methods to internal: I can definitely see your point here! My take is that these methods are not marked as internal in V2 and may already be in use by customers. Changing the method signature makes migration from V2 to V3 more difficult. Also, the benefit of marking these functions as internal is minimal, as python doesn't prevent these methods from being directly called. So overall I would say it's more worthwhile to leave these signatures as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that these methods are not marked as internal in V2 and may already be in use by customers. Changing the method signature makes migration from V2 to V3 more difficult.
We are also locking in the experience for a new major version and hampering our ability to make breaking changes to these offerings, this is our only opportunity to make these internal. Have we marketed these utilities within notebooks? If customer encounters errors while migrating from v2 -> v3, the migration shift would be minimal.
I would leave this decision to you David, if you feel these signatures will not be needing any breaking changes, since these are AWS Batch specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I'm good if we want to make these utility methods internal. Effort to migrate is minimal, and it does seem unlikely that customers are currently using these. Thanks for your input Mufi!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the batch api helper methods internal
| import boto3 | ||
|
|
||
|
|
||
| def get_batch_boto_client( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it internal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in the notebook, should stay external
| ] | ||
|
|
||
|
|
||
| def get_training_job_name_from_training_job_arn(training_job_arn: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made internal!
| "jinja2>=3.0,<4.0", | ||
| "sagemaker-mlflow>=0.0.1,<1.0.0", | ||
| "mlflow>=3.0.0,<4.0.0", | ||
| "nest_asyncio>=1.5.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a required dependency? how much of additional size implications are added to the sagemaker-train package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in the result() method for TrainingQueuedJob. It's a really small package; however, I'm aligned with removing it from pyproject.toml and having it be a dependency users can pip install (we have something similar where there are many different ML frameworks for inference but we don't require all of them in the pyproject.toml file since users can pick and choose which ML frameworks they want to use)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed nest_asyncio as a dependency in pyproject.toml

Issue #, if available:
Description of changes:
Introducing aws_batch into PySDK V3, placed in sagemaker.train.
Supports queueing in front of training jobs with interfaces TrainingQueue and TrainingQueuedJob.
Includes unit tests, integration tests, and an example notebook
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.