Skip to content

Commit 2a7513a

Browse files
Documented alternative way to attach SSH Helper through requirements.txt
1 parent 14df77d commit 2a7513a

File tree

1 file changed

+35
-18
lines changed

1 file changed

+35
-18
lines changed

README.md

Lines changed: 35 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ pip install sagemaker-ssh-helper
7373

7474
### Step 2: Modify your start training job code
7575
1. Add import for SSHEstimatorWrapper
76-
2. Add a `dependencies` parameter to the Estimator object.
76+
2. Add a `dependencies` parameter to the Estimator object. Alternatively, add `sagemaker_ssh_helper` into `requirements.txt`.
7777
3. Add an `SSHEstimatorWrapper.create(estimator,...)` call before calling `fit()` and add SageMaker SSH Helper
7878
as `dependencies`.
7979
4. Add a call to `ssh_wrapper.get_instance_ids()` to get the SSM instance(s) id. We'll use this as the target
@@ -82,26 +82,33 @@ to connect to later on.
8282
For example:
8383

8484
```python
85+
import logging
8586
from sagemaker.pytorch import PyTorch
8687
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper # <--NEW--
8788

8889
role = ...
8990

90-
estimator = PyTorch(entry_point='train.py',
91-
source_dir='source_dir/training/',
92-
dependencies=[SSHEstimatorWrapper.dependency_dir()], # <--NEW--
93-
role=role,
94-
framework_version='1.9.1',
95-
py_version='py38',
96-
instance_count=1,
97-
instance_type='ml.m5.xlarge')
91+
estimator = PyTorch(
92+
entry_point='train.py',
93+
source_dir='source_dir/training/',
94+
dependencies=[SSHEstimatorWrapper.dependency_dir()], # <--NEW
95+
# (alternatively, add sagemaker_ssh_helper into requirements.txt
96+
# inside source dir) --
97+
role=role,
98+
framework_version='1.9.1',
99+
py_version='py38',
100+
instance_count=1,
101+
instance_type='ml.m5.xlarge'
102+
)
98103

99104
ssh_wrapper = SSHEstimatorWrapper.create(estimator, connection_wait_time_seconds=600) # <--NEW--
100105

101106
estimator.fit(wait=False)
102107

103108
instance_ids = ssh_wrapper.get_instance_ids() # <--NEW--
104-
print(f'To connect over SSM run: aws ssm start-session --target {instance_ids[0]}') # <--NEW--
109+
110+
logging.info(f"To connect over SSM run: aws ssm start-session --target {instance_ids[0]}")
111+
logging.info(f"To connect over SSH run: sm-local-ssh-training connect {ssh_wrapper.latest_training_job_name()}")
105112
```
106113

107114
*Note:* `connection_wait_time_seconds` is the amount of time the SSH helper will wait inside SageMaker before it continues normal execution. It's useful for training jobs, when you want to connect before training starts.
@@ -138,7 +145,7 @@ and will appear in the job's CloudWatch log like this:
138145
Successfully registered the instance with AWS SSM using Managed instance-id: mi-1234567890abcdef0
139146
```
140147

141-
To fetch the instance IDs from the logs in an automated way, call the Python method of `ssh_wrapper`,
148+
To fetch the instance IDs in an automated way, call the Python method of `ssh_wrapper`,
142149
as mentioned in the previous step:
143150

144151
```python
@@ -218,21 +225,29 @@ Adding SageMaker SSH Helper to inference endpoint is similar to training with th
218225
1. Wrap your model into `SSHModelWrapper` before calling `deploy()` and add SSH Helper to `dependencies`:
219226

220227
```python
228+
from sagemaker import Predictor
221229
from sagemaker_ssh_helper.wrapper import SSHModelWrapper # <--NEW--
222230

223231
estimator = ...
224232
...
225233
endpoint_name = ...
226234

227-
model = estimator.create_model(entry_point='inference.py',
228-
source_dir='source_dir/inference/',
229-
dependencies=[SSHModelWrapper.dependency_dir()]) # <--NEW--
235+
model = estimator.create_model(
236+
entry_point='inference_ssh.py',
237+
source_dir='source_dir/inference/',
238+
dependencies=[SSHModelWrapper.dependency_dir()] # <--NEW
239+
# (alternatively, add sagemaker_ssh_helper into requirements.txt
240+
# inside source dir) --
241+
)
230242

231243
ssh_wrapper = SSHModelWrapper.create(model, connection_wait_time_seconds=0) # <--NEW--
232244

233-
predictor = model.deploy(initial_instance_count=1,
234-
instance_type='ml.m5.xlarge',
235-
endpoint_name=endpoint_name)
245+
predictor: Predictor = model.deploy(
246+
initial_instance_count=1,
247+
instance_type='ml.m5.xlarge',
248+
endpoint_name=endpoint_name,
249+
wait=True
250+
)
236251

237252
predicted_value = predictor.predict(data=...)
238253
```
@@ -561,7 +576,7 @@ Note, that if you stop the waiting loop, SageMaker will run your training script
561576
562577
But there's a useful trick: submit a dummy script `train_placeholder.py` with the infinite loop, and while this loop will be running, you can
563578
run your real training script again and again with the remote interpreter.
564-
Setting `max_run` parameter of the estimator is highly recommended in this case.
579+
Setting `max_run` parameter of the estimator is highly recommended in this case.
565580
566581
The dummy script may look like this:
567582
@@ -581,6 +596,8 @@ The method `is_last_session_timeout()` will help to prevent unused resources and
581596
582597
Keep in mind that SSM sessions will [terminate automatically due to user inactivity](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-timeout.html), but SSH sessions will keep running until either a user terminates them or network timeout occurs (e.g., when local machine hibernates).
583598
599+
Consider also sending e-mail notifications for users of the long-running jobs, so the users don't forget to shut down unused resources.
600+
584601
Make also sure that you're aware of [SageMaker Managed Warm Pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html)
585602
feature, which is also helpful in the scenario when you need to rerun your code multiple times.
586603

0 commit comments

Comments
 (0)