How to Work With Pycharm and AWS SageMaker Using AWS SageMaker Python SDK
In this blog, we are going to discuss how to make use of AWS SageMaker services locally on PyCharm using the AWS SageMaker Python SDK.
Amazon SageMaker, which is a fully managed ML service, has made it easier for organizations to put their ML ideas into production faster and it has improved the productivity of data science teams to a greater height. Many teams are able to easily and quickly train models, tune the models for better results, and deploy the models to production-ready environments.
On the other hand, many developers, and data scientists would prefer to have the full advantage of the services of SageMaker studio, while also using the preferred local IDE, such as PyCharm or Visual Studio Code for python code development. They prefer to combine the capabilities of the two to optimize their productivity and the results of the projects.
In our last post, on how to use Pycharm and AWS SageMaker, we discussed how to work with the two by making an SSH connection using the Remote Development Gateway plugin.
This post shows how you can use AWS SageMaker to manage your training jobs and experiments on AWS using the AWS SageMaker Python SDK with Pycharm as your local IDE, but you can use your preferred IDE with no code changes.
AWS SageMaker Python SDK
AWS SageMaker Python SDK provides several high-level abstractions for working with AWS SageMaker.
These are
Estimators: They help in encapsulating training on AWS SageMaker. There is also an Estimator that runs SageMaker-compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK.
Models: Encapsulates built ML models. AWS SageMaker Python SDK provides built-in algorithms with pre-trained models from popular open-source model hubs, such as Tensorflow Hub, Pytorch Hub, and HuggingFace. You can deploy these pre-trained models as-is or first fine-tune them on a custom dataset and then deploy them to a SageMaker endpoint for inference.
Predictors: They provide real-time inference and transformation using Python data types against a SageMaker endpoint.
Session: Provides a collection of methods for working with SageMaker resources.
Transformers: They encapsulate batch transform jobs for inference on AWS SageMaker.
Processors: They encapsulate running processing jobs for data processing on AWS SageMaker.
AWS SageMaker Python SDK supports local mode, which allows you to create estimators and deploy them to your local environment. This is a great way to test your scripts before running them in SageMaker-managed training or hosting environments.
With SageMaker local mode, the managed frameworks (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and images you supply yourself are downloaded to your local computer and show up in Docker. This Docker image is the same as in the SageMaker-managed training or hosting environments, so you can debug your code locally and faster.
Setup
To get started, complete the following steps:
In your AWS account, create a new user with programmatic access that enables access key ID and secret access key for the AWS CLI.
Then, you attach these permissions to the new user-created,
AmazonSageMakerFullAccess
andAmazonS3FullAccess
, and then if possible limit them to specific AWS S3 storage buckets.After attaching the permissions, now create an execution role for the SageMaker permissions above. SageMaker will use this role to perform operations on your behalf on the AWS hardware that is managed by SageMaker.
Now, proceed to install the AWS CLI on your local computer and perform a quick configuration with
aws configure
$ aws configure
AWS Access Key ID [None]: AKIAI*********EXAMPLE
AWS Secret Access Key [None]: wJal********EXAMPLEKEY
Default region name [None]: eu-west-1
Default output format [None]: json
For more information on the configuration, see Configuring the AWS CLI
After the above steps;
Install Docker if you have not yet installed it on your local computer.
Make sure that you have all the required Python libraries to run your code locally.
Then add SageMaker Python SDK to your local library. You can use
pip install sagemaker
or create a virtual environment withvenv
for your project then install SageMaker within the virtual environment.
Now after setting up your environment ready to develop and train your ML algorithms using AWS SageMaker on your local IDE, we will discuss important things you must not forget while developing.
Making your code SageMaker compatible.
There are certain rules that you must follow so as to make your code compatible with SageMaker, eg reading input data and writing output models and other artifacts.
The script will be very similar to the one you might run outside SageMaker, but you can access useful properties about the training environment through various environment variables.
Through the following code, we show some important environment variables used by SageMaker for managing the infrastructure.
For input data location, SM_CHANNEL_{channel_name}
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation
SM_CHANNEL_TESTING=/opt/ml/input/data/testing
The following code uses the model output location to save the model artifact:
SM_MODEL_DIR=/opt/ml/model
The code below uses the output location to write non-model training artifacts:
SM_OUTPUT_DATA_DIR=/opt/ml/output
Now, using the below code, you can pass the SageMaker environment variables as arguments so you can still run the script outside the SageMaker:
# SageMaker default SM_MODEL_DIR=/opt/ml/model
if os.getenv("SM_MODEL_DIR") is None:
os.environ["SM_MODEL_DIR"] = os.getcwd() + '/model'
# SageMaker default SM_OUTPUT_DATA_DIR=/opt/ml/output
if os.getenv("SM_OUTPUT_DATA_DIR") is None:
os.environ["SM_OUTPUT_DATA_DIR"] = os.getcwd() + '/output'
# SageMaker default SM_CHANNEL_TRAINING=/opt/ml/input/data/training
if os.getenv("SM_CHANNEL_TRAINING") is None:
os.environ["SM_CHANNEL_TRAINING"] = os.getcwd() + '/data'
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--train', type=str,default=os.environ.get('SM_CHANNEL_TRAINING'))
parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--output_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
AWS SageMaker Experiments for Organizing, tracking and comparing ML trainings
Amazon SageMaker Experiments helps you to group, organize and track your ML iterations when you have lots of experiences with different preprocessing configurations, different hyperparameters or even different ML algorithms to test.
AWS SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. It is integrated with Studio, providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models.
Conclusion
This blog post shows you how to use AWS SageMaker Python SDK with your preferred local IDE (for this case, we use PyCharm) to take full advantage of AWS SageMaker to develop, train and test ML algorithms.
We also introduce AWS SageMaker Experiments which helps in the organization and tracking of different experiments in SageMaker Studio.
Additional Resources:
How to securely connect to AWS SageMaker using SSH through a Bastion Host
Top 10 Data Science Platforms And Their Customer Reviews 2022
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.