How to Work with Custom S3 Buckets and AWS SageMaker

In this blog, we’ll discuss how you can work with custom S3 buckets and AWS SageMaker.

By David Elvis | Wednesday, January 25, 2023 | Data Science & ML

Introduction

AWS Sagemaker is a managed service in the AWS public cloud.

It’s used to create, train, and deploy machine learning models, but it’s also great for doing exploratory data analysis and prototyping.

One of the advantages of working with AWS Sagemaker is it provides a more convenient way to store your data privately in S3 Bucket which could contain any type of data like, csv, pickle, zip or photos and videos.

If you’re working with private data, then special care must be taken when accessing this data for model training. Downloading the entire data set to your laptop may be against your company’s policy or maybe simply imprudent. Imagine having your laptop lost or stolen, knowing that it contains sensitive data. As a side note, this is another reason why you should use always disk encryption.

The data hosted in the cloud may also be too large to fit on your personal computer’s disk, so storing your data in S3 buckets is a good solution since you can access the data when working in AWS SageMaker.

In this blog, we discuss how you can work with custom S3 buckets and AWS SageMaker.

Working with custom S3 buckets and AWS Sagemaker can be a breeze with Saturn Cloud. Join for free and seamlessly manage your AWS resources and accelerate your data science projects.

Creating an S3 Bucket

AWS SageMaker typically uses S3 buckets for storing your data, model artifacts and other equipment that need storage.

To begin with, we will create an S3 bucket

1.  Navigate to your AWS Management Console and then choose **Services** then select **S3** under Storage.

2.  Next, Choose **Create Bucket**

3.  You will be prompted to Provide a globally unique **name for your bucket**

4.  Select the **Region** you’ve chosen to use for this workshop from the dropdown, the region that you select must be in the same AWS Region as the SageMaker notebook instance.

5.  Choose **Create** in the lower left of the dialog without selecting a bucket to copy settings from.

Reading data from the S3 bucket to AWS SageMaker

AWS SageMaker and S3 buckets are two separate services that are offered by AWS and it is mandatory that for one service to perform actions on another service, there have to be appropriate permissions set for that to happen.

To be able to work with S3 buckets and SageMaker, the IAM Role that you will use needs to have a policy that gives the S3FullAccess permission

There are multiple ways that you can work with the data from your custom S3 bucket while developing in AWS SageMaker.

1. Using Boto3 to create a connection

Boto3 is a Python library that is used to ease the integration of your python application, library or script with your AWS services such as S3, EC2, SageMaker, and many more. `

Using the Boto3 library, we will facilitate the connection between the AWS Sagemaker notebook and the custom S3 bucket.

In the example below, we show how you are able to create a connection and then list all of the files contained within a specific subfolder on an S3 bucket.

import boto3
conn = boto3.client('s3')
contents = conn.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
for f in contents:
   print(f['Key'])

2. Loading pickled data directly from Custom S3 buckets.

The Python pickle library is very important when it comes to serialization – converting python data structures into a linear form that can be stored or transmitted over a network, and then loaded later.

Using this pickle library and boto3 library, you can work with data stored in your custom s3 buckets while in the AWS SageMaker instance.

In the example below, we use the boto3 library to get the specific file object (the pickle) on S3 that we want to load.

You will notice in the example, the boto3 client returns a response that contains a data stream and we must read the data stream with the pickle library into the data object.

import pickle

my_bucket = 's3-bucket-name'
my_file = 'full/path/to/file.pickle'
s3client = boto3.client('s3')
response = s3client.get_object(Bucket=my_bucket, Key=my_file)
body = response['Body']
data = pickle.loads(body.read())

3. Mounting S3 bucket to SageMaker using S3FS

It is possible to mount an S3 bucket onto an AWS SageMaker instance as a file system known as S3fs.

S3fs is a FUSE filesystem application which is backed by AWS web services and it allows us to mount an S3 bucket as a local file system onto AWS SageMaker.

After mounting S3fs onto AWS SageMaker, we can simply use cp, mv and ls, the basic Unix commands similar to run on locally attached disks.

Using the following steps, you can be able to mount your S3 bucket to the SageMaker instance,

Step 1: While in your AWS SageMaker instance, run the following command to update your system.

sudo yum update

Step 2: Install the required packages

First, we will install all the dependencies which are for FUSE and s3cmd using the following command.

sudo yum install automake fuse fuse-devel gcc-c++ git libcurl-devel libxml2-devel make openssl-devel

Step 3: We will now install s3fs-fuse using the following command.

sudo amazon-linux-extras install epel
sudo yum install s3fs-fuse

Step 4: Using the command below, we will check where the s3fs command is placed

which s3fs

Step 5: Now, for us to be able to mount, we need to create an IAM role for s3 bucket.

Using the code below, we create a policy with read-and-write access for the S3 bucket

{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Action": [
            "s3:GetBucketLocation",
            "s3:ListAllMyBuckets"
         ],
         "Resource": "arn:aws:s3:::*"
      },
      {
         "Effect": "Allow",
         "Action": [
            "s3:ListBucket"
         ],
         "Resource": [
            "arn:aws:s3:::<bucket_name>"
         ]
      },
      {
         "Effect": "Allow",
         "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject"
         ],
         "Resource": [
            "arn:aws:s3:::<bucket_name>/*"
         ]
      }
   ]
}

Remember to change the variable bucket name with the name that you will give your custom S3 bucket.

After creating the IAM role, we will attach it to our AWS SageMaker instance.

Step 6: Now, we need to create a directory in our AWS SageMaker instance where we will mount our S3 bucket into.

sudo mkdir -p /s3fs-data

Step 7: Now, we can safely mount our S3 bucket into AWS SageMaker using the following command.

s3fs -o iam_role="<IAM_role_name>" -o url="https://s3-eu-central-1.amazonaws.com" -o endpoint=eu-central-1 -o dbglevel=info -o curldbg -o allow_other -o use_cache=/tmp <bucket_name>/s3fs-data

Now, we have successfully mounted our custom S3 bucket to AWS SageMaker and using the command below you can check.

df -h

Working with custom S3 buckets and AWS Sagemaker can be a breeze with Saturn Cloud. Join for free and seamlessly manage your AWS resources and accelerate your data science projects.

Conclusion

When working with AWS SageMaker, there are multiple places that you will need to use an S3 bucket for storing data and other files that need to be stored e.g model artifacts.

In this article, we have discussed how to work with custom S3 buckets with AWS SageMaker.

About Saturn Cloud

Saturn Cloud is a portable AI platform that installs securely in any cloud account. Build, deploy, scale and collaborate on AI/ML workloads-no long term contracts, no vendor lock-in. Join today and get 150 hours of free compute per month.

Try Saturn Cloud Now