Using AWS SageMaker Input Modes: Amazon S3, EFS, or FSx

In this blog post, we discuss how to use AWS SageMaker Input modes for Amazon S3 and file systems in Amazon EFS and Amazon FSx for Lustre.

In this blog post, we discuss how to use AWS SageMaker Input modes for Amazon S3 and file systems in Amazon EFS and Amazon FSx for Lustre.

Introduction

One persistent issue with ML training is the ease and flexibility of reading training data in a performant manner. With various effective, high-throughput data ingestion mechanisms known as data sources and their corresponding input modes, AWS SageMaker makes the process of ingesting data simpler. This enables you to quickly implement data sharding amongst GPUs and instances to enable data parallelism, reading with high performance, decoupling training code from the actual data source, automatically mounting file systems and auto-shuffle data at the start of each epoch.

When performing training in AWS SageMaker, it allows the training script to access datasets stored on Amazon S3, FSx for Lustre, or Amazon EFS, as if it were available on a local file system through a POSIX-compliant file system interface.

If you are using Amazon S3 as a data source, you can choose between File mode, FastFile mode, and Pipe mode but with Amazon EFS or FSx for Lustre as a data source, SageMaker mounts the file system before your training script starts.

CTA

Data sources and input modes

Pipe mode is one of the main features offered by the AWS SageMaker training environment. It is used to stream data directly from An AWS S3 storage.

One advantage of streaming data directly is that you are able to reduce the size of AWS EBS volumes used by the training instance.

In this mode, data is pre-fetched from an AWS S3 data source at high concurrency and throughput, and then it is streamed into unix-named pipes, which can also be called First-In-First-Out (FIFO) pipes.

Pipe mode enables data to be fed directly to the algorithm as needed, which means that training of the algorithm can start as soon as the pipe is opened and no local storage is required. This in turn removes any limitations on the size of your data – you can store all your data on AWS S3 and not have to worry about local storage constraints.

File mode is the default input mode when you don’t specify any input mode option. In this mode, data is presented as a file system view of the training container.

When using file mode, AWS SageMaker downloads the data from where it is stored into a local directory in the docker container, after the whole dataset is downloaded that is when training starts. The download speed will depend on the size of the data, the average size of the files, and the number of files.

Since in file mode, all the data is first downloaded, the training instance must have enough storage space to fit the entire dataset.

To configure the data for file mode, you will need to provide either an AWS S3 prefix (use it when all your data is located within a common S3 prefix), a manifest file, or an augmented manifest file.

Fast file mode is another input mode that provides file system access to the data stored in AWS S3 storage while leveraging the performance advantage of pipe mode.

Fast file mode makes S3 objects accessible through a POSIX-compliant file system interface, just as if the files were present on your training instance’s local drive. While your training script consumes data, it streams S3 material as needed. As a result, your dataset is no longer required to fit entirely within the storage space of the training instance, and training may begin right away without having to wait for the data to be downloaded. Only S3 prefixes are presently supported by Fast File (it does not support manifest and augmented manifest). SageMaker local mode is compatible with fast file mode.

Amazon Elastic File System (EFS) is a fully elastic, serverless file storage. If you want to use AWS EFS as your data source when training, the data must already be present in Amazon EFS before training. AWS SageMaker will launch your training script after mounting the specified Amazon EFS file system to the training instance. To utilize Amazon EFS, your training job has to be connected to a VPC.

Amazon FSx for Lustre - With low-latency file retrieval, FSx for Lustre can scale to hundreds of terabytes of throughput and millions of IOPS. SageMaker launches your training script after mounting the FSx for the Lustre file system to the training instance file system. The speed of the mounting process is mostly independent of the size of the dataset saved in FSx for Lustre.

Choosing Data Source and Input Mode

When choosing the best data source for your SageMaker training, there are a number of workload characteristics that you will consider such as the size of your data, the format of the files, the training duration, the average size of the files, data loader read pattern, and how fats your algorithm can consume the training data.

Below, we are going to discuss cases when you can use different data sources and input modes.

  • Fast file mode - When you have large datasets with larger files (more than 50 MB per file), fast file mode is the best option to try first which is more straightforward to use than FSx for Lustre because it doesn’t require creating a file system or connecting to a VPC.

Fast file mode is ideal for large file containers (more than 150 MB).

  • When you have a dataset which is small ( less than 50-100 GB) but is made up of many small files of less than 50 MB per file, the file mode download overhead will grow since each file will have to be downloaded individually from AWS S3 storage to the training instance volume. The solution to this is to serialize the small files into fewer larger file containers (such as 150 MB per file).

  • Amazon FSx for Lustre - When your dataset is too large for file mode, and it has many small files that you can’t serialize easily or it uses a random read access pattern, FSx for Lustre is a good option to consider, because it file system scales to hundreds of GB/s of throughput and millions of IOPS, which is great when you have many small files. However, the overhead of configuring and initializing the FSx for the Lustre file system may result in the cold start problem.

  • File mode is best for small datasets. If your dataset is stored in AWS S3 storage and its overall volume is relatively small for example, less than 50- 100 GB, try using file mode.

Depending on how many files are downloaded, a 50 GB dataset can have a different overhead. For instance, chunking a dataset into 100 MB shards takes roughly 5 minutes. The length of your training work is the main factor in determining if this startup overhead is acceptable because a longer training phase results in a correspondingly shorter download phase.

Choosing Data Input Mode Using the AWS SageMaker Python SDK

AWS SageMaker Python SDK provides several high-level abstractions for working with Amazon SageMaker. One of the abstractions is the Estimator class, which helps launch training jobs in AWS SageMaker.

You can specify one of the data input modes while configuring the SageMaker Estimator class or the Estimator.fit method.

Using the Estimator.fit method to specify the input mode.


from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

estimator = Estimator(
    checkpoint_s3_uri='s3://my_bucket/checkpoint_destination/',
    output_path='s3://my_bucket/output_path/',
    base_job_name='job_name',
)

# Run the training job
estimator.fit(
    inputs=TrainingInput(
        s3_data="s3://my_bucket/my_data/train",
        input_mode='File'  # Available options: File | Pipe | FastFile
    )
)

Using the Estimator class method to specify the input mode.


from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

estimator = Estimator(
    checkpoint_s3_uri='s3://my_bucket/checkpoint_destination/',
    output_path='s3://my_bucket/output_path/',
    base_job_name='job_name',
    input_mode='File'  # Available options: File | Pipe | FastFile
    ...
)

# Run the training job
estimator.fit(
    inputs=TrainingInput(s3_data="s3://my_bucket/my_data/train")
)

CTA

Configuring Data Input Channel to Use Amazon FSx for Lustre

Amazon FSx for Lustre is usually linked to an existing S3 bucket. When you use Amazon FSx for Lustre as your data source, you get higher throughput and faster training by reducing the time for data loading.

To be able to configure our data channel to use Amazon FSx for Lustre, first, we need to sync it with an existing AWS S3 storage and upload our training datasets as outlined in the following steps

  1. First, prepare your dataset and upload it to an AWS S3 storage.

  2. Next, create an FSx for the Lustre file system linked with the Amazon S3 bucket with the training data. You can do this through the AWS Management Console and AWS Command Line Interface (AWS CLI).

aws fsx create-data-repository-association \
      --file-system-id fs-0123456789abcdef0 \
      --file-system-path /ns1/path1/ \
      --data-repository-path s3://mybucket/myprefix/ \
      --s3   "AutoImportPolicy={Events=[NEW,CHANGED,DELETED]},AutoExportPolicy={Events=[NEW,CHANGED,DELETED]}"
  • Using the following Amazon FSx CLI command create-data-repository-association, you will be able to create a data repository association.
aws fsx create-data-repository-association \
      --file-system-id fs-0123456789abcdef0 \
      --file-system-path /ns1/path1/ \
      --data-repository-path s3://mybucket/myprefix/ \
      --s3   "AutoImportPolicy={Events=[NEW,CHANGED,DELETED]},AutoExportPolicy={Events=[NEW,CHANGED,DELETED]}"

  1. After the above steps are completed, you can check the configuration information by running the following commands
aws fsx describe-file-systems && \
aws fsx describe-data-repository-association

Now, after syncing AWS S3 storage with Amazon FSx for Lustre and uploading the training data to AWS S3, it is time to Set the Amazon FSx file system path as the data input channel for the AWS SageMaker training.

Before doing the setup, you need to make sure the IAM Role used to launch the SageMaker training job has access to Amazon FSx.

Through the following steps, you will be able to set up the Amazon FSx file system as the data source for AWS SageMaker training jobs using the AWS SageMaker Python SDK

  1. Using the following code, we will be able to configure a FileSystemInput class object.
from sagemaker.inputs import FileSystemInput

train_fs = FileSystemInput(
    file_system_id="",
    file_system_type="FSxLustre",
    directory_path="",
    file_system_access_mode="ro",
)
  • When you specify directory_path, make sure that you provide the Amazon FSx file system path starting with MountName
  1. The next step is to configure the AWS SageMaker estimator with the VPC configuration used for the Amazon FSx file system.
from sagemaker.estimator import Estimator

estimator = Estimator(
    role="your-iam-role-with-access-to-your-fsx",
    subnets=["subnet-id"],  # Should be the same as the subnet used for Amazon FSx
    security_group_ids="security-group-id"
)

  1. The last step is to launch the training job by running the estimator.fit method with the Amazon FSx file system.
estimator.fit(train_fs)
CTA

Conclusion

The data input mode integrations with Amazon S3, Amazon EFS, and FSx for Lustre are recommended ways to optimally configure the data sources for the best practices. You can strategically improve data loading performance using the SageMaker managed storage options and input modes, but it’s not strictly constrained. You can write your own data-reading logic directly in your training container. However, you must make sure that you specify the right paths that SageMaker can recognize.

In this blog post, we have discussed the different data sources and input modes and how to use them when training with AWS SageMaker.

Additional Resources:


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.