Managing Cloud Cost for ML Teams

With the rise of AI and the use of GPUs cloud costs are quickly sky rocketing. This post includes tips we’ve learned working with our customers over the years for managing cloud spend.

By Hugo Shi | Tuesday, June 18, 2024 | Data Science & ML

Managing cloud costs is critical. ML is expensive. An AWS instance with H100 GPUs costs $98 per hour. If you forget to turn this machine off over the weekend you just wasted $5000 - that’s more than rent for most people. That doesn’t even include data storage and networking costs associated with these workloads. Your cloud cost will increase every single day as your company collects more data, and as you hire more data scientists and machine learning engineers.

How do data scientists provision cloud resources?

Most companies adopt one of 2 patterns

Cloud is owned by IT

In many companies, Cloud is owned by IT. Data Scientists that need resources can submit tickets, and IT will resolve the tickets in a timely manner (where timely manner might mean 2 weeks from now). IT is responsible for the provisioning and allocation of resources, IT is responsible for monitoring and cost managemnt, IT is responsible for security and compliance. This is good for things IT cares about - cost management and security, but it’s bad for everything else. Here is a hypothetical story of how this plays out which I clearly made up but reflects patterns we’ve seen in our customer base.

Day 1: Data Scientist Sarah, armed with too much coffee and a dream of world-changing algorithms, files a ticket for an EC2 instance. She’s hopeful it’ll be resolved soon because “how hard can it be?”. She’s spun up EC2 instances before.
Day 5: It’s finally Friday, and IT, which has been busy patching servers for the latest critical CVEs, finally gets around to the ticket. They deliver the EC2 instance just in time for the weekend. Sarah celebrates, already planning the groundbreaking machine learning model… on Monday.
Day 8: Sarah logs into her new EC2 instance. Uh oh! could not connect to server: Connection refused. It looks like Sarah can’t connect to the internal corporate database. She files another ticket.
Day 9: IT investigates Sarah’s new ticket.
Day 11: IT realizes the transit gateway in the VPC hasn’t been configured properly. They fix it. Sarah begins training her model!
Day 13: OSError: [Errno 28] No space left on device. Sarah files a ticket with IT to expand her EBS volume.
Day 17: The volume is expanded, and Sarah can finally finish training her model.

This pattern is ok for managing spend, but terrible for ML team productivity.

Data Science provisions their own cloud resources.

In smaller companies, or in companies where IT has gotten tired of dealing with Sarah, the power to manage cloud resources is turned over to individual data scientists. This is good because it eliminates the IT bottleneck on provisioning data science resources, however it pushes things like security, cost management, and any extra devops work onto the data scientist. The data scientist is capable of stack overflowing their way into success, but you should expect less attention to detail on things they do not require to get their job done, such as security and cost management.

Managing cloud cost for ML teams is hard because the responsibility for saving money is pushed to the individual. If you’ve ever shared a not-so-clean house with a bunch of students where everyone is supposed to be responsible for cleaning up after themselves, you know exactly what I mean.

General principles around cloud cost management

The following are table stakes in the world of cost management, but worth reviewing.

Resource Optimization

Right-Sizing: Regularly assess and adjust the size of your instances to match your workloads. Avoid over-provisioning.
Spot Instances: Use spot instances for non-critical workloads to benefit from significant cost savings.

Instance Lifecycle Management

Auto-scaling: Implement auto-scaling policies to adjust resources based on demand automatically.
Start/Stop Schedules: Automate the starting and stopping of instances based on working hours or usage patterns to avoid paying for idle resources.

Storage Optimization

Tiered Storage: Use appropriate storage tiers based on access frequency. For example, use cheaper storage for infrequently accessed data.
Data Retention Policies: Implement data retention policies to regularly clean up outdated or unused data.

Monitoring and Alerts

Cost Monitoring: Use cloud provider tools (like AWS Cost Explorer, Google Cloud Cost Management) to monitor spending and identify cost drivers.
Set Budgets and Alerts: Set cost budgets and configure alerts to notify you of unexpected spikes in usage.

Efficient Data Processing

Data Pipelines: Optimize data pipelines to reduce unnecessary data transfer and processing.
Serverless Architectures: Use serverless computing for event-driven workloads to only pay for what you use.

Cost Governance

Tagging and Cost Allocation: Implement a tagging strategy to allocate costs accurately to different projects, teams, or departments.
Regular Reviews: Conduct regular cost reviews and audits to ensure adherence to cost management practices.

Use Reserved Instances

Reserved Instances: For steady-state or predictable workloads, consider reserved instances or savings plans to save on long-term costs.

These are guidelines any consultant you hire to lower your cost will recommend. These are interesting, but don’t always deliver actionable results for ML teams.

Resource Optimization in DS/ML

Spot Instances

Spot Instances are great - they do tend to be harder to leverage in DS/ML for 2 reasons.

DS/ML workloads often use GPUs. Spot instances deliver discounted prices when there is excess compute capacity. These days, we have GPU shortages and as a result getting spot instances for GPU workloads is much harder.
Much of DS/ML is interactive, long running, and stateful. This means an instance that can disappear on you at any moment can be pretty inconvenient.

We still recommend Spot instances, especially if you have CPU focused workloads where you won’t be too annoyed if the machine is shut down. Spot interruptions are also not that common for CPU workloads.

Hardware utilization

How do you know if your DS/ML team is right-sizing their instances? You have to look at hardware metrics in order to understand how well the hardware is being utilized.

Most cloud platforms have ways to monitor CPU/RAM utilization. For example AWS Cloud Watch can collect these metrics. Most DS/ML workloads are limited by the memory of the instance, not the compute. Python workloads are also notoriously single threaded. It is very common to see DS/ML hardware sit close to the memory limit of the machine, but only leverage a single CPU. This is hard to address without asking data scientists to figure out how to parallelize their workloads. If this was my team, I would focus on making sure they generally consume 70% of the RAM on the machine. If not I would ask them to see if they could scale down their workloads.
Other than CPU/RAM the other resource that can go under-utilized is GPUs. This is particularly painful because GPUs are so expensive. A common pattern are Data scientists that spin up multi-gpu machines (with 4-8 GPUs) but spend significant time developing code on those machines, before they are ready to run their training or inference workloads. If this was my team I would look for users that are not leveraging all GPUs, and ask them to develop on single-GPU machines, which they can switch out for multi-GPU machines when they are ready to train.

Cost optimizations for DS/ML Teams

Research is a mix of development and production

One thing that makes DS/ML infra difficult to manage is that it is some mix between development and production. By comparison development infrastructure is easy to manage because you can generally destroy it and re-build it without much consequence. If you contrast that to research infrastructure, even if you could destroy and re-create all cloud objects using something like Terraform, all un-pushed code would suddenly be deleted from everybody’s jupyter notebook servers. DS/ML infrastructure needs the permanence of production infrastructure, but it also has the dynamism of development infrastructure as your data science team is constantly spinning up new resources.

Compute

Between compute and storage - compute expenses tend to be much larger, but they also tend to be much easier. Compute is easier than storage because you can stopping an EC2 instance is a reversible change, whereas deleting an S3 bucket is not. It is really important to tag instances, or have some other way of breaking out costs by users or teams, otherwise you have no way to figure out who is responsible for 5k in AWS spend over the weekend that you need to clean up.

Storage

Storage is hard to deal with because deleting data is scary. The same principles of tagging resources and making sure you can figure out who is responsible for your spend is even more important here.

Conclusion

In conclusion, managing cloud costs for ML teams requires a delicate balance between optimizing resource usage and maintaining productivity. By understanding the provisioning patterns and challenges faced by data scientists, companies can implement strategies that streamline resource allocation while minimizing unnecessary expenses. Key principles like resource optimization, instance lifecycle management, and storage optimization are essential, but their practical application must be tailored to the unique needs of ML workflows.

Regular monitoring, setting budgets, and leveraging tools for cost governance ensure that cloud spending remains under control. Additionally, encouraging efficient data processing and utilizing reserved instances can lead to significant savings. While spot instances and hardware utilization offer opportunities for cost reduction, their implementation must be carefully managed to avoid disrupting ongoing projects.

Ultimately, the goal is to foster an environment where ML teams can innovate without being hindered by cloud infrastructure bottlenecks or runaway costs. By adopting these best practices and continuously refining them based on real-world usage, organizations can achieve a sustainable and cost-effective approach to managing cloud resources for their ML initiatives.

About Saturn Cloud

Saturn Cloud is a portable AI platform that installs securely in any cloud account. Build, deploy, scale and collaborate on AI/ML workloads-no long term contracts, no vendor lock-in.

Start for free