Most Data Science Platforms are a Bad Idea
A data science platform is an integrated set of tools that deliver the capabilities that most data science teams need. These capabilities are:
- The ability to do exploratory data analysis and create machine learning models.
- The ability to deploy models as APIs for other teams to use.
- The ability to schedule jobs and data pipelines to keep the business running.
- The ability to deploy dashboards for executives and stakeholders to view at any time.
- The ability to collaborate between members of the team easily on their work
Most data science teams have these at least needs and thus most products in this category have focus on this set of uses.
Your team will grow to the point where you’ll need a data science platform. Your team will need to provide an infrastructure for research and production in a standard, secure, and reproducible way. Data scientists will run out of memory on their laptops and need scalable hardware. Data scientists will “try out” a few packages on their laptops and suddenly, no one else can reproduce their work. We’ve seen data scientists stop pushing their code to git, quit their job, and then that work is lost.
Teams should leverage automated tooling whenever they can. Five to ten years ago in data science, Git was something only the “nerds” used. Now it’s become standard practice and the benefits have become obvious. Similarly, we are at an inflection point where data science platforms are going from a niche tool to a common practice. Data scientists need standard, secure, and reproducible ways to provision infrastructure and setup new projects, and having each data scientist manage their own infrastructure can’t cut it.
Data Scientists are in exceptionally high demand. They already have to learn and keep up with the latest trends in data visualization, data analysis, machine learning–in addition to becoming competent software engineers. Adding an entire new category of skills to their plate makes them impossible to hire. Letting each data scientist own their own infrastructure creates dozens of bespoke stacks and makes it impossible for one data scientist to pick up another team members work. Offloading infrastructure to a separate software engineering team is expensive, and creates significant delays where data science is blocked by another team.
While you don’t necessarily need to buy a platform, you do need to have at set of standards and practices for your data scientist to follow. Standards like what kind of hardware the work is done one, what libraries and languages are used, and so on. If you do build your own tools to achieve and enforce these standards, you’ve probably just built (and have to maintain) your own data science platform. This is almost always more expensive and time consuming than buying an off-the-shelf solution.
However, not all data science platforms are created equal. There are a few core problems with most platforms out there.
Problem: data science platforms force you to use a fixed workflow
The huge flaw with most data science platforms is that they force data scientists to change their workflows to fit their paradigms.
Products like Knime and Dataiku produce workflows that are represented in GUIs instead of code. By using GUIs, data scientists are severely limited in what they can do in the platform and how work can be reproduced by other people. For a detailed explanation on why GUIs aren’t great tools for data scientists, see the this talk by Hadley Wickham for the ACM.
Other products like Amazon SageMaker and Databricks require you to write code with their specific APIs and packages. This makes it harder to migrate to their platform and locks you into it. It also becomes harder to write code since any guides or articles you read about relevant packages or methods will first need to be altered to run on their platform before you can try it. For instance, the examples that AWS has publishes on how to use SageMaker require extensive boilerplate code and reasoning about other AWS services just to get it to work. This creates a heavy overhead on data scientists compared to just coding on a laptop.
Example [AWS SageMaker](https://saturncloud.io/glossary/aws-sagemaker) boilerplate code
sm_boto3 = boto3.client("sagemaker")
training_job_name = "example-training-job-{}".format(current_time())
data_path = "s3://" + bucket + "/" + input_prefix
output_prefix = "example/output/"
output_path = "s3://" + bucket + "/" + output_prefix
region = boto3.Session().region_name
account = account_id()
image_uri = "{}.dkr.ecr.{}.amazonaws.com/example-image:latest".format(account, region)
algorithm_specification = {
"TrainingImage": image_uri,
"TrainingInputMode": "File",
}
input_data_config = [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": data_path,
"S3DataDistributionType": "FullyReplicated",
}
},
},
{
"ChannelName": "test",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": data_path,
"S3DataDistributionType": "FullyReplicated",
}
},
},
]
output_data_config = {"S3OutputPath": output_path}
resource_config = {"InstanceType": "ml.m5.large", "InstanceCount": 1, "VolumeSizeInGB": 10}
stopping_condition = {
"MaxRuntimeInSeconds": 120,
}
enable_network_isolation = False
ct_res = sm_boto3.create_training_job(
TrainingJobName=training_job_name,
AlgorithmSpecification=algorithm_specification,
RoleArn=role_arn,
InputDataConfig=input_data_config,
OutputDataConfig=output_data_config,
ResourceConfig=resource_config,
StoppingCondition=stopping_condition,
EnableNetworkIsolation=enable_network_isolation,
EnableManagedSpotTraining=False,
)
Problem: data science platforms require you to use single notebooks
Still other data science platforms such as Google Colab and Paperspace are centered around the idea of having data scientists entirely work from notebooks and abstract away the decisions data scientists make outside of the notebook itself. So while it’s quick to start up a notebook and write code, it becomes harder to install libraries, manage files required on the machine, and other infrastructure components. But the limitations of using hosted notebooks become very apparent when you consider the work data scientists have to do:
- They do not offer scalable compute. In Google Colab, you do not get access to guaranteed RAM or GPUs. Resources are based on availability - even for paid tiers. This makes them impractical for enterprise use cases.
- They don’t let you write code outside of notebooks. Notebooks are useful for exploration, but data science teams need to build things. Putting everything into a single notebook make code hard to test and hard to reuse.
- They are hard to deploy and maintain. Data Scientists need to deploy code in order to deliver business value, and deployments are difficult to make and maintain with just a notebook. Without the ability to deploy models, APIs, dashboards and jobs, data science remains research only.
Problem: MLOps don’t provide data science platforms
MLOps is another huge category of tools, that focuses primarily on tracking machine learning (experiment tracking and data/model versioning), and working with models in production (model monitoring and governance).Some data science platforms also include MLOps capabilities. Most work with other stand alone MLOps products (like Weights & Biases, Comet, and Verta). MLOps can be enormously valuable for data science teams by providing these sorts of backends. That said, MLOps tools on their own do not provide the infrastructure required for data scientists to do analyses and train models.
Not only do MLOps tools miss the important component of having sandbox environments for data scientists to work, but migrating teams to these platforms carries significant cost. Migration can involve re-writing large sections of your code base to fit the patterns required by the tool, and migrating existing data and models to new locations. Developing in these platforms also carries significant risk. If in the future you have to do work that isn’t supported by the tool you’ll either have to throw away years of work or split your MLOps across multiple platforms. Vendor lock-in and limited support of other tools and libraries can create havoc for machine learning teams. So while MLOps can be enormously helpful for teams they are not a simple plug and play solution for all of a data science or machine learning team’s infrastructure.
What makes Saturn Cloud a great data science platform
At Saturn Cloud, we provide a fully flexible platform for data scientists that can connect to whatever you need. The fundamental units that data scientist use on Saturn Cloud are resources. Each resource consists of:
- A Docker image
- Some initialization scripts for package installations
- Your git repositories
- Attached secrets and storage
By defining a resource in this way, there is full flexibility when using them. A resource can use any data science programming language. While we natively support Python, R, and Julia, you can run other languages too. Our resources don’t force you to use a single notebook, you can use any number of files and manage them in one or multiple git repositories. We also provide storage to the resource so users don’t have to worry about files being lost if the resource is restarted. Finally, the initialization scripts allow users to install any packages or software they want on the resources. Because of this, our Saturn Cloud resources are as flexible as if a data scientist had admin rights on a laptop!
We have three types of resources, depending on the work you need to do:
- A workspace runs JupyterLab or R for exploratory analysis, model training, and ad hoc work. These resources also support SSH so you can connect via PyCharm, VSCode, a terminal or any other IDE you want. This is what data scientists spend most of their time using.
- A deployment runs constantly and can serve traffic for hosting dashboards and APIs. This is great for providing dashboards and reports to stakeholders or models and APIs to engineering teams.
- A job runs once or on a schedule. This lets data scientist create data pipelines or scheduled tasks to help run the business. Saturn Cloud also integrates with Prefect Cloud for more complex workflows.
Further, all types of resource can be connected to distributed clusters to use frameworks like Dask for big data processing. Saturn Cloud is incredibly flexible so data scientists can work how they want to. Further, Saturn Cloud integrates with MLOps providers such as Weights & Biases, Comet, and Verta to deliver advanced functionality for teams that need it.
As you consider data science platforms for your team, definitely think about the trade-offs of flexibility, easy to use infrastructure, and the ability to avoid lock-in. If you’re interested in Saturn Cloud, feel free to reach out to us with questions or for a demo.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.