Automating Deployment with CI/CD for data science
In the world of data science, the ability to quickly and reliably deploy models and applications is crucial. This article explores how Continuous Integration and Continuous Deployment (CI/CD) pipelines can streamline this process on Saturn Cloud, ensuring that data-driven insights and innovations reach production environments with speed and consistency. This article goes through a real world use case of using CI/CD for deployments that we use in production at Saturn Cloud.
Overview
In Saturn Cloud, we run an API service that provides Saturn Cloud installations with information about historical usage, data such as:
- who is running workloads
- what size machines are they running
- when are they running
This API is consumed by every Saturn Cloud instance and is used when we enforce usage limits (Saturn Cloud installations can be configured so that different groups of users have access to different hardware types and can be capped after hitting spend limits). The API is also consumed by the usage reports we make available in Saturn Cloud. We serve this API as a Saturn Cloud Deployment. Previously, we were updating the deployment manually. We recently automated updates to this API via Saturn Cloud recipes and GitHub actions.
Recipes
A Saturn Cloud recipe is a yaml that encodes all relevant information about a Saturn Cloud resource. The following is the real recipe we use for our usage statistics API deployed on Saturn Cloud.
schema_version: 2024.04.01
type: deployment
spec:
name: usage-stats-api
owner: internal/production
description: ''
image: internal/production/usage-limits-api:2024.07.19
instance_type: large
environment_variables:
BASE_URL: https://usage-statistics-api-deploy.internal.saturnenterprise.io/
working_directory: /home/jovyan/workspace/usage-statistics
start_script: export PYTHONPATH=/home/jovyan/workspace/usage-statistics:${PYTHONPATH}
git_repositories:
- url: git@github.com:saturncloud/usage-statistics.git
path: /home/jovyan/workspace/usage-statistics
public: false
on_restart: reclone
reference: 2024.07.20
reference_type: tag
secrets:
- location: ANALYTICS_RDS_URL
type: environment_variable
description: ''
owner: internal/production
name: analytics-rds-url
- location: SNOWFLAKE_ACCOUNT
type: environment_variable
description: ''
owner: internal/production
name: snowflake-account
- location: SNOWFLAKE_PASSWORD
type: environment_variable
description: ''
owner: internal/production
name: snowflake-password
- location: SNOWFLAKE_USERNAME
type: environment_variable
description: ''
owner: internal/production
name: snowflake-username
shared_folders: []
start_dind: false
command: make run-backend
scale: 1
start_ssh: false
use_spot_instance: false
routes:
- subdomain: usage-statistics-api-deploy
container_port: 8000
visibility: unauthenticated
viewers: []
state:
id: 3c8558f9a0044987b5f6edfd77b1cf37
status: running
You don’t need to understand all of it. We will highlight the interesting parts below.
image: internal/production/usage-limits-api:2024.07.19
This is the docker image we run in the service.
instance_type: large
This defines which instance we run on.
git_repositories:
- url: git@github.com:saturncloud/usage-statistics.git
path: /home/jovyan/workspace/usage-statistics
public: false
on_restart: reclone
reference: main
reference_type: branch
This defines which Git repository holds the source code we are running in this deployment. It also tells us what to checkout (in this case, the main
branch)
command: make run-backend
working_directory: /home/jovyan/workspace/usage-statistics
This means the deployment will execute make run-backend
. This works because our Git repository is being checked out to /home/jovyan/workspace/usage-statistics
and there is a Makefile there which defines the run-backend
entrypoint. From our Makefile:
.PHONY: run-backend
run-backend:
python -m usage_statistics.scripts.run
Updating Deployments
Our procedure for updating the deployment is as follows.
- (optional, only if project dependencies have changed) build a new Docker image with the correct dependencies.
- Modify the recipe to include the new image
- Modify the recipe so that the GitHub repository points to a new tag
- Commit the new recipe
- Push the tag to GitHub which should trigger a GitHub Actions workflow.
- GitHub actions will apply the recipe, and then restart it.
Currently we are building images in the Saturn Cloud image builder. We are doing this manually and it has not been incorporated into our GitHub actions flow (but we will in the future)
Github Actions
The following is our Github actions yaml that actually updates Saturn Cloud.
name: deploy
on:
push:
tags:
- 'release-*'
jobs:
deploy:
name: deploy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: mamba-org/setup-micromamba@v1
name: Set up micromamba
with:
environment-file: environment.yaml
init-shell: >-
bash
cache-environment: true
- name: pythonpath
run: echo "PYTHONPATH=${GITHUB_WORKSPACE}" >> $GITHUB_ENV
- name: path
run: echo "PATH=/home/runner/micromamba/envs/usage-statistics/bin:/home/runner/micromamba-bin/:${PATH}" >> $GITHUB_ENV
- name: deploy
run: make deploy
Let’s go through the relevant snippets.
on:
push:
tags:
- 'release-*'
The above tells GitHub to only run this action on tags that start with release-
- uses: actions/checkout@v2
- uses: mamba-org/setup-micromamba@v1
name: Set up micromamba
with:
environment-file: environment.yaml
init-shell: >-
bash
cache-environment: true
The above checks out the project repository, and then creates a conda environment from our environment.yml (defined in our GitHub repository) with micromamba.
- name: pythonpath
run: echo "PYTHONPATH=${GITHUB_WORKSPACE}" >> $GITHUB_ENV
- name: path
run: echo "PATH=/home/runner/micromamba/envs/usage-statistics/bin:/home/runner/micromamba-bin/:${PATH}" >> $GITHUB_ENV
The above defines our PATH
and PYTHONPATH
environment variable
- name: deploy
run: make deploy
Finally we can run make deploy
which again runs a command from our Makefile:
.PHONY: deploy
deploy:
sc apply recipe.yaml
sc restart deployment usage-stats-api --owner production
This applies the recipe, which updates the deployment according to our IAC conventions, and then restarts it.
Conclusion
In the field of data science, maintaining efficiency and reliability in deployment processes is important. The integration of Continuous Integration and Continuous Deployment (CI/CD) pipelines into Saturn Cloud significantly enhances our ability to manage and update data-driven services seamlessly.
This automated approach not only streamlines our workflow but also allows us to focus more on developing innovative solutions and less on the intricacies of deployment. Deploying data science models and applications with CI/CD on Saturn Cloud delivers the power of automation in achieving operational efficiency. By leveraging these tools, we can deliver high-quality, reliable data services faster, driving better outcomes for our clients and advancing the field of data science as a whole.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.