Automating Deployment with CI/CD for data science

In the world of data science, the ability to quickly and reliably deploy models and applications is crucial. This article explores how Continuous Integration and Continuous Deployment (CI/CD) pipelines can streamline this process on Saturn Cloud, ensuring that data-driven insights and innovations reach production environments with speed and consistency. This article goes through a real world use case of using CI/CD for deployments that we use in production at Saturn Cloud.

By Hugo Shi | Tuesday, July 23, 2024 | Data Science & ML

Overview

In Saturn Cloud, we run an API service that provides Saturn Cloud installations with information about historical usage, data such as:

who is running workloads
what size machines are they running
when are they running

This API is consumed by every Saturn Cloud instance and is used when we enforce usage limits (Saturn Cloud installations can be configured so that different groups of users have access to different hardware types and can be capped after hitting spend limits). The API is also consumed by the usage reports we make available in Saturn Cloud. We serve this API as a Saturn Cloud Deployment. Previously, we were updating the deployment manually. We recently automated updates to this API via Saturn Cloud recipes and GitHub actions.

Recipes

A Saturn Cloud recipe is a yaml that encodes all relevant information about a Saturn Cloud resource. The following is the real recipe we use for our usage statistics API deployed on Saturn Cloud.

schema_version: 2024.04.01
type: deployment
spec:
  name: usage-stats-api
  owner: internal/production
  description: ''
  image: internal/production/usage-limits-api:2024.07.19
  instance_type: large
  environment_variables:
    BASE_URL: https://usage-statistics-api-deploy.internal.saturnenterprise.io/
  working_directory: /home/jovyan/workspace/usage-statistics
  start_script: export PYTHONPATH=/home/jovyan/workspace/usage-statistics:${PYTHONPATH}
  git_repositories:
    - url: git@github.com:saturncloud/usage-statistics.git
      path: /home/jovyan/workspace/usage-statistics
      public: false
      on_restart: reclone
      reference: 2024.07.20
      reference_type: tag
  secrets:
    - location: ANALYTICS_RDS_URL
      type: environment_variable
      description: ''
      owner: internal/production
      name: analytics-rds-url
    - location: SNOWFLAKE_ACCOUNT
      type: environment_variable
      description: ''
      owner: internal/production
      name: snowflake-account
    - location: SNOWFLAKE_PASSWORD
      type: environment_variable
      description: ''
      owner: internal/production
      name: snowflake-password
    - location: SNOWFLAKE_USERNAME
      type: environment_variable
      description: ''
      owner: internal/production
      name: snowflake-username
  shared_folders: []
  start_dind: false
  command: make run-backend
  scale: 1
  start_ssh: false
  use_spot_instance: false
  routes:
    - subdomain: usage-statistics-api-deploy
      container_port: 8000
      visibility: unauthenticated
  viewers: []
state:
  id: 3c8558f9a0044987b5f6edfd77b1cf37
  status: running

You don’t need to understand all of it. We will highlight the interesting parts below.

  image: internal/production/usage-limits-api:2024.07.19

This is the docker image we run in the service.

  instance_type: large

This defines which instance we run on.

  git_repositories:
    - url: git@github.com:saturncloud/usage-statistics.git
      path: /home/jovyan/workspace/usage-statistics
      public: false
      on_restart: reclone
      reference: main
      reference_type: branch

This defines which Git repository holds the source code we are running in this deployment. It also tells us what to checkout (in this case, the main branch)

  command: make run-backend
  working_directory: /home/jovyan/workspace/usage-statistics

This means the deployment will execute make run-backend . This works because our Git repository is being checked out to /home/jovyan/workspace/usage-statistics and there is a Makefile there which defines the run-backend entrypoint. From our Makefile:

.PHONY: run-backend
run-backend:
	python -m usage_statistics.scripts.run

Updating Deployments

Our procedure for updating the deployment is as follows.

(optional, only if project dependencies have changed) build a new Docker image with the correct dependencies.
Modify the recipe to include the new image
Modify the recipe so that the GitHub repository points to a new tag
Commit the new recipe
Push the tag to GitHub which should trigger a GitHub Actions workflow.
GitHub actions will apply the recipe, and then restart it.

Currently we are building images in the Saturn Cloud image builder. We are doing this manually and it has not been incorporated into our GitHub actions flow (but we will in the future)

Github Actions

The following is our Github actions yaml that actually updates Saturn Cloud.

name: deploy
on:
  push:
    tags:
    - 'release-*'
jobs:
  deploy:
    name: deploy
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: mamba-org/setup-micromamba@v1
        name: Set up micromamba
        with:
          environment-file: environment.yaml
          init-shell: >-
                        bash
          cache-environment: true
      - name: pythonpath
        run: echo "PYTHONPATH=${GITHUB_WORKSPACE}" >> $GITHUB_ENV
      - name: path
        run: echo "PATH=/home/runner/micromamba/envs/usage-statistics/bin:/home/runner/micromamba-bin/:${PATH}" >> $GITHUB_ENV
      - name: deploy
        run: make deploy

Let’s go through the relevant snippets.

on:
  push:
    tags:
    - 'release-*'

The above tells GitHub to only run this action on tags that start with release-

      - uses: actions/checkout@v2
      - uses: mamba-org/setup-micromamba@v1
        name: Set up micromamba
        with:
          environment-file: environment.yaml
          init-shell: >-
                        bash
          cache-environment: true

The above checks out the project repository, and then creates a conda environment from our environment.yml (defined in our GitHub repository) with micromamba.

      - name: pythonpath
        run: echo "PYTHONPATH=${GITHUB_WORKSPACE}" >> $GITHUB_ENV
      - name: path
        run: echo "PATH=/home/runner/micromamba/envs/usage-statistics/bin:/home/runner/micromamba-bin/:${PATH}" >> $GITHUB_ENV

The above defines our PATH and PYTHONPATH environment variable

      - name: deploy
        run: make deploy

Finally we can run make deploy which again runs a command from our Makefile:

.PHONY: deploy
deploy:
	sc apply recipe.yaml
	sc restart deployment usage-stats-api --owner production

This applies the recipe, which updates the deployment according to our IAC conventions, and then restarts it.

Conclusion

In the field of data science, maintaining efficiency and reliability in deployment processes is important. The integration of Continuous Integration and Continuous Deployment (CI/CD) pipelines into Saturn Cloud significantly enhances our ability to manage and update data-driven services seamlessly.

This automated approach not only streamlines our workflow but also allows us to focus more on developing innovative solutions and less on the intricacies of deployment. Deploying data science models and applications with CI/CD on Saturn Cloud delivers the power of automation in achieving operational efficiency. By leveraging these tools, we can deliver high-quality, reliable data services faster, driving better outcomes for our clients and advancing the field of data science as a whole.

About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.

Start for free