Saturn Cloud on Nebius: Platform Architecture

How to deploy Saturn Cloud on Nebius for teams that need H100 and H200 GPUs without hyperscaler quota constraints.

The Platform Engineer’s Problem

If you’re running platform for an AI team, you likely have a backlog of projects: integrating the training pipeline with your feature store, building compliance automation for your industry’s data retention requirements, setting up model drift monitoring specific to your domain, optimizing the data loader for your particular workloads.

Before you can work on that backlog, you’re frontline support for issues that eat up your week:

  • IAM/RBAC: explaining permissions, debugging access issues
  • Cost tracking: manual usage reports, idle instance cleanup
  • Certificate management: TLS renewal, cert-manager debugging
  • Image building: Docker troubleshooting, registry management, CVE patching
  • Resource quotas: constant adjustment requests
  • Network debugging: service-to-service communication, VPN issues
  • Kubernetes upgrades: compatibility testing, deprecated API fixes
  • Logging: pipeline maintenance, retention management
  • User support: access setup, answering the same questions

Building this infrastructure takes a few weeks. The problem is maintaining it. Two years of edge cases, user requests, security patches, and daily operational interrupts.

The combination of Nebius and Saturn Cloud addresses both the GPU access problem and the platform maintenance burden. Nebius provides H100 GPUs at $2.95/hour per GPU with no quota approvals. An 8-GPU H100 instance runs $23.60/hour on Nebius versus $88-98/hour on Azure/GCP, and with no multi-month waitlists. Saturn Cloud handles the operational baseline:

Why Nebius

Nebius is a cloud provider focused on AI workloads, operating in US and EU regions.

GPU Availability: On-demand access to H100 and H200 instances without sales calls or capacity reservations. You can provision multi-node clusters with 16-32 GPUs immediately. Larger clusters through commitment contracts.

Pricing: $2.95/hour per H100 GPU. An 8-GPU instance runs $23.60/hour versus $88-98/hour on Azure or GCP.

Networking: NVLink and InfiniBand standard on all GPU instances. NFS storage at 12 GBps read throughput per 8-GPU VM (compare to AWS EFS at 1.5 GBps max).

Managed Services: Managed Kubernetes, PostgreSQL, and MLflow included.

Saturn Cloud Architecture

Saturn Cloud adds the platform layer on top of Nebius’s managed Kubernetes.

Resource Types

TypePurposeExamples
WorkspacesDevelopment environments with persistent home directoriesJupyterLab, RStudio, SSH (VS Code/PyCharm/Cursor)
JobsScheduled or on-demand training runs from GitSingle-node fine-tuning, multi-node distributed training
DeploymentsLong-running services behind authenticated endpointsModel inference APIs, dashboards

Core Components

ComponentFunction
AtlasAPI server and PostgreSQL-backed database
Auth-serverRS256 JWT tokens for sessions and API access
TraefikIngress routing to workspaces, jobs, deployments, and UI
SSH-proxyGateway for IDE connections (VS Code, PyCharm, Cursor)
Cluster-autoscalerScales Nebius node groups based on pending pods

What Gets Automated

IAM/RBAC: User management, groups, project-based access, SSO integration.

Cost tracking: Per-user and per-project usage reports, not just cluster-level costs. Integrates with your internal cost allocation.

Multi-node distributed training: Jobs with multiple GPU nodes get automatic environment setup (SATURN_JOB_LEADER, SATURN_JOB_RANK, worker DNS). All workers land on the same InfiniBand fabric for RDMA. NCCL configured for InfiniBand. Logs from all nodes accessible for debugging. You handle checkpointing strategy and recovery.

Certificate management: Automated TLS provisioning and renewal via cert-manager.

Image management: Pre-built images with NVIDIA libraries (CUDA, NeMo, RAPIDS), or bring your own from any registry.

Platform upgrades: Kubernetes compatibility, operator updates, security patches. Typically every 6 months, causing 1-2 minutes of UI/API downtime. User workloads continue running during upgrades.

AI/ML engineers contact Saturn Cloud support directly for workspace and job questions. Platform engineers work on their actual project backlog.

GPU Options

Saturn Cloud on Nebius provides access to:

GPUMemoryConfigurations
H10080GB HBM31-GPU, 8-GPU
H200141GB HBM3e1-GPU, 8-GPU
GB200BlackwellVia Nebius

All instances include NVLink (intra-node) and InfiniBand (inter-node).

Architecture

Saturn Cloud on Nebius architecture diagram

Infrastructure Layer

The reference Terraform provisions a Nebius MK8S cluster (Kubernetes 1.30) with public control plane endpoints and a single etcd instance (configurable for HA). A service account is created and added to your Nebius IAM viewers group to pull images from Nebius Container Registry.

For GPU workloads, Terraform creates separate InfiniBand GPU clusters. In EU (eu-north1), this means fabric-6 for H100 (configurable) and fabric-7 for H200. In US (us-central1), the fabrics are us-central1-a and us-central1-b. These clusters provide the low-latency RDMA networking required for multi-node distributed training.

Node groups are provisioned for different workload types. The system pool runs 2-100 cpu-d3 nodes (4vcpu-16gb) for the Saturn control plane. Three CPU pools handle general workloads: 4vcpu-16gb, 16vcpu-64gb, and 64vcpu-256gb, all scaling from 0-100 nodes. GPU pools include H100 and H200 configurations, available in 1-GPU and 8-GPU variants, also scaling 0-100. All node groups carry node.saturncloud.io/role labels for scheduling, and GPU nodes use Nebius-managed CUDA 12 drivers via the gpu_settings.drivers_preset parameter. The 8-GPU nodes attach to their respective InfiniBand GPU clusters for distributed training.

Platform Layer

Saturn Cloud installs via a Kubernetes operator that manages platform components as custom resources. The saturn-helm-operator follows the standard operator pattern: it watches CRDs and reconciles Helm releases every 2 minutes. It ships as a Helm chart from oci://ghcr.io/saturncloud/charts/saturn-helm-operator-nebius.

The core services handle user-facing functionality. Atlas is the API server and PostgreSQL-backed database that manages resources (workspaces, jobs, deployments). The auth-server issues RS256 JWT tokens for user sessions and API access. Traefik acts as the ingress controller, routing traffic to workspaces, jobs, deployments, and the Saturn UI. The ssh-proxy provides an SSH gateway that proxies IDE connections (VS Code, PyCharm, Cursor) to running workspace pods.

Infrastructure services provide cluster functionality. The cluster-autoscaler scales Nebius node groups based on pending pods. Cert-manager handles TLS certificate provisioning. Logging runs Fluent Bit for log aggregation. Monitoring deploys Prometheus for metrics collection. Network policy enforcement (Cilium) and DNS (CoreDNS) are managed by Nebius MK8S. Shared folders use Nebius’s native shared filesystem infrastructure rather than a separate NFS provisioner.

The bootstrap process works as follows: Terraform provisions the MK8S cluster and node groups, then installs the saturn-helm-operator via Helm with a short-lived bootstrap token. The operator exchanges this bootstrap token for a long-lived token and stores it in cluster secrets. It then creates custom resources for each Saturn component and reconciles those CRs into Helm releases, installing all services.

All compute, storage, and networking stays in your Nebius account under your IAM and VPC policies. Data never leaves your account. Saturn Cloud only accesses the Kubernetes API to manage the operator and platform components.

Operations and Customization

Failure Modes

The saturn-helm-operator reconciles every 2 minutes. If it stops:

TimeframeImpact
0-12 hoursExisting workloads unaffected. New resources cannot pull Saturn Cloud images once registry credentials expire.
12 hours - 3 monthsNew workloads fail. Existing workloads continue running.
3 months+SSL certificate expiration makes platform inaccessible.

Saturn Cloud support receives automatic alerts when the operator stops reconciling.

If Atlas (API server) goes down, the UI and API become unavailable, but running workspaces, jobs, and deployments continue executing. Production deployments can be configured with authentication disabled when deployed behind your VPN or auth layer.

Debugging

Logs are accessible via the Saturn Cloud UI (Status tab) or kubectl logs <pod-name>. Most issues are escalated directly to Saturn Cloud support, who resolves them with users. Platform engineers typically do not debug individual workspace failures.

Customization

Saturn Cloud runs your Docker containers with code from your Git repositories. You can use Saturn’s pre-built images with NVIDIA libraries or build your own from any registry. Node pool configurations are customizable via Terraform. You can deploy additional services into the same Kubernetes cluster (Prefect, Flyte, Dagster, ClickHouse, Datadog, Crowdstrike) and Saturn Cloud workloads can connect to them.

Portability

Your workloads are standard Kubernetes pods. Resource configurations export as YAML recipes via CLI or API. All data stays in your Nebius account. If you stop using Saturn Cloud, redeploying on standard Kubernetes is straightforward: containers run without modification, Git repositories remain your source of truth, no proprietary formats to convert.

See the Operations and Customization documentation for additional details.

Installation

Both installation options use the reference Terraform and saturn-helm-operator described in the Architecture section. The Terraform is customizable (different node pool sizes, additional GPU types, network configurations), but the Saturn Cloud operator configuration must match your node pool setup. Saturn Cloud can only provision workloads on node pools that exist in your Terraform.

Option 1: Managed Installation

Saturn Cloud support runs the reference Terraform and operator deployment.

  1. Have a Nebius project with VPC and subnet configured. Note your subnet ID and project ID.

  2. Email support@saturncloud.io with:

    • Organization name
    • Nebius project ID
    • Subnet ID
    • Requirements (GPU types, region, network configuration)
  3. Provide Saturn Cloud a service account with permissions to create resources.

  4. Saturn Cloud runs the Terraform (MK8S cluster, node groups, InfiniBand GPU clusters) and deploys the operator. Takes 15-30 minutes.

  5. Receive your Saturn Cloud URL and admin credentials.

Option 2: Self-Service Installation

Run the reference Terraform and operator deployment yourself.

  1. Register:
curl -X POST https://manager.saturnenterprise.io/api/v2/customers/register \
    -H "Content-Type: application/json" \
    -d '{
      "name": "your-organization-name",
      "email": "your-email@example.com",
      "cloud": "nebius"
    }'
  1. Activate via email. You’ll receive a terraform.tfvars with a 4-hour bootstrap token.

  2. Clone and deploy the reference Terraform:

git clone https://github.com/saturncloud/saturncloud-reference-terraform.git
cd saturncloud-reference-terraform/nebius/eu-north1  # or us-central1
terraform init && terraform plan && terraform apply

The Terraform provisions the MK8S cluster, node groups, InfiniBand GPU clusters, and installs the saturn-helm-operator with your bootstrap token. The operator then deploys all Saturn Cloud components as described in the Architecture section.

  1. Verify:
export KUBECONFIG=./kubeconfig
kubectl get nodes
kubectl get pods -A

GPU nodes scale from zero and appear when users create GPU workloads.

Why Multi-Node Training is Hard

Single-node GPU training is straightforward. Multi-node distributed training is where platform complexity explodes.

The core challenge: coordinating N GPU nodes to act as a single training job. Each worker needs to know its rank, the total world size, the address of the leader node, and which other workers it can communicate with. This information must be injected before the training script starts. The workers need to land on the same InfiniBand fabric for RDMA, or your inter-node communication bottlenecks at 25 Gbps instead of 400 Gbps. NCCL needs the right environment variables to use InfiniBand instead of falling back to TCP.

When something fails (and something always fails), you need logs from all nodes, not just the one that raised the exception. The worker that OOM’d might not be the worker that logged the error. You need to correlate logs across nodes to debug.

Saturn Cloud handles this by:

  • Scheduling all workers onto the same InfiniBand fabric (Nebius GPU clusters are fabric-specific)
  • Injecting SATURN_JOB_LEADER, SATURN_JOB_RANK, and worker DNS entries before containers start
  • Configuring NCCL environment variables for InfiniBand
  • Aggregating logs from all nodes into a single view

You still handle checkpointing strategy, failure recovery semantics, and training script logic. Saturn Cloud handles the coordination infrastructure.

Tradeoffs

You can build all of this yourself. Many teams do.

Saturn Cloud is opinionated about how workspaces, jobs, and deployments work. You don’t control the pod specs directly, can’t use custom scheduler plugins, and can’t modify the Saturn Cloud operator internals. If you need that level of control, a custom Kubernetes stack is the right choice.

Saturn Cloud doesn’t have to be your entire platform. It runs alongside whatever else you deploy. We have customers running Prefect, Dagster, and custom services in the same cluster. Saturn Cloud handles GPU workspaces and training jobs; your existing orchestration, databases, and internal tools run next to it. You can also run Saturn Cloud in one cluster and everything else in a secondary cluster if you prefer isolation.

Saturn Cloud removes the mechanical Kubernetes work required to support ML teams, but it doesn’t eliminate the need for infrastructure engineering around data, networking, compliance, and observability.

Conclusion

Nebius provides H100 and H200 GPU access without hyperscaler quota constraints. Saturn Cloud provides the platform layer so AI/ML engineers can use that infrastructure without platform engineers building and maintaining workspaces, job scheduling, and deployment infrastructure.

The installation uses standard Terraform and a Kubernetes operator. The full reference implementation is open source.

Resources