GPU Fractionalization

How to fractionalize GPUs in Saturn Cloud

Introduction

GPU fractionalization is a powerful technique that allows multiple workloads to share a single GPU, maximizing hardware utilization and reducing costs. In many scenarios, a single workload may not fully utilize a GPU’s compute capacity, leading to expensive idle resources. By fractionalizing GPUs, organizations can run multiple smaller workloads on the same physical GPU, improving efficiency and enabling more developers and data scientists to access GPU resources simultaneously. This is particularly valuable for inference workloads, development environments, and training smaller models that don’t require an entire GPU’s resources.

How Saturn Cloud Manages GPU Fractionalization

Saturn Cloud implements GPU fractionalization through NOS, an open-source Kubernetes module that enables efficient GPU sharing through dynamic GPU partitioning. NOS works by intelligently partitioning GPUs based on workload demands, similar to how Kubernetes Cluster Autoscaler manages nodes, but at the GPU resource level.

Key Features of NOS in Saturn Cloud

Dynamic Partitioning: NOS automatically adjusts GPU partitioning based on pending workloads, maximizing the number of pods that can be scheduled
Multi-Instance GPU (MIG): Saturn Cloud uses NOS with MIG configuration, providing hardware-level isolation between GPU partitions for maximum security and performance isolation
Safety First: Never disrupts running workloads - only creates new GPU partitions when resources are available
Intelligent Scheduling: Uses an internal scheduler to simulate different partitioning configurations and selects the optimal one

This architecture ensures that Saturn Cloud users can efficiently share GPU resources across multiple workloads while maintaining proper isolation and resource guarantees through MIG’s hardware-level partitioning, dramatically improving GPU utilization and reducing costs.

How GPU Fractionalization Works with NOS

Understanding Multi-Instance GPU (MIG)

Multi-Instance GPU (MIG) is NVIDIA’s hardware-level GPU partitioning technology that allows a single physical GPU to be divided into multiple smaller, isolated GPU instances. Each MIG instance has dedicated memory, cache, and streaming multiprocessors, providing true hardware isolation between workloads.

Benefits of MIG

Hardware Isolation: Each MIG instance is fully isolated at the hardware level, preventing interference between workloads
Predictable Performance: Dedicated resources ensure consistent performance without noisy neighbor effects
Quality of Service: Guaranteed memory bandwidth and compute resources for each partition
Error Isolation: Failures in one MIG instance don’t affect others running on the same GPU
Security: Hardware-level separation provides strong security boundaries between different workloads

Supported GPU Models

MIG is available on select NVIDIA data center GPUs designed specifically for multi-tenant environments:

A100: Up to 7 MIG instances per GPU (Ampere architecture)
A30: Up to 4 MIG instances per GPU (Ampere architecture)
H100: Up to 7 MIG instances per GPU with enhanced capabilities (Hopper architecture)
H200: Up to 7 MIG instances with 141GB HBM3e memory (Hopper architecture)
B200: Up to 7 MIG instances with 180GB HBM3e memory (Blackwell architecture)

MIG Profiles and Configurations

MIG instances are created using predefined profiles that specify the compute and memory resources. Common profiles include:

1g.10gb: 1/7 of GPU compute, 10GB memory - ideal for inference workloads
2g.20gb: 2/7 of GPU compute, 20GB memory - suitable for medium-sized models
3g.40gb: 3/7 of GPU compute, 40GB memory - for larger training jobs
4g.40gb: 4/7 of GPU compute, 40GB memory - balanced compute/memory ratio
7g.80gb: Full GPU (A100 80GB) - when you need maximum resources

The exact profiles available depend on your GPU model and memory configuration. Saturn Cloud and NOS work together to automatically select the optimal MIG configuration based on your workload requirements.

MIG Mode and GPU Reset Requirements

Understanding when GPU resets are required is important for planning maintenance windows:

Enabling MIG Mode: Requires a GPU reset on Ampere GPUs (A100, A30). Hopper and newer GPUs (H100, H200, B200) enable MIG mode without requiring a reset.
Disabling MIG Mode: Always requires a system reboot regardless of GPU generation
Reconfiguring MIG Profiles: Once MIG mode is enabled, creating, destroying, or reconfiguring MIG instances is dynamic and does not require any GPU reset or system reboot

This dynamic reconfiguration capability allows Saturn Cloud to adjust GPU partitioning throughout the day based on workload demands - for example, using many small instances for inference during business hours and consolidating to larger instances for training overnight.

The NOS Architecture

NOS operates as a Kubernetes-native solution with two main components that work together to manage GPU fractionalization dynamically.

GPU Partitioner - The Brain

The GPU Partitioner is the central controller that makes intelligent decisions about how to partition GPUs across the cluster. It runs as a Kubernetes controller and performs several key functions:

Monitors Pending Workloads: Continuously watches for pods that are pending due to insufficient GPU resources
Batch Processing: Groups pending pods together for efficient processing, configurable via batch window timeouts
Simulation Engine: Uses the Kubernetes scheduler framework to simulate different MIG configurations and predict their outcomes
Optimization: Selects the partitioning plan that maximizes the number of pods that can be scheduled
State Management: Maintains the desired state of GPU partitioning across the cluster

The GPU Partitioner never directly modifies GPUs - instead, it determines the optimal configuration and communicates this to the MIG Agents.

MIG Agent - The Executor

The MIG Agent runs as a DaemonSet on every GPU node in the cluster. Each agent is responsible for:

MIG Profile Management: Creates and deletes MIG profiles on the physical GPUs based on instructions from the GPU Partitioner
Safety Enforcement: Ensures that MIG instances currently in use by running pods are never deleted
Status Reporting: Updates node annotations to reflect the current GPU configuration and availability
Health Monitoring: Tracks the health and status of MIG instances on the node

The MIG Agent only runs on nodes labeled with nos.nebuly.com/gpu-partitioning: mig, allowing fine-grained control over which nodes participate in GPU fractionalization.

Communication Through Kubernetes

NOS leverages Kubernetes' native mechanisms for coordination:

Node Annotations: Used to communicate GPU status and availability
- Example: nos.nebuly.com/status-gpu-0-1g.10gb-free: 3 indicates 3 free 1g.10gb instances on GPU 0
- Example: nos.nebuly.com/status-gpu-0-1g.10gb-used: 2 indicates 2 used 1g.10gb instances
Custom Resources: MIG partitioning plans are stored as Kubernetes custom resources
Label Selectors: Nodes opt into GPU fractionalization through specific labels
Event-Driven: Changes in pod scheduling or GPU status trigger immediate re-evaluation

This architecture ensures that NOS integrates seamlessly with Kubernetes while maintaining reliability and scalability across large GPU clusters.

NOS and Kubernetes Scheduling

NOS integrates with the standard Kubernetes scheduler rather than replacing it:

Simulation Mode: When the GPU Partitioner needs to test different MIG configurations, it uses the Kubernetes scheduler framework to simulate pod placement
Standard Scheduling: Once MIG instances are created, pods are scheduled using the standard Kubernetes scheduler with the newly available MIG resources
No Custom Scheduler Required: Workloads continue to use the default scheduler - NOS simply ensures the right GPU resources are available

This design keeps the system simple and compatible with existing Kubernetes deployments.

How GPU Fractionalization Works in Practice

To understand how NOS manages GPU fractionalization, let’s follow the journey of a typical workload from submission to execution.

Detecting GPU Demand

When you submit a workload requesting a fraction of a GPU (such as nvidia.com/mig-1g.10gb), the GPU Partitioner detects this request. Rather than processing each request individually, NOS batches pending workloads over a configurable time window. This batching approach allows for better GPU partitioning decisions by considering multiple workloads together.

The Simulation Phase

Once NOS has collected a batch of pending workloads, it enters the planning phase. The GPU Partitioner takes a snapshot of your cluster’s current state - which nodes have GPUs, how they’re currently partitioned, and what workloads are already running.

Using the Kubernetes scheduler framework, NOS runs multiple simulations to determine what MIG configuration would allow the most pending pods to be scheduled. It tests various scenarios - splitting one A100 into seven small instances for inference workloads, or keeping another as a single large instance for a training job. The optimization algorithm evaluates each potential configuration based on scheduling success, resource utilization, and minimal disruption to existing workloads.

Dynamic Reconfiguration

With a plan selected, NOS executes the GPU reconfiguration. The GPU Partitioner communicates the plan to the MIG Agents running on each affected node. These agents then create new MIG profiles as specified. On modern GPUs like the H100 and H200, this reconfiguration happens without any GPU reset - workloads continue running on unaffected MIG instances while new ones are being created.

Once the MIG Agents update the node annotations to reflect the new GPU capacity, the Kubernetes scheduler sees these resources and begins placing the pending pods. The process from detection to execution typically completes in seconds.

Working with Fractionalized GPUs

Now that we understand how NOS creates GPU fractions, let’s explore how to actually use them in your workloads.

Requesting the Right GPU Fraction

When deploying a workload on Saturn Cloud, you specify your GPU requirements using standard Kubernetes resource requests. For example, if you’re running an inference service that needs just a small GPU slice:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1  # Requests one 1g.10gb MIG instance

The naming convention follows a predictable pattern: nvidia.com/mig-<compute>g.<memory>gb. This makes it easy to understand exactly what resources you’re getting - 1g.10gb means 1/7 of the GPU’s compute power and 10GB of memory.

What Happens Behind the Scenes

When your pod requests a specific MIG instance, the Kubernetes scheduler becomes very particular about placement. It will only consider nodes that have exactly the MIG profile you requested available. This exact matching ensures predictable performance - a pod requesting 1g.10gb won’t accidentally end up on a larger 2g.20gb instance where it might not utilize all the resources efficiently.

This strict matching might seem limiting, but it’s actually a feature. It ensures that your workload gets exactly the resources it needs for optimal performance, and helps Saturn Cloud maintain efficient cluster utilization by preventing resource waste.

Ensuring Reliability and Safety

NOS manages GPU fractionalization without disrupting running workloads through several reliability mechanisms.

Workload Protection

NOS follows a simple rule: never disrupt a running workload. When the MIG Agent receives instructions to reconfigure a GPU, it first checks if any MIG instances are currently in use. If they are, those instances remain untouched. Your production inference service continues running while NOS creates new MIG instances on the same GPU for other workloads.

This protection extends to failure scenarios. If creating a new MIG profile fails on one GPU, NOS continues with other GPUs and retries the failed operation later. This partial failure recovery means temporary issues don’t prevent the rest of your cluster from adapting to workload demands.

Handling Failures

GPUs can fail, nodes can restart, or configuration can drift from the desired state. NOS handles these scenarios through continuous reconciliation. The system monitors the actual state of GPU partitioning and compares it to the desired state. When discrepancies are detected - such as a MIG instance disappearing after a node restart - NOS automatically recreates it.

The system includes cleanup mechanisms with grace periods. When a MIG instance is no longer needed, NOS waits before deletion, giving any pods that might be scheduling time to land. This prevents resources from being deleted just as a pod is about to use them.

Monitoring Your Fractionalized GPUs

NOS provides several ways to observe and troubleshoot GPU fractionalization.

Checking GPU Status

The most direct way to see your GPU partitioning is through node annotations. NOS continuously updates these to reflect the current state. For instance, you might see:

nos.nebuly.com/status-gpu-0-1g.10gb-free: "3"
nos.nebuly.com/status-gpu-0-1g.10gb-used: "2"

This tells you that GPU 0 has been partitioned into 1g.10gb instances, with 3 available and 2 in use. It’s a real-time view that helps you understand resource availability at a glance.

For deeper investigation, you can use standard NVIDIA tools. Running nvidia-smi mig -lgip on a node shows the exact MIG configuration, including which instances are occupied and their specific resource allocations.

Understanding Scheduling Decisions

When a pod isn’t scheduling as expected, Kubernetes events provide insights. The command kubectl describe pod <pod-name> shows the scheduler’s decision-making process, including why a pod might be pending. You might see messages indicating that no nodes have the requested MIG profile available, or that all matching instances are already in use.

NOS logs its decision-making process in detail. The GPU Partitioner logs show why it chose specific partitioning configurations, what simulations it ran, and how it optimized for pod scheduling. These logs help you understand how your cluster’s GPU partitioning evolved.

Installing NOS with Helm

For self-managed Saturn Cloud installations, you can install NOS using Helm. Here are the basic steps:

Prerequisites

NVIDIA GPU Operator installed
GPUs that support MIG (A100, A30, H100, H200, B200)

Installation Steps

Install NOS directly from the OCI registry:

helm install oci://ghcr.io/nebuly-ai/helm-charts/nos \
  --version 0.1.2 \
  --namespace nebuly-nos \
  --generate-name \
  --create-namespace
  --set gpuPartitioner.devicePlugin.config.namespace=nvidia-gpu-operator

Label GPU nodes to enable MIG partitioning:

kubectl label node <gpu-node-name> nos.nebuly.com/gpu-partitioning=mig

Ensure that your Terraform for your k8s cluster applies that node label for newly create nodes.

Managed Saturn Cloud Installations

Most Saturn Cloud deployments are fully managed. If you have a managed installation and would like to enable GPU fractionalization, contact support@saturncloud.io and we can configure it for your installation. Our team will handle the setup and optimization based on your specific workload requirements.