MLOps 09 - GPU Infrastructure and Scaling

Why Do ML Workloads Demand Different Infrastructure?

The infrastructure that runs web services and the infrastructure that trains models have fundamentally different requirements. Web services are about processing many small requests quickly. Model training, on the other hand, involves executing massive matrix operations over extended periods, and a CPU's sequential processing approach often cannot complete these within a realistic timeframe.

The reason GPUs are essential for ML workloads comes down to architectural differences. CPUs are optimized for handling complex logical operations with a small number of high-performance cores. GPUs specialize in parallel processing, applying the same operation across large volumes of data simultaneously using thousands of simpler cores. Since matrix multiplication — the core operation in deep learning — is inherently parallelizable, GPUs can deliver speedups of tens to hundreds of times over CPUs.

The Complexity of GPU Scheduling

Allocating GPUs to teams is far more challenging than managing CPU resources. GPUs are expensive, and leaving them idle represents significant cost waste. Yet sharing a single GPU across multiple workloads simultaneously is difficult due to memory isolation and performance interference issues.

Training and inference workloads also have very different characteristics. Training occupies large amounts of GPU memory for extended periods, while inference uses relatively little memory for short durations. Efficiently scheduling these two types of workloads on the same cluster requires strategies that go beyond simple resource allocation.

Technologies like NVIDIA's MIG (Multi-Instance GPU) and time-slicing emerged to address these challenges. They partition a single physical GPU into multiple logical GPUs or share it across time intervals to improve utilization.

GPU Management on Kubernetes

There are good reasons why Kubernetes has become the standard orchestrator for ML infrastructure. Container-based workload isolation, declarative resource management, and automatic scheduling align well with the complex requirements of ML workloads.

Using GPUs in Kubernetes requires some additional configuration. The NVIDIA GPU Operator automatically installs and manages GPU drivers, container runtimes, and device plugins. The device plugin registers GPUs as Kubernetes resources, allowing pods to request GPUs with a simple declaration like nvidia.com/gpu: 1.

apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
  - name: trainer
    image: training:latest
    resources:
      limits:
        nvidia.com/gpu: 2
  nodeSelector:
    gpu-type: a100

Node selectors and affinity rules can place workloads on nodes with specific GPU types. This makes it possible to schedule large-scale training jobs that require A100s separately from inference tasks where T4s are sufficient.

Distributed Training Strategies

Distributed training becomes necessary when a model exceeds the memory capacity of a single GPU or when training time needs to be shortened. There are two broad approaches.

Data parallelism replicates the same model across multiple GPUs, splits the data so that each GPU processes a different batch, then aggregates the gradients. It is relatively straightforward to implement and applicable to most models.

Model parallelism distributes the model itself across multiple GPUs. It is used when training very large models that cannot fit in a single GPU's memory. This approach is further divided into pipeline parallelism and tensor parallelism. The implementation complexity is high, but it has become an essential technique for LLM training.

Strategy	When to Apply	Complexity
Data Parallelism	Scale batch size or reduce training time	Low
Model Parallelism (Pipeline)	Model exceeds single GPU memory	Medium
Model Parallelism (Tensor)	Layer-level splitting of very large models	High
Hybrid	Large-scale LLM training	Very High

Libraries like PyTorch's DistributedDataParallel (DDP), DeepSpeed, and FSDP abstract the implementation of distributed training, allowing engineers to avoid dealing with the details of communication protocols directly.

Cost Optimization

GPU infrastructure is expensive. The hourly cloud cost of a single A100 can be tens of times higher than a standard CPU instance, making cost optimization not merely a matter of savings but a question of whether a project is financially viable at all.

Spot instances (or preemptible instances) use idle GPU capacity at discounted prices. They can reduce costs by 60-90%, but come with the constraint that the instance can be reclaimed at any time. Fault-tolerant design is essential — periodically saving training checkpoints and automatically resuming on a new instance when the current one is reclaimed.

Auto-scaling dynamically expands and contracts GPU nodes based on workload demand. Using Kubernetes Cluster Autoscaler or Karpenter, GPU nodes are automatically added when training jobs queue up and removed when jobs complete, minimizing costs. Because GPU nodes take longer to provision, combining predictive scaling with reactive scaling is effective at reducing delays.

Choosing Between Cloud and On-Premise

Whether to run GPU infrastructure in the cloud or build it on-premise depends on the organization's scale and workload characteristics.

Cloud offers the advantage of using GPUs on demand without upfront investment and provides immediate access to the latest hardware. For early-stage teams with intermittent workloads or unpredictable scale, the cloud is a rational choice. Conversely, organizations with sustained GPU usage above a certain threshold may find on-premise more cost-effective in the long run. However, this comes with the additional engineering burden of hardware maintenance, driver management, and physical infrastructure operations.

This is why many organizations adopt a hybrid approach — securing baseline GPU capacity on-premise while handling peak demand in the cloud. Kubernetes becomes the key technology that enables this hybrid configuration.

In the next post, we'll look at building an MLOps platform.

Where to go next

Why Do ML Workloads Demand Different Infrastructure?

The Complexity of GPU Scheduling

GPU Management on Kubernetes

Distributed Training Strategies

Cost Optimization

Choosing Between Cloud and On-Premise

Continue Reading

MLOps 10 - Building an MLOps Platform

GPU Systems 20 - From Nsight to Triton to FlashAttention

GPU Systems 19 - Asynchronous Copy and Pipelining