← All posts
infra

Multi-Node vLLM on K8s: Every Mistake I Made So You Don't Have To

I’ve now deployed vLLM in multi-node configurations enough times to have a mental checklist of things that go wrong. This is that checklist.

StatefulSet, not Deployment

You need stable pod identities for multi-node inference. Use a StatefulSet with a headless service. Each pod needs to discover the others by hostname, and Deployments don’t give you that.

NVIDIA driver parameters matter

Two that will waste your time if you forget them:

  • NVreg_EnableStreamMemOPs=1
  • PeerMappingOverride=1

These need to be set as module parameters on the host, not in the container. If you’re on a managed Kubernetes cluster, this means DaemonSet or node configuration at provisioning time.

EP+DP configurations

For large MoE models like GLM-5-FP8, you probably want expert parallelism (EP) combined with data parallelism (DP). The vLLM configuration for this is not super intuitive — you’re essentially running multiple API server instances behind a load balancer, each handling a subset of the DP ranks.

The preemption trap

vLLM’s preemption policy when KV cache fills up can cause surprising latency spikes. For serving, I now default to recompute over swap unless the prompt lengths are very long (>8k tokens). The recompute cost is more predictable than the I/O cost of swapping to CPU memory.

Debugging CUDA graph capture

If you’re using a naive MoE path (not the fused kernel), CUDA graph capture can fail silently or produce wrong results. The symptom is usually garbage output at high batch sizes. Check that your MoE layer is compatible with graph capture before enabling it.