KV Cache Recompute vs. Offload — When Does Each Win?
When your KV cache fills up during inference serving, you have two options: recompute the evicted KV entries from the prompt, or offload them to CPU memory and swap them back in when needed.
The conventional wisdom is “offload is always better because recompute wastes FLOPs.” In practice it’s more nuanced.
When recompute wins
Short to medium prompts (under ~4k tokens), high batch sizes, and GPU memory bandwidth that’s already saturated. The recompute cost scales with prompt length but the compute is fast on modern GPUs. The key advantage is predictable latency — you know exactly how long a recompute will take.
Offload latency depends on PCIe bandwidth and CPU memory pressure, both of which can be noisy in a multi-tenant cluster.
When offload wins
Long prompts (>8k tokens) where recompute cost is genuinely expensive. Also when you have dedicated PCIe bandwidth and aren’t sharing the bus with other I/O. On H200s with the wider memory bus, offload becomes more attractive because you can overlap the transfer better.
The practical answer
For serving workloads, I default to recompute and only switch to offload after benchmarking the specific model and prompt distribution. Recompute is simpler operationally — fewer failure modes, easier to reason about tail latency.