Weight Caching + GPU Snapshot Recipe for Sub-Second Cold Starts with vLLM + Modal Volume
Anyone who has operated an LLM inference service in a serverless GPU environment knows how painful cold start problems can be. I was genuinely caught off guard when I first deployed vLLM in a serverless environment. On that first deployment, cold starts easily exceeded 4 minutes, and it took considerable trial and error to get them down to 0.8 seconds. Downloading 14GB of weights from HuggingFace, running torch.compile, waiting for CUDA graphs to initialize — from the user's perspective, it's just a "slow service."
In 2025, the foundation to truly solve this problem effectively has fallen into place. With NVIDIA officially providing a GPU memory serialization API in the driver 570/575 branch, serverless platforms like Modal have been able to introduce memory snapshot features that leverage it. By caching weights and compiled artifacts in a Modal Volume and applying GPU memory snapshots, reducing cold starts to under one second is genuinely achievable. This post goes beyond a simple "save to Volume" approach to explore how to combine three caching layers, with real code and a look at the tradeoffs at each stage.
By the end of this post, you'll have a structural understanding of why cold starts are slow, and you'll have hands-on code to apply each caching strategy directly in a Modal + vLLM environment. To run the example code, you'll need a Modal account and the CLI installed — install with pip install modal, authenticate with modal setup, and you're ready to follow along.
Core Concepts
Cold Starts Are Actually Slow Three Times Over
If you lump cold starts together as "just model loading time," it's hard to find the right solution. In reality, three stages pile up sequentially.
| Stage | Time Before Caching | Time After Caching |
|---|---|---|
| Model weight download (HuggingFace) | Several minutes | ~Tens of seconds (Volume load) |
| torch.compile / CUDA graph compilation | 1–5 minutes | ~10 seconds (artifact cache) |
| GPU memory load | ~Tens of seconds | < 1 second (GPU snapshot) |
The key is tackling each stage one by one. Applying all three stages produces a cumulative effect that brings multi-minute cold starts down to under one second.
Why Does vLLM Take So Long to Initialize?
vLLM manages the KV cache like OS virtual memory using the PagedAttention algorithm. To maximize GPU memory efficiency, the initialization phase must pre-capture CUDA graphs and JIT-compile Triton kernels with torch.compile.
PagedAttention: A technique that manages the KV (Key-Value) cache for attention operations in page-sized units rather than contiguous memory blocks. Inspired by OS virtual memory paging, it reduces GPU memory fragmentation and enables more concurrent requests to be processed simultaneously.
The reason Triton kernels and CUDA graphs must be compiled every time is that the optimized binary differs per GPU architecture and driver version. In other words, as long as the environment stays the same, these outputs can be cached and reused. This repeated work on every cold start is the root of the problem — and conversely, if you store the outputs, subsequent boots can skip it entirely.
Modal Volume — Simply Put
Simply put, it's a shared disk that survives container shutdown. That's all. Read speeds reach 1–2 GB/s — roughly half the 3–7 GB/s you'd see from a local NVMe SSD. You might ask "isn't that slow?" — but it's far faster than pulling from HuggingFace over the network (hundreds of MB/s or less), and multiple containers can read the same Volume simultaneously, making it efficient even in autoscaling scenarios.
GPU Memory Snapshots — The Game Changer
Modal's GPU Memory Snapshots, which appeared in 2025, leverage NVIDIA's CUDA Checkpoint/Restore API (driver 570/575 or later). Immediately after initialization is complete, the container's entire CPU + GPU memory is serialized to disk, and subsequent boots skip initialization and perform only deserialization.
CUDA Checkpoint/Restore API: A GPU memory serialization/deserialization API officially provided by NVIDIA starting with the driver 570/575 branch. It preserves all loaded kernels, CUDA graphs, and weight state.
vLLM Sleep Mode — A Lifesaver for Multi-Model Serving
vLLM Sleep Mode, released in October 2025, is a feature that frees GPU memory while keeping the server process alive. Honestly, when I first saw this feature, my first thought was "why did it take so long to build this?"
- Level 1: Offload weights to CPU RAM → fast wake-up (0.1–0.8 seconds)
- Level 2: Fully discard weights → minimal RAM usage, slightly slower wake-up
How much faster the Sleep → Wake transition is compared to a cold start varies greatly by conditions. At Level 1 (CPU offload) it's 18–30x faster, and even at Level 2 with weights already cached in a Volume, it's many times faster. The "200x" figure is a comparison against a cold start that includes the initial HuggingFace download. This is especially useful in multi-model architectures where you're switching between and serving multiple models.
Practical Application
Example 1: Setting Up a Three-Layer Caching Pipeline with Modal Volume
This is the most fundamental configuration. The key is separating the HuggingFace weight cache and torch.compile artifact cache into distinct Volumes. I once had them combined and ran into tangled artifact invalidation management, so I strongly recommend keeping them separate.
import modal
app = modal.App("vllm-cached-inference")
# Separate Volume for weights and Volume for compiled artifacts
hf_volume = modal.Volume.from_name("hf-model-cache", create_if_missing=True)
vllm_volume = modal.Volume.from_name("vllm-compile-cache", create_if_missing=True)
image = (
modal.Image.debian_slim()
.pip_install("vllm", "hf_transfer")
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) # Enable high-speed downloads
)
@app.cls(
gpu="A100",
image=image,
volumes={
"/root/.cache/huggingface": hf_volume, # HF weight cache
"/root/.cache/vllm": vllm_volume, # torch.compile artifact cache
},
)
class VLLMServer:
@modal.enter()
def load_model(self):
from vllm import LLM
self.llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_prefix_caching=True, # Enable Automatic Prefix Caching
)
@modal.method()
def generate(self, prompt: str) -> str:
from vllm import SamplingParams
outputs = self.llm.generate(prompt, SamplingParams(max_tokens=256))
return outputs[0].outputs[0].text| Code Point | Description |
|---|---|
hf_transfer + env var |
Pushes HuggingFace download speed up to the network bandwidth limit |
volumes dictionary |
Mounts container paths to Volumes. Skips downloads on restart |
enable_prefix_caching=True |
Reuses KV cache for identical prompt prefixes (Automatic Prefix Caching) |
This configuration alone shortens cold starts from minutes to tens of seconds after the first boot. As a bonus, you're also shielded from HuggingFace outages.
If you need torch.compile batch size optimization, you can extend Example 1 with the following config file:
# compile.yaml — pre-compiles kernels optimized for static batch sizes
compilation:
compile_sizes: [1, 2, 4, 8]vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching \
--compilation-config compile.yamlWithout caching, compilation takes 1–5 minutes; with Volume caching, it drops to ~10 seconds. In an autoscaling environment where multiple instances share the same Volume, the compilation cost for each instance after the first effectively reaches zero.
Key limitation of this approach: torch.compile artifacts can only be reused in environments with identical GPU architecture, driver, and CUDA version. Changing the instance type or driver version requires invalidating the entire cache.
Example 2: Achieving Sub-Second Cold Starts with GPU Memory Snapshots
This is the pattern best exemplified by Modal's LFM2 deployment case. The two key lines are enable_memory_snapshot=True and @modal.enter(snap=True).
@app.cls(
gpu="H100",
image=image,
volumes={"/root/.cache/huggingface": hf_volume},
enable_memory_snapshot=True, # Enable CPU + GPU memory snapshots
)
class SnapshotVLLMServer:
@modal.enter(snap=True) # Designates this method's result as the snapshot target
def load_model(self):
from vllm import LLM
self.llm = LLM(
model="liquid-ai/LFM2-1B",
# Modal layer option — Modal integration setting, not an official vLLM parameter
experimental_options={"enable_gpu_snapshot": True},
)
@modal.method()
def generate(self, prompt: str) -> str:
from vllm import SamplingParams
outputs = self.llm.generate(prompt, SamplingParams(max_tokens=256))
return outputs[0].outputs[0].textThe boot flow can be summarized as follows:
| Boot Order | Work Performed | Time Required |
|---|---|---|
| First boot | Model initialization → save CPU + GPU memory snapshot | Tens of seconds (one-time) |
| Subsequent boots | Snapshot deserialization only | < 1 second |
The load_model marked with snap=True runs only once. When the container comes up afterward, load_model itself is not called — instead, the CPU + GPU memory image from that point is restored directly. No torch.compile recompilation, no CUDA graph re-initialization. During debugging, you might be confused by "I clearly reloaded the model, so why isn't the update showing?" — it's because load_model doesn't execute as long as a snapshot exists.
Caution: GPU snapshots are an alpha feature as of 2025. NVIDIA driver 570/575 or later is required, and support is not guaranteed for all models and environments. Thorough testing is recommended before production use.
Key limitation of this approach: If you upgrade the model version or vLLM, the snapshot must be regenerated. Deploying without regenerating leads to a subtle bug where inference runs on the old weight state.
Example 3: Configuring Multi-Model Serving with vLLM Sleep Mode
Sleep Mode is useful when you need to switch between and serve multiple models on a single GPU instance. This is a situation you encounter frequently in practice — spinning up a separate container per model causes costs to grow exponentially.
The code below is pseudocode aligned with the actual vLLM Sleep Mode API spec. /sleep and /wake do not take a model parameter; model-switching logic must be handled in a separate orchestration layer.
import requests
# Enable Sleep Mode when starting the server
# vllm serve ... --enable-sleep-mode
# vLLM Sleep Mode API usage example
# POST /sleep — no model parameter, transitions the current instance to sleep state
# POST /wake — no model parameter, wakes from sleep state and restores GPU memory
def sleep_current_model(server_url: str):
# Offload current model to CPU (Level 1 — maintains fast wake-up)
requests.post(f"{server_url}/sleep", json={"level": 1})
def wake_model(server_url: str):
# Restore GPU memory (0.1–0.8 seconds for Level 1)
requests.post(f"{server_url}/wake")
# Sleep → Wake transition characteristics
# - Level 1 (CPU offload): 0.1–0.8 seconds, requires sufficient CPU RAM
# - Level 2 (full discard): minimizes RAM, relatively slower wake-upLevel 1 offloads weights to CPU RAM, so you need enough CPU RAM to hold that model's weights. For a 70B model, the RAM requirement is substantial, so check your instance specs ahead of time.
Pros and Cons Analysis
Distilling down to what actually burned me in production, the table below covers the key points.
Advantages
| Item | Details |
|---|---|
| Network independence | Completely eliminates HuggingFace downloads. No impact from external service outages |
| High-speed Volume reads | Large models load quickly with 1–2 GB/s read speed |
| Shared compilation cost | Zero additional compilation cost by sharing torch.compile artifacts across autoscaled instances |
| Complete GPU snapshot restoration | CUDA graphs, loaded kernels, and compiled state are all restored |
| Sleep Mode transition speed | Wake-up in 0.1–0.8 seconds at Level 1. No CUDA context re-initialization required |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| GPU snapshot alpha status | Support not guaranteed for all models/environments | Maintain a fallback path that works without snapshots |
| Driver version requirement | NVIDIA driver 570/575 or later required | Verify driver version when selecting instances |
| Snapshot invalidation | Must regenerate when upgrading model version or vLLM | Include snapshot regeneration in deployment pipeline for version changes |
| Volume storage costs | 70B+ models require hundreds of GB of storage | Consider on-demand download strategy for infrequently used models |
| Level 1 Sleep RAM requirement | CPU offload mode requires sufficient CPU RAM | Pre-plan instance RAM capacity based on model size |
| Compilation cache environment dependency | Reuse requires identical GPU architecture, driver, and CUDA version | Recommend automating cache invalidation handling on environment changes |
| APC tradeoffs | Consumes KV cache space. May be inefficient for short-prompt-heavy workloads | Selectively enable enable_prefix_caching based on workload characteristics |
Automatic Prefix Caching (APC): A technique that reuses KV cache for identical prompt prefixes using hash-based block mapping. Effective for chatbot or RAG scenarios with repeated long system prompts, but in workloads that process completely different short prompts every time, cache hit rates are low and it only occupies KV cache space.
The Most Common Mistakes in Practice
- Managing Volumes as a single unit: Putting the weight cache and compiled artifacts in the same Volume means that when you need to invalidate artifacts after a vLLM version upgrade, you'll accidentally wipe the weights too. Separating them from the start is recommended.
- Missing the snapshot invalidation window: If you upgrade your model version or vLLM without regenerating the snapshot, you'll get a subtle bug where inference runs on the old weight state. Including snapshot regeneration in the deployment pipeline for version changes is recommended.
- Always enabling APC:
enable_prefix_caching=Trueisn't always beneficial. For workloads processing novel prompts every time, it just wastes KV cache space. Understanding your workload characteristics first before deciding whether to enable it is recommended.
Closing Thoughts
For models 7B or smaller, Volume weight caching alone is enough to get down to the tens-of-seconds range. For larger models, you need to layer on torch.compile artifact caching and GPU memory snapshots in order to achieve sub-second cold starts. Only by applying all three layers do you reach cold start times approaching the physical limit.
Three steps you can take right now:
- Start by applying Volume mounting. Create a Volume with
modal.Volume.from_name("hf-model-cache", create_if_missing=True)and mount it to the/root/.cache/huggingfacepath in thevolumesparameter of@app.cls. You'll immediately see HuggingFace downloads disappear after the first boot. - Add a separate Volume for torch.compile artifact caching. Mount the
/root/.cache/vllmpath to a second Volume and setcompile_sizesto match your workload — this eliminates compilation costs during autoscaling. - Experiment with GPU snapshots on a small model first. Add
enable_memory_snapshot=Trueand@modal.enter(snap=True), verify the snapshot creation → deserialization flow on a small model like LFM2-1B, then expand to larger models for a stable rollout.
Next post: How to scale KV cache beyond a single instance to the cluster level — a distributed Prefix Caching architecture implemented with LMCache and llm-d
References
- Run OpenAI-compatible LLM inference with Gemma and vLLM | Modal Docs
- GPU Memory Snapshots: Supercharging Sub-second Startup | Modal Blog
- Modal + Mistral 3: 10x faster cold starts with GPU snapshotting
- Low Latency, Serverless LFM2 with vLLM and Modal | Modal Docs
- Snapshot GPU memory to speed up cold starts | Modal Docs
- High-performance LLM inference | Modal Docs
- Volumes | Modal Docs
- Memory Snapshots | Modal Docs
- Sleep Mode - vLLM Official Docs
- Zero-Reload Model Switching with vLLM Sleep Mode | vLLM Blog
- torch.compile integration - vLLM
- Automatic Prefix Caching - vLLM
- Reducing GPU Cold Start Time when using vLLM - Tensorfuse
- vLLM Roadmap Q1 2026 · GitHub
- GPU Memory Snapshotting to reduce cold starts · vLLM Issue #33930
- KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d
- Product updates: GPU memory snapshots, notebooks, service tokens | Modal Blog