SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide
When running LLM serving infrastructure, GPU costs can quickly spiral out of control. Back when I was operating a multi-turn chatbot service, I eventually realized that every request was recomputing the entire system prompt from scratch. It turned out SGLang's RadixAttention had already solved that problem — I just missed one configuration option and wasn't benefiting from it at all. When I enabled --enable-metrics and saw the actual hit rate for the first time, I was stunned. It was 3%. After fixing the prompt structure and applying cache-aware routing, it climbed to 78%.
This guide is aimed at people who already have some experience with SGLang — particularly those who think "I know what a KV cache is, but I can't figure out why my hit rate is low," or who have heard terms like HiCache or Cache-Aware Load Balancer but aren't sure when to use them. This post covers how to monitor RadixAttention prefix reuse rates with Prometheus and how to actually improve KV cache hit rates in production.
Core Concepts
How RadixAttention Reuses the KV Cache
Traditional LLM serving frameworks prefill the entire prompt from scratch on every request. If you send the same system prompt 100 times, the GPU computes it 100 times in full. RadixAttention solves this inefficiency using a Radix Tree data structure.
Each node in the tree holds a token sequence as an edge label and manages a pointer (reference) to a KV cache tensor in GPU memory as its value. The tensor itself is not embedded in the node — the node simply knows where in GPU memory that tensor lives. When a new request arrives, the tree is traversed from the root to find the longest common prefix. Any matching portion already exists in GPU memory, so the prefill computation is skipped. When the cache is full, the least recently used leaf node is evicted by default (LRU policy).
[root]
└── "You are a helpful AI assistant. [1024 tokens]" ← system prompt node (cached)
├── "Hello, what's the weather today..." ← conversation A
└── "Can you review this Python code..." ← conversation BAll requests that start with the same system prompt get a cache hit from root → system prompt node, and only the branching conversation afterward needs to be computed fresh.
KV cache hit rate: The proportion of total prefill tokens that were reused via RadixAttention. A higher value means fewer prefill computations, which shortens TTFT (Time To First Token) — the time from when a request is sent until the model outputs the first token.
Workload Patterns Where Prefix Reuse Is Effective
To be honest, RadixAttention doesn't help every workload. For a search-style service where completely novel questions arrive every time, hit rate can be close to 0. On the other hand, the following patterns see dramatic results.
| Workload | Prefix Duplication Form | Expected Hit Rate |
|---|---|---|
| Multi-turn conversation | Accumulating conversation history | 75–95% |
| Agents sharing a common system prompt | Long system prompt | 60–90% |
| RAG pipelines | Repeating the same context documents | 50–80% |
| Few-shot prompts | Shared example blocks | 40–70% |
| Fully random one-off questions | None | ~0% |
Reading Hit Rates via Prometheus Metrics
SGLang exposes metrics in Prometheus format when the --enable-metrics flag is enabled. The KV cache hit rate gauge name differs by version: SGLang v0.3.x and below uses sglang:cache_hit_rate (colon separator), while v0.4.x and above uses sglang_cache_hit_rate (underscore). If your grep returns nothing, try checking both formats.
# Enable the metrics endpoint when starting the server
python -m sglang.launch_server \
--model-path meta-llama/Llama-3-70B-Instruct \
--enable-metrics \
--port 30000# Check the current hit rate immediately (search both formats)
curl http://localhost:30000/metrics | grep -E "cache_hit_rate"If you're using Grafana, you can import examples/monitoring/grafana.json from the SGLang repo directly. The default dashboard is already well-structured, so there's no need to build one from scratch.
Practical Application
Setting Up a Prometheus + Grafana Monitoring Pipeline
The very first thing to do is make the hit rate visible. You can't drive in the dark.
# prometheus.yml — SGLang scrape configuration
scrape_configs:
- job_name: 'sglang'
static_configs:
- targets: ['localhost:30000']
metrics_path: '/metrics'
scrape_interval: 15sBelow is a Prometheus Alertmanager rule for alerting on hit rate drops. Note that Grafana Unified Alerting (v8+) does not directly support this YAML format — to configure alerts in Grafana, you need to use the Grafana Provisioning API or UI separately.
# prometheus/alerting_rules.yml — Prometheus Alertmanager rules
groups:
- name: sglang_cache
rules:
- alert: LowCacheHitRate
expr: sglang_cache_hit_rate < 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "SGLang KV cache hit rate below 10% (current: {{ $value | humanizePercentage }})"| Config | Value | Description |
|---|---|---|
scrape_interval |
15s | Collect metrics every 15 seconds |
for: 15m |
15 minutes | Distinguish temporary spikes from sustained degradation |
| Alert threshold | 0.1 (10%) | Minimum acceptable level for prefix-duplication workloads |
If the hit rate stays below 0.3 for more than 15 minutes, something is wrong. Diagnostic steps are covered below.
Multi-Turn Chatbot — Server Configuration to Maximize Hit Rate
The most important factor in hit rate optimization is prompt structure. I was confused at first too — RadixAttention does automatically find common prefixes, but how you arrange that structure makes a significant difference.
# Recommended prompt structure
# 1. System prompt (immutable, always at the front)
# 2. Conversation history (accumulates as prefix)
# 3. New user message (always at the end)
messages = [
{"role": "system", "content": SYSTEM_PROMPT}, # never change this
{"role": "user", "content": turn_1_user},
{"role": "assistant", "content": turn_1_assistant},
{"role": "user", "content": turn_2_user},
# ...
{"role": "user", "content": new_message}, # new input
]# Optimized server startup parameters
python -m sglang.launch_server \
--model-path meta-llama/Llama-3-70B-Instruct \
--enable-metrics \
--chunked-prefill-size 4096 \
--mem-fraction-static 0.85 \
--port 30000| Parameter | Value | Role |
|---|---|---|
--enable-metrics |
— | Expose Prometheus metrics |
--chunked-prefill-size |
4096 | Split long prefills into chunks for better memory efficiency |
--mem-fraction-static |
0.85 | Allocate 85% of GPU VRAM to the KV cache |
Even a single space or newline difference in the system prompt causes a full cache miss. The moment a value that changes per request — a timestamp, user ID, or date — appears at the beginning of the system prompt, nothing after it gets cached. It is strongly recommended to place any dynamic values in the last user message of the conversation.
Hit Rate Degradation Diagnosis Flow
When hit rate suddenly drops, it can be unclear where to start. Here's a sequence I've found useful in practice.
Hit rate < 0.3 detected
│
▼
[Step 1] Check for system prompt drift
→ Was the prompt changed in a recent deployment?
→ Any whitespace, newline, or encoding differences?
│
▼ If no issue found
[Step 2] Check routing (multi-node deployment)
→ Is Cache-Aware Load Balancer configured?
→ Is the same prefix being distributed to different nodes?
│
▼ If no issue found
[Step 3] Analyze actual prefix duplication in the workload
→ Does the traffic actually have duplicate prefixes?
→ Has the proportion of fully random one-off requests increased?# Check additional indicators alongside metrics
curl -s http://localhost:30000/metrics | grep -E "(cache_hit|num_cached|num_running)"
# Key metrics
# sglang_cache_hit_rate — current hit rate
# sglang_num_cached_tokens — number of tokens currently in cache
# sglang_num_running_reqs — number of requests currently being processedApplying Cache-Aware Load Balancer in Multi-Node Deployments
I once had great hit rates on a single node, only to see them cut in half after horizontal scale-out. Because each node maintains an independent KV cache, round-robin traffic distribution means the same prefix lands on a different node each time, preventing cache hits.
Since SGLang v0.4, a Cache-Aware Load Balancer is built in. It routes requests with the same prefix to the same worker based on a hash of the prefix.
The example below is for a single host with 2 GPUs using Data Parallelism. --dp-size 2 creates two data-parallel workers within the same process, and --load-balance-method cache_aware distributes requests to one of the two workers based on prefix hash.
# Single host, 2 GPUs — Cache-Aware DP configuration
python -m sglang.launch_server \
--model-path meta-llama/Llama-3-70B-Instruct \
--enable-metrics \
--dp-size 2 \
--load-balance-method cache_aware \
--port 30000If you're combining two separate physical nodes, start an SGLang server on each node individually and place an SGLang Router (python -m sglang.srt.router) in front, configured to distribute traffic using the cache_aware method. External clients only need to point to the router endpoint, and the addresses of each node are registered in the router's configuration.
According to official benchmarks (LMSYS blog), this single configuration change has achieved 1.9× throughput improvement and 3.8× cache hit rate improvement in documented cases. The results are quite impressive relative to the configuration effort.
Extending Cache Capacity with HiCache
If you've grown to a scale that GPU VRAM alone can't handle, HiCache is worth considering. It extends the cache across three tiers: GPU VRAM (L1) → Host RAM (L2) → distributed storage (L3).
# Enable HiCache (add Host RAM tier)
python -m sglang.launch_server \
--model-path meta-llama/Llama-3-70B-Instruct \
--enable-metrics \
--enable-hierarchical-cache \
--hicache-ratio 4.0 \
--mem-fraction-static 0.80 \
--port 30000| Parameter | Value | Meaning |
|---|---|---|
--enable-hierarchical-cache |
— | Enable HiCache tiered caching |
--hicache-ratio |
4.0 | Host RAM KV cache size = GPU VRAM KV cache × 4 |
--mem-fraction-static |
0.80 | Reduce GPU allocation slightly when using HiCache |
To put --hicache-ratio 4.0 in concrete terms: on a server with 80GB of GPU VRAM, applying --mem-fraction-static 0.80 allocates approximately 64GB to the KV cache. With --hicache-ratio 4.0, an additional 256GB (64GB × 4) is used in Host RAM as an L2 cache. The effect is most pronounced when the server's Host RAM is sufficiently larger than GPU VRAM — typically 4× or more.
A collaboration case study with Alibaba Cloud Tair (per the official blog) applied HiCache to a coding agent workload (average 8 turns per session, 25K+ tokens) and recorded a 56% reduction in TTFT, 2× throughput, and hit rate improvement from 40% to 80%.
# FP8 quantization for KV cache to fit more entries in the same VRAM
python -m sglang.launch_server \
--model-path meta-llama/Llama-3-70B-Instruct \
--kv-cache-dtype fp8 \
--enable-metrics \
--port 30000Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Automatic optimization | Detects cache opportunities at runtime without manual prompt caching code |
| Supports diverse patterns | Covers few-shot, branching reasoning trees, multi-turn conversation, and RAG |
| Hierarchical scalability | HiCache extends beyond GPU VRAM to Host RAM and distributed storage |
| Measurable performance gains | Up to 80% reduction in TTFT and up to 6× throughput for workloads with 60%+ prefix duplication |
| Easy integration with existing infrastructure | Connects directly to standard Prometheus/Grafana stack |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Cache invalidation sensitivity | A single whitespace change in the system prompt causes a full cache miss | Version-control prompts like code; verify diffs at deployment |
| Ineffective for random workloads | No benefit for requests with low prefix duplication | Analyze workload characteristics before applying |
| Multi-node routing is mandatory | Hit rate drops sharply with round-robin distribution | Configure Cache-Aware Load Balancer |
| Memory tuning required | Incorrect combination of --mem-fraction-static and --hicache-ratio can cause OOM |
Adjust incrementally to match workload |
| Speculative decoding bug | cached_tokens misreported as 0 when using Speculative Decoding |
GitHub Issue #20451 (open as of May 2025); temporarily avoid this combination |
Speculative Decoding: An inference acceleration technique that uses a small draft model to quickly predict multiple tokens, then verifies them with the original model. A known bug causes the
cached_tokensmetric to be incorrectly reported as 0 when used together with RadixAttention in SGLang (GitHub Issue #20451). I checked the issue directly and it was still open at the time of writing (May 2025), so I recommend temporarily avoiding this combination in environments where hit rate monitoring is critical.
The Most Common Mistakes in Practice
-
Placing dynamic values at the beginning of the prompt template — When changing values like dates or user IDs appear at the front of the system prompt, nothing after them gets cached. It is recommended to place dynamic values in the last user message of the conversation.
-
Guessing at hit rates without
--enable-metrics— More often than you'd think, people assume "it's probably working fine" and end up running at 0% for a long time. Simply looking at the numbers makes the next steps obvious. -
Not applying cache-aware routing after multi-node deployment — If hit rates actually got worse after horizontal scale-out, this is almost always the reason. Cache-Aware Load Balancer is not optional in multi-node setups — it's mandatory.
Closing Thoughts
The value of RadixAttention comes not from "configuration" but from the cycle of "measure → diagnose → tune." Making the hit rate visible must come first before any other optimization becomes meaningful. If you can only do one thing, turn on the metrics. That single number determines your next action.
Three steps you can start right now:
- Enable
--enable-metricsright away. Runcurl http://localhost:30000/metrics | grep cache_hit_rateto check your current hit rate. Simply seeing the number will make your next action clear. - If the hit rate is lower than expected, inspect your prompt structure. Check whether the system prompt is truly fixed and whether timestamps or random values are mixed into the front of it.
- For multi-node environments, it is recommended to apply the
--load-balance-method cache_awareoption. A single line of configuration can produce a noticeable difference in hit rate. From there, consider exploring HiCache or FP8 quantization as next steps.
References
- Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Blog
- RadixAttention | SGLang Official Docs
- SGLang HiCache: Fast Hierarchical KV Caching | LMSYS Blog
- Production Metrics | SGLang Official Docs
- HiCache System Design and Optimization | SGLang Official Docs
- SGLang v0.4: Cache-Aware Load Balancer | LMSYS Blog
- SGLang Prometheus Metrics: A Guide for Production Monitoring | kuncoro.io
- SGLang Production Deployment Guide: RadixAttention | Spheron Blog
- SGLang Prometheus Metrics | NVIDIA Dynamo Docs
- Alibaba Cloud Tair Partners with SGLang to Build HiCache | Alibaba Cloud
- SGLang: Efficient Execution of Structured Language Model Programs | arXiv