SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

When running LLM serving infrastructure, GPU costs can quickly spiral out of control. Back when I was operating a multi-turn chatbot service, I eventually realized that every request was recomputing the entire system prompt from scratch. It turned out SGLang's RadixAttention had already solved that problem — I just missed one configuration option and wasn't benefiting from it at all. When I enabled --enable-metrics and saw the actual hit rate for the first time, I was stunned. It was 3%. After fixing the prompt structure and applying cache-aware routing, it climbed to 78%.

This guide is aimed at people who already have some experience with SGLang — particularly those who think "I know what a KV cache is, but I can't figure out why my hit rate is low," or who have heard terms like HiCache or Cache-Aware Load Balancer but aren't sure when to use them. This post covers how to monitor RadixAttention prefix reuse rates with Prometheus and how to actually improve KV cache hit rates in production.

Core Concepts

How RadixAttention Reuses the KV Cache

Traditional LLM serving frameworks prefill the entire prompt from scratch on every request. If you send the same system prompt 100 times, the GPU computes it 100 times in full. RadixAttention solves this inefficiency using a Radix Tree data structure.

Each node in the tree holds a token sequence as an edge label and manages a pointer (reference) to a KV cache tensor in GPU memory as its value. The tensor itself is not embedded in the node — the node simply knows where in GPU memory that tensor lives. When a new request arrives, the tree is traversed from the root to find the longest common prefix. Any matching portion already exists in GPU memory, so the prefill computation is skipped. When the cache is full, the least recently used leaf node is evicted by default (LRU policy).

[root]
  └── "You are a helpful AI assistant. [1024 tokens]"  ← system prompt node (cached)
        ├── "Hello, what's the weather today..."   ← conversation A
        └── "Can you review this Python code..."  ← conversation B

All requests that start with the same system prompt get a cache hit from root → system prompt node, and only the branching conversation afterward needs to be computed fresh.

KV cache hit rate: The proportion of total prefill tokens that were reused via RadixAttention. A higher value means fewer prefill computations, which shortens TTFT (Time To First Token) — the time from when a request is sent until the model outputs the first token.

Workload Patterns Where Prefix Reuse Is Effective

To be honest, RadixAttention doesn't help every workload. For a search-style service where completely novel questions arrive every time, hit rate can be close to 0. On the other hand, the following patterns see dramatic results.

Workload	Prefix Duplication Form	Expected Hit Rate
Multi-turn conversation	Accumulating conversation history	75–95%
Agents sharing a common system prompt	Long system prompt	60–90%
RAG pipelines	Repeating the same context documents	50–80%
Few-shot prompts	Shared example blocks	40–70%
Fully random one-off questions	None	~0%

Reading Hit Rates via Prometheus Metrics

SGLang exposes metrics in Prometheus format when the --enable-metrics flag is enabled. The KV cache hit rate gauge name differs by version: SGLang v0.3.x and below uses sglang:cache_hit_rate (colon separator), while v0.4.x and above uses sglang_cache_hit_rate (underscore). If your grep returns nothing, try checking both formats.

bash

# Enable the metrics endpoint when starting the server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --port 30000

bash

# Check the current hit rate immediately (search both formats)
curl http://localhost:30000/metrics | grep -E "cache_hit_rate"

If you're using Grafana, you can import examples/monitoring/grafana.json from the SGLang repo directly. The default dashboard is already well-structured, so there's no need to build one from scratch.

Practical Application

Setting Up a Prometheus + Grafana Monitoring Pipeline

The very first thing to do is make the hit rate visible. You can't drive in the dark.

yaml

# prometheus.yml — SGLang scrape configuration
scrape_configs:
  - job_name: 'sglang'
    static_configs:
      - targets: ['localhost:30000']
    metrics_path: '/metrics'
    scrape_interval: 15s

Below is a Prometheus Alertmanager rule for alerting on hit rate drops. Note that Grafana Unified Alerting (v8+) does not directly support this YAML format — to configure alerts in Grafana, you need to use the Grafana Provisioning API or UI separately.

yaml

# prometheus/alerting_rules.yml — Prometheus Alertmanager rules
groups:
  - name: sglang_cache
    rules:
      - alert: LowCacheHitRate
        expr: sglang_cache_hit_rate < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SGLang KV cache hit rate below 10% (current: {{ $value | humanizePercentage }})"

Config	Value	Description
`scrape_interval`	15s	Collect metrics every 15 seconds
`for: 15m`	15 minutes	Distinguish temporary spikes from sustained degradation
Alert threshold	0.1 (10%)	Minimum acceptable level for prefix-duplication workloads

If the hit rate stays below 0.3 for more than 15 minutes, something is wrong. Diagnostic steps are covered below.

Multi-Turn Chatbot — Server Configuration to Maximize Hit Rate

The most important factor in hit rate optimization is prompt structure. I was confused at first too — RadixAttention does automatically find common prefixes, but how you arrange that structure makes a significant difference.

python

# Recommended prompt structure
# 1. System prompt (immutable, always at the front)
# 2. Conversation history (accumulates as prefix)
# 3. New user message (always at the end)
 
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},   # never change this
    {"role": "user",   "content": turn_1_user},
    {"role": "assistant", "content": turn_1_assistant},
    {"role": "user",   "content": turn_2_user},
    # ...
    {"role": "user",   "content": new_message},     # new input
]

bash

# Optimized server startup parameters
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --chunked-prefill-size 4096 \
  --mem-fraction-static 0.85 \
  --port 30000

Parameter	Value	Role
`--enable-metrics`	—	Expose Prometheus metrics
`--chunked-prefill-size`	4096	Split long prefills into chunks for better memory efficiency
`--mem-fraction-static`	0.85	Allocate 85% of GPU VRAM to the KV cache

Even a single space or newline difference in the system prompt causes a full cache miss. The moment a value that changes per request — a timestamp, user ID, or date — appears at the beginning of the system prompt, nothing after it gets cached. It is strongly recommended to place any dynamic values in the last user message of the conversation.

Hit Rate Degradation Diagnosis Flow

When hit rate suddenly drops, it can be unclear where to start. Here's a sequence I've found useful in practice.

css

Hit rate < 0.3 detected
       │
       ▼
[Step 1] Check for system prompt drift
       → Was the prompt changed in a recent deployment?
       → Any whitespace, newline, or encoding differences?
       │
       ▼ If no issue found
[Step 2] Check routing (multi-node deployment)
       → Is Cache-Aware Load Balancer configured?
       → Is the same prefix being distributed to different nodes?
       │
       ▼ If no issue found
[Step 3] Analyze actual prefix duplication in the workload
       → Does the traffic actually have duplicate prefixes?
       → Has the proportion of fully random one-off requests increased?

bash

# Check additional indicators alongside metrics
curl -s http://localhost:30000/metrics | grep -E "(cache_hit|num_cached|num_running)"
 
# Key metrics
# sglang_cache_hit_rate        — current hit rate
# sglang_num_cached_tokens     — number of tokens currently in cache
# sglang_num_running_reqs      — number of requests currently being processed

Applying Cache-Aware Load Balancer in Multi-Node Deployments

I once had great hit rates on a single node, only to see them cut in half after horizontal scale-out. Because each node maintains an independent KV cache, round-robin traffic distribution means the same prefix lands on a different node each time, preventing cache hits.

Since SGLang v0.4, a Cache-Aware Load Balancer is built in. It routes requests with the same prefix to the same worker based on a hash of the prefix.

The example below is for a single host with 2 GPUs using Data Parallelism. --dp-size 2 creates two data-parallel workers within the same process, and --load-balance-method cache_aware distributes requests to one of the two workers based on prefix hash.

bash

# Single host, 2 GPUs — Cache-Aware DP configuration
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --dp-size 2 \
  --load-balance-method cache_aware \
  --port 30000

If you're combining two separate physical nodes, start an SGLang server on each node individually and place an SGLang Router (python -m sglang.srt.router) in front, configured to distribute traffic using the cache_aware method. External clients only need to point to the router endpoint, and the addresses of each node are registered in the router's configuration.

According to official benchmarks (LMSYS blog), this single configuration change has achieved 1.9× throughput improvement and 3.8× cache hit rate improvement in documented cases. The results are quite impressive relative to the configuration effort.

Extending Cache Capacity with HiCache

If you've grown to a scale that GPU VRAM alone can't handle, HiCache is worth considering. It extends the cache across three tiers: GPU VRAM (L1) → Host RAM (L2) → distributed storage (L3).

bash

# Enable HiCache (add Host RAM tier)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --enable-hierarchical-cache \
  --hicache-ratio 4.0 \
  --mem-fraction-static 0.80 \
  --port 30000

Parameter	Value	Meaning
`--enable-hierarchical-cache`	—	Enable HiCache tiered caching
`--hicache-ratio`	4.0	Host RAM KV cache size = GPU VRAM KV cache × 4
`--mem-fraction-static`	0.80	Reduce GPU allocation slightly when using HiCache

To put --hicache-ratio 4.0 in concrete terms: on a server with 80GB of GPU VRAM, applying --mem-fraction-static 0.80 allocates approximately 64GB to the KV cache. With --hicache-ratio 4.0, an additional 256GB (64GB × 4) is used in Host RAM as an L2 cache. The effect is most pronounced when the server's Host RAM is sufficiently larger than GPU VRAM — typically 4× or more.

A collaboration case study with Alibaba Cloud Tair (per the official blog) applied HiCache to a coding agent workload (average 8 turns per session, 25K+ tokens) and recorded a 56% reduction in TTFT, 2× throughput, and hit rate improvement from 40% to 80%.

bash

# FP8 quantization for KV cache to fit more entries in the same VRAM
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --kv-cache-dtype fp8 \
  --enable-metrics \
  --port 30000

Pros and Cons Analysis

Advantages

Item	Details
Automatic optimization	Detects cache opportunities at runtime without manual prompt caching code
Supports diverse patterns	Covers few-shot, branching reasoning trees, multi-turn conversation, and RAG
Hierarchical scalability	HiCache extends beyond GPU VRAM to Host RAM and distributed storage
Measurable performance gains	Up to 80% reduction in TTFT and up to 6× throughput for workloads with 60%+ prefix duplication
Easy integration with existing infrastructure	Connects directly to standard Prometheus/Grafana stack

Disadvantages and Caveats

Item	Details	Mitigation
Cache invalidation sensitivity	A single whitespace change in the system prompt causes a full cache miss	Version-control prompts like code; verify diffs at deployment
Ineffective for random workloads	No benefit for requests with low prefix duplication	Analyze workload characteristics before applying
Multi-node routing is mandatory	Hit rate drops sharply with round-robin distribution	Configure Cache-Aware Load Balancer
Memory tuning required	Incorrect combination of `--mem-fraction-static` and `--hicache-ratio` can cause OOM	Adjust incrementally to match workload
Speculative decoding bug	`cached_tokens` misreported as 0 when using Speculative Decoding	GitHub Issue #20451 (open as of May 2025); temporarily avoid this combination

Speculative Decoding: An inference acceleration technique that uses a small draft model to quickly predict multiple tokens, then verifies them with the original model. A known bug causes the cached_tokens metric to be incorrectly reported as 0 when used together with RadixAttention in SGLang (GitHub Issue #20451). I checked the issue directly and it was still open at the time of writing (May 2025), so I recommend temporarily avoiding this combination in environments where hit rate monitoring is critical.

The Most Common Mistakes in Practice

Placing dynamic values at the beginning of the prompt template — When changing values like dates or user IDs appear at the front of the system prompt, nothing after them gets cached. It is recommended to place dynamic values in the last user message of the conversation.
Guessing at hit rates without --enable-metrics — More often than you'd think, people assume "it's probably working fine" and end up running at 0% for a long time. Simply looking at the numbers makes the next steps obvious.
Not applying cache-aware routing after multi-node deployment — If hit rates actually got worse after horizontal scale-out, this is almost always the reason. Cache-Aware Load Balancer is not optional in multi-node setups — it's mandatory.

Closing Thoughts

The value of RadixAttention comes not from "configuration" but from the cycle of "measure → diagnose → tune." Making the hit rate visible must come first before any other optimization becomes meaningful. If you can only do one thing, turn on the metrics. That single number determines your next action.

Three steps you can start right now:

Enable --enable-metrics right away. Run curl http://localhost:30000/metrics | grep cache_hit_rate to check your current hit rate. Simply seeing the number will make your next action clear.
If the hit rate is lower than expected, inspect your prompt structure. Check whether the system prompt is truly fixed and whether timestamps or random values are mixed into the front of it.
For multi-node environments, it is recommended to apply the --load-balance-method cache_aware option. A single line of configuration can produce a noticeable difference in hit rate. From there, consider exploring HiCache or FP8 quantization as next steps.

References

#SGLang#RadixAttention#KVCache#Prometheus#Grafana#LLM서빙#HiCache#캐시인식로드밸런싱#TTFT최적화#SpeculativeDecoding

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

Core Concepts

How RadixAttention Reuses the KV Cache

[root]
  └── "You are a helpful AI assistant. [1024 tokens]"  ← system prompt node (cached)
        ├── "Hello, what's the weather today..."   ← conversation A
        └── "Can you review this Python code..."  ← conversation B

All requests that start with the same system prompt get a cache hit from root → system prompt node, and only the branching conversation afterward needs to be computed fresh.

KV cache hit rate: The proportion of total prefill tokens that were reused via RadixAttention. A higher value means fewer prefill computations, which shortens TTFT (Time To First Token) — the time from when a request is sent until the model outputs the first token.

Workload Patterns Where Prefix Reuse Is Effective

Workload	Prefix Duplication Form	Expected Hit Rate
Multi-turn conversation	Accumulating conversation history	75–95%
Agents sharing a common system prompt	Long system prompt	60–90%
RAG pipelines	Repeating the same context documents	50–80%
Few-shot prompts	Shared example blocks	40–70%
Fully random one-off questions	None	~0%

Reading Hit Rates via Prometheus Metrics

bash

# Enable the metrics endpoint when starting the server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --port 30000

bash

# Check the current hit rate immediately (search both formats)
curl http://localhost:30000/metrics | grep -E "cache_hit_rate"

Practical Application

Setting Up a Prometheus + Grafana Monitoring Pipeline

The very first thing to do is make the hit rate visible. You can't drive in the dark.

yaml

# prometheus.yml — SGLang scrape configuration
scrape_configs:
  - job_name: 'sglang'
    static_configs:
      - targets: ['localhost:30000']
    metrics_path: '/metrics'
    scrape_interval: 15s

yaml

# prometheus/alerting_rules.yml — Prometheus Alertmanager rules
groups:
  - name: sglang_cache
    rules:
      - alert: LowCacheHitRate
        expr: sglang_cache_hit_rate < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SGLang KV cache hit rate below 10% (current: {{ $value | humanizePercentage }})"

Config	Value	Description
`scrape_interval`	15s	Collect metrics every 15 seconds
`for: 15m`	15 minutes	Distinguish temporary spikes from sustained degradation
Alert threshold	0.1 (10%)	Minimum acceptable level for prefix-duplication workloads

If the hit rate stays below 0.3 for more than 15 minutes, something is wrong. Diagnostic steps are covered below.

Multi-Turn Chatbot — Server Configuration to Maximize Hit Rate

python

# Recommended prompt structure
# 1. System prompt (immutable, always at the front)
# 2. Conversation history (accumulates as prefix)
# 3. New user message (always at the end)
 
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},   # never change this
    {"role": "user",   "content": turn_1_user},
    {"role": "assistant", "content": turn_1_assistant},
    {"role": "user",   "content": turn_2_user},
    # ...
    {"role": "user",   "content": new_message},     # new input
]

bash

# Optimized server startup parameters
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --chunked-prefill-size 4096 \
  --mem-fraction-static 0.85 \
  --port 30000

Parameter	Value	Role
`--enable-metrics`	—	Expose Prometheus metrics
`--chunked-prefill-size`	4096	Split long prefills into chunks for better memory efficiency
`--mem-fraction-static`	0.85	Allocate 85% of GPU VRAM to the KV cache

Hit Rate Degradation Diagnosis Flow

When hit rate suddenly drops, it can be unclear where to start. Here's a sequence I've found useful in practice.

css

Hit rate < 0.3 detected
       │
       ▼
[Step 1] Check for system prompt drift
       → Was the prompt changed in a recent deployment?
       → Any whitespace, newline, or encoding differences?
       │
       ▼ If no issue found
[Step 2] Check routing (multi-node deployment)
       → Is Cache-Aware Load Balancer configured?
       → Is the same prefix being distributed to different nodes?
       │
       ▼ If no issue found
[Step 3] Analyze actual prefix duplication in the workload
       → Does the traffic actually have duplicate prefixes?
       → Has the proportion of fully random one-off requests increased?

bash

# Check additional indicators alongside metrics
curl -s http://localhost:30000/metrics | grep -E "(cache_hit|num_cached|num_running)"
 
# Key metrics
# sglang_cache_hit_rate        — current hit rate
# sglang_num_cached_tokens     — number of tokens currently in cache
# sglang_num_running_reqs      — number of requests currently being processed

Applying Cache-Aware Load Balancer in Multi-Node Deployments

Since SGLang v0.4, a Cache-Aware Load Balancer is built in. It routes requests with the same prefix to the same worker based on a hash of the prefix.

bash

# Single host, 2 GPUs — Cache-Aware DP configuration
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --dp-size 2 \
  --load-balance-method cache_aware \
  --port 30000

Extending Cache Capacity with HiCache

If you've grown to a scale that GPU VRAM alone can't handle, HiCache is worth considering. It extends the cache across three tiers: GPU VRAM (L1) → Host RAM (L2) → distributed storage (L3).

bash

# Enable HiCache (add Host RAM tier)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --enable-metrics \
  --enable-hierarchical-cache \
  --hicache-ratio 4.0 \
  --mem-fraction-static 0.80 \
  --port 30000

Parameter	Value	Meaning
`--enable-hierarchical-cache`	—	Enable HiCache tiered caching
`--hicache-ratio`	4.0	Host RAM KV cache size = GPU VRAM KV cache × 4
`--mem-fraction-static`	0.80	Reduce GPU allocation slightly when using HiCache

bash

# FP8 quantization for KV cache to fit more entries in the same VRAM
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-70B-Instruct \
  --kv-cache-dtype fp8 \
  --enable-metrics \
  --port 30000

Pros and Cons Analysis

Advantages

Item	Details
Automatic optimization	Detects cache opportunities at runtime without manual prompt caching code
Supports diverse patterns	Covers few-shot, branching reasoning trees, multi-turn conversation, and RAG
Hierarchical scalability	HiCache extends beyond GPU VRAM to Host RAM and distributed storage
Measurable performance gains	Up to 80% reduction in TTFT and up to 6× throughput for workloads with 60%+ prefix duplication
Easy integration with existing infrastructure	Connects directly to standard Prometheus/Grafana stack

Disadvantages and Caveats

Item	Details	Mitigation
Cache invalidation sensitivity	A single whitespace change in the system prompt causes a full cache miss	Version-control prompts like code; verify diffs at deployment
Ineffective for random workloads	No benefit for requests with low prefix duplication	Analyze workload characteristics before applying
Multi-node routing is mandatory	Hit rate drops sharply with round-robin distribution	Configure Cache-Aware Load Balancer
Memory tuning required	Incorrect combination of `--mem-fraction-static` and `--hicache-ratio` can cause OOM	Adjust incrementally to match workload
Speculative decoding bug	`cached_tokens` misreported as 0 when using Speculative Decoding	GitHub Issue #20451 (open as of May 2025); temporarily avoid this combination

Speculative Decoding: An inference acceleration technique that uses a small draft model to quickly predict multiple tokens, then verifies them with the original model. A known bug causes the cached_tokens metric to be incorrectly reported as 0 when used together with RadixAttention in SGLang (GitHub Issue #20451). I checked the issue directly and it was still open at the time of writing (May 2025), so I recommend temporarily avoiding this combination in environments where hit rate monitoring is critical.

The Most Common Mistakes in Practice

Placing dynamic values at the beginning of the prompt template — When changing values like dates or user IDs appear at the front of the system prompt, nothing after them gets cached. It is recommended to place dynamic values in the last user message of the conversation.
Guessing at hit rates without --enable-metrics — More often than you'd think, people assume "it's probably working fine" and end up running at 0% for a long time. Simply looking at the numbers makes the next steps obvious.
Not applying cache-aware routing after multi-node deployment — If hit rates actually got worse after horizontal scale-out, this is almost always the reason. Cache-Aware Load Balancer is not optional in multi-node setups — it's mandatory.

Closing Thoughts

Three steps you can start right now:

Enable --enable-metrics right away. Run curl http://localhost:30000/metrics | grep cache_hit_rate to check your current hit rate. Simply seeing the number will make your next action clear.
If the hit rate is lower than expected, inspect your prompt structure. Check whether the system prompt is truly fixed and whether timestamps or random values are mixed into the front of it.
For multi-node environments, it is recommended to apply the --load-balance-method cache_aware option. A single line of configuration can produce a noticeable difference in hit rate. From there, consider exploring HiCache or FP8 quantization as next steps.

References

#SGLang#RadixAttention#KVCache#Prometheus#Grafana#LLM서빙#HiCache#캐시인식로드밸런싱#TTFT최적화#SpeculativeDecoding

Core Concepts

How RadixAttention Reuses the KV Cache

Workload Patterns Where Prefix Reuse Is Effective

Reading Hit Rates via Prometheus Metrics

Practical Application

Setting Up a Prometheus + Grafana Monitoring Pipeline

Multi-Turn Chatbot — Server Configuration to Maximize Hit Rate

Hit Rate Degradation Diagnosis Flow

Applying Cache-Aware Load Balancer in Multi-Node Deployments

Extending Cache Capacity with HiCache

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

How RadixAttention Reuses the KV Cache

Workload Patterns Where Prefix Reuse Is Effective

Reading Hit Rates via Prometheus Metrics

Practical Application

Setting Up a Prometheus + Grafana Monitoring Pipeline

Multi-Turn Chatbot — Server Configuration to Maximize Hit Rate

Hit Rate Degradation Diagnosis Flow

Applying Cache-Aware Load Balancer in Multi-Node Deployments

Extending Cache Capacity with HiCache

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria