XGrammar-2: The Design Principles Behind 80x Faster Structured Output
When an LLM calls a tool or returns JSON, there's actually quite a heavy operation running behind the scenes. Every time the model emits a token, it must determine in real time whether that token is valid in the current grammar state — and as the number of tools grows into the hundreds, the cost of recompiling the grammar on every request quietly starts to pile up. Whether or not you've personally served AI agents, this bottleneck is hard to avoid once you're running an LLM-based service.
I ran into this myself while attaching function calling to DeepSeek-series models — the JSON parsing would go sideways after the <think> block closed, and at the time I just wrote a parser by hand and stuffed it in. It wasn't until I saw XGrammar-2 that I realized just how barbaric that approach was.
In May 2026, the CMU research team and the MLC project released XGrammar-2, achieving up to 80x efficiency improvement over XGrammar generation 1, 6–10x per-token throughput, and over 100x reduction in compilation time. This post analyzes the paper and official blog, and includes verification of select code examples. By the end, you'll be able to judge under what conditions switching to XGrammar-2 delivers real-world impact — and what single line of code to change in vLLM or SGLang.
Core Concepts
XGrammar-2 is not simply a performance upgrade. It identifies two fundamentally flawed assumptions that existing engines made in agentic environments, and addresses each with a distinct mechanism. Let's start by examining what exactly went wrong.
Why Structured Output Is Especially Tricky in Agentic Environments
Structured Generation: A technique that constrains token generation so an LLM produces text that exactly conforms to a predefined format — JSON, a function-call schema, a specific protocol. On every token generation, a token mask — a bit array that restricts which tokens can appear next given the current grammar state — is computed and applied.
Existing structured generation engines were designed on two main assumptions. First, schemas can be pre-compiled before a request arrives. Second, the output structure does not change within a single request. Agentic environments break both assumptions.
Inter-request dynamism: Each agent uses different tools, and for every request a different subset must be selected from a registry of 500 tools and the schema recompiled. The number of combinations that could be cached in advance is effectively infinite.
Intra-request dynamism: Chain-of-thought models like DeepSeek R1 or QwQ switch structures within a single request: free-form <think>...</think> reasoning → tool call JSON → response text. No single grammar can cover this entire flow.
Even without agentic development experience, you can intuitively understand why this is a problem. A traditional web API has one fixed schema per request, but an LLM agent's set of allowed tokens changes in real time depending on "what the model is outputting at this very moment." A statically designed parser cannot handle this dynamism efficiently.
XGrammar-2 addresses these two forms of dynamism with separate mechanisms.
TagDispatch: Handling Intra-Request Structure Transitions Automatically
TagDispatch is a dispatch layer that dynamically swaps the active grammar when it detects specific tags in the output stream. When a <tool_call> tag appears, it automatically switches to the JSON schema grammar for that tool; when a <think> tag appears, it switches to an unconstrained free-form grammar.
# TagDispatch conceptual structure (pseudocode)
grammar_map = {
"<think>": FreeFormGrammar(),
"<tool_call>": JSONSchemaGrammar(tool_schema),
"<response>": ResponseFormatGrammar(),
}
dispatcher = TagDispatch(grammar_map)
for token in token_stream:
active_grammar = dispatcher.dispatch(token)
# Compute token mask based on the current grammar's state machine
mask = active_grammar.get_token_mask(current_state)At first I thought, "Can't you just split the stream into segments at the parser level?" — but the issue is that token mask computation depends on the current state of the grammar's state machine. Changing the grammar means the state machine itself must change, and performing that transition at runtime without overhead is the core challenge. TagDispatch handles this state machine transition atomically at the moment a tag is detected.
Cross-Grammar Cache: Reusing Shared Structure Across Different Schemas
Cross-Grammar Cache: A caching layer that computes the token mask for substructures that appear in common across different grammars (tool schemas) only once and reuses them. The key point is that this cache operates at the level of states in the grammar state machine, not at the schema level.
Even with 500 tools, every tool schema ultimately follows JSON object structure. Patterns like {, "key":, numeric parsing, and closing arrays repeat across most schemas. The old approach performed this computation independently per schema, but Cross-Grammar Cache identifies shared substructures at the grammar state level and reuses once-computed token masks across multiple grammars.
To put it intuitively: even though tool A's JSON schema and tool B's JSON schema differ in content, both grammars share identical state machine structure when in the state "JSON object opened, awaiting first key." The token mask for that common state only needs to be computed once and can be reused by both grammars.
Other Core Mechanisms
Earley Parser: An algorithm for parsing context-free grammars (CFGs), capable of accurately handling grammars with recursive structures like JSON schemas. It operates in O(N²) or better for typical inputs, but can reach O(N³) for highly ambiguous grammars.
| Technique | Problem Solved | Core Principle |
|---|---|---|
| JIT Compilation | Unnecessary upfront compilation cost | Compile a grammar only when an actual request comes in. Even with 500 tools, only process the subset used by real requests |
| Repetition State Compression | Memory waste from array/list patterns | Compress repetitive structures like [item, item, ...] in state space to reduce memory and computation |
| Earley Adaptive Mask | Handling complex recursive grammars | Earley-parser-based efficient token mask computation even for context-dependent grammars |
| TagDispatch | Intra-request structure transitions | Atomically swap the grammar state machine at tag detection time (detailed above) |
| Cross-Grammar Cache | Inter-request duplicate schema computation | Reuse token masks for common state substructures at the state level (detailed above) |
Structural Tag: Abstraction for Agent Output Formats
Agent output formats vary from model to model and framework to framework. With OpenAI format, Anthropic format, and custom formats all mixed together, XGrammar-2 introduces an abstraction layer called Structural Tag.
Honestly, this is an area where the actual implementation in production code is not yet publicly available enough to fully internalize — the paper alone is hard to get a feel for. The basic idea is as follows:
{
"type": "tool_call",
"name": "search",
"arguments": {"query": "Seoul weather"},
"reasoning": "<think> ... </think>"
}If TagDispatch is the tag-detection layer, Structural Tag is the protocol sitting on top of it that represents agent output as a consistent JSON structure. Amid ongoing discussions around standardizing the OpenAI response format, this is an attempt to unify diverse formats under a single abstraction.
Practical Application
Example 1: Serving a Large-Scale Tool-Calling Agent in SGLang
Environment: Latest version of SGLang (check per-version release notes for XGrammar-2 integration status), Python 3.10+
SGLang has integrated XGrammar-2 as its default structured output backend. Passing a schema via the guided_json parameter causes Cross-Grammar Cache to operate automatically.
import sglang as sgl
from pydantic import BaseModel, ConfigDict
from typing import Literal
class ToolCallResponse(BaseModel):
model_config = ConfigDict(extra="forbid") # Block additional fields
tool_name: Literal["search", "calculate", "fetch_data"]
arguments: dict
confidence: float
@sgl.function
def tool_call_agent(s, user_query: str):
s += sgl.system("You are a helpful assistant with access to tools.")
s += sgl.user(user_query)
s += sgl.assistant(
sgl.gen(
"response",
guided_json=ToolCallResponse.model_json_schema(),
max_tokens=512,
)
)
# Initialize runtime with XGrammar-2 backend
runtime = sgl.Runtime(model_path="deepseek-ai/DeepSeek-V3", backend="xgrammar")
result = tool_call_agent.run(
user_query="Search for the current weather in Seoul and tell me in Celsius",
)
print(result["response"])| Code Point | Description |
|---|---|
guided_json=ToolCallResponse.model_json_schema() |
Passes the Pydantic model's JSON schema directly. XGrammar-2 JIT-compiles this schema to generate the token mask |
backend="xgrammar" |
Explicitly activates the XGrammar-2 backend in the SGLang runtime |
ConfigDict(extra="forbid") |
Blocks additional fields not in the schema. Without this, the model may insert unexpected keys |
Example 2: Handling Mixed Output from Models with a Reasoning Channel
Models like DeepSeek R1 and QwQ that first emit a <think> block and then produce a structured response are honestly the trickiest case. Seeing how TagDispatch handles this transition automatically is quite elegant.
Note: The code below is pseudocode based on XGrammar-2 official documentation. Actual class names and method names may differ depending on the library version — check the official GitHub API docs before running.
# pip install xgrammar
# The following is conceptual pseudocode based on XGrammar-2 (actual API names may differ)
from xgrammar import GrammarCompiler, TagDispatchGrammar
# The GrammarCompiler instance must be reused to benefit from Cross-Grammar Cache
compiler = GrammarCompiler()
tag_dispatch_grammar = TagDispatchGrammar(
tag_grammar_map={
# <think> block: free-form (no token mask constraints)
"think": compiler.compile_free_form(),
# <tool_call> block: strict JSON schema
"tool_call": compiler.compile_json_schema({
"type": "object",
"properties": {
"name": {"type": "string"},
"arguments": {"type": "object"},
},
"required": ["name", "arguments"],
"additionalProperties": False,
}),
# <response> block: free-form response
"response": compiler.compile_free_form(),
},
default_grammar=compiler.compile_free_form(),
)| Processing Stage | Active Grammar | Token Mask |
|---|---|---|
<think> start ~ </think> |
Free-form | Full vocabulary allowed |
After <tool_call> start |
JSON schema grammar | Only tokens matching the schema allowed |
After </tool_call> |
Default grammar | Awaiting tag detection |
Reusing the GrammarCompiler instance is the key. The Cross-Grammar Cache lives inside this compiler, accumulating and reusing token masks for common substructures across multiple requests. Creating a new compiler per request resets the cache and eliminates the performance benefit.
Example 3: Integrating XGrammar-2 into Existing vLLM Code
Environment: vLLM v0.4+ (check per-version release notes for XGrammar-2 support status), Python 3.10+
If you're already using vLLM, you can switch with a single guided_decoding_backend setting. The approach leaves existing code untouched and only changes the backend, so migration risk is minimal.
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
guided_decoding_backend="xgrammar", # Add only this one line
)
response_schema = {
"type": "object",
"properties": {
"status": {"type": "string", "enum": ["success", "error", "partial"]},
"results": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {"type": "integer"},
"content": {"type": "string"},
"score": {"type": "number"},
},
"additionalProperties": False,
},
},
"metadata": {"type": "object"},
},
"additionalProperties": False,
}
params = SamplingParams(
temperature=0.0,
guided_json=response_schema,
max_tokens=1024,
)
outputs = llm.generate(["Organize the search results as JSON"], params)In environments with a fixed set of 10 or fewer tools, the existing engine may be sufficient. On the other hand, if you're dynamically selecting a different tool subset per request or serving models with a reasoning channel, there's a good chance the switch will be noticeable.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Dramatic performance gains | Up to 80× efficiency over XGrammar generation 1, 6–10× per-token throughput, 100×+ reduction in compilation time |
| Optimized for dynamic agents | The only engine that efficiently handles both inter-request and intra-request structural changes |
| Universal framework compatibility | Fully integrated with vLLM, SGLang, TensorRT-LLM, and MLC-LLM |
| Native support for reasoning models | Handles <think> patterns from DeepSeek and Qwen series naturally via TagDispatch |
| Open source MIT | Available on GitHub, customizable |
Drawbacks and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Increased complexity | New abstraction layers like TagDispatch and Cross-Grammar Cache raise debugging difficulty | Recommend increasing log level to trace grammar transition points |
| Limited benefit for simple fixed schemas | If the schema is always the same, the practical advantage over XGrammar generation 1 may not be significant | If you have 10 or fewer fixed-schema tools, the existing engine is sufficient |
| Earley parser complexity | Worst case O(N³) — can bottleneck on very complex recursive grammars | Recommend benchmarking with your actual schemas before adopting |
| Emerging technology | Announced May 2026, limited production feedback accumulated | Recommend thorough validation in a staging environment before switching |
Most Common Mistakes in Practice
-
Omitting
additionalProperties: falsefrom the schema: XGrammar-2 ends up allowing extra fields, and the model freely inserts unexpected keys. If using Pydantic v2 models, it's recommended to also setmodel_config = ConfigDict(extra="forbid"). -
Mismatching TagDispatch tag names with the model's prompt: If the system prompt instructs
<tool_call>but TagDispatch is registered with<function_call>, dispatch will not work at all. Tag names must match exactly between the system prompt and the TagDispatch configuration. -
Creating a new
GrammarCompilerper request within a single process: To benefit from Cross-Grammar Cache, you must reuse theGrammarCompilerinstance. Creating a new compiler per request resets the cache and eliminates the performance benefit.
Closing Thoughts
XGrammar-2 is an engine that demonstrates structured output no longer has to be a performance bottleneck in agentic LLM serving. In particular, if you're running a dynamic tool registry or serving models with a reasoning channel like DeepSeek R1 or QwQ, there's a good chance you'll feel the difference after switching. Conversely, if you're operating with a single fixed schema, replacing it right now is not a high priority.
Three steps you can take right now:
-
Install the package and check a quick benchmark: Install via
pip install xgrammarorpip install "vllm[xgrammar]", then try compiling your current schemas withGrammarCompiler. If your environment has more than 50 tools, you'll immediately feel the difference in compilation time. -
Add a single backend option to your existing vLLM or SGLang serving code: Adding just the
guided_decoding_backend="xgrammar"parameter switches the backend without changing any existing code. You can first check structured output latency metrics in a staging environment. If structured output overhead accounts for more than 5% of total response time, that's a sufficient signal to consider switching. -
If using reasoning models, consider adopting TagDispatch configuration: If you've experienced unstable JSON parsing with models like DeepSeek R1, QwQ, or the Qwen series that output
<think>blocks, refer to the TagDispatch examples in the official documentation to set it up. In this case, you can expect greater stability improvements than a simple backend switch.
References
- XGrammar-2: Fast and Customizable Structured Generation for Tool Calling and Agents | MLC Blog
- XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs | arXiv
- XGrammar: Achieving Efficient, Flexible, Portable Structured Generation | MLC Blog
- XGrammar (original paper) | arXiv
- mlc-ai/xgrammar | GitHub
- xgrammar | PyPI
- XGrammar-2 Review | TheMoonlight