Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

When an LLM calls a tool or returns JSON, there's actually quite a heavy operation running behind the scenes. Every time the model emits a token, it must determine in real time whether that token is valid in the current grammar state — and as the number of tools grows into the hundreds, the cost of recompiling the grammar on every request quietly starts to pile up. Whether or not you've personally served AI agents, this bottleneck is hard to avoid once you're running an LLM-based service.

I ran into this myself while attaching function calling to DeepSeek-series models — the JSON parsing would go sideways after the <think> block closed, and at the time I just wrote a parser by hand and stuffed it in. It wasn't until I saw XGrammar-2 that I realized just how barbaric that approach was.

In May 2026, the CMU research team and the MLC project released XGrammar-2, achieving up to 80x efficiency improvement over XGrammar generation 1, 6–10x per-token throughput, and over 100x reduction in compilation time. This post analyzes the paper and official blog, and includes verification of select code examples. By the end, you'll be able to judge under what conditions switching to XGrammar-2 delivers real-world impact — and what single line of code to change in vLLM or SGLang.


Core Concepts

XGrammar-2 is not simply a performance upgrade. It identifies two fundamentally flawed assumptions that existing engines made in agentic environments, and addresses each with a distinct mechanism. Let's start by examining what exactly went wrong.

Why Structured Output Is Especially Tricky in Agentic Environments

Structured Generation: A technique that constrains token generation so an LLM produces text that exactly conforms to a predefined format — JSON, a function-call schema, a specific protocol. On every token generation, a token mask — a bit array that restricts which tokens can appear next given the current grammar state — is computed and applied.

Existing structured generation engines were designed on two main assumptions. First, schemas can be pre-compiled before a request arrives. Second, the output structure does not change within a single request. Agentic environments break both assumptions.

Inter-request dynamism: Each agent uses different tools, and for every request a different subset must be selected from a registry of 500 tools and the schema recompiled. The number of combinations that could be cached in advance is effectively infinite.

Intra-request dynamism: Chain-of-thought models like DeepSeek R1 or QwQ switch structures within a single request: free-form <think>...</think> reasoning → tool call JSON → response text. No single grammar can cover this entire flow.

Even without agentic development experience, you can intuitively understand why this is a problem. A traditional web API has one fixed schema per request, but an LLM agent's set of allowed tokens changes in real time depending on "what the model is outputting at this very moment." A statically designed parser cannot handle this dynamism efficiently.

XGrammar-2 addresses these two forms of dynamism with separate mechanisms.


TagDispatch: Handling Intra-Request Structure Transitions Automatically

TagDispatch is a dispatch layer that dynamically swaps the active grammar when it detects specific tags in the output stream. When a <tool_call> tag appears, it automatically switches to the JSON schema grammar for that tool; when a <think> tag appears, it switches to an unconstrained free-form grammar.

python
# TagDispatch conceptual structure (pseudocode)
grammar_map = {
    "<think>":     FreeFormGrammar(),
    "<tool_call>": JSONSchemaGrammar(tool_schema),
    "<response>":  ResponseFormatGrammar(),
}
 
dispatcher = TagDispatch(grammar_map)
for token in token_stream:
    active_grammar = dispatcher.dispatch(token)
    # Compute token mask based on the current grammar's state machine
    mask = active_grammar.get_token_mask(current_state)

At first I thought, "Can't you just split the stream into segments at the parser level?" — but the issue is that token mask computation depends on the current state of the grammar's state machine. Changing the grammar means the state machine itself must change, and performing that transition at runtime without overhead is the core challenge. TagDispatch handles this state machine transition atomically at the moment a tag is detected.


Cross-Grammar Cache: Reusing Shared Structure Across Different Schemas

Cross-Grammar Cache: A caching layer that computes the token mask for substructures that appear in common across different grammars (tool schemas) only once and reuses them. The key point is that this cache operates at the level of states in the grammar state machine, not at the schema level.

Even with 500 tools, every tool schema ultimately follows JSON object structure. Patterns like {, "key":, numeric parsing, and closing arrays repeat across most schemas. The old approach performed this computation independently per schema, but Cross-Grammar Cache identifies shared substructures at the grammar state level and reuses once-computed token masks across multiple grammars.

To put it intuitively: even though tool A's JSON schema and tool B's JSON schema differ in content, both grammars share identical state machine structure when in the state "JSON object opened, awaiting first key." The token mask for that common state only needs to be computed once and can be reused by both grammars.


Other Core Mechanisms

Earley Parser: An algorithm for parsing context-free grammars (CFGs), capable of accurately handling grammars with recursive structures like JSON schemas. It operates in O(N²) or better for typical inputs, but can reach O(N³) for highly ambiguous grammars.

Technique Problem Solved Core Principle
JIT Compilation Unnecessary upfront compilation cost Compile a grammar only when an actual request comes in. Even with 500 tools, only process the subset used by real requests
Repetition State Compression Memory waste from array/list patterns Compress repetitive structures like [item, item, ...] in state space to reduce memory and computation
Earley Adaptive Mask Handling complex recursive grammars Earley-parser-based efficient token mask computation even for context-dependent grammars
TagDispatch Intra-request structure transitions Atomically swap the grammar state machine at tag detection time (detailed above)
Cross-Grammar Cache Inter-request duplicate schema computation Reuse token masks for common state substructures at the state level (detailed above)

Structural Tag: Abstraction for Agent Output Formats

Agent output formats vary from model to model and framework to framework. With OpenAI format, Anthropic format, and custom formats all mixed together, XGrammar-2 introduces an abstraction layer called Structural Tag.

Honestly, this is an area where the actual implementation in production code is not yet publicly available enough to fully internalize — the paper alone is hard to get a feel for. The basic idea is as follows:

json
{
  "type": "tool_call",
  "name": "search",
  "arguments": {"query": "Seoul weather"},
  "reasoning": "<think> ... </think>"
}

If TagDispatch is the tag-detection layer, Structural Tag is the protocol sitting on top of it that represents agent output as a consistent JSON structure. Amid ongoing discussions around standardizing the OpenAI response format, this is an attempt to unify diverse formats under a single abstraction.


Practical Application

Example 1: Serving a Large-Scale Tool-Calling Agent in SGLang

Environment: Latest version of SGLang (check per-version release notes for XGrammar-2 integration status), Python 3.10+

SGLang has integrated XGrammar-2 as its default structured output backend. Passing a schema via the guided_json parameter causes Cross-Grammar Cache to operate automatically.

python
import sglang as sgl
from pydantic import BaseModel, ConfigDict
from typing import Literal
 
class ToolCallResponse(BaseModel):
    model_config = ConfigDict(extra="forbid")  # Block additional fields
    tool_name: Literal["search", "calculate", "fetch_data"]
    arguments: dict
    confidence: float
 
@sgl.function
def tool_call_agent(s, user_query: str):
    s += sgl.system("You are a helpful assistant with access to tools.")
    s += sgl.user(user_query)
    s += sgl.assistant(
        sgl.gen(
            "response",
            guided_json=ToolCallResponse.model_json_schema(),
            max_tokens=512,
        )
    )
 
# Initialize runtime with XGrammar-2 backend
runtime = sgl.Runtime(model_path="deepseek-ai/DeepSeek-V3", backend="xgrammar")
result = tool_call_agent.run(
    user_query="Search for the current weather in Seoul and tell me in Celsius",
)
print(result["response"])
Code Point Description
guided_json=ToolCallResponse.model_json_schema() Passes the Pydantic model's JSON schema directly. XGrammar-2 JIT-compiles this schema to generate the token mask
backend="xgrammar" Explicitly activates the XGrammar-2 backend in the SGLang runtime
ConfigDict(extra="forbid") Blocks additional fields not in the schema. Without this, the model may insert unexpected keys

Example 2: Handling Mixed Output from Models with a Reasoning Channel

Models like DeepSeek R1 and QwQ that first emit a <think> block and then produce a structured response are honestly the trickiest case. Seeing how TagDispatch handles this transition automatically is quite elegant.

Note: The code below is pseudocode based on XGrammar-2 official documentation. Actual class names and method names may differ depending on the library version — check the official GitHub API docs before running.

python
# pip install xgrammar
# The following is conceptual pseudocode based on XGrammar-2 (actual API names may differ)
from xgrammar import GrammarCompiler, TagDispatchGrammar
 
# The GrammarCompiler instance must be reused to benefit from Cross-Grammar Cache
compiler = GrammarCompiler()
 
tag_dispatch_grammar = TagDispatchGrammar(
    tag_grammar_map={
        # <think> block: free-form (no token mask constraints)
        "think": compiler.compile_free_form(),
 
        # <tool_call> block: strict JSON schema
        "tool_call": compiler.compile_json_schema({
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "arguments": {"type": "object"},
            },
            "required": ["name", "arguments"],
            "additionalProperties": False,
        }),
 
        # <response> block: free-form response
        "response": compiler.compile_free_form(),
    },
    default_grammar=compiler.compile_free_form(),
)
Processing Stage Active Grammar Token Mask
<think> start ~ </think> Free-form Full vocabulary allowed
After <tool_call> start JSON schema grammar Only tokens matching the schema allowed
After </tool_call> Default grammar Awaiting tag detection

Reusing the GrammarCompiler instance is the key. The Cross-Grammar Cache lives inside this compiler, accumulating and reusing token masks for common substructures across multiple requests. Creating a new compiler per request resets the cache and eliminates the performance benefit.


Example 3: Integrating XGrammar-2 into Existing vLLM Code

Environment: vLLM v0.4+ (check per-version release notes for XGrammar-2 support status), Python 3.10+

If you're already using vLLM, you can switch with a single guided_decoding_backend setting. The approach leaves existing code untouched and only changes the backend, so migration risk is minimal.

python
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    guided_decoding_backend="xgrammar",  # Add only this one line
)
 
response_schema = {
    "type": "object",
    "properties": {
        "status": {"type": "string", "enum": ["success", "error", "partial"]},
        "results": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "id": {"type": "integer"},
                    "content": {"type": "string"},
                    "score": {"type": "number"},
                },
                "additionalProperties": False,
            },
        },
        "metadata": {"type": "object"},
    },
    "additionalProperties": False,
}
 
params = SamplingParams(
    temperature=0.0,
    guided_json=response_schema,
    max_tokens=1024,
)
 
outputs = llm.generate(["Organize the search results as JSON"], params)

In environments with a fixed set of 10 or fewer tools, the existing engine may be sufficient. On the other hand, if you're dynamically selecting a different tool subset per request or serving models with a reasoning channel, there's a good chance the switch will be noticeable.


Pros and Cons

Advantages

Item Details
Dramatic performance gains Up to 80× efficiency over XGrammar generation 1, 6–10× per-token throughput, 100×+ reduction in compilation time
Optimized for dynamic agents The only engine that efficiently handles both inter-request and intra-request structural changes
Universal framework compatibility Fully integrated with vLLM, SGLang, TensorRT-LLM, and MLC-LLM
Native support for reasoning models Handles <think> patterns from DeepSeek and Qwen series naturally via TagDispatch
Open source MIT Available on GitHub, customizable

Drawbacks and Caveats

Item Details Mitigation
Increased complexity New abstraction layers like TagDispatch and Cross-Grammar Cache raise debugging difficulty Recommend increasing log level to trace grammar transition points
Limited benefit for simple fixed schemas If the schema is always the same, the practical advantage over XGrammar generation 1 may not be significant If you have 10 or fewer fixed-schema tools, the existing engine is sufficient
Earley parser complexity Worst case O(N³) — can bottleneck on very complex recursive grammars Recommend benchmarking with your actual schemas before adopting
Emerging technology Announced May 2026, limited production feedback accumulated Recommend thorough validation in a staging environment before switching

Most Common Mistakes in Practice

  1. Omitting additionalProperties: false from the schema: XGrammar-2 ends up allowing extra fields, and the model freely inserts unexpected keys. If using Pydantic v2 models, it's recommended to also set model_config = ConfigDict(extra="forbid").

  2. Mismatching TagDispatch tag names with the model's prompt: If the system prompt instructs <tool_call> but TagDispatch is registered with <function_call>, dispatch will not work at all. Tag names must match exactly between the system prompt and the TagDispatch configuration.

  3. Creating a new GrammarCompiler per request within a single process: To benefit from Cross-Grammar Cache, you must reuse the GrammarCompiler instance. Creating a new compiler per request resets the cache and eliminates the performance benefit.


Closing Thoughts

XGrammar-2 is an engine that demonstrates structured output no longer has to be a performance bottleneck in agentic LLM serving. In particular, if you're running a dynamic tool registry or serving models with a reasoning channel like DeepSeek R1 or QwQ, there's a good chance you'll feel the difference after switching. Conversely, if you're operating with a single fixed schema, replacing it right now is not a high priority.

Three steps you can take right now:

  1. Install the package and check a quick benchmark: Install via pip install xgrammar or pip install "vllm[xgrammar]", then try compiling your current schemas with GrammarCompiler. If your environment has more than 50 tools, you'll immediately feel the difference in compilation time.

  2. Add a single backend option to your existing vLLM or SGLang serving code: Adding just the guided_decoding_backend="xgrammar" parameter switches the backend without changing any existing code. You can first check structured output latency metrics in a staging environment. If structured output overhead accounts for more than 5% of total response time, that's a sufficient signal to consider switching.

  3. If using reasoning models, consider adopting TagDispatch configuration: If you've experienced unstable JSON parsing with models like DeepSeek R1, QwQ, or the Qwen series that output <think> blocks, refer to the TagDispatch examples in the official documentation to set it up. In this case, you can expect greater stability improvements than a simple backend switch.


References

  • XGrammar-2: Fast and Customizable Structured Generation for Tool Calling and Agents | MLC Blog
  • XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs | arXiv
  • XGrammar: Achieving Efficient, Flexible, Portable Structured Generation | MLC Blog
  • XGrammar (original paper) | arXiv
  • mlc-ai/xgrammar | GitHub
  • xgrammar | PyPI
  • XGrammar-2 Review | TheMoonlight
#구조화출력#LLM#XGrammar#에이전트AI#JSONSchema#vLLM#SGLang#TagDispatch#JIT컴파일#Pydantic
Share

Table of Contents

Core ConceptsWhy Structured Output Is Especially Tricky in Agentic EnvironmentsTagDispatch: Handling Intra-Request Structure Transitions AutomaticallyCross-Grammar Cache: Reusing Shared Structure Across Different SchemasOther Core MechanismsStructural Tag: Abstraction for Agent Output FormatsPractical ApplicationExample 1: Serving a Large-Scale Tool-Calling Agent in SGLangExample 2: Handling Mixed Output from Models with a Reasoning ChannelExample 3: Integrating XGrammar-2 into Existing vLLM CodePros and ConsAdvantagesDrawbacks and CaveatsMost Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions
AI

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

52,300 input tokens/s. This is the figure LMSYS announced in May 2025 when they became the first to openly deploy DeepSeek-V3 on 96 H100 GPUs. It was initially ...

May 28, 202622 min read
SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x
AI

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

Even if you've never directly served multimodal AI before, that's fine. These days, AI features that accept image input are becoming so widespread so quickly th...

May 27, 202623 min read
SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache
AI

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache

When I first took on LLM serving, the most baffling question was "We have enough GPUs, so why is this so slow?" The monitoring dashboard showed GPU memory nearl...

May 27, 202627 min read
SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide
AI

SGLang RadixAttention KV Cache Hit Rate: How We Raised Hit Rate from 3% to 78% with Prometheus Monitoring and Operational Tuning — Advanced Guide

When running LLM serving infrastructure, GPU costs can quickly spiral out of control. Back when I was operating a multi-turn chatbot service, I eventually reali...

May 27, 202621 min read
Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference
AI

Migrating LLM Inference from vLLM to SGLang: Why RAG and Multi-Turn Workloads See a 6x Throughput Difference

Honestly, my first reaction when I came across SGLang was "another new framework?" vLLM was working well enough, and touching a serving stack that's already run...

May 27, 202621 min read
Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)
AI

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

When operating an LLM service, you will eventually encounter this situation. When you had Automatic Prefix Caching (APC) enabled in vLLM and ran a multi-turn ch...

May 26, 202626 min read