XGrammar-2: The Design Principles Behind 80x Faster Structured Output

When an LLM calls a tool or returns JSON, there's actually quite a heavy operation running behind the scenes. Every time the model emits a token, it must determine in real time whether that token is valid in the current grammar state — and as the number of tools grows into the hundreds, the cost of recompiling the grammar on every request quietly starts to pile up. Whether or not you've personally served AI agents, this bottleneck is hard to avoid once you're running an LLM-based service.

I ran into this myself while attaching function calling to DeepSeek-series models — the JSON parsing would go sideways after the <think> block closed, and at the time I just wrote a parser by hand and stuffed it in. It wasn't until I saw XGrammar-2 that I realized just how barbaric that approach was.

In May 2026, the CMU research team and the MLC project released XGrammar-2, achieving up to 80x efficiency improvement over XGrammar generation 1, 6–10x per-token throughput, and over 100x reduction in compilation time. This post analyzes the paper and official blog, and includes verification of select code examples. By the end, you'll be able to judge under what conditions switching to XGrammar-2 delivers real-world impact — and what single line of code to change in vLLM or SGLang.

Core Concepts

XGrammar-2 is not simply a performance upgrade. It identifies two fundamentally flawed assumptions that existing engines made in agentic environments, and addresses each with a distinct mechanism. Let's start by examining what exactly went wrong.

Why Structured Output Is Especially Tricky in Agentic Environments

Structured Generation: A technique that constrains token generation so an LLM produces text that exactly conforms to a predefined format — JSON, a function-call schema, a specific protocol. On every token generation, a token mask — a bit array that restricts which tokens can appear next given the current grammar state — is computed and applied.

Existing structured generation engines were designed on two main assumptions. First, schemas can be pre-compiled before a request arrives. Second, the output structure does not change within a single request. Agentic environments break both assumptions.

Inter-request dynamism: Each agent uses different tools, and for every request a different subset must be selected from a registry of 500 tools and the schema recompiled. The number of combinations that could be cached in advance is effectively infinite.

Intra-request dynamism: Chain-of-thought models like DeepSeek R1 or QwQ switch structures within a single request: free-form <think>...</think> reasoning → tool call JSON → response text. No single grammar can cover this entire flow.

Even without agentic development experience, you can intuitively understand why this is a problem. A traditional web API has one fixed schema per request, but an LLM agent's set of allowed tokens changes in real time depending on "what the model is outputting at this very moment." A statically designed parser cannot handle this dynamism efficiently.

XGrammar-2 addresses these two forms of dynamism with separate mechanisms.

TagDispatch: Handling Intra-Request Structure Transitions Automatically

TagDispatch is a dispatch layer that dynamically swaps the active grammar when it detects specific tags in the output stream. When a <tool_call> tag appears, it automatically switches to the JSON schema grammar for that tool; when a <think> tag appears, it switches to an unconstrained free-form grammar.

python

# TagDispatch conceptual structure (pseudocode)
grammar_map = {
    "<think>":     FreeFormGrammar(),
    "<tool_call>": JSONSchemaGrammar(tool_schema),
    "<response>":  ResponseFormatGrammar(),
}
 
dispatcher = TagDispatch(grammar_map)
for token in token_stream:
    active_grammar = dispatcher.dispatch(token)
    # Compute token mask based on the current grammar's state machine
    mask = active_grammar.get_token_mask(current_state)

At first I thought, "Can't you just split the stream into segments at the parser level?" — but the issue is that token mask computation depends on the current state of the grammar's state machine. Changing the grammar means the state machine itself must change, and performing that transition at runtime without overhead is the core challenge. TagDispatch handles this state machine transition atomically at the moment a tag is detected.

Cross-Grammar Cache: Reusing Shared Structure Across Different Schemas

Cross-Grammar Cache: A caching layer that computes the token mask for substructures that appear in common across different grammars (tool schemas) only once and reuses them. The key point is that this cache operates at the level of states in the grammar state machine, not at the schema level.

Even with 500 tools, every tool schema ultimately follows JSON object structure. Patterns like {, "key":, numeric parsing, and closing arrays repeat across most schemas. The old approach performed this computation independently per schema, but Cross-Grammar Cache identifies shared substructures at the grammar state level and reuses once-computed token masks across multiple grammars.

To put it intuitively: even though tool A's JSON schema and tool B's JSON schema differ in content, both grammars share identical state machine structure when in the state "JSON object opened, awaiting first key." The token mask for that common state only needs to be computed once and can be reused by both grammars.

Other Core Mechanisms

Earley Parser: An algorithm for parsing context-free grammars (CFGs), capable of accurately handling grammars with recursive structures like JSON schemas. It operates in O(N²) or better for typical inputs, but can reach O(N³) for highly ambiguous grammars.

Technique	Problem Solved	Core Principle
JIT Compilation	Unnecessary upfront compilation cost	Compile a grammar only when an actual request comes in. Even with 500 tools, only process the subset used by real requests
Repetition State Compression	Memory waste from array/list patterns	Compress repetitive structures like `[item, item, ...]` in state space to reduce memory and computation
Earley Adaptive Mask	Handling complex recursive grammars	Earley-parser-based efficient token mask computation even for context-dependent grammars
TagDispatch	Intra-request structure transitions	Atomically swap the grammar state machine at tag detection time (detailed above)
Cross-Grammar Cache	Inter-request duplicate schema computation	Reuse token masks for common state substructures at the state level (detailed above)

Structural Tag: Abstraction for Agent Output Formats

Agent output formats vary from model to model and framework to framework. With OpenAI format, Anthropic format, and custom formats all mixed together, XGrammar-2 introduces an abstraction layer called Structural Tag.

Honestly, this is an area where the actual implementation in production code is not yet publicly available enough to fully internalize — the paper alone is hard to get a feel for. The basic idea is as follows:

json

{
  "type": "tool_call",
  "name": "search",
  "arguments": {"query": "Seoul weather"},
  "reasoning": "<think> ... </think>"
}

If TagDispatch is the tag-detection layer, Structural Tag is the protocol sitting on top of it that represents agent output as a consistent JSON structure. Amid ongoing discussions around standardizing the OpenAI response format, this is an attempt to unify diverse formats under a single abstraction.

Practical Application

Example 1: Serving a Large-Scale Tool-Calling Agent in SGLang

Environment: Latest version of SGLang (check per-version release notes for XGrammar-2 integration status), Python 3.10+

SGLang has integrated XGrammar-2 as its default structured output backend. Passing a schema via the guided_json parameter causes Cross-Grammar Cache to operate automatically.

python

import sglang as sgl
from pydantic import BaseModel, ConfigDict
from typing import Literal
 
class ToolCallResponse(BaseModel):
    model_config = ConfigDict(extra="forbid")  # Block additional fields
    tool_name: Literal["search", "calculate", "fetch_data"]
    arguments: dict
    confidence: float
 
@sgl.function
def tool_call_agent(s, user_query: str):
    s += sgl.system("You are a helpful assistant with access to tools.")
    s += sgl.user(user_query)
    s += sgl.assistant(
        sgl.gen(
            "response",
            guided_json=ToolCallResponse.model_json_schema(),
            max_tokens=512,
        )
    )
 
# Initialize runtime with XGrammar-2 backend
runtime = sgl.Runtime(model_path="deepseek-ai/DeepSeek-V3", backend="xgrammar")
result = tool_call_agent.run(
    user_query="Search for the current weather in Seoul and tell me in Celsius",
)
print(result["response"])

Code Point	Description
`guided_json=ToolCallResponse.model_json_schema()`	Passes the Pydantic model's JSON schema directly. XGrammar-2 JIT-compiles this schema to generate the token mask
`backend="xgrammar"`	Explicitly activates the XGrammar-2 backend in the SGLang runtime
`ConfigDict(extra="forbid")`	Blocks additional fields not in the schema. Without this, the model may insert unexpected keys

Example 2: Handling Mixed Output from Models with a Reasoning Channel

Models like DeepSeek R1 and QwQ that first emit a <think> block and then produce a structured response are honestly the trickiest case. Seeing how TagDispatch handles this transition automatically is quite elegant.

Note: The code below is pseudocode based on XGrammar-2 official documentation. Actual class names and method names may differ depending on the library version — check the official GitHub API docs before running.

python

# pip install xgrammar
# The following is conceptual pseudocode based on XGrammar-2 (actual API names may differ)
from xgrammar import GrammarCompiler, TagDispatchGrammar
 
# The GrammarCompiler instance must be reused to benefit from Cross-Grammar Cache
compiler = GrammarCompiler()
 
tag_dispatch_grammar = TagDispatchGrammar(
    tag_grammar_map={
        # <think> block: free-form (no token mask constraints)
        "think": compiler.compile_free_form(),
 
        # <tool_call> block: strict JSON schema
        "tool_call": compiler.compile_json_schema({
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "arguments": {"type": "object"},
            },
            "required": ["name", "arguments"],
            "additionalProperties": False,
        }),
 
        # <response> block: free-form response
        "response": compiler.compile_free_form(),
    },
    default_grammar=compiler.compile_free_form(),
)

Processing Stage	Active Grammar	Token Mask
`<think>` start ~ `</think>`	Free-form	Full vocabulary allowed
After `<tool_call>` start	JSON schema grammar	Only tokens matching the schema allowed
After `</tool_call>`	Default grammar	Awaiting tag detection

Reusing the GrammarCompiler instance is the key. The Cross-Grammar Cache lives inside this compiler, accumulating and reusing token masks for common substructures across multiple requests. Creating a new compiler per request resets the cache and eliminates the performance benefit.

Example 3: Integrating XGrammar-2 into Existing vLLM Code

Environment: vLLM v0.4+ (check per-version release notes for XGrammar-2 support status), Python 3.10+

If you're already using vLLM, you can switch with a single guided_decoding_backend setting. The approach leaves existing code untouched and only changes the backend, so migration risk is minimal.

python

from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    guided_decoding_backend="xgrammar",  # Add only this one line
)
 
response_schema = {
    "type": "object",
    "properties": {
        "status": {"type": "string", "enum": ["success", "error", "partial"]},
        "results": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "id": {"type": "integer"},
                    "content": {"type": "string"},
                    "score": {"type": "number"},
                },
                "additionalProperties": False,
            },
        },
        "metadata": {"type": "object"},
    },
    "additionalProperties": False,
}
 
params = SamplingParams(
    temperature=0.0,
    guided_json=response_schema,
    max_tokens=1024,
)
 
outputs = llm.generate(["Organize the search results as JSON"], params)

In environments with a fixed set of 10 or fewer tools, the existing engine may be sufficient. On the other hand, if you're dynamically selecting a different tool subset per request or serving models with a reasoning channel, there's a good chance the switch will be noticeable.

Pros and Cons

Advantages

Item	Details
Dramatic performance gains	Up to 80× efficiency over XGrammar generation 1, 6–10× per-token throughput, 100×+ reduction in compilation time
Optimized for dynamic agents	The only engine that efficiently handles both inter-request and intra-request structural changes
Universal framework compatibility	Fully integrated with vLLM, SGLang, TensorRT-LLM, and MLC-LLM
Native support for reasoning models	Handles `<think>` patterns from DeepSeek and Qwen series naturally via TagDispatch
Open source MIT	Available on GitHub, customizable

Drawbacks and Caveats

Item	Details	Mitigation
Increased complexity	New abstraction layers like TagDispatch and Cross-Grammar Cache raise debugging difficulty	Recommend increasing log level to trace grammar transition points
Limited benefit for simple fixed schemas	If the schema is always the same, the practical advantage over XGrammar generation 1 may not be significant	If you have 10 or fewer fixed-schema tools, the existing engine is sufficient
Earley parser complexity	Worst case O(N³) — can bottleneck on very complex recursive grammars	Recommend benchmarking with your actual schemas before adopting
Emerging technology	Announced May 2026, limited production feedback accumulated	Recommend thorough validation in a staging environment before switching

Most Common Mistakes in Practice

Omitting additionalProperties: false from the schema: XGrammar-2 ends up allowing extra fields, and the model freely inserts unexpected keys. If using Pydantic v2 models, it's recommended to also set model_config = ConfigDict(extra="forbid").
Mismatching TagDispatch tag names with the model's prompt: If the system prompt instructs <tool_call> but TagDispatch is registered with <function_call>, dispatch will not work at all. Tag names must match exactly between the system prompt and the TagDispatch configuration.
Creating a new GrammarCompiler per request within a single process: To benefit from Cross-Grammar Cache, you must reuse the GrammarCompiler instance. Creating a new compiler per request resets the cache and eliminates the performance benefit.

Closing Thoughts

XGrammar-2 is an engine that demonstrates structured output no longer has to be a performance bottleneck in agentic LLM serving. In particular, if you're running a dynamic tool registry or serving models with a reasoning channel like DeepSeek R1 or QwQ, there's a good chance you'll feel the difference after switching. Conversely, if you're operating with a single fixed schema, replacing it right now is not a high priority.

Three steps you can take right now:

Install the package and check a quick benchmark: Install via pip install xgrammar or pip install "vllm[xgrammar]", then try compiling your current schemas with GrammarCompiler. If your environment has more than 50 tools, you'll immediately feel the difference in compilation time.
Add a single backend option to your existing vLLM or SGLang serving code: Adding just the guided_decoding_backend="xgrammar" parameter switches the backend without changing any existing code. You can first check structured output latency metrics in a staging environment. If structured output overhead accounts for more than 5% of total response time, that's a sufficient signal to consider switching.
If using reasoning models, consider adopting TagDispatch configuration: If you've experienced unstable JSON parsing with models like DeepSeek R1, QwQ, or the Qwen series that output <think> blocks, refer to the TagDispatch examples in the official documentation to set it up. In this case, you can expect greater stability improvements than a simple backend switch.

References

#구조화출력#LLM#XGrammar#에이전트AI#JSONSchema#vLLM#SGLang#TagDispatch#JIT컴파일#Pydantic

XGrammar-2: The Design Principles Behind 80x Faster Structured Output | DEV BAK - 기술블로그

XGrammar-2: The Design Principles Behind 80x Faster Structured Output

Core Concepts

Why Structured Output Is Especially Tricky in Agentic Environments

Structured Generation: A technique that constrains token generation so an LLM produces text that exactly conforms to a predefined format — JSON, a function-call schema, a specific protocol. On every token generation, a token mask — a bit array that restricts which tokens can appear next given the current grammar state — is computed and applied.

XGrammar-2 addresses these two forms of dynamism with separate mechanisms.

TagDispatch: Handling Intra-Request Structure Transitions Automatically

python

# TagDispatch conceptual structure (pseudocode)
grammar_map = {
    "<think>":     FreeFormGrammar(),
    "<tool_call>": JSONSchemaGrammar(tool_schema),
    "<response>":  ResponseFormatGrammar(),
}
 
dispatcher = TagDispatch(grammar_map)
for token in token_stream:
    active_grammar = dispatcher.dispatch(token)
    # Compute token mask based on the current grammar's state machine
    mask = active_grammar.get_token_mask(current_state)

Cross-Grammar Cache: Reusing Shared Structure Across Different Schemas

Cross-Grammar Cache: A caching layer that computes the token mask for substructures that appear in common across different grammars (tool schemas) only once and reuses them. The key point is that this cache operates at the level of states in the grammar state machine, not at the schema level.

Other Core Mechanisms

Earley Parser: An algorithm for parsing context-free grammars (CFGs), capable of accurately handling grammars with recursive structures like JSON schemas. It operates in O(N²) or better for typical inputs, but can reach O(N³) for highly ambiguous grammars.

Technique	Problem Solved	Core Principle
JIT Compilation	Unnecessary upfront compilation cost	Compile a grammar only when an actual request comes in. Even with 500 tools, only process the subset used by real requests
Repetition State Compression	Memory waste from array/list patterns	Compress repetitive structures like `[item, item, ...]` in state space to reduce memory and computation
Earley Adaptive Mask	Handling complex recursive grammars	Earley-parser-based efficient token mask computation even for context-dependent grammars
TagDispatch	Intra-request structure transitions	Atomically swap the grammar state machine at tag detection time (detailed above)
Cross-Grammar Cache	Inter-request duplicate schema computation	Reuse token masks for common state substructures at the state level (detailed above)

Structural Tag: Abstraction for Agent Output Formats

json

{
  "type": "tool_call",
  "name": "search",
  "arguments": {"query": "Seoul weather"},
  "reasoning": "<think> ... </think>"
}

Practical Application

Example 1: Serving a Large-Scale Tool-Calling Agent in SGLang

Environment: Latest version of SGLang (check per-version release notes for XGrammar-2 integration status), Python 3.10+

SGLang has integrated XGrammar-2 as its default structured output backend. Passing a schema via the guided_json parameter causes Cross-Grammar Cache to operate automatically.

python

import sglang as sgl
from pydantic import BaseModel, ConfigDict
from typing import Literal
 
class ToolCallResponse(BaseModel):
    model_config = ConfigDict(extra="forbid")  # Block additional fields
    tool_name: Literal["search", "calculate", "fetch_data"]
    arguments: dict
    confidence: float
 
@sgl.function
def tool_call_agent(s, user_query: str):
    s += sgl.system("You are a helpful assistant with access to tools.")
    s += sgl.user(user_query)
    s += sgl.assistant(
        sgl.gen(
            "response",
            guided_json=ToolCallResponse.model_json_schema(),
            max_tokens=512,
        )
    )
 
# Initialize runtime with XGrammar-2 backend
runtime = sgl.Runtime(model_path="deepseek-ai/DeepSeek-V3", backend="xgrammar")
result = tool_call_agent.run(
    user_query="Search for the current weather in Seoul and tell me in Celsius",
)
print(result["response"])

Code Point	Description
`guided_json=ToolCallResponse.model_json_schema()`	Passes the Pydantic model's JSON schema directly. XGrammar-2 JIT-compiles this schema to generate the token mask
`backend="xgrammar"`	Explicitly activates the XGrammar-2 backend in the SGLang runtime
`ConfigDict(extra="forbid")`	Blocks additional fields not in the schema. Without this, the model may insert unexpected keys

Example 2: Handling Mixed Output from Models with a Reasoning Channel

Note: The code below is pseudocode based on XGrammar-2 official documentation. Actual class names and method names may differ depending on the library version — check the official GitHub API docs before running.

python

# pip install xgrammar
# The following is conceptual pseudocode based on XGrammar-2 (actual API names may differ)
from xgrammar import GrammarCompiler, TagDispatchGrammar
 
# The GrammarCompiler instance must be reused to benefit from Cross-Grammar Cache
compiler = GrammarCompiler()
 
tag_dispatch_grammar = TagDispatchGrammar(
    tag_grammar_map={
        # <think> block: free-form (no token mask constraints)
        "think": compiler.compile_free_form(),
 
        # <tool_call> block: strict JSON schema
        "tool_call": compiler.compile_json_schema({
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "arguments": {"type": "object"},
            },
            "required": ["name", "arguments"],
            "additionalProperties": False,
        }),
 
        # <response> block: free-form response
        "response": compiler.compile_free_form(),
    },
    default_grammar=compiler.compile_free_form(),
)

Processing Stage	Active Grammar	Token Mask
`<think>` start ~ `</think>`	Free-form	Full vocabulary allowed
After `<tool_call>` start	JSON schema grammar	Only tokens matching the schema allowed
After `</tool_call>`	Default grammar	Awaiting tag detection

Example 3: Integrating XGrammar-2 into Existing vLLM Code

Environment: vLLM v0.4+ (check per-version release notes for XGrammar-2 support status), Python 3.10+

If you're already using vLLM, you can switch with a single guided_decoding_backend setting. The approach leaves existing code untouched and only changes the backend, so migration risk is minimal.

python

from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    guided_decoding_backend="xgrammar",  # Add only this one line
)
 
response_schema = {
    "type": "object",
    "properties": {
        "status": {"type": "string", "enum": ["success", "error", "partial"]},
        "results": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "id": {"type": "integer"},
                    "content": {"type": "string"},
                    "score": {"type": "number"},
                },
                "additionalProperties": False,
            },
        },
        "metadata": {"type": "object"},
    },
    "additionalProperties": False,
}
 
params = SamplingParams(
    temperature=0.0,
    guided_json=response_schema,
    max_tokens=1024,
)
 
outputs = llm.generate(["Organize the search results as JSON"], params)

Pros and Cons

Advantages

Item	Details
Dramatic performance gains	Up to 80× efficiency over XGrammar generation 1, 6–10× per-token throughput, 100×+ reduction in compilation time
Optimized for dynamic agents	The only engine that efficiently handles both inter-request and intra-request structural changes
Universal framework compatibility	Fully integrated with vLLM, SGLang, TensorRT-LLM, and MLC-LLM
Native support for reasoning models	Handles `<think>` patterns from DeepSeek and Qwen series naturally via TagDispatch
Open source MIT	Available on GitHub, customizable

Drawbacks and Caveats

Item	Details	Mitigation
Increased complexity	New abstraction layers like TagDispatch and Cross-Grammar Cache raise debugging difficulty	Recommend increasing log level to trace grammar transition points
Limited benefit for simple fixed schemas	If the schema is always the same, the practical advantage over XGrammar generation 1 may not be significant	If you have 10 or fewer fixed-schema tools, the existing engine is sufficient
Earley parser complexity	Worst case O(N³) — can bottleneck on very complex recursive grammars	Recommend benchmarking with your actual schemas before adopting
Emerging technology	Announced May 2026, limited production feedback accumulated	Recommend thorough validation in a staging environment before switching

Most Common Mistakes in Practice

Omitting additionalProperties: false from the schema: XGrammar-2 ends up allowing extra fields, and the model freely inserts unexpected keys. If using Pydantic v2 models, it's recommended to also set model_config = ConfigDict(extra="forbid").
Mismatching TagDispatch tag names with the model's prompt: If the system prompt instructs <tool_call> but TagDispatch is registered with <function_call>, dispatch will not work at all. Tag names must match exactly between the system prompt and the TagDispatch configuration.
Creating a new GrammarCompiler per request within a single process: To benefit from Cross-Grammar Cache, you must reuse the GrammarCompiler instance. Creating a new compiler per request resets the cache and eliminates the performance benefit.

Closing Thoughts

Three steps you can take right now:

Install the package and check a quick benchmark: Install via pip install xgrammar or pip install "vllm[xgrammar]", then try compiling your current schemas with GrammarCompiler. If your environment has more than 50 tools, you'll immediately feel the difference in compilation time.
Add a single backend option to your existing vLLM or SGLang serving code: Adding just the guided_decoding_backend="xgrammar" parameter switches the backend without changing any existing code. You can first check structured output latency metrics in a staging environment. If structured output overhead accounts for more than 5% of total response time, that's a sufficient signal to consider switching.
If using reasoning models, consider adopting TagDispatch configuration: If you've experienced unstable JSON parsing with models like DeepSeek R1, QwQ, or the Qwen series that output <think> blocks, refer to the TagDispatch examples in the official documentation to set it up. In this case, you can expect greater stability improvements than a simple backend switch.

References

#구조화출력#LLM#XGrammar#에이전트AI#JSONSchema#vLLM#SGLang#TagDispatch#JIT컴파일#Pydantic

Core Concepts

Why Structured Output Is Especially Tricky in Agentic Environments

TagDispatch: Handling Intra-Request Structure Transitions Automatically

Cross-Grammar Cache: Reusing Shared Structure Across Different Schemas

Other Core Mechanisms

Structural Tag: Abstraction for Agent Output Formats

Practical Application

Example 1: Serving a Large-Scale Tool-Calling Agent in SGLang

Example 2: Handling Mixed Output from Models with a Reasoning Channel

Example 3: Integrating XGrammar-2 into Existing vLLM Code

Pros and Cons

Advantages

Drawbacks and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why Structured Output Is Especially Tricky in Agentic Environments

TagDispatch: Handling Intra-Request Structure Transitions Automatically

Cross-Grammar Cache: Reusing Shared Structure Across Different Schemas

Other Core Mechanisms

Structural Tag: Abstraction for Agent Output Formats

Practical Application

Example 1: Serving a Large-Scale Tool-Calling Agent in SGLang

Example 2: Handling Mixed Output from Models with a Reasoning Channel

Example 3: Integrating XGrammar-2 into Existing vLLM Code

Pros and Cons

Advantages

Drawbacks and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

FP4 Quantization + Blackwell GPU: Conditions for 4× Throughput over H100 and When Not to Use It

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

Why Serving DeepSeek-V3 on 96 H100s Is Possible: SGLang Expert Parallelism's Communication Optimization and Memory Fragmentation Solutions

SGLang EPD Disaggregation and Pipeline Parallelism — An Architecture That Splits Vision-Language Model Serving into 3 Stages to Reduce TTFT by Up to 8x

SGLang Architecture That Extracts 6× Throughput from the Same GPUs — PD Disaggregation and HiCache