AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline

Honestly, when I first heard about this concept, my reaction was "does that actually work?" It's already remarkable that an agent can write code on its own — but having another agent review that code and leave feedback seemed like a stretch. Yet here in 2026, this is already a story being told in production environments.

By connecting the Agentic Coding Loop with /code-review ultra, you can build a pipeline that runs from code generation all the way through review without any human involvement. Cloudflare ran this approach to automatically execute over 131,000 reviews across more than 48,000 MRs, and teams using a multi-agent structure have real-world evidence showing false positive rates dropping from 40% to 12%. The routine of opening a PR, waiting, receiving review comments, and making fixes — agents handle all of that overnight. Developers wake up in the morning to find validated code already waiting to be merged.

This post covers how to set up a pipeline where /code-review ultra's multi-agent system automatically reviews code produced by an agentic loop, and what pitfalls to watch out for in practice. We'll start with the core concepts, then walk through increasing complexity: hook configuration → GitHub Actions → orchestrator patterns.

Core Concepts

What Is an Agentic Coding Loop?

An agentic coding loop is an autonomous development cycle where an AI agent repeats write code → execute → check errors → fix without human intervention. Claude Code's loop feature is the canonical implementation — given a goal, it repeats autonomously until the completion condition is satisfied.

Agentic Loop: A pattern where AI receives only a goal and plans, executes, and validates the intermediate steps itself. Unlike traditional assistive tools where a human approves every step, the loop repeats autonomously until the completion condition is met.

Below is the conceptual flow for starting a loop agent. Since Claude Code's actual interface works by passing goals within an interactive session, treat this as an illustration of intent rather than a command you'd run directly in a terminal.

bash

# Conceptual example — check the official Claude Code docs for actual CLI flags
# claude --goal "Implement user authentication module with JWT + refresh token" --max-iterations 20

What makes this different from simply "automatically writing code" is that the agent observes execution results and decides its next action. If a test fails, it analyzes the cause, makes a fix, and runs it again. It feels like watching a junior developer work through a task on their own.

What Makes `/code-review ultra` Different?

Released by Anthropic in April 2026, /code-review ultra (or /ultrareview) is not a simple linter or static analysis tool. It runs multiple sub-agents that simultaneously analyze a diff in Anthropic's remote sandbox, independently verifying from each of their own perspectives and reporting only confirmed bugs.

Key differentiator: It doesn't say "there might be a problem here" — it only reports "this bug is actually reproducible." Sub-agents independently verify logic, security, performance, error handling, and test coverage, and only issues confirmed through cross-validation are included in the final output.

This cross-validation structure is what drives the false positive rate down from 40% to 12% — a fairly meaningful difference in practice. When review noise is high, developers end up ignoring it, and this structure breaks that pattern at a systemic level.

bash

# Run review against the current branch diff
/code-review ultra
 
# Run against a specific GitHub PR number
/code-review ultra 1234

Average review time is around 20 minutes, at a cost of roughly $5–$20 per run.

When the Two Meet — A Multi-Agent Pipeline

Now that we understand the concepts, let's look at what kind of flow emerges when these two are actually connected.

css

[Loop Agent] → generates and commits code
      ↓
[Hook or Actions] → automatic trigger
      ↓
[/code-review ultra] → analyzes and verifies diff
      ↓
[PR comment or issue] → automatically posts feedback
      ↓
[Loop Agent] → incorporates feedback and iterates

The only point where a human intervenes is the final merge approval. That gate can also be automated depending on your configuration, but for now, a pattern where a human sees the last step is recommended. We'll revisit why later.

Practical Application

Example 1: Auto-Triggering Review with the Stop Hook

The simplest starting point is Claude Code's Stop hook. It automatically runs a review the moment the agent declares "task complete." I still remember the first time I added this hook and watched the review results appear in the Actions log — it was genuinely surprising. One config file gets you halfway through the pipeline.

json

{
  "hooks": {
    "Stop": [
      {
        "type": "command",
        "command": "claude /code-review ultra"
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Bash",
        "type": "command",
        "command": "echo '[hook] Command execution complete, checking status...' >> .claude/loop.log"
      }
    ]
  }
}

Config Item	Role
`Stop` hook	Automatically runs review when agent declares task complete
`PostToolUse` hook	Logs each tool execution, used for debugging
`matcher` field	Can apply hook only to specific tools (Bash, Write, etc.)

With just this config, whenever the agent pushes a commit and declares "done," the review kicks off immediately.

Example 2: PR-Based Pipeline with GitHub Actions

When the agent opens a PR, the review runs automatically and results are posted as a comment. Cloudflare used this pattern to auto-run over 131,000 reviews across more than 48,000 MRs — looking at those numbers, you get a real sense of how well this structure scales. Imagining how many human reviewers it would take to handle that volume makes the point even more vivid.

yaml

# .github/workflows/agent-review.yml
name: Agent Code Review
 
on:
  pull_request:
    types: [opened, synchronize]
 
jobs:
  review:
    runs-on: ubuntu-latest
    # Only review PRs opened by the agent (prevents unnecessary billing)
    if: contains(github.event.pull_request.labels.*.name, 'agent-generated')
    timeout-minutes: 30
 
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
 
      - name: Install Claude Code
        run: npm install -g @anthropic-ai/claude-code
 
      - name: Run multi-agent review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Output review results to review-output.json
          # (actual output format may vary by Claude Code version)
          claude /code-review ultra ${{ github.event.pull_request.number }} \
            --output review-output.json
 
      - name: Post review results as PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs')
            // review-output.json is generated in the previous step via the --output flag
            const review = JSON.parse(fs.readFileSync('./review-output.json', 'utf8'))
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.payload.pull_request.number,
              body: review.summary
            })

Step	Description
`if` condition	Only runs review on PRs with the `agent-generated` label — key to cost control
`fetch-depth: 0`	Fetches full git history, required for diff analysis
`timeout-minutes`	Prevents infinite waiting if the loop misbehaves
`--output` flag	Saves review results as JSON, parseable in subsequent steps

The if: contains(...) condition looks minor at first, but without it, every PR triggers a review and costs accumulate faster than you'd expect. It's a situation that comes up frequently in practice, so it's worth emphasizing.

Example 3: Three-Tier Orchestrator + Coder + Reviewer Structure

For more complex tasks, a layered structure with separated roles is effective. Let me explain the concept before showing the code.

Orchestrator Pattern: A structure where a higher-level agent decomposes a task and delegates to lower-level agents. Parallel execution increases speed, and each agent operating in an independent context reduces interference.

Bun's Rust rewrite being completed in 6 days was also thanks to this structure. If a single agent had processed everything sequentially, finishing within that timeframe would have been difficult — but splitting subtasks to run in parallel changes the equation entirely.

The example below is a conceptual implementation using the Claude Agent SDK. The @anthropic-ai/agent-sdk package name and runAgentLoop API may differ from the actual public interface, so it's recommended to check the official documentation before building a concrete implementation.

typescript

// orchestrator.ts — conceptual example (see official docs for actual SDK API)
import { Agent, runAgentLoop } from "@anthropic-ai/agent-sdk";
 
async function runPipeline(task: string) {
  // Step 1: Orchestrator decomposes the task
  // The subtasks structure returned by orchestrator.run()
  // requires the system prompt to specify JSON output format
  const orchestrator = new Agent({
    model: "claude-opus-4-8",
    systemPrompt: `
      Decompose the given development task into an array of independent subtasks.
      Must return in JSON format: { "subtasks": ["task1", "task2", ...] }
    `,
  });
 
  const result = await orchestrator.run(task);
  const { subtasks } = JSON.parse(result); // parse structured output
 
  // Step 2: Coder agents implement in parallel
  const coderResults = await Promise.all(
    subtasks.map((subtask: string) =>
      runAgentLoop({
        model: "claude-sonnet-4-6",
        task: subtask,
        maxIterations: 15,
      })
    )
  );
 
  // Step 3: Trigger review after all subtasks complete
  // If the Stop hook is configured, /code-review ultra runs automatically
  console.log(`${coderResults.length} subtasks complete. Waiting for review trigger...`);
}

The key part is running the coder agents in parallel with Promise.all — the more independent the subtasks, the greater the parallelization benefit. However, if subtasks have dependencies between them, ordering must be respected, so it's a good idea to design the orchestrator's system prompt to output dependency information alongside the task decomposition.

Pros and Cons

Advantages

Item	Details
Speed	Eliminates human reviewer wait time; review completes within minutes of PR opening
Consistency	Agents don't get tired — the same standard applies to every PR
Noise reduction	Only reports reproduced and verified bugs; false positives down from 40% to 12%
Async operation	Generation, review, and fix cycles complete while you sleep
Security gate	Structure where AI reviewers automatically filter out vulnerabilities in AI-generated code

Drawbacks and Caveats

Item	Details	Mitigation
Cost explosion	Parallel agent execution causes rapid token consumption; $5–$20 per review run	Explicitly set label conditions, `timeout-minutes`, and token budget limits
Security vulnerabilities	Reports indicate 40–62% of AI-generated code contains security flaws	Mandatory security scan as CI prerequisite
Undetected structural changes	Unintended architectural changes may be missed	Maintain a final human review gate
Loop malfunction	Risk of infinite loops if hooks and workflows fail to coordinate	Must set maximum iteration count and `timeout-minutes`
Insufficient security tooling	83% of companies reportedly unprepared for autonomous code execution	Experiment in a sandbox environment first

Sandbox: An isolated test environment separate from the actual production environment. It's good practice to first validate your pipeline in an isolated environment where agent-executed unexpected commands won't affect production.

The Most Common Mistakes in Practice

Deploying agents before establishing CI foundations — Linters, unit tests, and security scans need to be in place first for agents to operate safely. Without that foundation, automation can become an accelerator for deploying broken code quickly.
Enabling auto-merge from the start — If you automate the final approval before the pipeline is sufficiently validated, unexpected structural changes can land directly in main. It's strongly recommended to keep the human final approval gate in place during the early stages.
Not setting cost limits — When an agent receives a complex task, it can run far more iterations than expected. Explicitly setting maxIterations, token budget limits, and Actions' timeout-minutes will protect you from billing surprises.

Closing Thoughts

The era of automating code generation has already passed — we're now in the era of automating both generation and verification together. Cloudflare's 130,000+ review executions and Bun's 6-day rewrite are no longer exceptional cases; they're becoming standard workflows.

There are three steps you can start with right now:

Start with the simplest hook — Add a Stop hook to .claude/settings.json and confirm that /code-review ultra runs automatically when the agent completes a task. One config file gets you halfway through the pipeline.
Attach a label-based Actions workflow — By conditioning the review to run only on PRs with the agent-generated label, you can build a structure that selectively auto-reviews agent PRs while keeping costs under control.
Consider auto-merge last — After observing what bugs the pipeline catches and what it misses over 1–2 weeks, decide whether to automate the final step once you're confident in its reliability. That sequence is the safe approach.

References

#멀티에이전트#ClaudeCode#에이전틱루프#GitHubActions#오케스트레이터패턴#TypeScript#CI/CD#코드리뷰자동화#AgentSDK#샌드박스

AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline

Core Concepts

What Is an Agentic Coding Loop?

Agentic Loop: A pattern where AI receives only a goal and plans, executes, and validates the intermediate steps itself. Unlike traditional assistive tools where a human approves every step, the loop repeats autonomously until the completion condition is met.

bash

# Conceptual example — check the official Claude Code docs for actual CLI flags
# claude --goal "Implement user authentication module with JWT + refresh token" --max-iterations 20

What Makes `/code-review ultra` Different?

Key differentiator: It doesn't say "there might be a problem here" — it only reports "this bug is actually reproducible." Sub-agents independently verify logic, security, performance, error handling, and test coverage, and only issues confirmed through cross-validation are included in the final output.

bash

# Run review against the current branch diff
/code-review ultra
 
# Run against a specific GitHub PR number
/code-review ultra 1234

Average review time is around 20 minutes, at a cost of roughly $5–$20 per run.

When the Two Meet — A Multi-Agent Pipeline

Now that we understand the concepts, let's look at what kind of flow emerges when these two are actually connected.

css

[Loop Agent] → generates and commits code
      ↓
[Hook or Actions] → automatic trigger
      ↓
[/code-review ultra] → analyzes and verifies diff
      ↓
[PR comment or issue] → automatically posts feedback
      ↓
[Loop Agent] → incorporates feedback and iterates

Practical Application

Example 1: Auto-Triggering Review with the Stop Hook

json

{
  "hooks": {
    "Stop": [
      {
        "type": "command",
        "command": "claude /code-review ultra"
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Bash",
        "type": "command",
        "command": "echo '[hook] Command execution complete, checking status...' >> .claude/loop.log"
      }
    ]
  }
}

Config Item	Role
`Stop` hook	Automatically runs review when agent declares task complete
`PostToolUse` hook	Logs each tool execution, used for debugging
`matcher` field	Can apply hook only to specific tools (Bash, Write, etc.)

With just this config, whenever the agent pushes a commit and declares "done," the review kicks off immediately.

Example 2: PR-Based Pipeline with GitHub Actions

yaml

# .github/workflows/agent-review.yml
name: Agent Code Review
 
on:
  pull_request:
    types: [opened, synchronize]
 
jobs:
  review:
    runs-on: ubuntu-latest
    # Only review PRs opened by the agent (prevents unnecessary billing)
    if: contains(github.event.pull_request.labels.*.name, 'agent-generated')
    timeout-minutes: 30
 
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
 
      - name: Install Claude Code
        run: npm install -g @anthropic-ai/claude-code
 
      - name: Run multi-agent review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Output review results to review-output.json
          # (actual output format may vary by Claude Code version)
          claude /code-review ultra ${{ github.event.pull_request.number }} \
            --output review-output.json
 
      - name: Post review results as PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs')
            // review-output.json is generated in the previous step via the --output flag
            const review = JSON.parse(fs.readFileSync('./review-output.json', 'utf8'))
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.payload.pull_request.number,
              body: review.summary
            })

Step	Description
`if` condition	Only runs review on PRs with the `agent-generated` label — key to cost control
`fetch-depth: 0`	Fetches full git history, required for diff analysis
`timeout-minutes`	Prevents infinite waiting if the loop misbehaves
`--output` flag	Saves review results as JSON, parseable in subsequent steps

Example 3: Three-Tier Orchestrator + Coder + Reviewer Structure

For more complex tasks, a layered structure with separated roles is effective. Let me explain the concept before showing the code.

Orchestrator Pattern: A structure where a higher-level agent decomposes a task and delegates to lower-level agents. Parallel execution increases speed, and each agent operating in an independent context reduces interference.

typescript

// orchestrator.ts — conceptual example (see official docs for actual SDK API)
import { Agent, runAgentLoop } from "@anthropic-ai/agent-sdk";
 
async function runPipeline(task: string) {
  // Step 1: Orchestrator decomposes the task
  // The subtasks structure returned by orchestrator.run()
  // requires the system prompt to specify JSON output format
  const orchestrator = new Agent({
    model: "claude-opus-4-8",
    systemPrompt: `
      Decompose the given development task into an array of independent subtasks.
      Must return in JSON format: { "subtasks": ["task1", "task2", ...] }
    `,
  });
 
  const result = await orchestrator.run(task);
  const { subtasks } = JSON.parse(result); // parse structured output
 
  // Step 2: Coder agents implement in parallel
  const coderResults = await Promise.all(
    subtasks.map((subtask: string) =>
      runAgentLoop({
        model: "claude-sonnet-4-6",
        task: subtask,
        maxIterations: 15,
      })
    )
  );
 
  // Step 3: Trigger review after all subtasks complete
  // If the Stop hook is configured, /code-review ultra runs automatically
  console.log(`${coderResults.length} subtasks complete. Waiting for review trigger...`);
}

Pros and Cons

Advantages

Item	Details
Speed	Eliminates human reviewer wait time; review completes within minutes of PR opening
Consistency	Agents don't get tired — the same standard applies to every PR
Noise reduction	Only reports reproduced and verified bugs; false positives down from 40% to 12%
Async operation	Generation, review, and fix cycles complete while you sleep
Security gate	Structure where AI reviewers automatically filter out vulnerabilities in AI-generated code

Drawbacks and Caveats

Item	Details	Mitigation
Cost explosion	Parallel agent execution causes rapid token consumption; $5–$20 per review run	Explicitly set label conditions, `timeout-minutes`, and token budget limits
Security vulnerabilities	Reports indicate 40–62% of AI-generated code contains security flaws	Mandatory security scan as CI prerequisite
Undetected structural changes	Unintended architectural changes may be missed	Maintain a final human review gate
Loop malfunction	Risk of infinite loops if hooks and workflows fail to coordinate	Must set maximum iteration count and `timeout-minutes`
Insufficient security tooling	83% of companies reportedly unprepared for autonomous code execution	Experiment in a sandbox environment first

Sandbox: An isolated test environment separate from the actual production environment. It's good practice to first validate your pipeline in an isolated environment where agent-executed unexpected commands won't affect production.

The Most Common Mistakes in Practice

Deploying agents before establishing CI foundations — Linters, unit tests, and security scans need to be in place first for agents to operate safely. Without that foundation, automation can become an accelerator for deploying broken code quickly.
Enabling auto-merge from the start — If you automate the final approval before the pipeline is sufficiently validated, unexpected structural changes can land directly in main. It's strongly recommended to keep the human final approval gate in place during the early stages.
Not setting cost limits — When an agent receives a complex task, it can run far more iterations than expected. Explicitly setting maxIterations, token budget limits, and Actions' timeout-minutes will protect you from billing surprises.

Closing Thoughts

There are three steps you can start with right now:

Start with the simplest hook — Add a Stop hook to .claude/settings.json and confirm that /code-review ultra runs automatically when the agent completes a task. One config file gets you halfway through the pipeline.
Attach a label-based Actions workflow — By conditioning the review to run only on PRs with the agent-generated label, you can build a structure that selectively auto-reviews agent PRs while keeping costs under control.
Consider auto-merge last — After observing what bugs the pipeline catches and what it misses over 1–2 weeks, decide whether to automate the final step once you're confident in its reliability. That sequence is the safe approach.

References

#멀티에이전트#ClaudeCode#에이전틱루프#GitHubActions#오케스트레이터패턴#TypeScript#CI/CD#코드리뷰자동화#AgentSDK#샌드박스

Core Concepts

What Is an Agentic Coding Loop?

What Makes /code-review ultra Different?

When the Two Meet — A Multi-Agent Pipeline

Practical Application

Example 1: Auto-Triggering Review with the Stop Hook

Example 2: PR-Based Pipeline with GitHub Actions

Example 3: Three-Tier Orchestrator + Coder + Reviewer Structure

Pros and Cons

Advantages

Drawbacks and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What Is an Agentic Coding Loop?

What Makes /code-review ultra Different?

When the Two Meet — A Multi-Agent Pipeline

Practical Application

Example 1: Auto-Triggering Review with the Stop Hook

Example 2: PR-Based Pipeline with GitHub Actions

Example 3: Three-Tier Orchestrator + Coder + Reviewer Structure

Pros and Cons

Advantages

Drawbacks and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

How AI Coding Agents Are Reshaping Dev Team Structure: How to Transition into an Orchestrator

How to Fine-Tune a Domain-Specific SLM with QLoRA on a Single Consumer GPU

Local LLM TCO Analysis: How to Calculate the On-Premises Break-Even Point and GPU Utilization Optimization Strategies

Cutting Long-Horizon Agent Costs by 60–90%: Caching, Compression, and Routing Strategies

Type-Safe LLM Response Validation with Pydantic AI

How to Make LLMs Directly Call Your Internal REST APIs: TypeScript MCP Server Implementation and the Gateway Pattern

What Makes `/code-review ultra` Different?

What Makes `/code-review ultra` Different?