AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline
Honestly, when I first heard about this concept, my reaction was "does that actually work?" It's already remarkable that an agent can write code on its own — but having another agent review that code and leave feedback seemed like a stretch. Yet here in 2026, this is already a story being told in production environments.
By connecting the Agentic Coding Loop with /code-review ultra, you can build a pipeline that runs from code generation all the way through review without any human involvement. Cloudflare ran this approach to automatically execute over 131,000 reviews across more than 48,000 MRs, and teams using a multi-agent structure have real-world evidence showing false positive rates dropping from 40% to 12%. The routine of opening a PR, waiting, receiving review comments, and making fixes — agents handle all of that overnight. Developers wake up in the morning to find validated code already waiting to be merged.
This post covers how to set up a pipeline where /code-review ultra's multi-agent system automatically reviews code produced by an agentic loop, and what pitfalls to watch out for in practice. We'll start with the core concepts, then walk through increasing complexity: hook configuration → GitHub Actions → orchestrator patterns.
Core Concepts
What Is an Agentic Coding Loop?
An agentic coding loop is an autonomous development cycle where an AI agent repeats write code → execute → check errors → fix without human intervention. Claude Code's loop feature is the canonical implementation — given a goal, it repeats autonomously until the completion condition is satisfied.
Agentic Loop: A pattern where AI receives only a goal and plans, executes, and validates the intermediate steps itself. Unlike traditional assistive tools where a human approves every step, the loop repeats autonomously until the completion condition is met.
Below is the conceptual flow for starting a loop agent. Since Claude Code's actual interface works by passing goals within an interactive session, treat this as an illustration of intent rather than a command you'd run directly in a terminal.
# Conceptual example — check the official Claude Code docs for actual CLI flags
# claude --goal "Implement user authentication module with JWT + refresh token" --max-iterations 20What makes this different from simply "automatically writing code" is that the agent observes execution results and decides its next action. If a test fails, it analyzes the cause, makes a fix, and runs it again. It feels like watching a junior developer work through a task on their own.
What Makes /code-review ultra Different?
Released by Anthropic in April 2026, /code-review ultra (or /ultrareview) is not a simple linter or static analysis tool. It runs multiple sub-agents that simultaneously analyze a diff in Anthropic's remote sandbox, independently verifying from each of their own perspectives and reporting only confirmed bugs.
Key differentiator: It doesn't say "there might be a problem here" — it only reports "this bug is actually reproducible." Sub-agents independently verify logic, security, performance, error handling, and test coverage, and only issues confirmed through cross-validation are included in the final output.
This cross-validation structure is what drives the false positive rate down from 40% to 12% — a fairly meaningful difference in practice. When review noise is high, developers end up ignoring it, and this structure breaks that pattern at a systemic level.
# Run review against the current branch diff
/code-review ultra
# Run against a specific GitHub PR number
/code-review ultra 1234Average review time is around 20 minutes, at a cost of roughly $5–$20 per run.
When the Two Meet — A Multi-Agent Pipeline
Now that we understand the concepts, let's look at what kind of flow emerges when these two are actually connected.
[Loop Agent] → generates and commits code
↓
[Hook or Actions] → automatic trigger
↓
[/code-review ultra] → analyzes and verifies diff
↓
[PR comment or issue] → automatically posts feedback
↓
[Loop Agent] → incorporates feedback and iteratesThe only point where a human intervenes is the final merge approval. That gate can also be automated depending on your configuration, but for now, a pattern where a human sees the last step is recommended. We'll revisit why later.
Practical Application
Example 1: Auto-Triggering Review with the Stop Hook
The simplest starting point is Claude Code's Stop hook. It automatically runs a review the moment the agent declares "task complete." I still remember the first time I added this hook and watched the review results appear in the Actions log — it was genuinely surprising. One config file gets you halfway through the pipeline.
{
"hooks": {
"Stop": [
{
"type": "command",
"command": "claude /code-review ultra"
}
],
"PostToolUse": [
{
"matcher": "Bash",
"type": "command",
"command": "echo '[hook] Command execution complete, checking status...' >> .claude/loop.log"
}
]
}
}| Config Item | Role |
|---|---|
Stop hook |
Automatically runs review when agent declares task complete |
PostToolUse hook |
Logs each tool execution, used for debugging |
matcher field |
Can apply hook only to specific tools (Bash, Write, etc.) |
With just this config, whenever the agent pushes a commit and declares "done," the review kicks off immediately.
Example 2: PR-Based Pipeline with GitHub Actions
When the agent opens a PR, the review runs automatically and results are posted as a comment. Cloudflare used this pattern to auto-run over 131,000 reviews across more than 48,000 MRs — looking at those numbers, you get a real sense of how well this structure scales. Imagining how many human reviewers it would take to handle that volume makes the point even more vivid.
# .github/workflows/agent-review.yml
name: Agent Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
# Only review PRs opened by the agent (prevents unnecessary billing)
if: contains(github.event.pull_request.labels.*.name, 'agent-generated')
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install Claude Code
run: npm install -g @anthropic-ai/claude-code
- name: Run multi-agent review
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Output review results to review-output.json
# (actual output format may vary by Claude Code version)
claude /code-review ultra ${{ github.event.pull_request.number }} \
--output review-output.json
- name: Post review results as PR comment
uses: actions/github-script@v7
with:
script: |
const fs = require('fs')
// review-output.json is generated in the previous step via the --output flag
const review = JSON.parse(fs.readFileSync('./review-output.json', 'utf8'))
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.payload.pull_request.number,
body: review.summary
})| Step | Description |
|---|---|
if condition |
Only runs review on PRs with the agent-generated label — key to cost control |
fetch-depth: 0 |
Fetches full git history, required for diff analysis |
timeout-minutes |
Prevents infinite waiting if the loop misbehaves |
--output flag |
Saves review results as JSON, parseable in subsequent steps |
The if: contains(...) condition looks minor at first, but without it, every PR triggers a review and costs accumulate faster than you'd expect. It's a situation that comes up frequently in practice, so it's worth emphasizing.
Example 3: Three-Tier Orchestrator + Coder + Reviewer Structure
For more complex tasks, a layered structure with separated roles is effective. Let me explain the concept before showing the code.
Orchestrator Pattern: A structure where a higher-level agent decomposes a task and delegates to lower-level agents. Parallel execution increases speed, and each agent operating in an independent context reduces interference.
Bun's Rust rewrite being completed in 6 days was also thanks to this structure. If a single agent had processed everything sequentially, finishing within that timeframe would have been difficult — but splitting subtasks to run in parallel changes the equation entirely.
The example below is a conceptual implementation using the Claude Agent SDK. The @anthropic-ai/agent-sdk package name and runAgentLoop API may differ from the actual public interface, so it's recommended to check the official documentation before building a concrete implementation.
// orchestrator.ts — conceptual example (see official docs for actual SDK API)
import { Agent, runAgentLoop } from "@anthropic-ai/agent-sdk";
async function runPipeline(task: string) {
// Step 1: Orchestrator decomposes the task
// The subtasks structure returned by orchestrator.run()
// requires the system prompt to specify JSON output format
const orchestrator = new Agent({
model: "claude-opus-4-8",
systemPrompt: `
Decompose the given development task into an array of independent subtasks.
Must return in JSON format: { "subtasks": ["task1", "task2", ...] }
`,
});
const result = await orchestrator.run(task);
const { subtasks } = JSON.parse(result); // parse structured output
// Step 2: Coder agents implement in parallel
const coderResults = await Promise.all(
subtasks.map((subtask: string) =>
runAgentLoop({
model: "claude-sonnet-4-6",
task: subtask,
maxIterations: 15,
})
)
);
// Step 3: Trigger review after all subtasks complete
// If the Stop hook is configured, /code-review ultra runs automatically
console.log(`${coderResults.length} subtasks complete. Waiting for review trigger...`);
}The key part is running the coder agents in parallel with Promise.all — the more independent the subtasks, the greater the parallelization benefit. However, if subtasks have dependencies between them, ordering must be respected, so it's a good idea to design the orchestrator's system prompt to output dependency information alongside the task decomposition.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Speed | Eliminates human reviewer wait time; review completes within minutes of PR opening |
| Consistency | Agents don't get tired — the same standard applies to every PR |
| Noise reduction | Only reports reproduced and verified bugs; false positives down from 40% to 12% |
| Async operation | Generation, review, and fix cycles complete while you sleep |
| Security gate | Structure where AI reviewers automatically filter out vulnerabilities in AI-generated code |
Drawbacks and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Cost explosion | Parallel agent execution causes rapid token consumption; $5–$20 per review run | Explicitly set label conditions, timeout-minutes, and token budget limits |
| Security vulnerabilities | Reports indicate 40–62% of AI-generated code contains security flaws | Mandatory security scan as CI prerequisite |
| Undetected structural changes | Unintended architectural changes may be missed | Maintain a final human review gate |
| Loop malfunction | Risk of infinite loops if hooks and workflows fail to coordinate | Must set maximum iteration count and timeout-minutes |
| Insufficient security tooling | 83% of companies reportedly unprepared for autonomous code execution | Experiment in a sandbox environment first |
Sandbox: An isolated test environment separate from the actual production environment. It's good practice to first validate your pipeline in an isolated environment where agent-executed unexpected commands won't affect production.
The Most Common Mistakes in Practice
- Deploying agents before establishing CI foundations — Linters, unit tests, and security scans need to be in place first for agents to operate safely. Without that foundation, automation can become an accelerator for deploying broken code quickly.
- Enabling auto-merge from the start — If you automate the final approval before the pipeline is sufficiently validated, unexpected structural changes can land directly in main. It's strongly recommended to keep the human final approval gate in place during the early stages.
- Not setting cost limits — When an agent receives a complex task, it can run far more iterations than expected. Explicitly setting
maxIterations, token budget limits, and Actions'timeout-minuteswill protect you from billing surprises.
Closing Thoughts
The era of automating code generation has already passed — we're now in the era of automating both generation and verification together. Cloudflare's 130,000+ review executions and Bun's 6-day rewrite are no longer exceptional cases; they're becoming standard workflows.
There are three steps you can start with right now:
- Start with the simplest hook — Add a
Stophook to.claude/settings.jsonand confirm that/code-review ultraruns automatically when the agent completes a task. One config file gets you halfway through the pipeline. - Attach a label-based Actions workflow — By conditioning the review to run only on PRs with the
agent-generatedlabel, you can build a structure that selectively auto-reviews agent PRs while keeping costs under control. - Consider auto-merge last — After observing what bugs the pipeline catches and what it misses over 1–2 weeks, decide whether to automate the final step once you're confident in its reliability. That sequence is the safe approach.
References
- Code Review - Claude Code Official Docs
- How the agent loop works - Claude Code Docs
- Automate actions with hooks - Claude Code Official Docs
- Anthropic Introduces Agent-Based Code Review for Claude Code - InfoQ
- Anthropic launches a multi-agent code review tool - The New Stack
- claude-code/plugins/code-review/README.md - GitHub
- 9 Parallel AI Agents That Review My Code (Claude Code Setup) - HAMY
- Claude Code PR Review: /ultrareview, Code Review, and Subagents Compared (2026)
- Orchestrating AI Code Review at Scale - Cloudflare Blog
- Agent pull requests are everywhere. Here's how to review them. - GitHub Blog
- Automating the Claude Code × Codex Review Loop - SmartScope
- Multi-Agent Development Workflows with Claude Code - DEV Community
- Agentic CI Pipelines: Autonomous Code Review & Testing Tutorial - Nandann
- Optimizing AI Code Reviews: A Multi-Agent Pipeline Approach - earezki.com
- Claude Code /loop: The Autonomous Agent Feature - Context Studios