AI Code Review That Reasons Over the Entire Repository Beyond PR Diffs — How Codebase Semantic Graphs Catch Cross-File Bugs
If you've ever used AI for code review, you've probably run into this situation at least once: the AI finds nothing wrong inside the modified function, but after deployment, a type error blows up in a completely different file. This is a structural limitation of diff-based tools. Because they only look at the changed lines, they can't see when the callers of that function break.
Greptile tackles this problem head-on. The approach itself is different. Instead of handing the diff to an LLM when a PR is opened, it builds the entire repository into a knowledge graph of functions, classes, modules, and dependencies — all connected — and then reasons about the ripple effects of changes on top of that graph. This article examines how Greptile models a codebase as a graph, and how that approach concretely differs from traditional diff analysis, with specific code examples. This is the first installment in a series on AI code review tools.
Let's dig into what Greptile actually does differently.
Core Concepts
Viewing the Codebase as a Knowledge Graph, Not a Collection of Files
Traditional diff analysis tools pass the changed lines in a PR as LLM context. It's fast and simple, but it tells you nothing about what role the modified function plays in the system as a whole. It's like cutting out a single paragraph from a book and asking, "What does this sentence mean in the context of the whole story?"
The Codebase Semantic Graph that Greptile builds transforms the entire repository into a node-edge structure connected by function call relationships, import dependencies, and pattern similarities. When a change occurs, the AI can traverse the graph to reason about "how far does this change reach."
The Indexing Pipeline: How Code Becomes a Graph
When a repository is first connected, Greptile goes through four stages to build the graph. Each stage exists to address the limitations of the previous one.
Stage 1 — AST Parsing
The entire codebase is transformed into Abstract Syntax Trees (ASTs) and decomposed into functions, variables, and classes. This is the first step toward treating code as structure rather than a blob of text.
# Conceptual example (pseudocode) — may differ from the actual tree-sitter API
def parse_to_ast(source_code: str, language: str):
# Load language-specific parser
parser = get_parser(language)
tree = parser.parse(source_code)
# Result: a structured tree of function names, parameters, return types, and call relationships
return extract_nodes(tree.root_node)Stage 2 — Natural Language Conversion
ASTs alone aren't enough. Because each language has different syntax — Python's def versus TypeScript's function — the same logic can end up far apart in embedding space. Greptile absorbs this noise by recursively generating natural language descriptions (docstrings) for each AST node. According to internal measurements published by the Greptile team on their blog, natural language descriptions improve vector embedding similarity by approximately 12 percentage points compared to raw code.
Stage 3 — Dense Vector Embedding
The generated natural language summaries are chunked at the function level and converted into embedding vectors. These vectors are stored in a vector-specialized database (Chroma, Pinecone, etc. — databases optimized for fast similarity search over high-dimensional vectors).
Stage 4 — Graph Construction
Function call relationships and import dependencies are extracted directly from the code structure, while pattern similarity is determined by connecting nodes whose cosine similarity between the embedding vectors from Stage 3 exceeds a certain threshold. These three types of edges are combined to form the final graph structure.
Graph RAG (Graph Retrieval-Augmented Generation): While standard RAG retrieves similar text chunks via simple vector search, Graph RAG enables multi-hop traversal by following connections between nodes. The key difference is that it can find not just "code similar to this function," but "all code that depends on this function."
Practical Applications
Example 1: A Payment Logic Change Hiding a Cross-File Contract Violation
This is a situation frequently encountered in practice. A PR modifies the tax calculation function in a payment service, and the logic inside the function looks clean. Judging by the diff alone, it's "LGTM." When I first saw this example, I completely missed the invoice.service.ts side.
// tax.service.ts — modified function
// Before: calculateTax(amount: number): number
// After: calculateTax(amount: number, region: string): TaxResult
interface TaxResult {
amount: number;
rate: number;
breakdown: Record<string, number>;
}
export function calculateTax(amount: number, region: string): TaxResult {
const rate = getTaxRate(region);
return {
amount: amount * rate,
rate,
breakdown: { base: amount * rate },
};
}A diff analysis tool only checks whether the logic inside tax.service.ts is correct. But Greptile traverses the graph to trace every node that calls this function.
// invoice.service.ts — the caller (not included in the diff)
// ❌ Assumes the return type is number — causes a runtime error
const tax = calculateTax(invoice.amount); // missing region, return type mismatch
const total = invoice.amount + tax; // tries to add TaxResult to number, resulting in NaN| Analysis Tool | Error inside tax.service.ts |
Contract violation in invoice.service.ts |
Missing parameter in order.controller.ts |
|---|---|---|---|
| Diff-based | Detectable | Not detectable | Not detectable |
| Greptile | Detectable | Detectable | Detectable |
Example 2: Implicit Interface Change in a Shared Library During a Refactoring PR
A PR described as a "simple refactor" that quietly changes the public interface of a shared library.
// shared/validators.ts — before refactoring
export function validateEmail(email: string): boolean {
return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
}
// After refactoring — error handling changed to throw
export function validateEmail(email: string): void {
if (!/^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email)) {
throw new ValidationError("Invalid email format");
}
}The return type changed from boolean to void, and the failure behavior shifted from return false to throw. A diff analysis tool reads the changes within this file just fine. But it can't see the other services consuming this function.
// auth.service.ts — the caller (not included in the diff)
// ❌ Code that assumes a boolean return — always behaves as truthy after refactoring
if (validateEmail(input)) {
await createUser(input); // users get created even with invalid emails
}Greptile finds all nodes in the graph that consume this function and leaves a comment warning that "the contract has been broken."
Example 3: Multi-Hop Investigation with the v3 Agent
Starting from Greptile v3, the agent autonomously goes through multiple steps of investigation beyond simple graph searches. The following is a conceptual diagram of how the agent performs this process.
Detected an abnormal discount rate calculation in the PR
↓
Trace back through git history → found relevant commit
↓
Read original PR description for that commit → "hotfix: per specific client request"
↓
Search codebase for similar discount calculation patterns
↓
"3 other discount logic instances use a different approach. Possible consistency issue."Multi-hop Investigation: A traversal approach that starts from a single question and continues the next search based on intermediate results. To answer "why does this function look like this," it autonomously explores git history → PR description → similar patterns in sequence.
The Decisive Difference Between Diff Analysis and Semantic Graphs
Through the examples above, you've seen directly what each tool does differently. To summarize:
| Item | Traditional Diff Analysis | Semantic Graph (Greptile) |
|---|---|---|
| Scope of analysis | Changed lines (line diff) | Entire repository |
| Cross-file context | Within changed files | Traces entire call chain |
| Git history utilization | Limited | Multi-hop traceability |
| Review speed | Fast (seconds) | Relatively slower |
| False positives | Low | Relatively higher |
| Initial setup cost | None | Indexing time required |
Pros and Cons Analysis
Advantages
The figures below are based on benchmarks self-published by Greptile, measured in internal testing environments.
| Item | Details |
|---|---|
| Bug detection rate | 82% vs. competitors' 44–54%. High-risk bugs detected at 100% (competitors: 36–57%) |
| Cross-file context | Traces entire call chains, import dependencies, and pattern similarities |
| Architectural regression detection | Catches interface contract violations hidden in "clean" diffs |
| Git history utilization | Enables judgments that reference the historical context of changes |
| Ecosystem integration | Provided as an MCP server, directly callable by AI agents such as Claude and Cursor |
Disadvantages and Caveats
The most painful issue on this list in practice was the false positive problem. In the first two weeks, there were so many warnings that team members started muting the review notifications. Similar experiences were shared in the Greptile community, and independent benchmarks also confirmed the gap in false positive rates numerically — Greptile logged 11 cases versus CodeRabbit's 2.
| Item | Details | Mitigation |
|---|---|---|
| High false positive rate | 11 cases in independent benchmarks (CodeRabbit: 2) | Configure team rule-based filters, gradually adjust thresholds |
| Initial indexing cost | Minutes to hours depending on repository size | Run initial indexing overnight, outside the CI pipeline |
| Semantic vs. structural dependencies | May miss call relationships where function signatures differ | Use alongside static analysis tools like TypeScript strict mode |
| Codebase exposure | Entire repository is sent to a cloud service | Use on-premises option, review security policy in advance |
| Operational cost | Agent-based analysis incurs many LLM calls, raising costs | Set analysis depth by PR size, apply only to critical branches |
False Positive: When AI incorrectly identifies code that has no actual problem as a bug or risk. Too many false positives cause developers to start ignoring review notifications, and ultimately even genuinely critical warnings get buried. This is exactly why you need to check the false positive rate alongside the detection rate.
The Most Common Mistakes in Practice
-
Abandoning the tool immediately when there are too many false positives. In the early stages, warnings unrelated to team conventions can flood in because the index hasn't yet learned enough of the codebase's implicit rules. After feeding back team rules for about 2–4 weeks, signal quality noticeably improves.
-
Assuming the semantic graph replaces static analysis tools. Structural errors caught by type checkers and linters, and architectural context caught by semantic graphs, are complementary. It's best to use TypeScript strict mode + ESLint + Greptile as a combination where each covers a different layer.
-
Applying it to the entire repository all at once. In monorepos (a pattern where multiple services or packages are managed together in a single repository) or large-scale repositories, the initial indexing cost and false positive volume both grow simultaneously. A practical approach is to first apply it to core domain modules with high change impact — like payments and authentication — and then gradually expand the scope.
Closing Thoughts
AI code review is shifting from a problem of "reading changed lines well" to one of "understanding the entire system and reasoning about the ripple effects of changes," and the semantic graph approach is currently one of the most concrete implementations in this direction. Competing tools like CodeRabbit and GitHub Copilot Code Review are also rapidly advancing in the same direction.
There are real-world tradeoffs in false positive rates and indexing costs, but the value of catching cross-file contract violations or architectural regressions before deployment becomes increasingly clear as team size grows.
Where in your repository would you connect first? Here are 3 steps you can start right now.
- You can check the current plans on the official website and try connecting one side project or staging repository. After installing the GitHub app and completing the initial indexing, it's worth seeing firsthand what kind of cross-file comments appear on your next PR.
- You can collect past cross-file bug cases and use them as retrospective tests. Having your team jointly verify "would this bug have been caught in a Greptile comment?" gives you the practical evidence you need for an adoption decision.
- Integrating the MCP server to run codebase queries directly from Claude or Cursor is also a great option. Getting a feel for the practical value of semantic graphs through everyday development tasks — like "find everywhere this function is called" — before expanding into review automation is a natural progression.
References
- Graph-based Codebase Context | Greptile Official Docs
- Greptile v3, an agentic approach to code review | Greptile Blog
- Series A and Greptile v3 | Greptile Blog
- Codebases are uniquely hard to search semantically | Greptile Blog
- AI Code Review Benchmarks 2025 | Greptile
- Greptile: Smarter Code Reviews Through Codebase-Aware AI | DEV Community
- Greptile bags $25M in funding | SiliconANGLE
- Benchmark in talks to lead Series A for Greptile | TechCrunch
- Customer story: Greptile | Anthropic Claude
- CodeRabbit vs Greptile: Which AI Reviewer Catches More Bugs? | DEV Community
- Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs | arXiv
- State of AI Code Review Tools in 2025 | DevTools Academy