Local LLM Agent Setup Guide — Implementing Code Review Agents and RAG Pipelines with Ollama, LangGraph, and MCP (Model Context Protocol)
When introducing GPT-4o or Claude APIs into a project, two barriers frequently arise. The first is privacy. You must consult with the legal team regarding whether it is acceptable to transfer internal codebases or confidential documents to external servers. The second is cost. Token-based billing becomes difficult to predict as usage increases. Local AI agents solve both of these problems simultaneously. By running LLM directly on your own machine, data does not leave the network, and API bills disappear.
With Ollama surpassing 50 million monthly downloads and LangGraph releasing its official version 1.0, the barrier to entry for configuring local agents has effectively lowered. The ecosystem is also maturing rapidly as the Model Context Protocol (MCP), designed by Anthropic, establishes itself as the standard for agent-tool communication. In this article, you will understand the three-tier architecture of local AI agents and gain key knowledge to build your own code review agent that operates without API keys and a fully local RAG pipeline.
What to make in this post
- Example 1 —
codellama:7b+ Offline Code Review Agent Running on LangGraph- Example 2 —
mistral:7b+ Fully Local RAG Pipeline with ChromaDB
- Example 3 — Pattern for connecting an MCP (Model Context Protocol) tool server to a LangGraph agent
Prerequisites: You can follow along if you have basic Python syntax and experience using pip. Hardware: A minimum of 8GB VRAM is recommended for 7B parameter models. Although it can run on the CPU without a GPU, the speed will be significantly slower. Apple Silicon Macs (M1 or higher) utilize integrated memory and run smoothly even in a 16GB environment.
Key Concepts
What is a Local AI Agent
Local AI agents are different from simple chatbots. While a chatbot ends with a one-time round trip of question → answer, an agent independently plans and executes multi-stage tasks by repeating a ReAct loop of Reasoning → Acting → Observing.
ReAct(Reasoning + Acting): This is an agent pattern in which an LLM repeats a cycle of reasoning "which tool to use and in what order," acting the actual tool, and observing the result. It was proposed in the 2023 paper "ReAct: Synergizing Reasoning and Acting in Language Models," and LangGraph implements this pattern in the form of a graph.
The most fundamental difference from cloud agents is that all reasoning is completed within the local machine. Data does not leave the network.
3-Tier Architecture
The local AI agent consists of a combination of three layers.
| Hierarchy | Role | Representative Implementation |
|---|---|---|
| Inference Engine | Local LLM Serving | Ollama, LM Studio, vLLM |
| Agent Frameworks | Tool Invocation · Memory · Workflow | LangGraph, CrewAI, AutoGen |
| Tools/Integration Layer | External System Connection | MCP, FastAPI, Docker |
Each layer is connected via an adapter. To use Ollama in LangGraph, the langchain-ollama package is required, and to connect the MCP tool server to LangGraph, langchain-mcp-adapters is required. Even if the inference engine is replaced from Ollama to vLLM, most of the LangGraph workflow code remains intact.
MCP — New Standard for Agent Tool Communication
Model Context Protocol (MCP), designed by Anthropic in 2024 and donated to the Linux Foundation in December 2025, standardizes how agents communicate with external tools (file systems, databases, browsers, etc.). OpenAI and Google have also announced official support, and over 75 MCP connectors are already available in the Claude environment.
MCP (Model Context Protocol): A standard protocol defined to enable agents to dynamically discover and execute "what tools are available and how to call them." Just as USB-C connects various devices in a unified manner, MCP connects various LLMs and external tools through a consistent interface.
Inference Engine Selection Guide
| Tool | Features | Suitable for |
|---|---|---|
| Ollama | CLI First, OpenAI Compatible REST API, Fully Open Source | Developers, API Integration |
| LM Studio | GUI-based, Apple Silicon MLX optimized | Non-developers, Mac users |
| Jan | Open Source, Privacy-First ChatGPT Alternative | Individual User |
| vLLM | High-performance batch serving, production deployment optimization | Enterprise |
If you are a developer, I recommend starting with Ollama. It provides a built-in REST API compatible with the OpenAI SDK, making it easy to integrate into your existing codebase.
Practical Application
Example 1: Configuring an Offline Code Review Agent with Ollama + LangGraph
Run the codellama:7b model locally without an API key and configure a lint → review workflow with LangGraph. codellama was chosen because it is a model specifically trained for code understanding and generation. Even with the same 7B size, it provides more relevant feedback in a code review context than the general-purpose model mistral.
# 1. Ollama 설치 후 모델 다운로드
brew install ollama # macOS
# Linux: curl -fsSL https://ollama.com/install.sh | sh
ollama pull codellama:7b # 약 3.8GB, 코드 특화 모델
# 2. 의존성 설치 (버전 명시)
pip install "langgraph>=0.2" "langchain-ollama>=0.1" "langchain-core>=0.3"# code_review_agent.py
from langchain_ollama import ChatOllama
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
import subprocess
# 상태 정의
# TypedDict: 노드 간에 전달되는 데이터의 구조와 타입을 선언합니다.
# LangGraph는 이 구조를 기반으로 상태를 직렬화하고 노드 사이에서 안전하게 전달합니다.
# Annotated[list[str], operator.add]: 여러 노드가 issues에 값을 추가할 때
# 기존 리스트를 덮어쓰지 않고 자동으로 합치는 리듀서(reducer)를 지정합니다.
class ReviewState(TypedDict):
file_path: str
code: str
lint_result: str
review_result: str
issues: Annotated[list[str], operator.add]
iteration: int
# 로컬 LLM 초기화 (API 키 불필요)
llm = ChatOllama(
model="codellama:7b",
base_url="http://localhost:11434", # Ollama 기본 엔드포인트
temperature=0.1, # 낮을수록 일관된 출력
)
# 린트 노드 — ruff로 정적 분석 후 결과를 상태에 저장
def lint_node(state: ReviewState) -> ReviewState:
result = subprocess.run(
["ruff", "check", state["file_path"]],
capture_output=True, text=True
)
return {**state, "lint_result": result.stdout or "린트 오류 없음"}
# 리뷰 노드 — 코드와 린트 결과를 LLM에 전달하여 리뷰 생성
def review_node(state: ReviewState) -> ReviewState:
prompt = f"""다음 코드를 리뷰하고 개선이 필요한 점을 JSON 형식으로 반환하세요.
코드:
{state['code']}
린트 결과:
{state['lint_result']}
응답 형식: {{"issues": ["문제1", "문제2"]}}"""
response = llm.invoke(prompt)
return {
**state,
"review_result": response.content,
"iteration": state["iteration"] + 1,
}
# 그래프 구성: lint → review → END
graph = StateGraph(ReviewState)
graph.add_node("lint", lint_node)
graph.add_node("review", review_node)
graph.set_entry_point("lint")
graph.add_edge("lint", "review")
graph.add_edge("review", END)
agent = graph.compile()
# 실행 — with 블록으로 파일 핸들이 반드시 닫히도록 보장
with open("my_module.py", "r") as f:
code = f.read()
result = agent.invoke({
"file_path": "my_module.py",
"code": code,
"lint_result": "",
"review_result": "",
"issues": [],
"iteration": 0,
})
print(result["review_result"])# 실행
python code_review_agent.py| Code Point | Description |
|---|---|
ChatOllama(base_url=...) |
Connecting to local Ollama server. No API key parameters |
TypedDict + Annotated |
Declare the type of state between nodes and specify a reducer on the list field to guarantee safe merging during parallel node execution |
lint_node → review_node |
Configured to separate linter execution and LLM reviews into separate nodes, enabling each step to be tested independently |
with open(...) |
Context manager ensures file handles are closed |
Example 2: In-house Document RAG Pipeline (Completely Local)
This is a pattern for configuring Search Augmentation (RAG) without transmitting sensitive internal documents externally. The reason for using mistral:7b, unlike code review agents, is that Mistral is trained to be better suited for general-purpose text understanding and summarization. Codellama provides better results for code-specific tasks, while Mistral provides better results for natural language document-based QA.
# Ollama 모델 다운로드
ollama pull mistral:7b # 약 4.1GB, 범용 텍스트 처리
ollama pull nomic-embed-text # 274MB, 로컬 임베딩 전용
# 의존성 설치 — langchain-ollama는 OllamaEmbeddings 포함
pip install "langchain-ollama>=0.1" "langchain-chroma>=0.1" chromadb \
"langchain-community>=0.3" "langchain-core>=0.3"# local_rag_pipeline.py
# LangChain 0.3 이상 기준으로 작성
from operator import itemgetter
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_core.prompts import ChatPromptTemplate
# 1. 로컬 임베딩 모델 (외부 API 호출 없음)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# 2. 문서 로드 및 청크 분할
loader = DirectoryLoader("./company_docs", glob="**/*.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(docs)
# 3. 로컬 벡터 DB에 저장
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db", # 로컬 디스크에 저장
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 4. RAG 체인 구성 (LangChain 0.3+ 권장 LCEL 문법)
# itemgetter로 딕셔너리 키를 안전하게 추출
llm = ChatOllama(model="mistral:7b", temperature=0)
prompt = ChatPromptTemplate.from_template("""
다음 컨텍스트만을 사용하여 질문에 답하세요.
컨텍스트에 없는 내용은 "문서에서 찾을 수 없습니다"라고 답하세요.
컨텍스트:
{context}
질문: {question}
""")
rag_chain = (
{
"context": itemgetter("question") | retriever,
"question": itemgetter("question"),
}
| prompt
| llm
)
# 5. 질의 — rag_chain 끝에 ChatOllama가 있으므로 반환값은 AIMessage 객체
response = rag_chain.invoke({"question": "온보딩 절차가 어떻게 되나요?"})
print(response.content)# 실행
python local_rag_pipeline.pyExample 3: Connecting the MCP Tool Server to the LangGraph Agent
The core value of MCP (Model Context Protocol) is the separation of the tool server and the agent. The tool server is deployed independently, and the agent dynamically discovers and invokes available tools through the MCP client. This example illustrates the entire flow of MCP tool server configuration → client connection → agent execution.
Docker Compose-based production deployment and Playwright browser automation integration will be covered in depth in a separate post.
# MCP 어댑터 및 서버 의존성 설치
pip install "langchain-mcp-adapters>=0.1" "langchain-ollama>=0.1" \
fastapi uvicornStep 1: Configure FastAPI MCP Tool Server
# tool_server/main.py
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class BrowseRequest(BaseModel):
url: str
@app.post("/tools/extract_text")
async def extract_text(req: BrowseRequest) -> dict:
"""지정된 URL의 텍스트를 추출합니다."""
# 실제 구현에서는 httpx 또는 playwright 사용
return {"result": f"{req.url}에서 추출한 텍스트 (예시)"}
# MCP 도구 목록 엔드포인트 — 프로토콜 요구사항
@app.get("/mcp/tools")
async def list_tools():
return {
"tools": [
{
"name": "extract_text",
"description": "URL에서 텍스트를 추출합니다",
"inputSchema": {
"type": "object",
"properties": {"url": {"type": "string"}},
"required": ["url"],
},
}
]
}Step 2: Register the tool with the agent as an MCP client
# mcp_agent.py
import asyncio
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent
from langchain_ollama import ChatOllama
llm = ChatOllama(model="mistral:7b", temperature=0.1)
async def main():
# MCP 클라이언트가 도구 서버에서 사용 가능한 도구 목록을 자동으로 탐색
async with MultiServerMCPClient(
{
"browser-tools": {
"url": "http://localhost:8000/mcp",
"transport": "streamable_http",
}
}
) as mcp_client:
# 탐색된 도구를 LangChain 호환 형태로 변환 후 에이전트에 등록
tools = mcp_client.get_tools()
agent = create_react_agent(llm, tools)
response = await agent.ainvoke(
{"messages": [{"role": "user", "content": "https://example.com 페이지의 텍스트를 추출해줘"}]}
)
print(response["messages"][-1].content)
asyncio.run(main())# 터미널 1: 도구 서버 실행
uvicorn tool_server.main:app --port 8000
# 터미널 2: 에이전트 실행
python mcp_agent.py| Code Point | Description |
|---|---|
MultiServerMCPClient |
Client connecting multiple MCP tool servers simultaneously |
mcp_client.get_tools() |
Dynamically browse the list of tools available on the server and convert them into LangChain-compatible tools |
create_react_agent |
A helper for simply creating LangGraph ReAct agents. Only the LLM and tool list need to be passed. |
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Complete Data Privacy | Data never leaves the local network, enabling compliance with regulations such as HIPAA and GDPR |
| No API Costs | Unlimited execution without token charges |
| Low Latency | Improved response speed with local inference without network round trips |
| Complete Customization | Model Fine-tuning, System Prompts, Free Parameter Adjustment |
| Offline Operation | Works without an internet connection |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Hardware Requirements | Minimum 8GB VRAM for 7B models, 48GB or more for 70B models | Utilizes CPU inference (slow but possible), quantization models (Q4KM, Q80) |
| Model Performance Gap | Inference quality may be lower compared to GPT-4o and Claude Opus | Select a specialized model for the task (Code: Codellama, General: Mistral) |
| Configuration Complexity | Technical knowledge required for environment configuration and model management | Standardize environment with Docker Compose, utilize LM Studio GUI |
| Prompt Injection Risk | Security vulnerability where agent cannot distinguish between external data and instructions | Input data sandboxing, whitelist-based tool access restriction |
| Model Update Management | Manual download and replacement of the latest model required | ollama pull Periodic execution automation |
Prompt Injection: This is an attack that unintentionally manipulates agent behavior by including malicious instructions in external data (web pages, files, etc.) processed by the agent. This risk increases as the agent accesses file systems or the web.
Top 3 Common Mistakes
- Selecting a large model without checking VRAM: This is a case where you select a model larger than 13B without checking the GPU memory, resulting in relegation to CPU inference. It is recommended to first check the current load status with
ollama psand select a quantization version (Q4_K_M, Q8_0) that matches the hardware. If you have 8GB VRAM, the 7B Q8 or 13B Q4 levels are realistic choices. - Granting Excessive Privileges to Agent: This refers to cases where the agent is allowed full file system and network access for testing convenience. It is recommended to apply the Principle of Least Privilege to restrict the agent's access to only the directories and URLs that are actually needed.
- Trust LLM response format: This assumes that the LLM always responds in the specified JSON format. Since local models may have lower format compliance rates compared to cloud APIs, it is recommended to include Pydantic parsing and retry logic. If your model supports the
response_formatparameter, you can also utilize it.
In Conclusion
You can now run code review agents without API keys and operate semantic search systems without transmitting internal documents externally. The combination of Ollama, LangGraph, and MCP (Model Context Protocol) has reached a level ready for immediate deployment in production.
3 Steps to Start Right Now:
- Install Olama and test the local switch by changing only
base_urlin your existing OpenAI SDK code. You can start a conversation directly in the terminal with a single linebrew install ollama && ollama run mistral:7b. Since the OpenAI-compatible API is immediately enabled inhttp://localhost:11434/v1/chat/completions, you can verify the local switch immediately by changing only one linebase_urlin your existing code. - Try applying the code review agent from Example 1 to a file in your GitHub repository. Install the dependency using
pip install "langgraph>=0.2" "langchain-ollama>=0.1", then simply change thefile_pathvariable. You can start with a single-node graph and gradually expand by adding tools and nodes. - Try configuring a RAG pipeline with 10 internal documents first. You can start small to check the relevance of search results, and then improve quality by gradually adjusting the number of documents and chunk size. The
nomic-embed-text+ ChromaDB combination is capable of searching tens of thousands of documents offline with a single 274MB embedding model.
Next Post: Deepening Multi-Agent Collaboration Patterns — How to Build Role-Based Teams and Automate Complex Research and Coding Pipelines with CrewAI and LangGraph
Reference Materials
- Building Local AI Agents with LangGraph and Ollama | DigitalOcean
- Building a Local AI Agent with Ollama + MCP + LangChain + Docker | DEV Community
- Running Local LLMs in 2026: Ollama, LM Studio, and Jan Compared | DEV Community
- Top 9 AI Agent Frameworks as of March 2026 | Shakudo
- Agentic AI and Security | Martin Fowler
- Best practices for AI agent security in 2025 | Glean
- Introducing Strands Agents, an Open Source AI Agents SDK | AWS Tech Blog