Local LLM Agent Setup Guide — Implementing Code Review Agents and RAG Pipelines with Ollama, LangGraph, and MCP (Model Context Protocol)

When introducing GPT-4o or Claude APIs into a project, two barriers frequently arise. The first is privacy. You must consult with the legal team regarding whether it is acceptable to transfer internal codebases or confidential documents to external servers. The second is cost. Token-based billing becomes difficult to predict as usage increases. Local AI agents solve both of these problems simultaneously. By running LLM directly on your own machine, data does not leave the network, and API bills disappear.

With Ollama surpassing 50 million monthly downloads and LangGraph releasing its official version 1.0, the barrier to entry for configuring local agents has effectively lowered. The ecosystem is also maturing rapidly as the Model Context Protocol (MCP), designed by Anthropic, establishes itself as the standard for agent-tool communication. In this article, you will understand the three-tier architecture of local AI agents and gain key knowledge to build your own code review agent that operates without API keys and a fully local RAG pipeline.

What to make in this post

Example 1 — codellama:7b + Offline Code Review Agent Running on LangGraph

Example 2 — mistral:7b + Fully Local RAG Pipeline with ChromaDB

Example 3 — Pattern for connecting an MCP (Model Context Protocol) tool server to a LangGraph agent

Prerequisites: You can follow along if you have basic Python syntax and experience using pip. Hardware: A minimum of 8GB VRAM is recommended for 7B parameter models. Although it can run on the CPU without a GPU, the speed will be significantly slower. Apple Silicon Macs (M1 or higher) utilize integrated memory and run smoothly even in a 16GB environment.

Key Concepts

What is a Local AI Agent

Local AI agents are different from simple chatbots. While a chatbot ends with a one-time round trip of question → answer, an agent independently plans and executes multi-stage tasks by repeating a ReAct loop of Reasoning → Acting → Observing.

ReAct(Reasoning + Acting): This is an agent pattern in which an LLM repeats a cycle of reasoning "which tool to use and in what order," acting the actual tool, and observing the result. It was proposed in the 2023 paper "ReAct: Synergizing Reasoning and Acting in Language Models," and LangGraph implements this pattern in the form of a graph.

The most fundamental difference from cloud agents is that all reasoning is completed within the local machine. Data does not leave the network.

3-Tier Architecture

The local AI agent consists of a combination of three layers.

Hierarchy	Role	Representative Implementation
Inference Engine	Local LLM Serving	Ollama, LM Studio, vLLM
Agent Frameworks	Tool Invocation · Memory · Workflow	LangGraph, CrewAI, AutoGen
Tools/Integration Layer	External System Connection	MCP, FastAPI, Docker

Each layer is connected via an adapter. To use Ollama in LangGraph, the langchain-ollama package is required, and to connect the MCP tool server to LangGraph, langchain-mcp-adapters is required. Even if the inference engine is replaced from Ollama to vLLM, most of the LangGraph workflow code remains intact.

MCP — New Standard for Agent Tool Communication

Model Context Protocol (MCP), designed by Anthropic in 2024 and donated to the Linux Foundation in December 2025, standardizes how agents communicate with external tools (file systems, databases, browsers, etc.). OpenAI and Google have also announced official support, and over 75 MCP connectors are already available in the Claude environment.

MCP (Model Context Protocol): A standard protocol defined to enable agents to dynamically discover and execute "what tools are available and how to call them." Just as USB-C connects various devices in a unified manner, MCP connects various LLMs and external tools through a consistent interface.

Inference Engine Selection Guide

Tool	Features	Suitable for
Ollama	CLI First, OpenAI Compatible REST API, Fully Open Source	Developers, API Integration
LM Studio	GUI-based, Apple Silicon MLX optimized	Non-developers, Mac users
Jan	Open Source, Privacy-First ChatGPT Alternative	Individual User
vLLM	High-performance batch serving, production deployment optimization	Enterprise

If you are a developer, I recommend starting with Ollama. It provides a built-in REST API compatible with the OpenAI SDK, making it easy to integrate into your existing codebase.

Practical Application

Example 1: Configuring an Offline Code Review Agent with Ollama + LangGraph

Run the codellama:7b model locally without an API key and configure a lint → review workflow with LangGraph. codellama was chosen because it is a model specifically trained for code understanding and generation. Even with the same 7B size, it provides more relevant feedback in a code review context than the general-purpose model mistral.

python

# 1. Ollama 설치 후 모델 다운로드
brew install ollama           # macOS
# Linux: curl -fsSL https://ollama.com/install.sh | sh
 
ollama pull codellama:7b      # 약 3.8GB, 코드 특화 모델
 
# 2. 의존성 설치 (버전 명시)
pip install "langgraph>=0.2" "langchain-ollama>=0.1" "langchain-core>=0.3"

python

# code_review_agent.py
from langchain_ollama import ChatOllama
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
import subprocess
 
# 상태 정의
# TypedDict: 노드 간에 전달되는 데이터의 구조와 타입을 선언합니다.
# LangGraph는 이 구조를 기반으로 상태를 직렬화하고 노드 사이에서 안전하게 전달합니다.
# Annotated[list[str], operator.add]: 여러 노드가 issues에 값을 추가할 때
# 기존 리스트를 덮어쓰지 않고 자동으로 합치는 리듀서(reducer)를 지정합니다.
class ReviewState(TypedDict):
    file_path: str
    code: str
    lint_result: str
    review_result: str
    issues: Annotated[list[str], operator.add]
    iteration: int
 
# 로컬 LLM 초기화 (API 키 불필요)
llm = ChatOllama(
    model="codellama:7b",
    base_url="http://localhost:11434",  # Ollama 기본 엔드포인트
    temperature=0.1,                    # 낮을수록 일관된 출력
)
 
# 린트 노드 — ruff로 정적 분석 후 결과를 상태에 저장
def lint_node(state: ReviewState) -> ReviewState:
    result = subprocess.run(
        ["ruff", "check", state["file_path"]],
        capture_output=True, text=True
    )
    return {**state, "lint_result": result.stdout or "린트 오류 없음"}
 
# 리뷰 노드 — 코드와 린트 결과를 LLM에 전달하여 리뷰 생성
def review_node(state: ReviewState) -> ReviewState:
    prompt = f"""다음 코드를 리뷰하고 개선이 필요한 점을 JSON 형식으로 반환하세요.
 
코드:
{state['code']}
 
린트 결과:
{state['lint_result']}
 
응답 형식: {{"issues": ["문제1", "문제2"]}}"""
 
    response = llm.invoke(prompt)
    return {
        **state,
        "review_result": response.content,
        "iteration": state["iteration"] + 1,
    }
 
# 그래프 구성: lint → review → END
graph = StateGraph(ReviewState)
graph.add_node("lint", lint_node)
graph.add_node("review", review_node)
graph.set_entry_point("lint")
graph.add_edge("lint", "review")
graph.add_edge("review", END)
 
agent = graph.compile()
 
# 실행 — with 블록으로 파일 핸들이 반드시 닫히도록 보장
with open("my_module.py", "r") as f:
    code = f.read()
 
result = agent.invoke({
    "file_path": "my_module.py",
    "code": code,
    "lint_result": "",
    "review_result": "",
    "issues": [],
    "iteration": 0,
})
print(result["review_result"])

# 실행
python code_review_agent.py

Code Point	Description
`ChatOllama(base_url=...)`	Connecting to local Ollama server. No API key parameters
`TypedDict` + `Annotated`	Declare the type of state between nodes and specify a reducer on the list field to guarantee safe merging during parallel node execution
`lint_node → review_node`	Configured to separate linter execution and LLM reviews into separate nodes, enabling each step to be tested independently
`with open(...)`	Context manager ensures file handles are closed

Example 2: In-house Document RAG Pipeline (Completely Local)

This is a pattern for configuring Search Augmentation (RAG) without transmitting sensitive internal documents externally. The reason for using mistral:7b, unlike code review agents, is that Mistral is trained to be better suited for general-purpose text understanding and summarization. Codellama provides better results for code-specific tasks, while Mistral provides better results for natural language document-based QA.

python

# Ollama 모델 다운로드
ollama pull mistral:7b           # 약 4.1GB, 범용 텍스트 처리
ollama pull nomic-embed-text     # 274MB, 로컬 임베딩 전용
 
# 의존성 설치 — langchain-ollama는 OllamaEmbeddings 포함
pip install "langchain-ollama>=0.1" "langchain-chroma>=0.1" chromadb \
            "langchain-community>=0.3" "langchain-core>=0.3"

python

# local_rag_pipeline.py
# LangChain 0.3 이상 기준으로 작성
from operator import itemgetter
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_core.prompts import ChatPromptTemplate
 
# 1. 로컬 임베딩 모델 (외부 API 호출 없음)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
 
# 2. 문서 로드 및 청크 분할
loader = DirectoryLoader("./company_docs", glob="**/*.md")
docs = loader.load()
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(docs)
 
# 3. 로컬 벡터 DB에 저장
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",  # 로컬 디스크에 저장
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
 
# 4. RAG 체인 구성 (LangChain 0.3+ 권장 LCEL 문법)
# itemgetter로 딕셔너리 키를 안전하게 추출
llm = ChatOllama(model="mistral:7b", temperature=0)
 
prompt = ChatPromptTemplate.from_template("""
다음 컨텍스트만을 사용하여 질문에 답하세요.
컨텍스트에 없는 내용은 "문서에서 찾을 수 없습니다"라고 답하세요.
 
컨텍스트:
{context}
 
질문: {question}
""")
 
rag_chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | llm
)
 
# 5. 질의 — rag_chain 끝에 ChatOllama가 있으므로 반환값은 AIMessage 객체
response = rag_chain.invoke({"question": "온보딩 절차가 어떻게 되나요?"})
print(response.content)

# 실행
python local_rag_pipeline.py

Example 3: Connecting the MCP Tool Server to the LangGraph Agent

The core value of MCP (Model Context Protocol) is the separation of the tool server and the agent. The tool server is deployed independently, and the agent dynamically discovers and invokes available tools through the MCP client. This example illustrates the entire flow of MCP tool server configuration → client connection → agent execution.

Docker Compose-based production deployment and Playwright browser automation integration will be covered in depth in a separate post.

# MCP 어댑터 및 서버 의존성 설치
pip install "langchain-mcp-adapters>=0.1" "langchain-ollama>=0.1" \
            fastapi uvicorn

Step 1: Configure FastAPI MCP Tool Server

python

# tool_server/main.py
from fastapi import FastAPI
from pydantic import BaseModel
 
app = FastAPI()
 
class BrowseRequest(BaseModel):
    url: str
 
@app.post("/tools/extract_text")
async def extract_text(req: BrowseRequest) -> dict:
    """지정된 URL의 텍스트를 추출합니다."""
    # 실제 구현에서는 httpx 또는 playwright 사용
    return {"result": f"{req.url}에서 추출한 텍스트 (예시)"}
 
# MCP 도구 목록 엔드포인트 — 프로토콜 요구사항
@app.get("/mcp/tools")
async def list_tools():
    return {
        "tools": [
            {
                "name": "extract_text",
                "description": "URL에서 텍스트를 추출합니다",
                "inputSchema": {
                    "type": "object",
                    "properties": {"url": {"type": "string"}},
                    "required": ["url"],
                },
            }
        ]
    }

Step 2: Register the tool with the agent as an MCP client

python

# mcp_agent.py
import asyncio
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent
from langchain_ollama import ChatOllama
 
llm = ChatOllama(model="mistral:7b", temperature=0.1)
 
async def main():
    # MCP 클라이언트가 도구 서버에서 사용 가능한 도구 목록을 자동으로 탐색
    async with MultiServerMCPClient(
        {
            "browser-tools": {
                "url": "http://localhost:8000/mcp",
                "transport": "streamable_http",
            }
        }
    ) as mcp_client:
        # 탐색된 도구를 LangChain 호환 형태로 변환 후 에이전트에 등록
        tools = mcp_client.get_tools()
        agent = create_react_agent(llm, tools)
 
        response = await agent.ainvoke(
            {"messages": [{"role": "user", "content": "https://example.com 페이지의 텍스트를 추출해줘"}]}
        )
        print(response["messages"][-1].content)
 
asyncio.run(main())

# 터미널 1: 도구 서버 실행
uvicorn tool_server.main:app --port 8000
 
# 터미널 2: 에이전트 실행
python mcp_agent.py

Code Point	Description
`MultiServerMCPClient`	Client connecting multiple MCP tool servers simultaneously
`mcp_client.get_tools()`	Dynamically browse the list of tools available on the server and convert them into LangChain-compatible tools
`create_react_agent`	A helper for simply creating LangGraph ReAct agents. Only the LLM and tool list need to be passed.

Pros and Cons Analysis

Advantages

Item	Content
Complete Data Privacy	Data never leaves the local network, enabling compliance with regulations such as HIPAA and GDPR
No API Costs	Unlimited execution without token charges
Low Latency	Improved response speed with local inference without network round trips
Complete Customization	Model Fine-tuning, System Prompts, Free Parameter Adjustment
Offline Operation	Works without an internet connection

Disadvantages and Precautions

Item	Content	Response Plan
Hardware Requirements	Minimum 8GB VRAM for 7B models, 48GB or more for 70B models	Utilizes CPU inference (slow but possible), quantization models (Q4KM, Q80)
Model Performance Gap	Inference quality may be lower compared to GPT-4o and Claude Opus	Select a specialized model for the task (Code: Codellama, General: Mistral)
Configuration Complexity	Technical knowledge required for environment configuration and model management	Standardize environment with Docker Compose, utilize LM Studio GUI
Prompt Injection Risk	Security vulnerability where agent cannot distinguish between external data and instructions	Input data sandboxing, whitelist-based tool access restriction
Model Update Management	Manual download and replacement of the latest model required	`ollama pull` Periodic execution automation

Prompt Injection: This is an attack that unintentionally manipulates agent behavior by including malicious instructions in external data (web pages, files, etc.) processed by the agent. This risk increases as the agent accesses file systems or the web.

Top 3 Common Mistakes

Selecting a large model without checking VRAM: This is a case where you select a model larger than 13B without checking the GPU memory, resulting in relegation to CPU inference. It is recommended to first check the current load status with ollama ps and select a quantization version (Q4_K_M, Q8_0) that matches the hardware. If you have 8GB VRAM, the 7B Q8 or 13B Q4 levels are realistic choices.
Granting Excessive Privileges to Agent: This refers to cases where the agent is allowed full file system and network access for testing convenience. It is recommended to apply the Principle of Least Privilege to restrict the agent's access to only the directories and URLs that are actually needed.
Trust LLM response format: This assumes that the LLM always responds in the specified JSON format. Since local models may have lower format compliance rates compared to cloud APIs, it is recommended to include Pydantic parsing and retry logic. If your model supports the response_format parameter, you can also utilize it.

In Conclusion

You can now run code review agents without API keys and operate semantic search systems without transmitting internal documents externally. The combination of Ollama, LangGraph, and MCP (Model Context Protocol) has reached a level ready for immediate deployment in production.

3 Steps to Start Right Now:

Install Olama and test the local switch by changing only base_url in your existing OpenAI SDK code. You can start a conversation directly in the terminal with a single line brew install ollama && ollama run mistral:7b. Since the OpenAI-compatible API is immediately enabled in http://localhost:11434/v1/chat/completions, you can verify the local switch immediately by changing only one line base_url in your existing code.
Try applying the code review agent from Example 1 to a file in your GitHub repository. Install the dependency using pip install "langgraph>=0.2" "langchain-ollama>=0.1", then simply change the file_path variable. You can start with a single-node graph and gradually expand by adding tools and nodes.
Try configuring a RAG pipeline with 10 internal documents first. You can start small to check the relevance of search results, and then improve quality by gradually adjusting the number of documents and chunk size. The nomic-embed-text + ChromaDB combination is capable of searching tens of thousands of documents offline with a single 274MB embedding model.

Next Post: Deepening Multi-Agent Collaboration Patterns — How to Build Role-Based Teams and Automate Complex Research and Coding Pipelines with CrewAI and LangGraph

Reference Materials

Local LLM Agent Setup Guide — Implementing Code Review Agents and RAG Pipelines with Ollama, LangGraph, and MCP (Model Context Protocol) | DEV BAK - 기술블로그

Local LLM Agent Setup Guide — Implementing Code Review Agents and RAG Pipelines with Ollama, LangGraph, and MCP (Model Context Protocol)

What to make in this post

Example 1 — codellama:7b + Offline Code Review Agent Running on LangGraph

Example 2 — mistral:7b + Fully Local RAG Pipeline with ChromaDB

Example 3 — Pattern for connecting an MCP (Model Context Protocol) tool server to a LangGraph agent

Key Concepts

What is a Local AI Agent

The most fundamental difference from cloud agents is that all reasoning is completed within the local machine. Data does not leave the network.

3-Tier Architecture

The local AI agent consists of a combination of three layers.

Hierarchy	Role	Representative Implementation
Inference Engine	Local LLM Serving	Ollama, LM Studio, vLLM
Agent Frameworks	Tool Invocation · Memory · Workflow	LangGraph, CrewAI, AutoGen
Tools/Integration Layer	External System Connection	MCP, FastAPI, Docker

MCP — New Standard for Agent Tool Communication

Inference Engine Selection Guide

Tool	Features	Suitable for
Ollama	CLI First, OpenAI Compatible REST API, Fully Open Source	Developers, API Integration
LM Studio	GUI-based, Apple Silicon MLX optimized	Non-developers, Mac users
Jan	Open Source, Privacy-First ChatGPT Alternative	Individual User
vLLM	High-performance batch serving, production deployment optimization	Enterprise

If you are a developer, I recommend starting with Ollama. It provides a built-in REST API compatible with the OpenAI SDK, making it easy to integrate into your existing codebase.

Practical Application

Example 1: Configuring an Offline Code Review Agent with Ollama + LangGraph

python

# 1. Ollama 설치 후 모델 다운로드
brew install ollama           # macOS
# Linux: curl -fsSL https://ollama.com/install.sh | sh
 
ollama pull codellama:7b      # 약 3.8GB, 코드 특화 모델
 
# 2. 의존성 설치 (버전 명시)
pip install "langgraph>=0.2" "langchain-ollama>=0.1" "langchain-core>=0.3"

python

# code_review_agent.py
from langchain_ollama import ChatOllama
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
import subprocess
 
# 상태 정의
# TypedDict: 노드 간에 전달되는 데이터의 구조와 타입을 선언합니다.
# LangGraph는 이 구조를 기반으로 상태를 직렬화하고 노드 사이에서 안전하게 전달합니다.
# Annotated[list[str], operator.add]: 여러 노드가 issues에 값을 추가할 때
# 기존 리스트를 덮어쓰지 않고 자동으로 합치는 리듀서(reducer)를 지정합니다.
class ReviewState(TypedDict):
    file_path: str
    code: str
    lint_result: str
    review_result: str
    issues: Annotated[list[str], operator.add]
    iteration: int
 
# 로컬 LLM 초기화 (API 키 불필요)
llm = ChatOllama(
    model="codellama:7b",
    base_url="http://localhost:11434",  # Ollama 기본 엔드포인트
    temperature=0.1,                    # 낮을수록 일관된 출력
)
 
# 린트 노드 — ruff로 정적 분석 후 결과를 상태에 저장
def lint_node(state: ReviewState) -> ReviewState:
    result = subprocess.run(
        ["ruff", "check", state["file_path"]],
        capture_output=True, text=True
    )
    return {**state, "lint_result": result.stdout or "린트 오류 없음"}
 
# 리뷰 노드 — 코드와 린트 결과를 LLM에 전달하여 리뷰 생성
def review_node(state: ReviewState) -> ReviewState:
    prompt = f"""다음 코드를 리뷰하고 개선이 필요한 점을 JSON 형식으로 반환하세요.
 
코드:
{state['code']}
 
린트 결과:
{state['lint_result']}
 
응답 형식: {{"issues": ["문제1", "문제2"]}}"""
 
    response = llm.invoke(prompt)
    return {
        **state,
        "review_result": response.content,
        "iteration": state["iteration"] + 1,
    }
 
# 그래프 구성: lint → review → END
graph = StateGraph(ReviewState)
graph.add_node("lint", lint_node)
graph.add_node("review", review_node)
graph.set_entry_point("lint")
graph.add_edge("lint", "review")
graph.add_edge("review", END)
 
agent = graph.compile()
 
# 실행 — with 블록으로 파일 핸들이 반드시 닫히도록 보장
with open("my_module.py", "r") as f:
    code = f.read()
 
result = agent.invoke({
    "file_path": "my_module.py",
    "code": code,
    "lint_result": "",
    "review_result": "",
    "issues": [],
    "iteration": 0,
})
print(result["review_result"])

# 실행
python code_review_agent.py

Code Point	Description
`ChatOllama(base_url=...)`	Connecting to local Ollama server. No API key parameters
`TypedDict` + `Annotated`	Declare the type of state between nodes and specify a reducer on the list field to guarantee safe merging during parallel node execution
`lint_node → review_node`	Configured to separate linter execution and LLM reviews into separate nodes, enabling each step to be tested independently
`with open(...)`	Context manager ensures file handles are closed

Example 2: In-house Document RAG Pipeline (Completely Local)

python

# Ollama 모델 다운로드
ollama pull mistral:7b           # 약 4.1GB, 범용 텍스트 처리
ollama pull nomic-embed-text     # 274MB, 로컬 임베딩 전용
 
# 의존성 설치 — langchain-ollama는 OllamaEmbeddings 포함
pip install "langchain-ollama>=0.1" "langchain-chroma>=0.1" chromadb \
            "langchain-community>=0.3" "langchain-core>=0.3"

python

# local_rag_pipeline.py
# LangChain 0.3 이상 기준으로 작성
from operator import itemgetter
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_core.prompts import ChatPromptTemplate
 
# 1. 로컬 임베딩 모델 (외부 API 호출 없음)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
 
# 2. 문서 로드 및 청크 분할
loader = DirectoryLoader("./company_docs", glob="**/*.md")
docs = loader.load()
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(docs)
 
# 3. 로컬 벡터 DB에 저장
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",  # 로컬 디스크에 저장
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
 
# 4. RAG 체인 구성 (LangChain 0.3+ 권장 LCEL 문법)
# itemgetter로 딕셔너리 키를 안전하게 추출
llm = ChatOllama(model="mistral:7b", temperature=0)
 
prompt = ChatPromptTemplate.from_template("""
다음 컨텍스트만을 사용하여 질문에 답하세요.
컨텍스트에 없는 내용은 "문서에서 찾을 수 없습니다"라고 답하세요.
 
컨텍스트:
{context}
 
질문: {question}
""")
 
rag_chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | llm
)
 
# 5. 질의 — rag_chain 끝에 ChatOllama가 있으므로 반환값은 AIMessage 객체
response = rag_chain.invoke({"question": "온보딩 절차가 어떻게 되나요?"})
print(response.content)

# 실행
python local_rag_pipeline.py

Example 3: Connecting the MCP Tool Server to the LangGraph Agent

Docker Compose-based production deployment and Playwright browser automation integration will be covered in depth in a separate post.

# MCP 어댑터 및 서버 의존성 설치
pip install "langchain-mcp-adapters>=0.1" "langchain-ollama>=0.1" \
            fastapi uvicorn

Step 1: Configure FastAPI MCP Tool Server

python

# tool_server/main.py
from fastapi import FastAPI
from pydantic import BaseModel
 
app = FastAPI()
 
class BrowseRequest(BaseModel):
    url: str
 
@app.post("/tools/extract_text")
async def extract_text(req: BrowseRequest) -> dict:
    """지정된 URL의 텍스트를 추출합니다."""
    # 실제 구현에서는 httpx 또는 playwright 사용
    return {"result": f"{req.url}에서 추출한 텍스트 (예시)"}
 
# MCP 도구 목록 엔드포인트 — 프로토콜 요구사항
@app.get("/mcp/tools")
async def list_tools():
    return {
        "tools": [
            {
                "name": "extract_text",
                "description": "URL에서 텍스트를 추출합니다",
                "inputSchema": {
                    "type": "object",
                    "properties": {"url": {"type": "string"}},
                    "required": ["url"],
                },
            }
        ]
    }

Step 2: Register the tool with the agent as an MCP client

python

# mcp_agent.py
import asyncio
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent
from langchain_ollama import ChatOllama
 
llm = ChatOllama(model="mistral:7b", temperature=0.1)
 
async def main():
    # MCP 클라이언트가 도구 서버에서 사용 가능한 도구 목록을 자동으로 탐색
    async with MultiServerMCPClient(
        {
            "browser-tools": {
                "url": "http://localhost:8000/mcp",
                "transport": "streamable_http",
            }
        }
    ) as mcp_client:
        # 탐색된 도구를 LangChain 호환 형태로 변환 후 에이전트에 등록
        tools = mcp_client.get_tools()
        agent = create_react_agent(llm, tools)
 
        response = await agent.ainvoke(
            {"messages": [{"role": "user", "content": "https://example.com 페이지의 텍스트를 추출해줘"}]}
        )
        print(response["messages"][-1].content)
 
asyncio.run(main())

# 터미널 1: 도구 서버 실행
uvicorn tool_server.main:app --port 8000
 
# 터미널 2: 에이전트 실행
python mcp_agent.py

Code Point	Description
`MultiServerMCPClient`	Client connecting multiple MCP tool servers simultaneously
`mcp_client.get_tools()`	Dynamically browse the list of tools available on the server and convert them into LangChain-compatible tools
`create_react_agent`	A helper for simply creating LangGraph ReAct agents. Only the LLM and tool list need to be passed.

Pros and Cons Analysis

Advantages

Item	Content
Complete Data Privacy	Data never leaves the local network, enabling compliance with regulations such as HIPAA and GDPR
No API Costs	Unlimited execution without token charges
Low Latency	Improved response speed with local inference without network round trips
Complete Customization	Model Fine-tuning, System Prompts, Free Parameter Adjustment
Offline Operation	Works without an internet connection

Disadvantages and Precautions

Item	Content	Response Plan
Hardware Requirements	Minimum 8GB VRAM for 7B models, 48GB or more for 70B models	Utilizes CPU inference (slow but possible), quantization models (Q4KM, Q80)
Model Performance Gap	Inference quality may be lower compared to GPT-4o and Claude Opus	Select a specialized model for the task (Code: Codellama, General: Mistral)
Configuration Complexity	Technical knowledge required for environment configuration and model management	Standardize environment with Docker Compose, utilize LM Studio GUI
Prompt Injection Risk	Security vulnerability where agent cannot distinguish between external data and instructions	Input data sandboxing, whitelist-based tool access restriction
Model Update Management	Manual download and replacement of the latest model required	`ollama pull` Periodic execution automation

Top 3 Common Mistakes

Selecting a large model without checking VRAM: This is a case where you select a model larger than 13B without checking the GPU memory, resulting in relegation to CPU inference. It is recommended to first check the current load status with ollama ps and select a quantization version (Q4_K_M, Q8_0) that matches the hardware. If you have 8GB VRAM, the 7B Q8 or 13B Q4 levels are realistic choices.
Granting Excessive Privileges to Agent: This refers to cases where the agent is allowed full file system and network access for testing convenience. It is recommended to apply the Principle of Least Privilege to restrict the agent's access to only the directories and URLs that are actually needed.
Trust LLM response format: This assumes that the LLM always responds in the specified JSON format. Since local models may have lower format compliance rates compared to cloud APIs, it is recommended to include Pydantic parsing and retry logic. If your model supports the response_format parameter, you can also utilize it.

In Conclusion

3 Steps to Start Right Now:

Install Olama and test the local switch by changing only base_url in your existing OpenAI SDK code. You can start a conversation directly in the terminal with a single line brew install ollama && ollama run mistral:7b. Since the OpenAI-compatible API is immediately enabled in http://localhost:11434/v1/chat/completions, you can verify the local switch immediately by changing only one line base_url in your existing code.
Try applying the code review agent from Example 1 to a file in your GitHub repository. Install the dependency using pip install "langgraph>=0.2" "langchain-ollama>=0.1", then simply change the file_path variable. You can start with a single-node graph and gradually expand by adding tools and nodes.
Try configuring a RAG pipeline with 10 internal documents first. You can start small to check the relevance of search results, and then improve quality by gradually adjusting the number of documents and chunk size. The nomic-embed-text + ChromaDB combination is capable of searching tens of thousands of documents offline with a single 274MB embedding model.

Next Post: Deepening Multi-Agent Collaboration Patterns — How to Build Role-Based Teams and Automate Complex Research and Coding Pipelines with CrewAI and LangGraph

Key Concepts

What is a Local AI Agent

3-Tier Architecture

MCP — New Standard for Agent Tool Communication

Inference Engine Selection Guide

Practical Application

Example 1: Configuring an Offline Code Review Agent with Ollama + LangGraph

Example 2: In-house Document RAG Pipeline (Completely Local)

Example 3: Connecting the MCP Tool Server to the LangGraph Agent

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

Top 3 Common Mistakes

In Conclusion

Reference Materials

Key Concepts

What is a Local AI Agent

3-Tier Architecture

MCP — New Standard for Agent Tool Communication

Inference Engine Selection Guide

Practical Application

Example 1: Configuring an Offline Code Review Agent with Ollama + LangGraph

Example 2: In-house Document RAG Pipeline (Completely Local)

Example 3: Connecting the MCP Tool Server to the LangGraph Agent

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

Top 3 Common Mistakes

In Conclusion

Reference Materials

Recommended Posts

Executing Business Automation Workflows in Claude Desktop with n8n MCP Using Natural Language — From Setup to Practice Pitfalls

In-depth Analysis of MCP (Model Context Protocol) Multi-Agent Delegation Patterns — Designing Agent Chain Security with OAuth Token Exchange (RFC 8693) and Unidirectional Authority Reduction

Building a Role-Based Multi-Agent Pipeline with CrewAI and LangGraph

Why the AI Agent Deleted the Production DB — How to Enhance VibeCoding Security with an ABAC Hybrid Model

Building AI Agent Pipelines with n8n: From Automatic Email Classification to RAG Pipelines

Production MCP Server Implementation Patterns: Mastering OAuth 2.1 Authentication and OpenTelemetry Tracing Based on FastMCP 3.0