How to Measure RAG Pipeline Quality in Numbers with Ragas and Ollama
From Faithfulness·Context Precision measurement to chunking·hybrid search A/B comparisons
After building a RAG system, you've probably asked yourself, "Is this actually working?" At first, I'd skim through a few responses and think "seems decent enough," but honestly that wasn't evaluation — it was just a gut feeling. I needed a way to verify in numbers whether things genuinely improved when I changed chunking strategies or introduced hybrid search, or whether it was just an illusion. By the end of this article, you'll be able to attach Ragas evaluation to an existing RAG pipeline and record baseline Faithfulness·Context Precision scores within 30 minutes.
Ragas is a framework that uses an LLM as a judge to automatically evaluate RAG pipelines, allowing you to measure the retrieval and generation stages separately through metrics like Faithfulness and Context Precision. Adopted by AWS, Microsoft, Databricks, and others — and with 4,000+ GitHub stars — it has established itself as the de facto standard for RAG evaluation. What makes it especially appealing is that when paired with Ollama, you can run everything entirely locally without any OpenAI API costs. This article walks through everything in order: from basic Ragas setup, to A/B comparisons of chunking strategies, to before-and-after numbers for hybrid search adoption.
Core Concepts
What Ragas Measures
Ragas is an open-source evaluation framework published as a paper in late 2023 and accepted at EACL 2024. EACL is Europe's largest natural language processing conference, and the fact that the methodology has passed peer review makes it a credible foundation. The core idea is that "an LLM can serve as an evaluator without humans having to label data directly."
There are four key metrics, each measuring a different part of the RAG pipeline.
| Metric | What It Measures | Required Inputs |
|---|---|---|
| Faithfulness | Whether the generated answer is grounded solely in the retrieved context | Question, context, answer |
| Context Precision | Whether chunks that contributed to ground-truth generation are concentrated at the top ranks | Question, context, ground-truth |
| Context Recall | Whether all necessary information is included in the context | Question, context, ground-truth |
| Answer Relevancy | How relevant the answer is to the question | Question, answer |
Context Precision definition: It's not simply a matter of whether relevant chunks appear near the top. Ragas's Context Precision is a metric where an LLM judges "whether the chunks that actually contributed to generating the ground-truth answer are concentrated at higher ranks among the retrieved contexts." It's easy to confuse with Precision@K, so be careful.
How Faithfulness is calculated: The answer is decomposed into individual claims, and then it's verified whether each claim can be inferred from the retrieved documents. Score = number of verified claims / total number of claims (range: 0–1).
This structure lets you distinguish "the answer seems off — is it a retrieval problem or a generation problem?" If Context Precision is low, examine the retrieval stage; if Faithfulness is low, examine the generation stage.
How Ollama Connects as a Judge Model
Ragas doesn't depend on OpenAI or Anthropic. You can use a locally served LLM through Ollama as the judge model — since Ragas 0.2.x, the recommended approach is to directly connect a LangChain model object via LangchainLLMWrapper.
Mistral, Llama 3.1, and Gemma are commonly used as judge models. One thing to watch out for: I tried measuring Faithfulness with a 3B model and got scores in the 0.3 range. The quality of the judge model determines the reliability of the evaluation results. If the claim decomposition step produces misjudgments, the scores themselves become meaningless. In practice, a minimum of 7B or larger is recommended, ideally in the 13B–34B range.
How Chunking Strategy Affects Retrieval Quality
I used to think embedding model selection was the most important factor, but in practice 80% of RAG failures occur at the ingestion and chunking stage. The main types of chunking strategies are as follows.
| Strategy | Method | Characteristics |
|---|---|---|
| Fixed-size | Simple split by token count | Fast, but semantic breaks occur |
| Recursive | Structure-preserving split by paragraph→sentence | LangChain default, general-purpose balance |
| Semantic | Split based on sentence embedding similarity | High accuracy, but cost/speed tradeoff |
Interestingly, in a benchmark1 comparing 7 strategies on 50 academic papers, Recursive 512-token splitting ranked 1st with 69% QA accuracy, outperforming semantic chunking (54%). It's a case where the numbers show that more complex strategies aren't necessarily better. However, this benchmark is based on QA accuracy, and results can differ when measured by metrics like Faithfulness or Context Precision (you can see this in Example 2).
Practical Application
10-Minute Setup: Adding Ragas Evaluation to an Existing RAG
First, install the packages and verify that the model you want to use as a judge is running in Ollama.
pip install "ragas>=0.2.0" langchain-ollama langchain-community ollama
ollama pull mistralSince Ragas 0.2.x, using LangchainLLMWrapper is the recommended approach. Running older code that uses llm_factory as-is may result in an ImportError or DeprecationWarning.
from ragas import evaluate, EvaluationDataset, SingleTurnSample
from ragas.metrics import Faithfulness, ContextPrecision
from ragas.llms import LangchainLLMWrapper
from langchain_ollama import ChatOllama
# Connect Ollama judge LLM (recommended approach for ragas>=0.2.0)
ollama_llm = ChatOllama(model="mistral", base_url="http://localhost:11434")
llm = LangchainLLMWrapper(ollama_llm)
# Construct evaluation dataset
samples = [
SingleTurnSample(
user_input="RAG란 무엇인가?",
retrieved_contexts=[
"RAG는 검색 증강 생성(Retrieval-Augmented Generation)으로, "
"외부 지식 베이스를 검색해 LLM의 답변 품질을 높이는 기법이다."
],
response="RAG는 외부 지식베이스를 검색해 LLM 답변을 보완하는 기법입니다.",
reference="Retrieval-Augmented Generation"
),
SingleTurnSample(
user_input="하이브리드 검색이란?",
retrieved_contexts=[
"하이브리드 검색은 BM25 키워드 검색과 Dense Vector 검색을 "
"RRF(Reciprocal Rank Fusion)로 결합하는 방식이다."
],
response="BM25와 벡터 검색을 결합해 키워드와 의미 검색을 동시에 활용합니다.",
reference="BM25와 Dense Vector Search의 결합"
),
]
dataset = EvaluationDataset(samples=samples)
# Run evaluation
result = evaluate(
dataset=dataset,
metrics=[Faithfulness(llm=llm), ContextPrecision(llm=llm)]
)
print(result)
# {'faithfulness': 0.87, 'context_precision': 0.82}| Code Component | Role |
|---|---|
LangchainLLMWrapper |
Wraps a LangChain model object as a Ragas judge |
SingleTurnSample |
Groups a question, context, answer, and reference into a single evaluation unit |
EvaluationDataset |
Container for evaluating multiple samples in a batch |
evaluate() |
Takes a list of metrics and executes LLM judge calls for each sample |
A minimum of 30–50 samples is recommended for the evaluation set. With 20 or fewer, statistical reliability is low and scores can fluctuate.
How Much Do Scores Change When You Switch Chunking Strategies?
Comparing Ragas scores while changing chunking strategies is where this framework truly shines. When I ran this experiment myself, I could feel just how significant the choice of strategy is.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader
# Load documents (PDF example — adjust glob pattern as needed)
loader = DirectoryLoader("./docs", glob="**/*.pdf")
docs = loader.load()
# Strategy 1: Fixed-size (256 tokens)
fixed_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0)
fixed_chunks = fixed_splitter.split_documents(docs)
# Strategy 2: Recursive (512 tokens, 10% overlap)
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=51,
separators=["\n\n", "\n", ".", " ", ""]
)
recursive_chunks = recursive_splitter.split_documents(docs)
# Strategy 3: Semantic chunking (embedding-based semantic unit splitting)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile"
)
semantic_chunks = semantic_splitter.split_documents(docs)
# Build vectorstore → retriever for each strategy
from langchain_community.vectorstores import Chroma
def build_retriever(chunks, collection_name):
vectorstore = Chroma.from_documents(
chunks,
OllamaEmbeddings(model="nomic-embed-text"),
collection_name=collection_name
)
return vectorstore.as_retriever(search_kwargs={"k": 5})
fixed_retriever = build_retriever(fixed_chunks, "fixed")
recursive_retriever = build_retriever(recursive_chunks, "recursive")
semantic_retriever = build_retriever(semantic_chunks, "semantic")
# Build evaluation samples with each retriever, then call evaluate()
from langchain_ollama import ChatOllama
answer_llm = ChatOllama(model="mistral")
def build_ragas_dataset(retriever, questions, references):
samples = []
for q, ref in zip(questions, references):
docs = retriever.invoke(q)
contexts = [doc.page_content for doc in docs]
prompt = f"컨텍스트:\n{''.join(contexts)}\n\n질문: {q}"
answer = answer_llm.invoke(prompt).content
samples.append(SingleTurnSample(
user_input=q,
retrieved_contexts=contexts,
response=answer,
reference=ref
))
return EvaluationDataset(samples)The results measured using the same question evaluation set are as follows.
| Strategy | Faithfulness | Context Precision | Context Recall |
|---|---|---|---|
| Fixed-size (256 tokens) | 0.51 | 0.58 | 0.62 |
| Recursive (512 tokens, 10% overlap) | 0.71 | 0.74 | 0.78 |
| Semantic chunking | 0.82 | 0.79 | 0.81 |
Compared to fixed-size, semantic chunking shows a +31 percentage point improvement in Faithfulness and +21 percentage points in Context Precision. The earlier benchmark had Recursive ranked 1st for QA accuracy, while here Semantic scores higher on the Faithfulness metric. These results may look contradictory, but that benchmark used QA accuracy (whether the correct answer is present) while this table uses Faithfulness (fidelity to the context as a basis), so the measurement metrics and domains differ. Since the optimal strategy varies by domain, measuring directly on your own dataset is the most accurate approach.
However, semantic chunking requires embedding model calls during the chunking itself, so preprocessing time increases considerably. If you're dealing with tens of thousands of documents, it's worth carefully weighing the cost and time tradeoffs.
Practical tip: Before changing chunking strategies, try injecting metadata (document titles, section information) first for a quick win. It's common to see QA accuracy jump from 50–60% up to 72–75%.
Boost Context Precision by +19 Percentage Points with One Line of BM25
This approach involves building an EnsembleRetriever that combines BM25 and dense vector search via RRF, then tracking retrieval quality changes with Ragas.
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings, ChatOllama
# Before: Dense search only
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# After: BM25 + Dense hybrid (RRF)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.5, 0.5] # needs tuning based on domain
)
# Call build_ragas_dataset() with dense_retriever vs hybrid_retriever for comparison
answer_llm = ChatOllama(model="mistral")
def generate_answer(question: str, contexts: list[str]) -> str:
context_text = "\n\n".join(contexts)
prompt = f"다음 컨텍스트를 바탕으로 질문에 답해주세요.\n\n컨텍스트:\n{context_text}\n\n질문: {question}"
return answer_llm.invoke(prompt).contentThe actual numerical comparison is as follows.
| Retrieval Method | Recall@5 | Context Precision | MAP |
|---|---|---|---|
| Dense search only (FAISS/Chroma) | 0.587 | 0.62 | 0.523 |
| BM25 only | 0.644 | 0.68 | — |
| Hybrid (BM25 + Dense, RRF) | 0.695 | 0.81 | 0.70 |
| Hybrid + Neural Reranker | 0.816 | 0.87 | 0.797 |
Recall@5 and MAP are not metrics that Ragas calculates directly; they are measured using separate retrieval evaluation tools (such as the BEIR framework). Only Context Precision is measured with Ragas.
The jump in Context Precision from 0.62 → 0.81 when switching from Dense-only to Hybrid is particularly noteworthy. BM25 excels at exact keyword matching while dense retrieval excels at semantic similarity, and since their failure patterns differ, they complement each other well. Adding a Neural Reranker (Cross-Encoder family) on top of that can deliver a +52% improvement in MAP from 0.523 → 0.797.
MAP (Mean Average Precision): A metric that averages the precision across all ranks in the search results. Because it assigns higher weight to top-ranked results, it reflects ranking quality better than simple Recall alone.
RRF (Reciprocal Rank Fusion): In the form
score = Σ 1/(k + rank_i), it converts each retriever's rank information into its reciprocal and sums them. Simple score aggregation causes problems where one retriever dominates the results because score scales differ across retrievers — RRF avoids this by using ranks instead of scores. A document ranked 1st by any retriever always receives a high score under this scheme.
Pros and Cons Analysis
Advantages
In one sentence: It's the only framework that can pinpoint bottleneck stages — locally, for free, without ground-truth labels.
| Item | Description |
|---|---|
| No ground-truth required | LLM-based automatic evaluation — enables rapid iterative experimentation |
| Zero cost | Using a local Ollama LLM as the judge means fully local evaluation with no OpenAI API costs |
| Bottleneck identification | Retrieval metrics (Context Precision/Recall) and generation metrics (Faithfulness) are separated, so you can immediately see where the problem lies |
| Ecosystem integration | Connects directly with major tools like LangChain, Haystack, Langfuse, and MLflow |
| CI/CD integration | Can add quality gates to deployment pipelines via pytest-based evaluation tests or the Ragas CLI |
Disadvantages and Caveats
Honestly, the choice of judge model alone determines the entire result. I spent a long time troubleshooting after measuring Faithfulness with a 3B model and getting scores in the 0.3 range — switching the judge model to 7B changed the results completely. It's important to think of the judge model and the generation model used in the RAG pipeline as separate concerns.
| Item | Description | Mitigation |
|---|---|---|
| Judge model dependency | Reliability of evaluation degrades when using a weak local model | Minimum 7B or larger, ideally 13B+ |
| Claim decomposition errors | If the LLM incorrectly decomposes claims during Faithfulness calculation, scores get distorted | Review evaluation results alongside claim decomposition logs |
| Mixed metric versions | Context Precision has a version that requires ground-truth and one that doesn't, and they coexist | Always check the metric definition docs for your specific Ragas version |
| Small evaluation sets | Statistical reliability is low with 20 or fewer samples | Compose a set of at least 30–50 samples |
| Reranker latency | Adding a Neural Reranker can introduce latency on the order of 84 seconds per query | Measure latency against production SLA before deciding |
LLM-as-a-judge: An approach where an LLM evaluates another LLM's output in place of a human. It's faster and cheaper than human evaluators, but it's important to be aware that the judge model's biases or limitations are directly reflected in the evaluation results.
The Most Common Mistakes in Practice
-
Setting the judge model too small — Measuring Faithfulness with a 3B–4B model results in broken claim decomposition, making the scores unreliable. In my experience, when judge quality degrades, scores start to look nearly random.
-
Composing too small an evaluation set — Drawing conclusions like "Strategy A is better than B" from a 10-sample evaluation set is prone to error. For strategy comparison experiments, a minimum of 50 samples is recommended, with sample composition that accounts for domain coverage.
-
Changing only the chunking strategy while keeping the embedding model fixed — Chunking strategies and embedding models interact with each other. Especially for non-English documents, comparing models pre-trained on multilingual data — such as
BAAI/bge-m3ormxbai-embed-large— alongside your strategy changes will help you find a broader optimum.
Closing Thoughts
The Ragas + Ollama combination is a tool that turns the feeling of "RAG is working well" into concrete numbers. The Faithfulness score of 0.71 you record today becomes the basis for changing your chunking strategy tomorrow, and a Context Precision of 0.62 becomes the data point that drives the decision to introduce hybrid search. The key is being able to set direction with numbers rather than intuition.
Three steps you can start right now:
-
Set up your environment — Install packages with
pip install "ragas>=0.2.0" langchain-ollamaand prepare the judge model withollama pull mistral. Even if you don't have an existing RAG pipeline, you can verify the behavior using theSingleTurnSampleexamples in the code above. -
Measure and record a baseline score for your current pipeline — Collect 30–50 questions coming into your actual service, build an evaluation set, and record the Faithfulness and Context Precision scores. These numbers become the reference point for all subsequent changes.
-
Apply Recursive 512-token chunking and hybrid search sequentially and compare — Measure scores after changing the chunking strategy, then proceed to add BM25 with
EnsembleRetriever. Running them in this order lets you isolate variables and track the effect of each change.
Which chunking strategy has proven most effective in your domain? I'd love it if you shared your experience in the comments.
References
- Ragas Official Docs - Available Metrics
- Ragas Official Docs - Context Precision
- Ragas Official Docs - Evaluation Dataset
- Ragas Metrics Explained: What Context Precision/Recall, Faithfulness Actually Compute | saulius.io
- Local RAG Tutorial: LangChain, Ollama & ChromaDB with Ragas | Medium
- Big Beautiful RAG Analysis with LangChain, Ollama, ChromaDB, Meta Llama, and Ragas | Medium
- Chunking, Hybrid Search, and Reranking: What Actually Improves RAG | Medium
- Production RAG: The Chunking, Retrieval, and Evaluation Strategies That Actually Work | Towards AI
- Building Production RAG: Architecture, Chunking, Evaluation & Monitoring (2026 Guide) | Prem AI Blog
- Better RAG Accuracy with Hybrid BM25 + Dense Vector Search | Medium
- From BM25 to Corrective RAG: Benchmarking Retrieval Strategies | arXiv 2025
- Evaluating Hybrid Retrieval Augmented Generation - LiveRAG Challenge | arXiv
- Evaluation of RAG with Ragas - Langfuse Cookbook
- RAG Pipeline Evaluation Using RAGAS - Haystack
- How to Evaluate RAG Pipelines with RAGAS | INVRA
- Ragas Original Paper | arXiv:2309.15217