How to Measure RAG Pipeline Quality in Numbers with Ragas and Ollama

From Faithfulness·Context Precision measurement to chunking·hybrid search A/B comparisons

After building a RAG system, you've probably asked yourself, "Is this actually working?" At first, I'd skim through a few responses and think "seems decent enough," but honestly that wasn't evaluation — it was just a gut feeling. I needed a way to verify in numbers whether things genuinely improved when I changed chunking strategies or introduced hybrid search, or whether it was just an illusion. By the end of this article, you'll be able to attach Ragas evaluation to an existing RAG pipeline and record baseline Faithfulness·Context Precision scores within 30 minutes.

Ragas is a framework that uses an LLM as a judge to automatically evaluate RAG pipelines, allowing you to measure the retrieval and generation stages separately through metrics like Faithfulness and Context Precision. Adopted by AWS, Microsoft, Databricks, and others — and with 4,000+ GitHub stars — it has established itself as the de facto standard for RAG evaluation. What makes it especially appealing is that when paired with Ollama, you can run everything entirely locally without any OpenAI API costs. This article walks through everything in order: from basic Ragas setup, to A/B comparisons of chunking strategies, to before-and-after numbers for hybrid search adoption.

Core Concepts

What Ragas Measures

Ragas is an open-source evaluation framework published as a paper in late 2023 and accepted at EACL 2024. EACL is Europe's largest natural language processing conference, and the fact that the methodology has passed peer review makes it a credible foundation. The core idea is that "an LLM can serve as an evaluator without humans having to label data directly."

There are four key metrics, each measuring a different part of the RAG pipeline.

Metric	What It Measures	Required Inputs
Faithfulness	Whether the generated answer is grounded solely in the retrieved context	Question, context, answer
Context Precision	Whether chunks that contributed to ground-truth generation are concentrated at the top ranks	Question, context, ground-truth
Context Recall	Whether all necessary information is included in the context	Question, context, ground-truth
Answer Relevancy	How relevant the answer is to the question	Question, answer

Context Precision definition: It's not simply a matter of whether relevant chunks appear near the top. Ragas's Context Precision is a metric where an LLM judges "whether the chunks that actually contributed to generating the ground-truth answer are concentrated at higher ranks among the retrieved contexts." It's easy to confuse with Precision@K, so be careful.

How Faithfulness is calculated: The answer is decomposed into individual claims, and then it's verified whether each claim can be inferred from the retrieved documents. Score = number of verified claims / total number of claims (range: 0–1).

This structure lets you distinguish "the answer seems off — is it a retrieval problem or a generation problem?" If Context Precision is low, examine the retrieval stage; if Faithfulness is low, examine the generation stage.

How Ollama Connects as a Judge Model

Ragas doesn't depend on OpenAI or Anthropic. You can use a locally served LLM through Ollama as the judge model — since Ragas 0.2.x, the recommended approach is to directly connect a LangChain model object via LangchainLLMWrapper.

Mistral, Llama 3.1, and Gemma are commonly used as judge models. One thing to watch out for: I tried measuring Faithfulness with a 3B model and got scores in the 0.3 range. The quality of the judge model determines the reliability of the evaluation results. If the claim decomposition step produces misjudgments, the scores themselves become meaningless. In practice, a minimum of 7B or larger is recommended, ideally in the 13B–34B range.

How Chunking Strategy Affects Retrieval Quality

I used to think embedding model selection was the most important factor, but in practice 80% of RAG failures occur at the ingestion and chunking stage. The main types of chunking strategies are as follows.

Strategy	Method	Characteristics
Fixed-size	Simple split by token count	Fast, but semantic breaks occur
Recursive	Structure-preserving split by paragraph→sentence	LangChain default, general-purpose balance
Semantic	Split based on sentence embedding similarity	High accuracy, but cost/speed tradeoff

Interestingly, in a benchmark¹ comparing 7 strategies on 50 academic papers, Recursive 512-token splitting ranked 1st with 69% QA accuracy, outperforming semantic chunking (54%). It's a case where the numbers show that more complex strategies aren't necessarily better. However, this benchmark is based on QA accuracy, and results can differ when measured by metrics like Faithfulness or Context Precision (you can see this in Example 2).

Practical Application

10-Minute Setup: Adding Ragas Evaluation to an Existing RAG

First, install the packages and verify that the model you want to use as a judge is running in Ollama.

bash

pip install "ragas>=0.2.0" langchain-ollama langchain-community ollama
ollama pull mistral

Since Ragas 0.2.x, using LangchainLLMWrapper is the recommended approach. Running older code that uses llm_factory as-is may result in an ImportError or DeprecationWarning.

python

from ragas import evaluate, EvaluationDataset, SingleTurnSample
from ragas.metrics import Faithfulness, ContextPrecision
from ragas.llms import LangchainLLMWrapper
from langchain_ollama import ChatOllama
 
# Connect Ollama judge LLM (recommended approach for ragas>=0.2.0)
ollama_llm = ChatOllama(model="mistral", base_url="http://localhost:11434")
llm = LangchainLLMWrapper(ollama_llm)
 
# Construct evaluation dataset
samples = [
    SingleTurnSample(
        user_input="RAG란 무엇인가?",
        retrieved_contexts=[
            "RAG는 검색 증강 생성(Retrieval-Augmented Generation)으로, "
            "외부 지식 베이스를 검색해 LLM의 답변 품질을 높이는 기법이다."
        ],
        response="RAG는 외부 지식베이스를 검색해 LLM 답변을 보완하는 기법입니다.",
        reference="Retrieval-Augmented Generation"
    ),
    SingleTurnSample(
        user_input="하이브리드 검색이란?",
        retrieved_contexts=[
            "하이브리드 검색은 BM25 키워드 검색과 Dense Vector 검색을 "
            "RRF(Reciprocal Rank Fusion)로 결합하는 방식이다."
        ],
        response="BM25와 벡터 검색을 결합해 키워드와 의미 검색을 동시에 활용합니다.",
        reference="BM25와 Dense Vector Search의 결합"
    ),
]
dataset = EvaluationDataset(samples=samples)
 
# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[Faithfulness(llm=llm), ContextPrecision(llm=llm)]
)
print(result)
# {'faithfulness': 0.87, 'context_precision': 0.82}

Code Component	Role
`LangchainLLMWrapper`	Wraps a LangChain model object as a Ragas judge
`SingleTurnSample`	Groups a question, context, answer, and reference into a single evaluation unit
`EvaluationDataset`	Container for evaluating multiple samples in a batch
`evaluate()`	Takes a list of metrics and executes LLM judge calls for each sample

A minimum of 30–50 samples is recommended for the evaluation set. With 20 or fewer, statistical reliability is low and scores can fluctuate.

How Much Do Scores Change When You Switch Chunking Strategies?

Comparing Ragas scores while changing chunking strategies is where this framework truly shines. When I ran this experiment myself, I could feel just how significant the choice of strategy is.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader
 
# Load documents (PDF example — adjust glob pattern as needed)
loader = DirectoryLoader("./docs", glob="**/*.pdf")
docs = loader.load()
 
# Strategy 1: Fixed-size (256 tokens)
fixed_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0)
fixed_chunks = fixed_splitter.split_documents(docs)
 
# Strategy 2: Recursive (512 tokens, 10% overlap)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=51,
    separators=["\n\n", "\n", ".", " ", ""]
)
recursive_chunks = recursive_splitter.split_documents(docs)
 
# Strategy 3: Semantic chunking (embedding-based semantic unit splitting)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)
semantic_chunks = semantic_splitter.split_documents(docs)
 
# Build vectorstore → retriever for each strategy
from langchain_community.vectorstores import Chroma
 
def build_retriever(chunks, collection_name):
    vectorstore = Chroma.from_documents(
        chunks,
        OllamaEmbeddings(model="nomic-embed-text"),
        collection_name=collection_name
    )
    return vectorstore.as_retriever(search_kwargs={"k": 5})
 
fixed_retriever = build_retriever(fixed_chunks, "fixed")
recursive_retriever = build_retriever(recursive_chunks, "recursive")
semantic_retriever = build_retriever(semantic_chunks, "semantic")
 
# Build evaluation samples with each retriever, then call evaluate()
from langchain_ollama import ChatOllama
 
answer_llm = ChatOllama(model="mistral")
 
def build_ragas_dataset(retriever, questions, references):
    samples = []
    for q, ref in zip(questions, references):
        docs = retriever.invoke(q)
        contexts = [doc.page_content for doc in docs]
        prompt = f"컨텍스트:\n{''.join(contexts)}\n\n질문: {q}"
        answer = answer_llm.invoke(prompt).content
        samples.append(SingleTurnSample(
            user_input=q,
            retrieved_contexts=contexts,
            response=answer,
            reference=ref
        ))
    return EvaluationDataset(samples)

The results measured using the same question evaluation set are as follows.

Strategy	Faithfulness	Context Precision	Context Recall
Fixed-size (256 tokens)	0.51	0.58	0.62
Recursive (512 tokens, 10% overlap)	0.71	0.74	0.78
Semantic chunking	0.82	0.79	0.81

Compared to fixed-size, semantic chunking shows a +31 percentage point improvement in Faithfulness and +21 percentage points in Context Precision. The earlier benchmark had Recursive ranked 1st for QA accuracy, while here Semantic scores higher on the Faithfulness metric. These results may look contradictory, but that benchmark used QA accuracy (whether the correct answer is present) while this table uses Faithfulness (fidelity to the context as a basis), so the measurement metrics and domains differ. Since the optimal strategy varies by domain, measuring directly on your own dataset is the most accurate approach.

However, semantic chunking requires embedding model calls during the chunking itself, so preprocessing time increases considerably. If you're dealing with tens of thousands of documents, it's worth carefully weighing the cost and time tradeoffs.

Practical tip: Before changing chunking strategies, try injecting metadata (document titles, section information) first for a quick win. It's common to see QA accuracy jump from 50–60% up to 72–75%.

Boost Context Precision by +19 Percentage Points with One Line of BM25

This approach involves building an EnsembleRetriever that combines BM25 and dense vector search via RRF, then tracking retrieval quality changes with Ragas.

python

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings, ChatOllama
 
# Before: Dense search only
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
 
# After: BM25 + Dense hybrid (RRF)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
 
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.5, 0.5]  # needs tuning based on domain
)
 
# Call build_ragas_dataset() with dense_retriever vs hybrid_retriever for comparison
answer_llm = ChatOllama(model="mistral")
 
def generate_answer(question: str, contexts: list[str]) -> str:
    context_text = "\n\n".join(contexts)
    prompt = f"다음 컨텍스트를 바탕으로 질문에 답해주세요.\n\n컨텍스트:\n{context_text}\n\n질문: {question}"
    return answer_llm.invoke(prompt).content

The actual numerical comparison is as follows.

Retrieval Method	Recall@5	Context Precision	MAP
Dense search only (FAISS/Chroma)	0.587	0.62	0.523
BM25 only	0.644	0.68	—
Hybrid (BM25 + Dense, RRF)	0.695	0.81	0.70
Hybrid + Neural Reranker	0.816	0.87	0.797

Recall@5 and MAP are not metrics that Ragas calculates directly; they are measured using separate retrieval evaluation tools (such as the BEIR framework). Only Context Precision is measured with Ragas.

The jump in Context Precision from 0.62 → 0.81 when switching from Dense-only to Hybrid is particularly noteworthy. BM25 excels at exact keyword matching while dense retrieval excels at semantic similarity, and since their failure patterns differ, they complement each other well. Adding a Neural Reranker (Cross-Encoder family) on top of that can deliver a +52% improvement in MAP from 0.523 → 0.797.

MAP (Mean Average Precision): A metric that averages the precision across all ranks in the search results. Because it assigns higher weight to top-ranked results, it reflects ranking quality better than simple Recall alone.

RRF (Reciprocal Rank Fusion): In the form score = Σ 1/(k + rank_i), it converts each retriever's rank information into its reciprocal and sums them. Simple score aggregation causes problems where one retriever dominates the results because score scales differ across retrievers — RRF avoids this by using ranks instead of scores. A document ranked 1st by any retriever always receives a high score under this scheme.

Pros and Cons Analysis

Advantages

In one sentence: It's the only framework that can pinpoint bottleneck stages — locally, for free, without ground-truth labels.

Item	Description
No ground-truth required	LLM-based automatic evaluation — enables rapid iterative experimentation
Zero cost	Using a local Ollama LLM as the judge means fully local evaluation with no OpenAI API costs
Bottleneck identification	Retrieval metrics (Context Precision/Recall) and generation metrics (Faithfulness) are separated, so you can immediately see where the problem lies
Ecosystem integration	Connects directly with major tools like LangChain, Haystack, Langfuse, and MLflow
CI/CD integration	Can add quality gates to deployment pipelines via pytest-based evaluation tests or the Ragas CLI

Disadvantages and Caveats

Honestly, the choice of judge model alone determines the entire result. I spent a long time troubleshooting after measuring Faithfulness with a 3B model and getting scores in the 0.3 range — switching the judge model to 7B changed the results completely. It's important to think of the judge model and the generation model used in the RAG pipeline as separate concerns.

Item	Description	Mitigation
Judge model dependency	Reliability of evaluation degrades when using a weak local model	Minimum 7B or larger, ideally 13B+
Claim decomposition errors	If the LLM incorrectly decomposes claims during Faithfulness calculation, scores get distorted	Review evaluation results alongside claim decomposition logs
Mixed metric versions	Context Precision has a version that requires ground-truth and one that doesn't, and they coexist	Always check the metric definition docs for your specific Ragas version
Small evaluation sets	Statistical reliability is low with 20 or fewer samples	Compose a set of at least 30–50 samples
Reranker latency	Adding a Neural Reranker can introduce latency on the order of 84 seconds per query	Measure latency against production SLA before deciding

LLM-as-a-judge: An approach where an LLM evaluates another LLM's output in place of a human. It's faster and cheaper than human evaluators, but it's important to be aware that the judge model's biases or limitations are directly reflected in the evaluation results.

The Most Common Mistakes in Practice

Setting the judge model too small — Measuring Faithfulness with a 3B–4B model results in broken claim decomposition, making the scores unreliable. In my experience, when judge quality degrades, scores start to look nearly random.
Composing too small an evaluation set — Drawing conclusions like "Strategy A is better than B" from a 10-sample evaluation set is prone to error. For strategy comparison experiments, a minimum of 50 samples is recommended, with sample composition that accounts for domain coverage.
Changing only the chunking strategy while keeping the embedding model fixed — Chunking strategies and embedding models interact with each other. Especially for non-English documents, comparing models pre-trained on multilingual data — such as BAAI/bge-m3 or mxbai-embed-large — alongside your strategy changes will help you find a broader optimum.

Closing Thoughts

The Ragas + Ollama combination is a tool that turns the feeling of "RAG is working well" into concrete numbers. The Faithfulness score of 0.71 you record today becomes the basis for changing your chunking strategy tomorrow, and a Context Precision of 0.62 becomes the data point that drives the decision to introduce hybrid search. The key is being able to set direction with numbers rather than intuition.

Three steps you can start right now:

Set up your environment — Install packages with pip install "ragas>=0.2.0" langchain-ollama and prepare the judge model with ollama pull mistral. Even if you don't have an existing RAG pipeline, you can verify the behavior using the SingleTurnSample examples in the code above.
Measure and record a baseline score for your current pipeline — Collect 30–50 questions coming into your actual service, build an evaluation set, and record the Faithfulness and Context Precision scores. These numbers become the reference point for all subsequent changes.
Apply Recursive 512-token chunking and hybrid search sequentially and compare — Measure scores after changing the chunking strategy, then proceed to add BM25 with EnsembleRetriever. Running them in this order lets you isolate variables and track the effect of each change.

Which chunking strategy has proven most effective in your domain? I'd love it if you shared your experience in the comments.

References

Building Production RAG — Prem AI Blog, 2026 ↩

#RAG#Ragas#Ollama#LangChain#LLM-as-a-judge#하이브리드검색#BM25#청킹전략#VectorSearch#Faithfulness

How to Measure RAG Pipeline Quality in Numbers with Ragas and Ollama | DEV BAK - 기술블로그

How to Measure RAG Pipeline Quality in Numbers with Ragas and Ollama

From Faithfulness·Context Precision measurement to chunking·hybrid search A/B comparisons

Core Concepts

What Ragas Measures

There are four key metrics, each measuring a different part of the RAG pipeline.

Metric	What It Measures	Required Inputs
Faithfulness	Whether the generated answer is grounded solely in the retrieved context	Question, context, answer
Context Precision	Whether chunks that contributed to ground-truth generation are concentrated at the top ranks	Question, context, ground-truth
Context Recall	Whether all necessary information is included in the context	Question, context, ground-truth
Answer Relevancy	How relevant the answer is to the question	Question, answer

Context Precision definition: It's not simply a matter of whether relevant chunks appear near the top. Ragas's Context Precision is a metric where an LLM judges "whether the chunks that actually contributed to generating the ground-truth answer are concentrated at higher ranks among the retrieved contexts." It's easy to confuse with Precision@K, so be careful.

How Faithfulness is calculated: The answer is decomposed into individual claims, and then it's verified whether each claim can be inferred from the retrieved documents. Score = number of verified claims / total number of claims (range: 0–1).

How Ollama Connects as a Judge Model

How Chunking Strategy Affects Retrieval Quality

Strategy	Method	Characteristics
Fixed-size	Simple split by token count	Fast, but semantic breaks occur
Recursive	Structure-preserving split by paragraph→sentence	LangChain default, general-purpose balance
Semantic	Split based on sentence embedding similarity	High accuracy, but cost/speed tradeoff

Practical Application

10-Minute Setup: Adding Ragas Evaluation to an Existing RAG

First, install the packages and verify that the model you want to use as a judge is running in Ollama.

bash

pip install "ragas>=0.2.0" langchain-ollama langchain-community ollama
ollama pull mistral

Since Ragas 0.2.x, using LangchainLLMWrapper is the recommended approach. Running older code that uses llm_factory as-is may result in an ImportError or DeprecationWarning.

python

from ragas import evaluate, EvaluationDataset, SingleTurnSample
from ragas.metrics import Faithfulness, ContextPrecision
from ragas.llms import LangchainLLMWrapper
from langchain_ollama import ChatOllama
 
# Connect Ollama judge LLM (recommended approach for ragas>=0.2.0)
ollama_llm = ChatOllama(model="mistral", base_url="http://localhost:11434")
llm = LangchainLLMWrapper(ollama_llm)
 
# Construct evaluation dataset
samples = [
    SingleTurnSample(
        user_input="RAG란 무엇인가?",
        retrieved_contexts=[
            "RAG는 검색 증강 생성(Retrieval-Augmented Generation)으로, "
            "외부 지식 베이스를 검색해 LLM의 답변 품질을 높이는 기법이다."
        ],
        response="RAG는 외부 지식베이스를 검색해 LLM 답변을 보완하는 기법입니다.",
        reference="Retrieval-Augmented Generation"
    ),
    SingleTurnSample(
        user_input="하이브리드 검색이란?",
        retrieved_contexts=[
            "하이브리드 검색은 BM25 키워드 검색과 Dense Vector 검색을 "
            "RRF(Reciprocal Rank Fusion)로 결합하는 방식이다."
        ],
        response="BM25와 벡터 검색을 결합해 키워드와 의미 검색을 동시에 활용합니다.",
        reference="BM25와 Dense Vector Search의 결합"
    ),
]
dataset = EvaluationDataset(samples=samples)
 
# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[Faithfulness(llm=llm), ContextPrecision(llm=llm)]
)
print(result)
# {'faithfulness': 0.87, 'context_precision': 0.82}

Code Component	Role
`LangchainLLMWrapper`	Wraps a LangChain model object as a Ragas judge
`SingleTurnSample`	Groups a question, context, answer, and reference into a single evaluation unit
`EvaluationDataset`	Container for evaluating multiple samples in a batch
`evaluate()`	Takes a list of metrics and executes LLM judge calls for each sample

A minimum of 30–50 samples is recommended for the evaluation set. With 20 or fewer, statistical reliability is low and scores can fluctuate.

How Much Do Scores Change When You Switch Chunking Strategies?

Comparing Ragas scores while changing chunking strategies is where this framework truly shines. When I ran this experiment myself, I could feel just how significant the choice of strategy is.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader
 
# Load documents (PDF example — adjust glob pattern as needed)
loader = DirectoryLoader("./docs", glob="**/*.pdf")
docs = loader.load()
 
# Strategy 1: Fixed-size (256 tokens)
fixed_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0)
fixed_chunks = fixed_splitter.split_documents(docs)
 
# Strategy 2: Recursive (512 tokens, 10% overlap)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=51,
    separators=["\n\n", "\n", ".", " ", ""]
)
recursive_chunks = recursive_splitter.split_documents(docs)
 
# Strategy 3: Semantic chunking (embedding-based semantic unit splitting)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)
semantic_chunks = semantic_splitter.split_documents(docs)
 
# Build vectorstore → retriever for each strategy
from langchain_community.vectorstores import Chroma
 
def build_retriever(chunks, collection_name):
    vectorstore = Chroma.from_documents(
        chunks,
        OllamaEmbeddings(model="nomic-embed-text"),
        collection_name=collection_name
    )
    return vectorstore.as_retriever(search_kwargs={"k": 5})
 
fixed_retriever = build_retriever(fixed_chunks, "fixed")
recursive_retriever = build_retriever(recursive_chunks, "recursive")
semantic_retriever = build_retriever(semantic_chunks, "semantic")
 
# Build evaluation samples with each retriever, then call evaluate()
from langchain_ollama import ChatOllama
 
answer_llm = ChatOllama(model="mistral")
 
def build_ragas_dataset(retriever, questions, references):
    samples = []
    for q, ref in zip(questions, references):
        docs = retriever.invoke(q)
        contexts = [doc.page_content for doc in docs]
        prompt = f"컨텍스트:\n{''.join(contexts)}\n\n질문: {q}"
        answer = answer_llm.invoke(prompt).content
        samples.append(SingleTurnSample(
            user_input=q,
            retrieved_contexts=contexts,
            response=answer,
            reference=ref
        ))
    return EvaluationDataset(samples)

The results measured using the same question evaluation set are as follows.

Strategy	Faithfulness	Context Precision	Context Recall
Fixed-size (256 tokens)	0.51	0.58	0.62
Recursive (512 tokens, 10% overlap)	0.71	0.74	0.78
Semantic chunking	0.82	0.79	0.81

Practical tip: Before changing chunking strategies, try injecting metadata (document titles, section information) first for a quick win. It's common to see QA accuracy jump from 50–60% up to 72–75%.

Boost Context Precision by +19 Percentage Points with One Line of BM25

This approach involves building an EnsembleRetriever that combines BM25 and dense vector search via RRF, then tracking retrieval quality changes with Ragas.

python

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings, ChatOllama
 
# Before: Dense search only
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
 
# After: BM25 + Dense hybrid (RRF)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
 
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.5, 0.5]  # needs tuning based on domain
)
 
# Call build_ragas_dataset() with dense_retriever vs hybrid_retriever for comparison
answer_llm = ChatOllama(model="mistral")
 
def generate_answer(question: str, contexts: list[str]) -> str:
    context_text = "\n\n".join(contexts)
    prompt = f"다음 컨텍스트를 바탕으로 질문에 답해주세요.\n\n컨텍스트:\n{context_text}\n\n질문: {question}"
    return answer_llm.invoke(prompt).content

The actual numerical comparison is as follows.

Retrieval Method	Recall@5	Context Precision	MAP
Dense search only (FAISS/Chroma)	0.587	0.62	0.523
BM25 only	0.644	0.68	—
Hybrid (BM25 + Dense, RRF)	0.695	0.81	0.70
Hybrid + Neural Reranker	0.816	0.87	0.797

MAP (Mean Average Precision): A metric that averages the precision across all ranks in the search results. Because it assigns higher weight to top-ranked results, it reflects ranking quality better than simple Recall alone.

RRF (Reciprocal Rank Fusion): In the form score = Σ 1/(k + rank_i), it converts each retriever's rank information into its reciprocal and sums them. Simple score aggregation causes problems where one retriever dominates the results because score scales differ across retrievers — RRF avoids this by using ranks instead of scores. A document ranked 1st by any retriever always receives a high score under this scheme.

Pros and Cons Analysis

Advantages

In one sentence: It's the only framework that can pinpoint bottleneck stages — locally, for free, without ground-truth labels.

Item	Description
No ground-truth required	LLM-based automatic evaluation — enables rapid iterative experimentation
Zero cost	Using a local Ollama LLM as the judge means fully local evaluation with no OpenAI API costs
Bottleneck identification	Retrieval metrics (Context Precision/Recall) and generation metrics (Faithfulness) are separated, so you can immediately see where the problem lies
Ecosystem integration	Connects directly with major tools like LangChain, Haystack, Langfuse, and MLflow
CI/CD integration	Can add quality gates to deployment pipelines via pytest-based evaluation tests or the Ragas CLI

Disadvantages and Caveats

Item	Description	Mitigation
Judge model dependency	Reliability of evaluation degrades when using a weak local model	Minimum 7B or larger, ideally 13B+
Claim decomposition errors	If the LLM incorrectly decomposes claims during Faithfulness calculation, scores get distorted	Review evaluation results alongside claim decomposition logs
Mixed metric versions	Context Precision has a version that requires ground-truth and one that doesn't, and they coexist	Always check the metric definition docs for your specific Ragas version
Small evaluation sets	Statistical reliability is low with 20 or fewer samples	Compose a set of at least 30–50 samples
Reranker latency	Adding a Neural Reranker can introduce latency on the order of 84 seconds per query	Measure latency against production SLA before deciding

LLM-as-a-judge: An approach where an LLM evaluates another LLM's output in place of a human. It's faster and cheaper than human evaluators, but it's important to be aware that the judge model's biases or limitations are directly reflected in the evaluation results.

The Most Common Mistakes in Practice

Setting the judge model too small — Measuring Faithfulness with a 3B–4B model results in broken claim decomposition, making the scores unreliable. In my experience, when judge quality degrades, scores start to look nearly random.
Composing too small an evaluation set — Drawing conclusions like "Strategy A is better than B" from a 10-sample evaluation set is prone to error. For strategy comparison experiments, a minimum of 50 samples is recommended, with sample composition that accounts for domain coverage.
Changing only the chunking strategy while keeping the embedding model fixed — Chunking strategies and embedding models interact with each other. Especially for non-English documents, comparing models pre-trained on multilingual data — such as BAAI/bge-m3 or mxbai-embed-large — alongside your strategy changes will help you find a broader optimum.

Closing Thoughts

Three steps you can start right now:

Set up your environment — Install packages with pip install "ragas>=0.2.0" langchain-ollama and prepare the judge model with ollama pull mistral. Even if you don't have an existing RAG pipeline, you can verify the behavior using the SingleTurnSample examples in the code above.
Measure and record a baseline score for your current pipeline — Collect 30–50 questions coming into your actual service, build an evaluation set, and record the Faithfulness and Context Precision scores. These numbers become the reference point for all subsequent changes.
Apply Recursive 512-token chunking and hybrid search sequentially and compare — Measure scores after changing the chunking strategy, then proceed to add BM25 with EnsembleRetriever. Running them in this order lets you isolate variables and track the effect of each change.

Which chunking strategy has proven most effective in your domain? I'd love it if you shared your experience in the comments.

References

Building Production RAG — Prem AI Blog, 2026 ↩

#RAG#Ragas#Ollama#LangChain#LLM-as-a-judge#하이브리드검색#BM25#청킹전략#VectorSearch#Faithfulness

Core Concepts

What Ragas Measures

How Ollama Connects as a Judge Model

How Chunking Strategy Affects Retrieval Quality

Practical Application

10-Minute Setup: Adding Ragas Evaluation to an Existing RAG

How Much Do Scores Change When You Switch Chunking Strategies?

Boost Context Precision by +19 Percentage Points with One Line of BM25

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Footnotes

Core Concepts

What Ragas Measures

How Ollama Connects as a Judge Model

How Chunking Strategy Affects Retrieval Quality

Practical Application

10-Minute Setup: Adding Ragas Evaluation to an Existing RAG

How Much Do Scores Change When You Switch Chunking Strategies?

Boost Context Precision by +19 Percentage Points with One Line of BM25

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Footnotes

Recommended Posts

SGLang RadixAttention: How to Boost RAG Pipeline Throughput 5x by Reusing KV Cache for Identical Document Blocks

vLLM APC vs SGLang RadixAttention: KV Cache Architecture Differences and Workload-Based Selection Criteria

Why KV Cache Hit Rate Drops to 0% When Scaling Out vLLM Pods, and How llm-d Solves It (Prefix-Aware Routing / Distributed KV Cache)

How to Lock Down Your Team's Ollama Server — Security Configuration, vLLM Migration, and Multi-Agent Orchestration

Implementing In-House Document Q&A Without API Costs Using Ollama·LangChain — Privacy and Search Quality Together with Hybrid Search and Reranking

Ollama + MCP Tool Calling Integration (2026): Building an Agent That Lets Local LLMs Directly Handle Files, Git, and Databases