Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026

If your team is shipping RAG, chatbots, or agents to production, this decision is waiting for you

If you've ever shipped an AI feature to your product and thought, "I know this works — but how do I know how well it works?", you've arrived at exactly the right moment. I used to think, "Can't we just manually check a few prompts?" — but a regression popped up in an unexpected place after a model version change, and that thinking shifted. Once you actually try to set up evaluation infrastructure, the very first question you face is this: should you use an off-the-shelf tool like RAGAS, DeepEval, or Promptfoo, or build your own? By the end of this post, you'll understand the core differences between the two approaches and be able to define the right criteria for your team's current situation.

As of 2025–2026, LLM evaluation is no longer just a research team's homework. It has become a quality gate that practically blocks deployment pipelines, and how you build this infrastructure often determines a team's AI capabilities. The one-line summary: off-the-shelf tools let you start fast but have customization limits, while building your own enables domain specialization at a higher upfront cost.


Core Concepts

What Is an LLM Evaluation Harness?

Evaluation Harness: A testing infrastructure for systematically measuring the output quality of language models. A system that bundles an input dataset, evaluation metrics, scoring logic, and result reporting into a single pipeline.

Think of it as the equivalent of pytest or Jest in software testing, applied to LLM outputs. The key difference is that LLM outputs are non-deterministic. "What is the capital of France?" has a right answer, but "Is this response sufficiently empathetic?" is much harder to reduce to a single number. That's why scoring becomes more complex than ordinary unit tests.

When Does Evaluation Run?

Evaluation operates at three distinct points in time.

Timing Description Practical Use
Offline Batch evaluation against a curated dataset Before/after comparison for model swaps or fine-tuning
CI/CD Regression testing before prompt or model changes Automated run before PR merges
Online (Production) Real-time traffic sampling and monitoring Drift detection, anomaly detection

Regression Test: A test that verifies previously working functionality hasn't broken after a code, prompt, or model change. Plays the same role in LLM systems as software unit tests.

Teams that cover all three timings are rarer than you'd think. Teams that have shipped agents to production consistently say that starting with CI/CD integration offers the highest leverage.

Three Types of Evaluation Metrics

python
# Type 1: Deterministic metrics — regex / code matching
import json
 
def evaluate_json_output(response: str) -> bool:
    """Check whether the model returned valid JSON"""
    try:
        json.loads(response)
        return True
    except json.JSONDecodeError:
        return False
python
# Type 2: LLM-as-Judge metrics
# The following is pseudocode for illustration purposes.
# In a real implementation, use openai.chat.completions.create(...) or
# anthropic.messages.create(...).
def llm_judge_relevance(question: str, answer: str, judge_model) -> float:
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Return only a score between 0 and 1 indicating how relevant the answer is to the question.
    """
    score = judge_model.complete(prompt)  # Replace with actual API interface
    return float(score.strip())
python
# Type 3: Statistical / reference-based metrics (ROUGE, BERTScore, etc.)
from rouge_score import rouge_scorer
 
def evaluate_with_rouge(reference: str, generated: str) -> dict:
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, generated)
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

LLM-as-Judge: A pattern that uses a powerful model like GPT-4o or Claude as the scorer. Useful for evaluating nuance or consistency that deterministic metrics alone struggle to catch. Note that each framework has a different implementation philosophy — DeepEval is Python metric object-based, while Promptfoo is YAML assertion-based, so it's worth keeping these design differences in mind to avoid confusion.

In practice, a composite approach mixing all three types is the most realistic.


Practical Application

Example 1: Measuring RAG Pipeline Quality with an Off-the-Shelf Tool — RAGAS

RAG (Retrieval-Augmented Generation): An approach where the model first retrieves relevant information from external documents or databases before generating a response. Widely used in internal document-based Q&A and customer support chatbots.

If you're building a RAG-based service, RAGAS is the fastest starting point. The key point up front: faithfulness and answer_relevancy can be computed without ground-truth labels. However, context_recall requires ground_truth (reference labels). I missed this distinction early on and discovered it through a runtime error — knowing it in advance will save you some headaches.

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from datasets import Dataset
 
# RAGAS requires an LLM configuration for scoring (mandatory since 2024)
judge_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
 
eval_data = {
    "question": [
        "What is your refund policy?",
        "How long does shipping take?"
    ],
    "answer": [
        "You can request a refund within 30 days of purchase.",
        "It takes an average of 3–5 business days."
    ],
    "contexts": [
        ["Refunds are available within 30 days of the purchase date...", "Exchanges within 7 days..."],
        ["Standard shipping takes 3–5 business days, express shipping takes 1–2 business days..."]
    ],
    # ground_truth is required for context_recall.
    # This field can be omitted if you only use faithfulness and answer_relevancy.
    "ground_truth": [
        "Refund available within 30 days",
        "Takes 3–5 business days"
    ]
}
 
dataset = Dataset.from_dict(eval_data)
 
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_recall],
    llm=judge_llm
)
 
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.87, 'context_recall': 0.95}
Metric Meaning What to Suspect When Low ground_truth Required
faithfulness Is the answer grounded in the context? Model is generating hallucinations Not required
answer_relevancy How relevant is the answer to the question? Prompt or retrieval design issues Not required
context_recall Was the necessary information retrieved? Embedding model or chunking strategy issues Required

Knowing how faithfulness works internally makes it easier to trust the score. RAGAS scores it by decomposing the answer into multiple claims, then judging whether each claim is grounded in the retrieved context. When this number drops below 0.8, it's a signal to revisit the retrieval design itself.

In practice, the LangchainLLMWrapper setup requires one more look at the docs, but things move faster than expected after that.


Example 2: Integrating Prompt Regression Testing into CI/CD with Promptfoo

Manually checking every prompt change hits a wall quickly. Promptfoo lets you compare multiple models and prompt variants simultaneously with a single YAML config, solving this problem quite cleanly. Where DeepEval defines metric objects in Python code, Promptfoo operates through YAML assertions. The philosophy differs, so pick whichever fits your taste and team situation.

yaml
# promptfooconfig.yaml
description: "Customer support chatbot regression tests"
 
prompts:
  - "You are a friendly customer support representative. Answer the following question: {{question}}"
  - "Customer question: {{question}}\n\nPlease respond concisely and accurately."
 
providers:
  - openai:gpt-4o
  - anthropic:claude-3-5-sonnet-20241022
 
tests:
  - vars:
      question: "I'd like a refund"
    assert:
      - type: contains
        value: "refund"
      - type: llm-rubric
        value: "Is the response empathetic and does it guide the user on next steps?"
      - type: not-contains
        value: "I don't know"
 
  - vars:
      question: "My delivery is taking too long"
    assert:
      - type: llm-rubric
        value: "Does the response include an apology?"
      - type: javascript
        # output is a Promptfoo built-in variable referring to the model's response text
        value: "output.length < 500"
bash
# Run in GitHub Actions (eval first, then generate dataset from results)
npx promptfoo eval --config promptfooconfig.yaml --output results.json
npx promptfoo generate dataset  # Automatically generates additional test cases from results

In practice, managing YAML starts to feel a bit burdensome as the number of cases grows. Promptfoo shines when comparing many prompt variants across multiple models, but once you need domain-specific metrics, the amount of custom assertion writing starts to increase.


Example 3: Building Your Own for Agent Systems

Multi-step, tool-calling agent systems are an area that general-purpose tools struggle to cover. For example, checking "when a cancellation request comes in, did the agent call tools in the order get_order → cancel_order → notify_user?" requires awkward workarounds with off-the-shelf tools. In cases like this, building your own is the practical choice.

Simplified from a pattern actually used by teams that have shipped agents to production, it looks something like this:

python
import asyncio
from dataclasses import dataclass
from typing import Callable, Any
from enum import Enum
 
class TestCategory(Enum):
    HAPPY_PATH = "happy_path"
    # Cases where the model self-corrects after a failed attempt
    # e.g., "Entered the wrong order number, retried successfully"
    RECOVERABLE = "recoverable"
    # Cases that should produce an error — cancel_order should NOT be called
    # e.g., "Attempts to cancel an already-shipped order → notified that cancellation is unavailable"
    UNRECOVERABLE = "unrecoverable"
    ADVERSARIAL = "adversarial"  # Adversarial inputs
 
@dataclass
class EvalCase:
    name: str
    category: TestCategory
    input: dict
    expected_tool_calls: list[str]
    success_criteria: Callable[[Any], bool]
 
class AgentEvalHarness:
    def __init__(self, agent, cases: list[EvalCase]):
        self.agent = agent
        self.cases = cases
        self.results = []
 
    async def run(self) -> dict:
        for case in self.cases:
            result = await self._evaluate_case(case)
            self.results.append(result)
        return self._summarize()
 
    async def _evaluate_case(self, case: EvalCase) -> dict:
        # Assumes the agent returns a (output, tool_calls) tuple.
        # An adapter layer may be needed depending on your actual agent interface.
        actual_output, actual_tool_calls = await self.agent.run(case.input)
 
        return {
            "name": case.name,
            "category": case.category.value,
            "passed": case.success_criteria(actual_output),
            "tool_call_order_correct": actual_tool_calls == case.expected_tool_calls,
        }
 
    def _summarize(self) -> dict:
        by_category = {}
        for r in self.results:
            cat = r["category"]
            by_category.setdefault(cat, {"passed": 0, "total": 0})
            by_category[cat]["total"] += 1
            if r["passed"]:
                by_category[cat]["passed"] += 1
        return by_category
 
harness = AgentEvalHarness(agent=my_agent, cases=[
    EvalCase(
        name="Normal order cancellation handling",
        category=TestCategory.HAPPY_PATH,
        input={"user_message": "Please cancel order 12345"},
        expected_tool_calls=["get_order", "cancel_order", "notify_user"],
        success_criteria=lambda out: "cancel" in out and "confirmed" in out
    ),
    # A case added from an actual production bug
    EvalCase(
        name="Attempt to cancel an already-shipped order",
        category=TestCategory.UNRECOVERABLE,
        input={"user_message": "Please cancel the order that was shipped yesterday"},
        expected_tool_calls=["get_order"],  # cancel_order must NOT be called
        success_criteria=lambda out: "cannot cancel" in out or "already shipped" in out
    ),
])

The core of this pattern lies in the workflow of adding a case every time a production bug occurs. Trying to build a perfect case set from the start means you'll never get started — cases born from real bugs that actually fired are far more valuable.


Pros and Cons Analysis

Here's how the two compare directly.

Item Build Your Own Off-the-Shelf Tools
Time to Start Slow (must design from scratch) Fast (first evaluation running within hours)
Customization Completely free Limited (need to check extension points)
Data Security No external transmission External transmission possible depending on platform
Maintenance Cost High (team maintains directly) Low (vendor updates)
Vendor Lock-in None Commercial platforms carry pricing change risk
Standard Benchmarks Must implement yourself Can compare directly against MMLU, etc.
Agent-Specific Strong suit Requires workaround implementations

Advantages

Build Your Own

Item Details
Domain Specialization Business logic and metrics can be fully customized
Infrastructure Integration Connects freely with existing CI/CD and data pipelines
Data Security Proprietary data never leaves to an external service
Strategic Asset The accumulated evaluation dataset itself becomes a competitive advantage

Off-the-Shelf Tools

Item Details
Fast Start First evaluation can run within hours of installation
Validated Metrics Community- and paper-validated metrics available immediately
Vendor Updates New LLM support is added automatically
Standard Comparison Can compare directly against benchmarks like MMLU and GSM8K

Drawbacks and Caveats

Build Your Own

Item Details Mitigation
High Upfront Cost Design, implementation, and maintenance all consume team resources Start MVP small, expand cases incrementally
Technical Debt Notebook-level implementations pile up and become hard to touch later Maintain modular structure from the start
Expertise Required Designing good evaluation metrics is harder than it looks Reference and borrow metric designs from off-the-shelf tools

Off-the-Shelf Tools

Item Details Mitigation
Customization Limits Adding domain-specific metrics can be difficult Check custom evaluator extension points first
Vendor Lock-in Commercial platforms are vulnerable to pricing policy changes Prioritize open-source, verify data export capability
External Data Transmission Prompts and responses may be sent to external servers Check for self-hosting options or on-premises support

The Most Common Mistakes in Practice

I've been through each of these myself, so sharing them with anyone who might have a similar experience.

  1. Trying to build a perfect set of evaluation cases from the start — Starting with 20–30 cases and adding more every time a production failure occurs is far more effective. It's surprisingly common to wait for the perfect dataset and end up doing nothing at all.

  2. Blindly trusting LLM-as-Judge — Believing "an AI is scoring it, so it must be objective" will catch up with you. The judge model makes mistakes too, and error rates climb especially with long contexts or specialized domains. It's strongly recommended to always pair it with deterministic metrics.

  3. Doing only offline evaluation and skipping production monitoring — Honestly, this is the part most often left out. Patterns that appear in real user traffic are hard to anticipate from a test set, and adding monitoring later is expensive in effort. It's recommended to include it in the initial design.


Closing Thoughts

Building evaluation infrastructure is just as important as choosing which model to use. Talking with teams that have shipped agents to production, it's not uncommon to hear that improvements to the harness outperformed the gains from switching to a more powerful model. More often than not, it's how you evaluate and improve that makes the difference, not the model itself.

If you're unsure where to start, here's one way to think about it:

  • Off-the-shelf tools are the right fit → AI is a supplementary rather than core feature, the use case is standard (RAG, chatbot), and fast shipping is the top priority
  • Building your own is the right fit → AI is the core business value, domain-specific metrics are required, or regulations prevent sending data externally
  • Hybrid is the most practical → Start with off-the-shelf tools and build custom implementations only where customization is needed. The two approaches aren't mutually exclusive, and this is actually the most commonly used pattern in practice.

Three steps you can take right now:

  1. You can start building a small test set — If you're running a RAG pipeline or chatbot, install with pip install ragas, collect 20 real questions and model responses from your current service, and measure just two metrics: faithfulness and answer_relevancy. Once numbers appear, direction starts to become clear.

  2. You can try CI/CD integration — With Promptfoo, run npx promptfoo@latest init to generate an initial config file, then add the promptfoo eval command to a PR trigger in GitHub Actions. The moment you catch your first regression, you'll feel the value of evaluation infrastructure intuitively.

  3. A hybrid approach is perfectly practical — Starting with a structure that covers standard metrics with off-the-shelf tools and implements domain-specific logic directly (as in Example 3) lets you strike a balance between upfront cost and customization.


References

  • Top 10 LLM Evaluation Harnesses: Features, Pros, Cons & Comparison | DevOpsSchool
  • Top 7 LLM Evaluation Tools in 2026 | Confident AI
  • EleutherAI lm-evaluation-harness | GitHub
  • Promptfoo vs DeepEval vs RAGAS: 2026 LLM Evaluation Tools Comparison | genai.qa
  • LLM Evaluation Frameworks: Head-to-Head Comparison | Comet ML
  • DeepEval vs. RAGAS vs. LangSmith: Choosing the Right Evaluation Framework | Descope
  • Best AI Eval Tools for CI/CD Pipelines 2026 | Braintrust
  • How to Build an LLM Evaluation Framework in 2025 | Deepchecks
  • The Model vs. the Harness: Which matters more? | Medium
#LLM평가#RAG#RAGAS#Promptfoo#DeepEval#LLM-as-Judge#에이전트평가#CI/CD#평가하네스#Python
Share

Table of Contents

Core ConceptsWhat Is an LLM Evaluation Harness?When Does Evaluation Run?Three Types of Evaluation MetricsPractical ApplicationExample 1: Measuring RAG Pipeline Quality with an Off-the-Shelf Tool — RAGASExample 2: Integrating Prompt Regression Testing into CI/CD with PromptfooExample 3: Building Your Own for Agent SystemsPros and Cons AnalysisAdvantagesDrawbacks and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents
AI

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents

Catching runtime errors at write time with RunContext · output_type · dependency injection When you layer LLM-powered features onto Python code, a nagging an...

May 30, 202624 min read
Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard
AI

Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard

A service connected to GPT-4 suddenly starts giving nonsensical answers. You dig through the logs and find no errors. HTTP response codes are all 200. But users...

May 30, 202625 min read
Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables
AI

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

When I first introduced RAG, I had a similar experience. I parsed a few hundred PDFs, loaded them into a vector DB, and ran some searches — it retrieved text-he...

May 30, 202620 min read
Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose
AI

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

If you've ever built an LLM-based app, you've hit this wall. "How do I make it remember past conversations?" You might think you can just shove the entire conve...

May 30, 202629 min read
LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System
AI

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

The most common mistake when first designing a multi-agent system is connecting agents loosely under the vague expectation that "they'll figure out how to collaborate." I thought the same thing at first, and the result was always the same: you can't tell where the control flow is, you can't trace where it failed, and debugging inevitably leads you to redesign everything from scratch.

May 30, 202622 min read
Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer
AI

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

When GPT-4 first came out, I—along with most developers around me—shared the same misconception: "Isn't a good model all you need?" We'd slap a few prompt lines...

May 29, 202628 min read