Building an evaluation system for AI agents means creating a repeatable testing framework that measures whether your agent actually does what it claims. Production LLMOps teams use tools like Langfuse, Weights & Biases, and custom scoring functions to track performance metrics, compare model versions, and catch failures before they reach users. You'll need three core components: a dataset of test cases with expected outputs, scoring methods that evaluate agent responses (exact-match or LLM-as-judge), and a tracking system for cost and latency metrics. This guide walks through building each piece using real tools and code.
What Is an AI Agent Evaluation Framework
An evaluation framework is a structured system that runs your AI agent against predefined test cases and scores the results. Unlike manual testing where you type prompts and eyeball responses, this approach automates the entire process and generates quantifiable metrics.
The framework typically includes a dataset (questions or tasks your agent should handle), an execution layer (code that runs your agent), scoring functions (code that grades outputs), and a dashboard (where you compare results). Think of it like unit testing for software, but adapted for the probabilistic nature of LLMs.
Most production teams run evaluations on every major change to prompts, models, or agent logic. A typical setup might test 50-200 examples per run, measuring accuracy, cost per task, and response time. Teams that skip this step often discover bugs only after users complain, which is expensive to fix and damages trust.
Why Testing AI Agents Matters for Production and Interviews
AI agents fail in ways traditional software doesn't. They hallucinate facts, skip steps in multi-step tasks, or produce correct answers with terrible reasoning. Without systematic testing, you won't catch these issues until production.
Production teams need proof that switching from GPT-4 to Claude 3.5 Sonnet won't tank performance. They need data showing that a new prompt template improves accuracy by 12% while cutting costs by 30%. This requires running the same test suite against multiple configurations and comparing metrics side-by-side.
In technical interviews for AI engineering roles, you'll face questions like "how do you validate your agent works?" Saying "I tested it manually" signals inexperience. The right answer involves specific tools (Langfuse, LangSmith, custom eval scripts), metrics you track (task success rate, cost per completion, P95 latency), and how you handle edge cases. Roughly 60% of senior AI engineer interviews now include questions about evaluation methodology.
Understanding how ReAct loops work in AI agents helps you design better test cases, since you'll know which intermediate reasoning steps to validate.
How to Benchmark AI Agents in Production
Start by defining what "working" means for your specific agent. If you're building a customer support agent, success might mean correctly categorizing tickets 95% of the time. For a code generation agent, it's producing syntactically valid code that passes unit tests.
Create a golden dataset of 50-100 examples with known correct answers. These should cover common cases (70%), edge cases (20%), and known failure modes (10%). Store them in a simple format like JSONL where each line contains an input and expected output.
{"input": "What's our refund policy for damaged items?", "expected": "full_refund", "category": "policy_question"}
{"input": "Can I upgrade my plan mid-month?", "expected": "prorated_upgrade", "category": "billing"}
{"input": "Why is the sky blue during sunset?", "expected": "out_of_scope", "category": "off_topic"}
Run your agent against this dataset and collect outputs. Track three categories of metrics: quality (did it get the right answer), cost (tokens used per task), and speed (time to complete). Store results in a structured format so you can compare runs over time.
Setting Up Langfuse for Agent Benchmarking
Langfuse provides a run_experiment() function that automates the evaluation loop. You define your agent logic, your test dataset, and your scoring functions, then Langfuse runs everything and creates a leaderboard comparing different configurations.
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe()
def my_agent(input_text):
# Your agent logic here
response = call_llm_with_tools(input_text)
return response
def exact_match_scorer(output, expected):
return 1.0 if output.strip().lower() == expected.strip().lower() else 0.0
langfuse.run_experiment(
name="customer_support_agent_v2",
runs=[
{"input": item["input"], "expected": item["expected"]}
for item in test_dataset
],
experiment_fn=my_agent,
scoring_fns=[exact_match_scorer]
)
This code runs your agent on every test case, applies the scoring function, and logs results to the Langfuse dashboard. You'll see aggregate metrics like mean score, P50/P95 latency, and total cost. The dashboard lets you filter by test case category to spot where your agent struggles.
Comparing Multiple Model Configurations
The real power comes from running experiments with different models or prompts. Create variants of your agent function that use GPT-4, Claude, or different temperature settings, then run each through the same test suite.
def agent_gpt4(input_text):
return call_openai(model="gpt-4", prompt=input_text)
def agent_claude(input_text):
return call_anthropic(model="claude-3-5-sonnet-20241022", prompt=input_text)
# Run both and compare
langfuse.run_experiment(name="gpt4_baseline", experiment_fn=agent_gpt4, ...)
langfuse.run_experiment(name="claude_variant", experiment_fn=agent_claude, ...)
Langfuse generates a leaderboard showing which configuration performs best on your specific tasks. You might discover that Claude costs 40% less while maintaining 98% of GPT-4's accuracy, a trade-off worth making for high-volume production use.
LLM as Judge Scoring for Agent Evaluation
Exact-match scoring works when outputs are predictable (categories, JSON structures, yes/no answers). But many agent tasks produce free-form text where correctness isn't binary. That's where LLM-as-judge comes in.
You use a separate LLM to evaluate your agent's output. The judge receives the original question, your agent's answer, and a rubric, then assigns a score. This catches cases where the answer is correct but phrased differently than your golden dataset.
def llm_judge_scorer(output, expected, input_text):
judge_prompt = f"""
Question: {input_text}
Expected answer: {expected}
Agent's answer: {output}
Score the agent's answer from 0-10 based on:
- Factual correctness (0-5 points)
- Completeness (0-3 points)
- Clarity (0-2 points)
Return only the numeric score.
"""
score = call_openai(model="gpt-4o-mini", prompt=judge_prompt)
return float(score) / 10.0 # Normalize to 0-1
LLM-as-judge costs more per evaluation (you're calling an LLM for every test case), but it handles nuanced tasks better. Production teams often use exact-match for structured outputs and LLM-as-judge for conversational responses. Using GPT-4o-mini as the judge instead of GPT-4 cuts evaluation costs by roughly 75% while maintaining reliable scoring.
One gotcha: judge LLMs have biases. They tend to prefer longer answers and their own model family's outputs. Always validate your judge's scores against a small human-labeled subset before trusting it fully. And honestly, most teams skip this part.
Multi-Criteria Scoring Systems
Complex agents need multiple scoring dimensions. A code generation agent should be judged on syntax validity, test passage, efficiency, and readability. Create separate scoring functions for each criterion.
def syntax_scorer(output):
try:
compile(output, '', 'exec')
return 1.0
except SyntaxError:
return 0.0
def test_passage_scorer(output, test_cases):
passed = run_unit_tests(output, test_cases)
return passed / len(test_cases)
scoring_fns = [syntax_scorer, test_passage_scorer, llm_judge_readability]
Langfuse aggregates these into a composite score and shows breakdowns per criterion. You might find your agent has 95% syntax correctness but only 60% test passage, pointing you toward specific improvements.
Tracking Cost and Latency Metrics Per Model and Task
Performance isn't just accuracy. An agent that's 99% accurate but takes 30 seconds per query won't work for customer support. One that's fast but costs $2 per completion will bankrupt your startup.
Track tokens consumed per task (input plus output) and multiply by your model's pricing. For GPT-4, that's $0.03 per 1K input tokens and $0.06 per 1K output tokens. A task using 2K input and 500 output tokens costs $0.09.
@observe()
def tracked_agent(input_text):
start_time = time.time()
response = call_llm(input_text)
latency = time.time() - start_time
tokens_in = count_tokens(input_text)
tokens_out = count_tokens(response)
cost = (tokens_in * 0.03 + tokens_out * 0.06) / 1000
langfuse.log_metrics({
"latency_seconds": latency,
"tokens_total": tokens_in + tokens_out,
"cost_usd": cost
})
return response
Run this across your test suite and you'll see aggregate metrics: average cost per task, P95 latency, total tokens for 100 runs. This data drives decisions like "switching to Claude Haiku saves $450/month with acceptable accuracy loss."
For agents that make multiple LLM calls per task, track metrics at both the individual call level and the full task level. You might discover that 80% of cost comes from one specific tool-calling step that you can optimize or cache. Check out how to reduce AI costs without sacrificing quality for specific optimization techniques.
How to Prove Your AI Agent Works in Interviews
When an interviewer asks "how do you validate your agent?", structure your answer around three layers: unit-level validation, integration testing, and production monitoring.
Start with unit-level: "I test individual components like the prompt template, tool-calling logic, and output parsing separately. For the prompt, I run it against 50 examples covering common cases and edge cases, using both exact-match scoring for structured outputs and LLM-as-judge for free-form responses."
Move to integration: "I test the full agent workflow end-to-end using tools like Langfuse's run_experiment function. This runs the complete agent against a golden dataset and tracks accuracy, cost, and latency. I maintain a leaderboard comparing different model configurations, so I can prove that Claude 3.5 Sonnet performs 8% better than GPT-4 on our specific tasks while costing 35% less."
Finish with production: "In production, I log every agent interaction with full traces showing reasoning steps, tool calls, and final outputs. I track success metrics like task completion rate and user satisfaction scores, and I set up alerts for anomalies like sudden accuracy drops or cost spikes."
Mention specific numbers. "Our current agent has 94% accuracy on the test suite, costs $0.12 per completion, and responds in under 3 seconds for P95 of queries." This shows you think quantitatively about AI systems.
If you've worked with human-in-the-loop AI agents, mention how you test the handoff logic between agent and human, since that's a common failure point.
Common Agent Failure Modes to Test For
Include test cases that specifically target known failure modes. Agents often fail on: inputs outside their training distribution, multi-step tasks where they skip steps, questions requiring recent information they don't have, and adversarial inputs designed to confuse them.
Your test suite should include examples of each. For out-of-distribution inputs, add questions about topics your agent shouldn't handle and verify it correctly declines or escalates. For multi-step tasks, check that all intermediate steps appear in the reasoning trace.
Tool-calling agents have specific failure modes like calling the wrong tool, passing malformed arguments, or ignoring tool outputs. Create test cases where the correct solution requires specific tool sequences and verify your agent executes them. Roughly 30% of agent failures in production stem from tool-calling issues that could've been caught with better test coverage.
AI Agent Testing Tools and Frameworks Beyond Langfuse
Langfuse isn't the only option. LangSmith (from LangChain) provides similar experiment tracking with tighter integration if you're already using LangChain. Weights & Biases added LLM evaluation features that work well if you're tracking ML experiments there already.
For custom setups, you can build evaluation scripts using Pandas for data handling, Pytest for test structure, and simple CSV or SQLite for storing results. This gives you full control but requires more setup time.
import pandas as pd
import pytest
def test_agent_accuracy():
results = []
for case in test_dataset:
output = my_agent(case['input'])
score = exact_match_scorer(output, case['expected'])
results.append({'input': case['input'], 'score': score})
df = pd.DataFrame(results)
accuracy = df['score'].mean()
assert accuracy >= 0.90, f"Accuracy {accuracy} below threshold"
df.to_csv('eval_results.csv')
This approach integrates with CI/CD pipelines easily. You can run it on every pull request and block merges if accuracy drops below your threshold.
OpenAI's Evals framework is open-source and well-documented, though it's optimized for OpenAI models. Anthropic provides similar tooling for Claude. Both work fine but lock you into their ecosystems somewhat.
Look, the best choice depends on your stack. If you're model-agnostic and need detailed traces, Langfuse wins. If you're all-in on LangChain, use LangSmith. If you want minimal dependencies and full control, build custom scripts.
Testing AI agents isn't optional anymore. Production teams expect quantifiable proof that agents work reliably, cost-effectively, and safely. Build your evaluation framework early, run it on every change, and track metrics over time. You'll catch bugs faster, make better model decisions, and have concrete answers when someone asks "how do you know this works?" The difference between a demo and a production system is exactly this testing infrastructure, and it's what separates junior developers from engineers who ship reliable AI products.
Get a free AI-powered SEO audit of your site
We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.
Run my free audit