Testing AI agents before deployment requires a 12-metric evaluation framework split into four categories: retrieval (context relevance, recall, precision, latency), generation (faithfulness, relevance, hallucination rate), agent behavior (tool selection, execution success, multi-step coherence), and production readiness (cost per query, P99 latency). The key is phased sequencing: weeks 0-2 focus on retrieval and faithfulness to catch launch blockers, weeks 3-6 target hallucination and tool selection under real traffic, weeks 7+ optimize cost and latency. This approach catches expensive failures early while deferring optimization until you've got production data to guide it.
What Is an AI Agent Evaluation Framework
An AI agent evaluation framework is a structured testing system that measures how well your agent performs across retrieval, generation, reasoning, and production metrics before you expose it to users. Unlike traditional software testing, AI agents fail in subtle ways: they retrieve irrelevant context, hallucinate facts, select wrong tools, rack up costs silently.
The framework breaks down into 12 specific metrics across four categories. Each metric answers a question about agent reliability. Does it find the right information? Does it generate accurate responses? Does it use tools correctly? Can you afford to run it at scale?
Most AI agent failures stem from skipping this evaluation step, not from choosing the wrong model. A GPT-4 agent with poor retrieval will underperform a GPT-3.5 agent with clean context every time.
Why Evaluation Matters More Than Model Selection
Roughly 60% of AI agent projects that reach production fail within the first three months due to accuracy or cost issues, not technical limitations. The problem isn't the underlying LLM. Teams deploy agents without systematic testing, then discover hallucinations, tool errors, or $10,000 monthly bills after users start complaining.
Evaluation catches these issues in controlled conditions. If your agent hallucinates 15% of the time in testing with 100 queries, it'll hallucinate 15% of the time in production with 10,000 queries. Better to find out early with a $50 test run than after users lose trust.
The phased approach matters because not all metrics are equally urgent. A 2-second latency issue won't kill your launch, but a 30% hallucination rate will. You need to sequence evaluation to catch blockers first, then optimize performance once you've got real usage patterns. If you're building governance around these systems, check out how to implement governance for AI agents in workflows for policy frameworks that complement technical evaluation.
The 12-Metric Evaluation Framework
Retrieval Metrics for RAG-Based Agents
If your agent uses retrieval-augmented generation (RAG), these four metrics measure whether it's finding the right information to answer queries. Poor retrieval is the most common cause of agent failure, so test this first.
Context relevance measures what percentage of retrieved chunks actually relate to the query. Calculate it by having an LLM judge score each chunk's relevance on a 0-1 scale, then average across your test set. You want 0.85 or higher. Anything below 0.7 means your retrieval system is polluting context with noise.
Recall measures whether your agent finds all the relevant information available. Build a test set where you know which documents contain the answer, then check if your retrieval system surfaces them. Target 90%+ recall for critical queries.
Precision measures what percentage of retrieved chunks are actually useful. If you retrieve 10 chunks and only 3 are relevant, your precision is 30%. High precision (80%+) keeps context windows clean and reduces costs.
Retrieval latency tracks how long it takes to fetch and rank chunks. Measure P50, P95, and P99 latency separately. For most applications, P99 retrieval latency should stay under 500ms to keep total response time acceptable.
Generation Metrics for Output Quality
Once your agent retrieves information, these metrics measure whether it generates accurate, relevant responses without making things up.
Faithfulness measures whether the agent's response is supported by the retrieved context. Use an LLM-as-judge approach: break the response into individual claims, then check if each claim has evidence in the context. Calculate faithfulness as the percentage of supported claims. You need 95%+ faithfulness for production use, especially in domains like healthcare, legal, or finance where accuracy is critical.
Answer relevance measures whether the response actually addresses the user's query. An agent can be perfectly faithful to context but still answer the wrong question. Use semantic similarity between query and response, or have an LLM judge score relevance on a 1-5 scale. Target an average score of 4.0 or higher.
Hallucination rate is the percentage of responses containing fabricated information not present in the context or training data. Measure this by spot-checking claims against source documents or using automated fact-checking. A hallucination rate above 5% makes your agent unreliable for most business applications. For more on measuring this, see the section below on tracking hallucination rates specifically.
Agent Behavior Metrics for Multi-Step Workflows
If your agent uses tools or executes multi-step reasoning, these metrics measure whether it's making good decisions about what to do and when.
Tool selection accuracy measures whether the agent picks the right tool for each task. Build a test set of queries where you know which tool(s) should be called, then compare the agent's choices. You want 90%+ accuracy here. If your agent calls a database query tool when it should search documentation, users get wrong answers even if the tools themselves work perfectly.
Tool execution success rate tracks how often tool calls complete without errors. This catches issues like malformed API calls, missing parameters, permission problems. Target 95%+ success rate. Track failures by tool type to identify which integrations need work.
Multi-step coherence measures whether the agent maintains logical flow across multiple reasoning steps. For agents that chain tool calls or break problems into subtasks, manually review 50-100 traces to check if the sequence makes sense. This is harder to automate but critical for complex workflows. If you're building multi-agent systems, using AI agents as a team has additional coordination metrics to track.
Production Readiness Metrics
These two metrics determine whether you can actually afford to run your agent at scale and whether performance meets user expectations.
Cost per query tracks the total expense of processing one user request, including LLM API calls, retrieval compute, tool execution, infrastructure. Break this down by component to identify optimization opportunities. For most business applications, you need cost per query under $0.10 to be economically viable at scale. If you're trying to understand overall AI costs, what AI costs a 50-person company provides broader context.
P99 latency measures how long 99% of queries complete within. This matters more than average latency because users remember the slow experiences. Target P99 under 5 seconds for interactive applications, under 10 seconds for background tasks. Measure this under realistic load, not just single-query tests.
How to Test AI Agents Before Launch: Phased Evaluation Timeline
Weeks 0-2: Focus on Launch Blockers
Start with the metrics that'll kill your launch if they're broken: retrieval quality and faithfulness. Build a test set of 100-200 representative queries covering your expected use cases. Make sure to include edge cases and questions you know are tricky.
Run your agent against this test set and measure context relevance, recall, precision, faithfulness. If context relevance is below 0.7 or faithfulness is below 0.9, don't move forward. Fix your retrieval system or prompt engineering first. These problems only get worse with real users.
Use tools like Ragas, TruLens, or Phoenix for automated evaluation. They provide LLM-as-judge scoring for most retrieval and generation metrics. You can also build custom evaluators with simple prompts:
def evaluate_faithfulness(query, context, response):
prompt = f"""
Query: {query}
Context: {context}
Response: {response}
Break the response into individual claims. For each claim,
check if it's supported by the context. Return the percentage
of claims that are supported.
"""
score = llm.generate(prompt)
return float(score)
This phase should take two weeks maximum. If you're still fixing retrieval after that, you probably need to rethink your architecture.
Weeks 3-6: Real Traffic Issues
Once retrieval and faithfulness pass, deploy to a small group of internal users or a limited beta. Now measure hallucination rate, tool selection accuracy, tool execution success under real usage patterns.
Real queries will surprise you. Users ask questions you didn't anticipate, phrase things differently, combine requests in unexpected ways. Your test set from weeks 0-2 won't cover everything.
Set up logging to capture every query, retrieval result, tool call, response. Sample 10% of interactions for manual review. Look for patterns in failures: are certain query types triggering hallucinations? Do specific tools fail more often? Is the agent misunderstanding user intent?
Track tool selection accuracy by having humans label what the agent should've done, then comparing to what it actually did. If accuracy drops below 85%, you need better tool descriptions or few-shot examples in your prompt.
This phase takes 3-4 weeks because you need volume to see patterns. Don't rush it. The goal is to catch the issues that only appear with diverse real-world use, and honestly, most teams skip this part.
Weeks 7+: Cost and Latency Optimization
Once your agent works reliably, optimize for efficiency. Now you've got real usage data showing which queries are expensive, which tools get called most often, where latency bottlenecks appear.
Measure cost per query by component. Often 80% of costs come from 20% of operations. Maybe you're retrieving too many chunks, using expensive models for simple tasks, making redundant tool calls. Profile your top 100 most expensive queries to find optimization targets.
For latency, measure P50, P95, P99 separately. P50 tells you typical performance. P99 tells you worst-case experience. If P99 is more than 3x your P50, you've got a consistency problem. Look for caching opportunities, parallel execution of independent steps, or switching to faster models for latency-sensitive paths.
Common optimizations that reduce cost by 40-60%: caching retrieval results for common queries, using GPT-3.5 for simple tasks and GPT-4 only for complex reasoning, reducing chunk count from 10 to 5 without hurting relevance, batching tool calls when possible.
AI Agent Hallucination Rate Measurement
Hallucination rate deserves special attention because it's the metric users care about most. A single confidently-stated falsehood destroys trust faster than ten slow responses.
Measure hallucination rate by sampling 200-300 responses and fact-checking claims against your source documents. Break each response into atomic statements, then verify each one. Calculate hallucination rate as (false statements / total statements) × 100.
For automated detection, use an LLM-as-judge approach with a specific prompt:
def detect_hallucination(context, response):
prompt = f"""
Context: {context}
Response: {response}
List any statements in the response that are not supported by
the context. For each unsupported statement, explain why it's
not supported. Return 'NONE' if all statements are supported.
"""
result = llm.generate(prompt)
return result != "NONE"
Target a hallucination rate under 3% for production. Between 3-5% is acceptable for low-stakes applications. Above 5% requires immediate fixes: better prompts, stricter faithfulness requirements, adding a verification step before returning responses.
Track hallucination rate by query type. Often agents hallucinate more on certain topics or when they lack relevant context. This tells you where to improve retrieval or add explicit "I don't know" handling.
Cost Per Query Optimization for AI Agents
Production AI agents can easily cost $0.50-$1.00 per query without optimization. At 10,000 queries per month, that's $5,000-$10,000. Most businesses can't sustain that for long.
Start by instrumenting every component: retrieval embedding calls, vector database queries, LLM API calls (by model and token count), tool execution costs. Break down a typical query to see where money goes.
Common cost breakdown for a RAG agent: 20% retrieval (embeddings + vector search), 70% LLM generation, 10% tool calls. That 70% LLM spend is your optimization target.
Reduce LLM costs by using smaller models where possible. GPT-4 costs roughly 10x more than GPT-3.5 per token. If 60% of your queries are simple lookups that don't need complex reasoning, route them to GPT-3.5. Use a classifier or simple rules to decide which model to use.
Reduce token usage by tightening prompts and limiting retrieved context. If you're passing 10 chunks at 200 tokens each, that's 2,000 input tokens per query. Cutting to 5 chunks saves 1,000 tokens. At $0.01 per 1K tokens, that's $0.01 per query, or $100 per 10K queries.
Cache aggressively. If 30% of queries are variations of the same question, cache responses and retrieval results. A cache hit costs nearly nothing compared to a full agent execution.
Set a cost budget per query and monitor it in production. Alert when queries exceed the budget so you can investigate expensive outliers. Often a single poorly-formed query or infinite loop in tool calls will blow your budget.
Look, testing AI agents systematically before deployment isn't optional anymore. The 12-metric framework gives you a concrete checklist to catch failures early when they're cheap to fix, then optimize performance once you understand real usage patterns. Start with retrieval and faithfulness in weeks 0-2 to prevent launch disasters, expand to agent behavior in weeks 3-6 to handle real-world complexity, optimize cost and latency in weeks 7+ once you've got data to guide decisions. The agents that succeed in production aren't necessarily the ones with the fanciest models. They're the ones that were tested rigorously against specific, measurable criteria before users ever saw them.
Get a free AI-powered SEO audit of your site
We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.
Run my free audit