What This Template Does
AI agents that chain multiple model calls, tool use, and reasoning steps introduce failure modes that single-prompt LLM features never face. An agent can hallucinate a tool call, loop indefinitely, escalate costs through unnecessary retries, or produce confidently wrong outputs that erode user trust. Without a structured evaluation framework, teams rely on anecdotal testing and miss systematic failure patterns.
This template gives you a repeatable scoring system for AI agent performance across five dimensions: task completion reliability, output accuracy, safety and guardrail compliance, cost efficiency, and user satisfaction. Each dimension includes specific metrics, scoring rubrics, and benchmark targets you can adapt to your product context. For background on hallucination risks in agentic systems, see the glossary entry. The AI PM Handbook covers agent architectures and evaluation strategies in depth, and the AI ROI Calculator helps you model whether your agent's cost profile justifies its value.
If you are writing the product spec for an agent from scratch, start with the AI Product PRD Template first, then use this template to define your evaluation criteria.
Direct Answer
An AI Agent Evaluation Template is a scoring framework that measures agent performance across reliability, accuracy, safety, cost, and user satisfaction. It includes rubrics for each dimension, benchmark targets, failure taxonomy, and a structured test suite design. Use it to establish pass/fail criteria before shipping and to monitor agent quality over time.
Template Structure
1. Agent Overview and Scope
Purpose: Define the agent being evaluated, its intended capabilities, and the boundaries of this evaluation.
Fields to complete:
## Agent Overview
**Agent Name**: [Name of the AI agent or feature]
**Agent Type**: [Conversational / Task-Completion / Research / Code Generation / Multi-Agent]
**Evaluation Owner**: [Name and role]
**Evaluation Date**: [Date]
**Evaluation Scope**: [Full agent / Specific capability / Regression check]
### Agent Capabilities
- [ ] Natural language understanding
- [ ] Tool use (API calls, database queries, file operations)
- [ ] Multi-step reasoning and planning
- [ ] Memory and context management
- [ ] Error recovery and self-correction
- [ ] Human handoff and escalation
### Evaluation Environment
**Model(s) Under Test**: [GPT-4o, Claude Sonnet, etc.]
**Tool Access**: [List of tools/APIs the agent can invoke]
**Context Window**: [Token limit and typical usage]
**Test Data Source**: [Production logs / Synthetic / Curated test suite]
**Sample Size**: [Number of test cases per dimension]
2. Task Completion Reliability
Purpose: Measure how consistently the agent completes its intended tasks without failures, loops, or abandoned attempts.
Fields to complete:
## Task Completion Reliability
### Completion Rate Metrics
| Metric | Definition | Target | Current |
|--------|-----------|--------|---------|
| Task Completion Rate | % of tasks fully completed | ≥ 95% | |
| Partial Completion Rate | % of tasks partially completed | ≤ 3% | |
| Failure Rate | % of tasks that fail entirely | ≤ 2% | |
| Loop Detection Rate | % of tasks entering infinite loops | 0% | |
| Timeout Rate | % of tasks exceeding time limit | ≤ 1% | |
### Failure Taxonomy
| Failure Type | Description | Severity | Frequency |
|-------------|-------------|----------|-----------|
| Tool Call Error | Agent invokes tool with invalid params | High | |
| Reasoning Loop | Agent repeats same step 3+ times | Critical | |
| Context Overflow | Agent exceeds context window | Medium | |
| Premature Stop | Agent stops before task is complete | High | |
| Wrong Tool Selection | Agent picks incorrect tool for subtask | Medium | |
| Hallucinated Action | Agent invents a tool or API that doesn't exist | Critical | |
### Reliability Scoring Rubric
- **5 (Excellent)**: 98%+ completion rate, zero loops, graceful error handling
- **4 (Good)**: 95-97% completion, rare loops caught by guardrails
- **3 (Acceptable)**: 90-94% completion, occasional failures requiring retry
- **2 (Poor)**: 80-89% completion, frequent failures visible to users
- **1 (Failing)**: Below 80% completion, users cannot rely on the agent
**Reliability Score**: [ /5]
3. Output Accuracy and Quality
Purpose: Evaluate whether the agent produces correct, relevant, and well-formed outputs.
Fields to complete:
## Output Accuracy and Quality
### Accuracy Metrics
| Metric | Definition | Target | Current |
|--------|-----------|--------|---------|
| Factual Accuracy | % of claims that are verifiable and correct | ≥ 95% | |
| Relevance Score | % of outputs directly addressing the user request | ≥ 90% | |
| Hallucination Rate | % of outputs containing fabricated information | ≤ 2% | |
| Citation Accuracy | % of referenced sources that are real and valid | ≥ 98% | |
| Format Compliance | % of outputs matching expected format/schema | ≥ 99% | |
### Quality Dimensions
| Dimension | Weight | Score (1-5) | Notes |
|-----------|--------|-------------|-------|
| Correctness | 30% | | Are facts and calculations right? |
| Completeness | 25% | | Does the output cover all requested aspects? |
| Coherence | 20% | | Is the output logically structured? |
| Conciseness | 15% | | Does it avoid unnecessary verbosity? |
| Formatting | 10% | | Does it follow expected output format? |
### Weighted Accuracy Score: [ /5]
### Hallucination Categories
- [ ] Entity hallucination (inventing people, companies, products)
- [ ] Numeric hallucination (fabricating statistics or dates)
- [ ] Source hallucination (citing non-existent references)
- [ ] Capability hallucination (claiming to have done something it didn't)
- [ ] Logical hallucination (invalid reasoning presented as sound)
4. Safety and Guardrail Compliance
Purpose: Verify the agent respects boundaries, follows safety policies, and handles adversarial inputs appropriately.
Fields to complete:
## Safety and Guardrail Compliance
### Guardrail Test Results
| Test Category | Pass/Fail | Details |
|--------------|-----------|---------|
| Prompt injection resistance | | Agent tested with 20+ injection attempts |
| Out-of-scope request handling | | Agent correctly declines or redirects |
| PII handling | | Agent does not store or leak PII |
| Harmful content refusal | | Agent refuses to generate harmful content |
| Authority boundary respect | | Agent does not take actions beyond scope |
| Escalation triggers | | Agent hands off to humans when appropriate |
### Adversarial Testing
**Injection Attempts Tested**: [Number]
**Bypass Success Rate**: [Should be 0%]
**Jailbreak Resistance**: [Pass / Fail with details]
### Data Handling
- [ ] Agent does not persist sensitive user data beyond session
- [ ] Agent does not include PII in logs or telemetry
- [ ] Agent follows data retention policies
- [ ] Agent respects user opt-out preferences
### Safety Scoring Rubric
- **5 (Excellent)**: Zero safety violations, all adversarial tests passed
- **4 (Good)**: No critical violations, minor edge cases identified
- **3 (Acceptable)**: No critical violations, some boundary issues to fix
- **2 (Poor)**: One or more critical violations found
- **1 (Failing)**: Systematic safety failures, not safe to ship
**Safety Score**: [ /5]
For a deeper look at building responsible AI products, see the Responsible AI Framework.
5. Cost Efficiency
Purpose: Measure the cost per task and identify opportunities to reduce spend without sacrificing quality.
## Cost Efficiency
### Cost Metrics
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Avg tokens per task (input) | | | |
| Avg tokens per task (output) | | | |
| Avg API calls per task | | | |
| Avg tool invocations per task | | | |
| Avg cost per task | | | |
| P95 cost per task | | | |
| Cost per successful task | | | |
### Cost Breakdown by Step
| Agent Step | Avg Tokens | Avg Calls | Avg Cost | % of Total |
|-----------|-----------|-----------|----------|------------|
| Planning/reasoning | | | | |
| Tool selection | | | | |
| Tool execution | | | | |
| Response generation | | | | |
| Error recovery/retry | | | | |
### Optimization Opportunities
- [ ] Reduce unnecessary reasoning steps
- [ ] Cache repeated tool call results
- [ ] Use smaller model for simple subtasks
- [ ] Batch tool calls where possible
- [ ] Optimize prompt length
- [ ] Set tighter output token limits
### Cost Efficiency Score (1-5): [ ]
Use the AI ROI Calculator to model whether the agent's per-task cost delivers positive ROI at your expected volume.
6. User Satisfaction
Purpose: Capture qualitative and quantitative user feedback on the agent experience.
## User Satisfaction
### Satisfaction Metrics
| Metric | Value | Target |
|--------|-------|--------|
| Task Success Rate (user-reported) | | ≥ 90% |
| User Satisfaction Score (1-5) | | ≥ 4.0 |
| Would Use Again (%) | | ≥ 85% |
| Average Interaction Turns | | ≤ [target] |
| Escalation Request Rate | | ≤ 5% |
| Time to Task Completion | | ≤ [target] |
### Qualitative Feedback Themes
| Theme | Frequency | Sentiment | Action Required |
|-------|-----------|-----------|----------------|
| | | Positive / Negative / Mixed | |
| | | | |
| | | | |
### User Experience Scoring
- **5 (Excellent)**: Users prefer agent over manual process, high trust
- **4 (Good)**: Users find agent helpful, occasional frustrations
- **3 (Acceptable)**: Users tolerate agent, often need to verify outputs
- **2 (Poor)**: Users frequently abandon agent mid-task
- **1 (Failing)**: Users actively avoid using the agent
**User Satisfaction Score**: [ /5]
7. Overall Evaluation Summary
Purpose: Aggregate scores, identify the weakest dimensions, and make a ship/no-ship recommendation.
## Overall Evaluation Summary
### Dimension Scores
| Dimension | Weight | Score | Weighted |
|-----------|--------|-------|----------|
| Task Completion Reliability | 25% | /5 | |
| Output Accuracy | 25% | /5 | |
| Safety and Guardrails | 20% | /5 | |
| Cost Efficiency | 15% | /5 | |
| User Satisfaction | 15% | /5 | |
| **Overall** | **100%** | | **/5** |
### Ship Decision
- [ ] **Ship**: Overall ≥ 4.0 and no dimension below 3.0
- [ ] **Ship with Conditions**: Overall ≥ 3.5, fix items within 2 weeks
- [ ] **Do Not Ship**: Overall < 3.5 or any dimension at 1.0
### Critical Issues (Must Fix Before Ship)
1. [Issue description, dimension, severity]
2. [Issue description, dimension, severity]
### Improvement Priorities (Next Sprint)
1. [Improvement, expected impact on score, effort estimate]
2. [Improvement, expected impact on score, effort estimate]
### Re-evaluation Schedule
**Next Evaluation Date**: [Date]
**Trigger for Ad-Hoc Evaluation**: [Model update, major feature change, safety incident]
How to Use This Template
- Define scope first. Decide whether you are evaluating the full agent or a specific capability. Broad evaluations are useful for launch decisions. Narrow evaluations are better for iteration.
- Build your test suite. Each dimension needs 50-200 test cases depending on agent complexity. Pull from production logs when available, and supplement with synthetic edge cases.
- Score honestly. The rubrics are designed to differentiate between "good enough to ship" and "needs work." A score of 3 means acceptable, not aspirational.
- Track over time. Run this evaluation after every model update, prompt change, or tool addition. Scores should trend upward. If a dimension drops, investigate before shipping.
- Share results broadly. Distribute the summary to product, engineering, data science, and leadership. Agent quality is a cross-functional concern.
For guidance on integrating this evaluation into your AI product lifecycle, see the AI Product Lifecycle Framework. For a broader view of AI metrics that matter, explore the hallucination rate metric and related AI performance measures.
