What This Template Is For
Choosing the right AI model is one of the highest-impact decisions a PM makes on an AI product. Get it wrong and you ship a feature that is too slow, too expensive, or too unreliable for production. Most teams default to picking the "best" model on public benchmarks without evaluating what matters for their specific use case.
This template gives you a structured scorecard to evaluate and compare AI models across the dimensions that actually matter in production: task-specific accuracy, latency under real load, cost per request, safety behavior, and user-perceived quality. It is designed for product managers who need to make model decisions alongside their ML engineers, not replace them.
For a deeper framework on running model evaluations, see the guide to LLM evals for product managers. The AI PM Handbook covers the full lifecycle of AI product management including model selection strategy. You can also use the AI Eval Scorecard tool to run an interactive scoring session.
How to Use This Template
- Define your evaluation criteria by filling in the scorecard dimensions. Weight each dimension based on your product's priorities. A customer-facing chatbot weights safety and latency higher. An internal data extraction tool weights accuracy and cost.
- Create your test dataset before evaluating any model. Include representative inputs, edge cases, and adversarial examples. The test set should reflect real production traffic, not cherry-picked demos.
- Score each candidate model against every dimension using the 1-5 rubric. Have at least two evaluators score independently to reduce bias.
- Calculate weighted scores and document the rationale for your final selection. The model with the highest weighted score is your recommendation, but document any close calls or tradeoffs.
- Re-run the scorecard quarterly or whenever a new model version ships. Use the LLM Cost Estimator to project cost implications of model changes.
The Template
Evaluation Setup
- ☐ Define the specific task the model must perform
- ☐ Document the input format and expected output format
- ☐ Identify 3-5 candidate models to evaluate
- ☐ Create test dataset with 50+ representative examples
- ☐ Include 10+ adversarial or edge case inputs
- ☐ Assign evaluation team (PM + ML engineer + domain expert)
- ☐ Set evaluation timeline and decision deadline
Scorecard Dimensions
## Model Evaluation Scorecard
**Evaluation Date**: [YYYY-MM-DD]
**Evaluated By**: [Names and roles]
**Task**: [Describe the specific AI task]
### Candidate Models
| # | Model Name | Provider | Version | Context Window |
|---|-----------|----------|---------|----------------|
| 1 | [Model A] | [Provider] | [Version] | [Tokens] |
| 2 | [Model B] | [Provider] | [Version] | [Tokens] |
| 3 | [Model C] | [Provider] | [Version] | [Tokens] |
### Dimension Weights
| Dimension | Weight (%) | Rationale |
|-----------|-----------|-----------|
| Task Accuracy | [e.g., 30%] | [Why this weight] |
| Latency | [e.g., 20%] | [Why this weight] |
| Cost | [e.g., 15%] | [Why this weight] |
| Safety & Guardrails | [e.g., 20%] | [Why this weight] |
| User Satisfaction | [e.g., 15%] | [Why this weight] |
| **Total** | **100%** | |
Scoring Rubric (1-5 Scale)
- ☐ Define what 1, 3, and 5 mean for Task Accuracy
- ☐ Define what 1, 3, and 5 mean for Latency
- ☐ Define what 1, 3, and 5 mean for Cost
- ☐ Define what 1, 3, and 5 mean for Safety
- ☐ Define what 1, 3, and 5 mean for User Satisfaction
### Detailed Scores
#### Task Accuracy (Weight: __%)
| Criteria | Model A | Model B | Model C |
|----------|---------|---------|---------|
| Correctness on standard inputs | [1-5] | [1-5] | [1-5] |
| Handling of edge cases | [1-5] | [1-5] | [1-5] |
| Output format compliance | [1-5] | [1-5] | [1-5] |
| Consistency across runs | [1-5] | [1-5] | [1-5] |
| **Subtotal** | [Avg] | [Avg] | [Avg] |
#### Latency (Weight: __%)
| Criteria | Model A | Model B | Model C |
|----------|---------|---------|---------|
| p50 response time | [1-5] | [1-5] | [1-5] |
| p99 response time | [1-5] | [1-5] | [1-5] |
| Time to first token | [1-5] | [1-5] | [1-5] |
| Performance under load | [1-5] | [1-5] | [1-5] |
| **Subtotal** | [Avg] | [Avg] | [Avg] |
#### Cost (Weight: __%)
| Criteria | Model A | Model B | Model C |
|----------|---------|---------|---------|
| Cost per 1K input tokens | [1-5] | [1-5] | [1-5] |
| Cost per 1K output tokens | [1-5] | [1-5] | [1-5] |
| Projected monthly cost at scale | [1-5] | [1-5] | [1-5] |
| Cost optimization potential | [1-5] | [1-5] | [1-5] |
| **Subtotal** | [Avg] | [Avg] | [Avg] |
#### Safety & Guardrails (Weight: __%)
| Criteria | Model A | Model B | Model C |
|----------|---------|---------|---------|
| Refusal of harmful prompts | [1-5] | [1-5] | [1-5] |
| Hallucination rate | [1-5] | [1-5] | [1-5] |
| PII handling | [1-5] | [1-5] | [1-5] |
| Bias in outputs | [1-5] | [1-5] | [1-5] |
| **Subtotal** | [Avg] | [Avg] | [Avg] |
#### User Satisfaction (Weight: __%)
| Criteria | Model A | Model B | Model C |
|----------|---------|---------|---------|
| Output readability | [1-5] | [1-5] | [1-5] |
| Tone and style match | [1-5] | [1-5] | [1-5] |
| Helpfulness rating | [1-5] | [1-5] | [1-5] |
| User preference (blind test) | [1-5] | [1-5] | [1-5] |
| **Subtotal** | [Avg] | [Avg] | [Avg] |
Final Decision
- ☐ Calculate weighted total for each model
- ☐ Document the recommended model with rationale
- ☐ Note any risks or tradeoffs with the selection
- ☐ Define re-evaluation triggers (cost spike, new model release, accuracy drift)
- ☐ Get sign-off from ML lead and product lead
Filled Example
Task: Summarize customer support tickets into 2-3 sentence summaries for the support dashboard.
Candidate Models: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash
Dimension Weights: Accuracy 35%, Latency 20%, Cost 20%, Safety 10%, Satisfaction 15%
| Dimension | GPT-4o | Sonnet | Flash |
|---|---|---|---|
| Accuracy (35%) | 4.5 | 4.3 | 3.8 |
| Latency (20%) | 3.2 | 3.8 | 4.7 |
| Cost (20%) | 2.5 | 3.0 | 4.8 |
| Safety (10%) | 4.2 | 4.5 | 4.0 |
| Satisfaction (15%) | 4.0 | 4.2 | 3.5 |
| Weighted Total | 3.68 | 3.88 | 4.08 |
Recommendation: Gemini 1.5 Flash. While GPT-4o and Sonnet scored higher on accuracy and satisfaction, Flash's cost and latency advantages outweigh the moderate accuracy gap for this summarization task. The 0.5-point accuracy difference is acceptable because summaries are reviewed by support agents before action.
Re-evaluation trigger: Re-score if hallucination rate exceeds 5% in production monitoring or if a new model version changes pricing by more than 20%.
