What This Template Is For
Choosing an AI vendor is not the same as choosing a SaaS tool. AI vendors introduce unique risks: model quality varies by use case, pricing models are complex (per-token, per-request, per-seat), data handling policies affect compliance, and switching costs can be extreme if you build on proprietary APIs. A wrong decision here is expensive to reverse.
This template provides a structured evaluation framework that covers the dimensions most critical to AI vendor selection: model quality for your specific use case, pricing analysis at scale, data privacy and compliance, reliability and SLAs, integration complexity, and vendor lock-in risk. The AI PM Handbook covers build vs buy strategy in detail. For a quick model comparison, the AI Build vs Buy tool can help you assess whether third-party APIs are the right approach for your use case. The AI ROI Calculator helps model the financial impact of different vendor pricing structures. See the LLM glossary entry for background on the underlying technology you are evaluating.
When to Use This Template
- You are selecting an LLM provider (OpenAI, Anthropic, Google, Cohere, Mistral, etc.)
- You are evaluating AI/ML platform vendors (AWS SageMaker, Vertex AI, Azure ML, etc.)
- You are comparing managed AI services for a specific use case (search, summarization, classification)
- You need to present a vendor recommendation to leadership with supporting analysis
- You are reviewing an existing vendor relationship for renewal or replacement
How to Use This Template
- Define your evaluation criteria and weight each dimension based on your priorities
- Create a shortlist of 2-4 vendors to evaluate (more than 4 creates analysis paralysis)
- Run a proof-of-concept test with each vendor using your actual data and use cases
- Complete the evaluation scorecard for each vendor based on test results
- Calculate the total weighted score and present the recommendation with supporting data
The Template
# AI Vendor Evaluation
**Evaluator**: [Name and role]
**Evaluation Date**: [Date]
**Use Case**: [What you are building with this AI vendor]
**Decision Deadline**: [Date]
**Budget**: [$X per month / $X per year]
---
## 1. Evaluation Criteria and Weights
| Dimension | Weight | Description |
|-----------|--------|-------------|
| Model Quality | [X]% | Accuracy, relevance, and reliability for your use case |
| Pricing | [X]% | Total cost at projected scale, pricing predictability |
| Data Privacy | [X]% | Data handling, retention, training policies, compliance |
| Reliability | [X]% | Uptime SLA, latency, throughput guarantees |
| Integration | [X]% | API quality, SDK support, time to integrate |
| Lock-in Risk | [X]% | Switching costs, standard interfaces, data portability |
| Support | [X]% | Technical support quality, documentation, community |
| **Total** | **100%** | |
---
## 2. Vendor Shortlist
| Vendor | Product | Why Considered |
|--------|---------|---------------|
| [Vendor A] | [Product name] | [Brief rationale for including in evaluation] |
| [Vendor B] | [Product name] | [Brief rationale] |
| [Vendor C] | [Product name] | [Brief rationale] |
---
## 3. Model Quality Assessment
### Test Methodology
- **Test dataset**: [Description, size, and source of evaluation data]
- **Test cases**: [Number of test cases per category]
- **Evaluation method**: [Human evaluation / Automated metrics / Both]
- **Evaluators**: [Who reviewed the outputs]
### Results
| Metric | Vendor A | Vendor B | Vendor C |
|--------|----------|----------|----------|
| Overall accuracy | [X]% | [X]% | [X]% |
| Task-specific quality | [Score] | [Score] | [Score] |
| Latency (p50) | [X]ms | [X]ms | [X]ms |
| Latency (p99) | [X]ms | [X]ms | [X]ms |
| Hallucination rate | [X]% | [X]% | [X]% |
| Edge case handling | [Score] | [Score] | [Score] |
| Output consistency | [Score] | [Score] | [Score] |
### Quality Notes
- Vendor A: [Observations from testing]
- Vendor B: [Observations]
- Vendor C: [Observations]
---
## 4. Pricing Analysis
### Pricing Model Comparison
| Component | Vendor A | Vendor B | Vendor C |
|-----------|----------|----------|----------|
| Pricing model | [Per token / Per request / Per seat / Flat] | | |
| Input cost | [$X per 1M tokens] | | |
| Output cost | [$X per 1M tokens] | | |
| Fine-tuning cost | [$X per training hour] | | |
| Storage cost | [$X per GB/month] | | |
| Minimum commitment | [$X / None] | | |
| Volume discounts | [Available at X volume] | | |
### Projected Monthly Cost at Scale
| Scale | Vendor A | Vendor B | Vendor C |
|-------|----------|----------|----------|
| Current volume | $[X] | $[X] | $[X] |
| 3x volume | $[X] | $[X] | $[X] |
| 10x volume | $[X] | $[X] | $[X] |
### Hidden Cost Factors
- [ ] Egress fees
- [ ] Rate limit overage charges
- [ ] Premium support costs
- [ ] Model version migration costs
- [ ] Dedicated capacity costs
---
## 5. Data Privacy and Compliance
| Requirement | Vendor A | Vendor B | Vendor C |
|-------------|----------|----------|----------|
| Data retention policy | [Days / None] | | |
| Uses data for training | [Yes / No / Opt-out] | | |
| Data processing location | [Regions] | | |
| SOC 2 Type II certified | [Yes / No] | | |
| HIPAA BAA available | [Yes / No] | | |
| GDPR compliant | [Yes / No] | | |
| Data encryption at rest | [Yes / No] | | |
| Data encryption in transit | [Yes / No] | | |
| DPA available | [Yes / No] | | |
| Audit logs available | [Yes / No] | | |
### Data Flow Diagram
For each vendor, document:
- Where user data is sent
- How long it is retained
- Who has access
- How it is deleted
---
## 6. Reliability and SLAs
| Metric | Vendor A | Vendor B | Vendor C |
|--------|----------|----------|----------|
| Uptime SLA | [X]% | | |
| Historical uptime (12 months) | [X]% | | |
| Latency SLA (p99) | [X]ms | | |
| Rate limits | [X] RPM | | |
| Burst capacity | [X] RPM | | |
| Status page transparency | [Good / Fair / Poor] | | |
| Incident response time | [X] hours | | |
| SLA credits | [% credit per % downtime] | | |
---
## 7. Integration Assessment
| Factor | Vendor A | Vendor B | Vendor C |
|--------|----------|----------|----------|
| API design quality | [1-5] | | |
| SDK languages supported | [List] | | |
| Documentation quality | [1-5] | | |
| Estimated integration time | [Days/Weeks] | | |
| Streaming support | [Yes / No] | | |
| Batch processing support | [Yes / No] | | |
| Webhook support | [Yes / No] | | |
| OpenAPI spec available | [Yes / No] | | |
| Sandbox/test environment | [Yes / No] | | |
---
## 8. Lock-in Risk Assessment
| Factor | Vendor A | Vendor B | Vendor C |
|--------|----------|----------|----------|
| Proprietary API format | [Yes / No] | | |
| OpenAI-compatible API | [Yes / No] | | |
| Data export capability | [Easy / Difficult] | | |
| Fine-tuned model portability | [Portable / Locked] | | |
| Estimated switching cost | [$X + Y weeks] | | |
| Open-source alternative exists | [Yes / No] | | |
---
## 9. Scorecard Summary
| Dimension | Weight | Vendor A | Vendor B | Vendor C |
|-----------|--------|----------|----------|----------|
| Model Quality | [X]% | [1-5] | [1-5] | [1-5] |
| Pricing | [X]% | [1-5] | [1-5] | [1-5] |
| Data Privacy | [X]% | [1-5] | [1-5] | [1-5] |
| Reliability | [X]% | [1-5] | [1-5] | [1-5] |
| Integration | [X]% | [1-5] | [1-5] | [1-5] |
| Lock-in Risk | [X]% | [1-5] | [1-5] | [1-5] |
| Support | [X]% | [1-5] | [1-5] | [1-5] |
| **Weighted Total** | **100%** | **[Score]** | **[Score]** | **[Score]** |
---
## 10. Recommendation
**Recommended vendor**: [Name]
**Rationale**: [2-3 sentences explaining why this vendor won]
**Key risks**: [1-2 sentences on what could go wrong]
**Next steps**: [Contract negotiation, POC expansion, integration planning]
Filled Example
## 9. Scorecard Summary (Partial)
| Dimension | Weight | Anthropic Claude | OpenAI GPT-4 | Google Gemini |
|-----------|--------|-----------------|-------------|---------------|
| Model Quality | 30% | 4.5 | 4.3 | 3.8 |
| Pricing | 20% | 3.5 | 3.0 | 4.0 |
| Data Privacy | 20% | 4.5 | 3.5 | 3.5 |
| Reliability | 10% | 4.0 | 4.0 | 3.5 |
| Integration | 10% | 4.0 | 4.5 | 3.5 |
| Lock-in Risk | 5% | 3.5 | 3.0 | 3.0 |
| Support | 5% | 3.5 | 4.0 | 3.5 |
| **Weighted Total** | **100%** | **4.10** | **3.76** | **3.63** |
## 10. Recommendation
**Recommended vendor**: Anthropic Claude
**Rationale**: Highest model quality scores on our specific use case (long-form
content generation), strongest data privacy posture (no training on inputs by default),
and competitive pricing at our projected 10x volume.
**Key risks**: Smaller ecosystem than OpenAI, fewer third-party integrations.
**Next steps**: Negotiate enterprise agreement, expand POC to 3 additional use cases.
Key Takeaways
- Always test vendors with your actual data and use cases, not generic benchmarks
- Project costs at 3x and 10x your current volume to avoid pricing surprises
- Data privacy policies vary significantly between vendors. Read the DPA, not just the marketing page
- Lock-in risk is highest when you fine-tune models or build on proprietary APIs
- Weight your evaluation criteria before testing to avoid anchoring on whichever vendor you tested first
- Include hidden costs (egress, overage, migration) in your pricing analysis
