What This Template Is For
Prompt engineering in production is not the same as tinkering in a playground. When your product depends on LLM prompts, every prompt change is a code change that affects user experience, accuracy, cost, and safety. Yet most teams treat prompts as informal artifacts: edited in chat windows, stored in Slack threads, and deployed without version history.
This template brings engineering discipline to prompt management. It provides a structured log for tracking prompt versions, test results, performance metrics, and the reasoning behind each change. When something breaks in production, you can trace back to which prompt change caused it and why it was made.
For foundational prompt engineering techniques, see the guide to prompt engineering for PMs. The AI PM Handbook covers prompt strategy in the context of AI product development. Track the cost impact of prompt changes with the LLM Cost Estimator.
How to Use This Template
- Create one log per prompt (per system prompt, per feature). A single AI product may have 5-10 separate prompts, each with its own log.
- Document the baseline version before making any changes. Record current performance metrics so you have a comparison point.
- Log every version change with the date, author, rationale, and the full prompt text. Include what you changed and why.
- Run your test suite after every change and record results in the log. Never deploy a prompt change without testing.
- Track production metrics after deployment. A prompt that passes testing can still degrade in production with real user inputs.
- Review the log monthly to identify patterns: which types of changes improve performance, and which cause regressions.
The Template
Prompt Metadata
- ☐ Assign a unique identifier to this prompt
- ☐ Document the feature and model this prompt serves
- ☐ Name the prompt owner (who approves changes)
- ☐ Define the test suite location and how to run it
- ☐ Set the review cadence for this prompt
## Prompt Log
**Prompt ID**: [e.g., PROMPT-CHAT-001]
**Feature**: [e.g., Customer support chatbot]
**Model**: [e.g., Claude 3.5 Sonnet]
**Owner**: [Name]
**Test Suite**: [Location / How to run]
**Created**: [YYYY-MM-DD]
**Current Version**: [v1.0 / v2.3 / etc.]
Version Entry Template
- ☐ Record the version number, date, and author
- ☐ Document what changed and the reasoning
- ☐ Include the full prompt text (or link to source control)
- ☐ Record test results before and after
- ☐ Document deployment status and production impact
### Version [X.Y] - [YYYY-MM-DD]
**Author**: [Name]
**Change Type**: [New / Optimization / Bug Fix / Safety / Cost Reduction]
**What Changed**:
[1-3 sentences describing the change]
**Why**:
[Reasoning for the change. What problem are we solving?]
**Prompt Text**:
\`\`\`
[Full system prompt text, or link to file in source control]
\`\`\`
**Test Results**:
| Test Case | Previous | This Version | Delta |
|-----------|----------|-------------|-------|
| Accuracy (standard inputs) | [X%] | [Y%] | [+/-Z%] |
| Accuracy (edge cases) | [X%] | [Y%] | [+/-Z%] |
| Hallucination rate | [X%] | [Y%] | [+/-Z%] |
| Average output tokens | [N] | [M] | [+/-Delta] |
| Cost per request | [$X] | [$Y] | [+/-$Z] |
| Latency (p50) | [Xs] | [Ys] | [+/-Zs] |
**Safety Checks**:
- [ ] Jailbreak resistance: [Pass/Fail]
- [ ] PII handling: [Pass/Fail]
- [ ] Refusal behavior: [Pass/Fail]
- [ ] Bias check: [Pass/Fail]
**Deployment**:
- [ ] Deployed to staging: [Date]
- [ ] Deployed to production: [Date]
- [ ] Production metrics verified after [24h / 48h / 1 week]
**Production Impact**:
[Observed changes in production metrics after deployment.
E.g., "User satisfaction increased from 78% to 83%.
Hallucination reports decreased by 40%."]
Test Case Registry
- ☐ Define 10+ standard test cases that cover common inputs
- ☐ Define 5+ edge case tests (unusual inputs, long inputs, empty inputs)
- ☐ Define 5+ adversarial tests (injection attempts, jailbreaks)
- ☐ Define 3+ safety tests (harmful content, PII, bias)
- ☐ Record expected output or acceptance criteria for each test case
## Test Case Registry
| ID | Category | Input | Expected Output / Criteria | Priority |
|----|----------|-------|---------------------------|----------|
| TC-01 | Standard | [Representative input 1] | [Expected behavior or output pattern] | Critical |
| TC-02 | Standard | [Representative input 2] | [Expected behavior or output pattern] | Critical |
| TC-03 | Edge Case | [Very long input] | [Handles gracefully, no truncation errors] | High |
| TC-04 | Edge Case | [Empty or minimal input] | [Prompts for more information] | High |
| TC-05 | Edge Case | [Non-English input] | [Responds in same language or politely declines] | Medium |
| TC-06 | Adversarial | [Prompt injection attempt] | [Ignores injection, responds normally] | Critical |
| TC-07 | Adversarial | [Jailbreak attempt] | [Maintains safety boundaries] | Critical |
| TC-08 | Safety | [Request for harmful content] | [Refuses clearly and helpfully] | Critical |
| TC-09 | Safety | [Input containing PII] | [Does not echo or store PII] | Critical |
| TC-10 | Safety | [Demographically sensitive topic] | [Responds without bias] | High |
Performance Tracking
- ☐ Set up automated metrics collection for production prompts
- ☐ Define alert thresholds for metric degradation
- ☐ Track metrics over time to identify drift
- ☐ Compare performance across model versions
## Production Metrics Dashboard
### Weekly Metrics (auto-populated)
| Week | Accuracy | Hallucination Rate | Avg Tokens | Cost/Req | User Satisfaction | Alerts |
|------|----------|-------------------|------------|----------|-------------------|--------|
| [W1] | [X%] | [Y%] | [N] | [$Z] | [W%] | [None / Details] |
| [W2] | [X%] | [Y%] | [N] | [$Z] | [W%] | [None / Details] |
Filled Example
Prompt ID: PROMPT-SUPPORT-001
Feature: Customer support ticket auto-responder
Model: Claude 3.5 Sonnet
Version 1.0 (2026-01-15): Initial prompt. Instruction to respond to support tickets with helpful, concise answers using the knowledge base. Accuracy on test suite: 82%. Hallucination rate: 8%. Cost per request: $0.003.
Version 1.1 (2026-01-28): Added "If you are not confident in your answer, say so and suggest the user contact support" instruction. Accuracy unchanged at 82%, but hallucination rate dropped to 4% because the model now hedges on uncertain topics instead of fabricating. Cost per request increased to $0.004 (slightly longer outputs).
Version 2.0 (2026-02-10): Major rewrite. Added structured output format (greeting, answer, next steps, closing). Added 5 few-shot examples. Accuracy jumped to 91%. Hallucination rate: 3%. Cost per request: $0.006. User satisfaction increased from 71% to 84%. Worth the cost increase.
Version 2.1 (2026-02-20): Cost optimization. Shortened few-shot examples and tightened output length instruction. Accuracy held at 90%. Cost per request dropped to $0.004. Use the LLM Cost Estimator to model the impact of prompt length on monthly costs.
