Skip to main content
New: Deck Doctor. Upload your deck, get CPO-level feedback. 7-day free trial.
TemplateFREE⏱️ 25 min

AI Prompt Engineering Log Template

A structured log for tracking prompt versions, test cases, performance metrics, and optimization decisions across LLM-powered product features.

Last updated 2026-03-04
AI Prompt Engineering Log Template preview

AI Prompt Engineering Log Template

Free AI Prompt Engineering Log Template — open and start using immediately

or use email

Instant access. No spam.

Get Template Pro — all templates, no gates, premium files

888+ templates without email gates, plus 30 premium Excel spreadsheets with formulas and professional slide decks. One payment, lifetime access.

Need a custom version?

Forge AI generates PM documents customized to your product, team, and goals. Get a draft in seconds, then refine with AI chat.

Generate with Forge AI

What This Template Is For

Prompt engineering in production is not the same as tinkering in a playground. When your product depends on LLM prompts, every prompt change is a code change that affects user experience, accuracy, cost, and safety. Yet most teams treat prompts as informal artifacts: edited in chat windows, stored in Slack threads, and deployed without version history.

This template brings engineering discipline to prompt management. It provides a structured log for tracking prompt versions, test results, performance metrics, and the reasoning behind each change. When something breaks in production, you can trace back to which prompt change caused it and why it was made.

For foundational prompt engineering techniques, see the guide to prompt engineering for PMs. The AI PM Handbook covers prompt strategy in the context of AI product development. Track the cost impact of prompt changes with the LLM Cost Estimator.

How to Use This Template

  1. Create one log per prompt (per system prompt, per feature). A single AI product may have 5-10 separate prompts, each with its own log.
  1. Document the baseline version before making any changes. Record current performance metrics so you have a comparison point.
  1. Log every version change with the date, author, rationale, and the full prompt text. Include what you changed and why.
  1. Run your test suite after every change and record results in the log. Never deploy a prompt change without testing.
  1. Track production metrics after deployment. A prompt that passes testing can still degrade in production with real user inputs.
  1. Review the log monthly to identify patterns: which types of changes improve performance, and which cause regressions.

The Template

Prompt Metadata

  • Assign a unique identifier to this prompt
  • Document the feature and model this prompt serves
  • Name the prompt owner (who approves changes)
  • Define the test suite location and how to run it
  • Set the review cadence for this prompt
## Prompt Log

**Prompt ID**: [e.g., PROMPT-CHAT-001]
**Feature**: [e.g., Customer support chatbot]
**Model**: [e.g., Claude 3.5 Sonnet]
**Owner**: [Name]
**Test Suite**: [Location / How to run]
**Created**: [YYYY-MM-DD]
**Current Version**: [v1.0 / v2.3 / etc.]

Version Entry Template

  • Record the version number, date, and author
  • Document what changed and the reasoning
  • Include the full prompt text (or link to source control)
  • Record test results before and after
  • Document deployment status and production impact
### Version [X.Y] - [YYYY-MM-DD]

**Author**: [Name]
**Change Type**: [New / Optimization / Bug Fix / Safety / Cost Reduction]

**What Changed**:
[1-3 sentences describing the change]

**Why**:
[Reasoning for the change. What problem are we solving?]

**Prompt Text**:
\`\`\`
[Full system prompt text, or link to file in source control]
\`\`\`

**Test Results**:
| Test Case | Previous | This Version | Delta |
|-----------|----------|-------------|-------|
| Accuracy (standard inputs) | [X%] | [Y%] | [+/-Z%] |
| Accuracy (edge cases) | [X%] | [Y%] | [+/-Z%] |
| Hallucination rate | [X%] | [Y%] | [+/-Z%] |
| Average output tokens | [N] | [M] | [+/-Delta] |
| Cost per request | [$X] | [$Y] | [+/-$Z] |
| Latency (p50) | [Xs] | [Ys] | [+/-Zs] |

**Safety Checks**:
- [ ] Jailbreak resistance: [Pass/Fail]
- [ ] PII handling: [Pass/Fail]
- [ ] Refusal behavior: [Pass/Fail]
- [ ] Bias check: [Pass/Fail]

**Deployment**:
- [ ] Deployed to staging: [Date]
- [ ] Deployed to production: [Date]
- [ ] Production metrics verified after [24h / 48h / 1 week]

**Production Impact**:
[Observed changes in production metrics after deployment.
E.g., "User satisfaction increased from 78% to 83%.
Hallucination reports decreased by 40%."]

Test Case Registry

  • Define 10+ standard test cases that cover common inputs
  • Define 5+ edge case tests (unusual inputs, long inputs, empty inputs)
  • Define 5+ adversarial tests (injection attempts, jailbreaks)
  • Define 3+ safety tests (harmful content, PII, bias)
  • Record expected output or acceptance criteria for each test case
## Test Case Registry

| ID | Category | Input | Expected Output / Criteria | Priority |
|----|----------|-------|---------------------------|----------|
| TC-01 | Standard | [Representative input 1] | [Expected behavior or output pattern] | Critical |
| TC-02 | Standard | [Representative input 2] | [Expected behavior or output pattern] | Critical |
| TC-03 | Edge Case | [Very long input] | [Handles gracefully, no truncation errors] | High |
| TC-04 | Edge Case | [Empty or minimal input] | [Prompts for more information] | High |
| TC-05 | Edge Case | [Non-English input] | [Responds in same language or politely declines] | Medium |
| TC-06 | Adversarial | [Prompt injection attempt] | [Ignores injection, responds normally] | Critical |
| TC-07 | Adversarial | [Jailbreak attempt] | [Maintains safety boundaries] | Critical |
| TC-08 | Safety | [Request for harmful content] | [Refuses clearly and helpfully] | Critical |
| TC-09 | Safety | [Input containing PII] | [Does not echo or store PII] | Critical |
| TC-10 | Safety | [Demographically sensitive topic] | [Responds without bias] | High |

Performance Tracking

  • Set up automated metrics collection for production prompts
  • Define alert thresholds for metric degradation
  • Track metrics over time to identify drift
  • Compare performance across model versions
## Production Metrics Dashboard

### Weekly Metrics (auto-populated)
| Week | Accuracy | Hallucination Rate | Avg Tokens | Cost/Req | User Satisfaction | Alerts |
|------|----------|-------------------|------------|----------|-------------------|--------|
| [W1] | [X%] | [Y%] | [N] | [$Z] | [W%] | [None / Details] |
| [W2] | [X%] | [Y%] | [N] | [$Z] | [W%] | [None / Details] |

Filled Example

Prompt ID: PROMPT-SUPPORT-001

Feature: Customer support ticket auto-responder

Model: Claude 3.5 Sonnet

Version 1.0 (2026-01-15): Initial prompt. Instruction to respond to support tickets with helpful, concise answers using the knowledge base. Accuracy on test suite: 82%. Hallucination rate: 8%. Cost per request: $0.003.

Version 1.1 (2026-01-28): Added "If you are not confident in your answer, say so and suggest the user contact support" instruction. Accuracy unchanged at 82%, but hallucination rate dropped to 4% because the model now hedges on uncertain topics instead of fabricating. Cost per request increased to $0.004 (slightly longer outputs).

Version 2.0 (2026-02-10): Major rewrite. Added structured output format (greeting, answer, next steps, closing). Added 5 few-shot examples. Accuracy jumped to 91%. Hallucination rate: 3%. Cost per request: $0.006. User satisfaction increased from 71% to 84%. Worth the cost increase.

Version 2.1 (2026-02-20): Cost optimization. Shortened few-shot examples and tightened output length instruction. Accuracy held at 90%. Cost per request dropped to $0.004. Use the LLM Cost Estimator to model the impact of prompt length on monthly costs.

Frequently Asked Questions

Why version prompts when they are just text?+
Prompts are code. A prompt change can break user experience, introduce [hallucinations](/glossary/hallucination), or triple your API costs. Version control gives you rollback capability, audit trails, and the ability to correlate production incidents with specific prompt changes. Teams that do not version prompts spend hours debugging issues that a version diff would explain in seconds.
Should I store prompts in source control or in this log?+
Both. Store the canonical prompt text in source control (Git) for deployment and rollback. Use this log to document the reasoning, test results, and production impact that source control diffs cannot capture. Link between them using version numbers.
How do I A/B test prompt changes?+
Route a percentage of traffic (start with 5-10%) to the new prompt version while the rest uses the current version. Measure the same metrics tracked in your test suite (accuracy, hallucination rate, user satisfaction). Run the test for at least one week to account for input variation. Only promote the new version if it improves target metrics without degrading others.
What if a model provider updates their model and my prompts break?+
This is why you have a test suite. Run your full test suite whenever your model provider ships an update. If tests fail, the prompt needs adjustment for the new model version. Log it as a new prompt version with "Model Update Compatibility" as the change type. The [AI PM Handbook](/ai-guide) covers strategies for managing model provider dependencies.
How many prompt versions is too many?+
There is no fixed limit, but if you are shipping more than 2-3 prompt versions per week, you may be optimizing prematurely or reacting to individual complaints instead of systematic issues. Focus prompt changes on measurable improvements to your core metrics. Archive old versions but keep the log for reference.

Explore More Templates

Browse our full library of PM templates, or generate a custom version with AI.

Free PDF

Like This Template?

Subscribe to get new templates, frameworks, and PM strategies delivered to your inbox.

or use email

Join 10,000+ product leaders. Instant PDF download.

Want full SaaS idea playbooks with market research?

Explore Ideas Pro →