Why Standard PRDs Fall Apart for AI Features
Traditional PRDs are built around deterministic logic. You define inputs, expected outputs, and edge cases. Engineering builds it, QA verifies it, and the feature either works or it does not. AI features break this model completely.
When you ship an LLM-powered feature, the same input can produce different outputs every time. "Correct" is not binary, it is a spectrum. The feature might work brilliantly for 90% of queries and hallucinate dangerously for the remaining 10%. A standard PRD has no framework for expressing this reality, which means engineering builds to the wrong spec, QA tests the wrong things, and the feature launches with risks nobody documented.
This guide covers the sections you need to add to your PRD when the feature involves AI, with concrete examples you can adapt for your own products.
Start with the Problem, Not the Model
Before you write a single line about any particular model, your PRD needs to answer a fundamental question: what user problem are you solving, and why does AI solve it better than a deterministic approach?
Most AI feature PRDs skip this. They start with "we will use an LLM to..." instead of "users struggle with X because Y, and AI enables a solution that was previously impossible because Z."
What to include in the problem statement
Example
Users spend an average of 8 minutes manually categorizing each support ticket across our 247 category taxonomy. Miscategorization rate is 23%, causing tickets to be routed to the wrong team and increasing resolution time by 2.3x. Rules-based routing covers only 34% of ticket types accurately. An LLM-based classifier can handle the full natural-language variability of ticket descriptions and reduce categorization to under 2 seconds with a target accuracy above 92%.
Define Your Eval Criteria Before Anything Else
The single biggest difference between a good AI PRD and a bad one is whether it defines evaluation criteria upfront. Without eval criteria, you have no way to know if the feature is working, no way to compare model options, and no way to make rational decisions about tradeoffs.
Types of eval criteria
Accuracy metrics vary by feature type:
Quality metrics capture what raw accuracy misses:
Operational metrics keep the feature viable:
Setting thresholds
For each metric, define three levels:
Be honest about launch thresholds. An AI feature that launches at 85% accuracy and improves to 95% over three months is far better than one that stays in development for six months trying to hit 95% before any user sees it.
Hallucination Tolerance and Guardrails
Every LLM hallucinates. Your PRD needs to define how much hallucination is acceptable and what happens when the model gets it wrong.
Categorize your hallucination risk
Not all hallucinations carry equal risk. Map your feature's outputs to risk tiers:
Guardrail specifications
For each risk tier, define the guardrails:
Example guardrail spec
Feature: AI-generated release notes from commit history
>
Hallucination risk tier: Medium (user reviews before publishing)
>
Input guardrails: Strip internal ticket references and employee names before sending to model. Reject if commit history exceeds 50,000 tokens (summarize first).
>
Output guardrails: Flag any output containing URLs not found in the input commits. Flag any customer names or specific revenue figures. Reject outputs shorter than 100 words (likely model failure).
>
Fallback: Display a bulleted list of commit messages grouped by date as the non-AI alternative. Show the message: "AI summary unavailable. Showing raw commit history instead."
Model Requirements and Selection Criteria
Your PRD should specify model requirements in terms of capabilities needed, not specific model names. Models evolve too fast to pin your spec to a particular version. Instead, define the capability profile your feature requires.
Capability requirements
Hosting and data constraints
Cost modeling
Include a cost projection table in your PRD:
| Scenario | Queries per day | Avg tokens per query | Monthly API cost |
|---|---|---|---|
| Launch (Month 1) | 500 | 2,000 | $450 |
| Growth (Month 6) | 5,000 | 2,200 | $4,950 |
| Scale (Month 12) | 25,000 | 2,500 | $28,125 |
These numbers force a real conversation about unit economics. If your AI feature costs $1.12 per user per month and your average revenue per user is $15, the margins work. If the feature costs $8 per user per month, you need a different architecture or a different pricing model.
Data Requirements and Training Considerations
AI features are only as good as the data behind them. Your PRD needs to specify what data the feature needs, where it comes from, and how it stays fresh.
Prompt context data
Most LLM features use retrieval-augmented generation (RAG) or structured prompt context rather than fine-tuning. Define:
Training and fine-tuning data (if applicable)
Data privacy and compliance
Feedback Loops and Continuous Improvement
AI features are not "build and forget." Your PRD needs to define how the feature gets better over time.
Implicit feedback signals
Explicit feedback mechanisms
Improvement workflow
Define who owns the feedback loop and what the review cadence looks like:
The PRD Template Section Checklist
When writing your next AI feature PRD, make sure you cover each of these sections. Not every section applies to every feature, but you should consciously decide to skip a section rather than forgetting it exists.
Standard PRD sections (still needed)
AI-specific sections to add
Common Mistakes to Avoid
After reviewing dozens of AI feature PRDs across different companies, these are the mistakes that show up most frequently:
Specifying a model instead of capabilities. "We will use GPT-4" is not a requirement. It is a premature implementation decision. Define what you need the model to do, and let engineering evaluate options against those requirements.
Ignoring cost at scale. A prototype that costs $0.03 per query seems cheap until you multiply it by 50,000 daily active users. Always model costs at your 12-month usage projection.
Treating accuracy as a single number. Aggregate accuracy hides critical failures. A model with 95% average accuracy might have 60% accuracy on your most important category. Break eval criteria down by segment, use case, and risk tier.
No fallback behavior. If the model goes down or produces garbage, what does the user see? If your PRD does not answer this question, your users will find out the answer the hard way during an outage.
Skipping the ethics review. AI features can discriminate, manipulate, or mislead at scale. Define your ethical boundaries in the PRD, not after launch when the press coverage forces you to.
Putting It into Practice
The best way to adopt this framework is incrementally. You do not need to rewrite every PRD template overnight. Start with your next AI feature and add the eval criteria and hallucination tolerance sections. Those two alone will sharpen your engineering conversations and launch decisions.
As your team ships more AI features, build a shared eval library that standardizes how you measure quality across the product. Over time, this library becomes one of your most valuable assets because it encodes what "good" means for your specific users and use cases, something no off-the-shelf framework can provide.
The PRD is not just a document. It is a forcing function that makes your team think through the hard questions before code gets written. For AI features, those hard questions are different from traditional software, and your PRD needs to reflect that.