Gartner projects that more than 40% of agentic AI projects will be canceled by the end of 2027. An MIT report found 95% of generative AI pilots fail to deliver measurable impact. The most common root cause isn't bad models or weak engineering. It's bad requirements.
Product managers writing PRDs for AI features face a problem that no one trained them for: traditional requirements assume deterministic behavior. You specify inputs, define expected outputs, and write acceptance criteria that either pass or fail. AI features don't work that way. A recommendation engine might return good results 87% of the time. A text summarizer might hallucinate facts that sound plausible. A classification model might degrade silently over months as the data distribution shifts.
The PRD you wrote for your last CRUD feature won't survive contact with a model that generates different outputs for identical inputs. You need new sections, new metrics, and a fundamentally different approach to defining "done."
This post walks through how to write product requirements that account for probabilistic behavior, evaluation thresholds, guardrails, and the failure modes that kill AI features in production.
Why Traditional PRDs Break for AI Features
Standard PRDs work by specifying behavior as rules. "When the user clicks Submit, the form validates all required fields and displays inline errors." This is deterministic. The same input always produces the same output. QA can write a test, verify it passes, and move on.
AI features introduce three problems that this model can't handle.
Problem 1: Non-deterministic outputs. Ask an LLM-powered feature to summarize the same document twice and you'll get two different summaries. Both might be good. One might hallucinate a fact. There is no single "correct" output to test against.
Problem 2: Gradient quality. Traditional features are binary: they work or they don't. AI features exist on a spectrum. A search result can be slightly relevant, mostly relevant, or exactly right. Your requirements need to define where "good enough" lives on that spectrum.
Problem 3: Silent degradation. A traditional feature breaks visibly. An API returns a 500 error, a button stops responding, a page crashes. AI features degrade invisibly. A recommendation engine slowly gets worse as user behavior shifts. A classification model drifts as the training data ages. By the time anyone notices, the damage is already done.
If you're wrestling with the broader question of whether AI is the right approach for your feature, the AI build vs. buy framework is a good starting point. It helps you evaluate whether you need a custom model, a fine-tuned API, or a third-party solution before you start writing requirements.
The BEAM Framework for AI Feature Requirements
After reviewing dozens of failed AI launches and studying how teams at Shopify, Notion, and Intercom structure their AI PRDs, a pattern emerges. Successful AI feature requirements cover four areas that traditional PRDs skip entirely. I call this BEAM: Behavior Boundaries, Evaluation Criteria, Action Guardrails, and Monitoring Thresholds.
Behavior Boundaries
This is where you define what the AI feature should and should not do. Traditional PRDs handle scope with a "what's included / what's not included" table. For AI features, you need something more granular.
A behavior boundary document specifies three tiers, adapted from a framework that Addy Osmani published on O'Reilly Radar:
- Always do: Actions the model should take without human review. Example: "Always include the original source link when summarizing a document."
- Ask first: Actions that require human confirmation before executing. Example: "When the confidence score is below 0.7, present the suggestion as a draft for user review rather than auto-applying."
- Never do: Hard stops the model must not cross, regardless of how the prompt is phrased. Example: "Never generate medical, legal, or financial advice. If the user's query touches these domains, surface a disclaimer and suggest consulting a professional."
This three-tier structure gives your engineering team clear guardrail requirements and gives your QA team specific scenarios to test against.
Here's what this looks like in practice for a customer support AI:
- ☐ Always do: Greet the user, acknowledge their issue, search the knowledge base
- ☐ Ask first: Offer a refund over $50, escalate to a human agent, access account billing data
- ☐ Never do: Promise delivery dates, make commitments about product roadmap, share other customers' data
Evaluation Criteria
This is the section that separates AI PRDs from traditional ones. You need to answer a question that most PMs never encounter with deterministic features: "How good is good enough?"
Miqdad Jaffer, a Product Lead at OpenAI, recommends defining evaluation in two phases:
Phase 1: Pre-launch eval. Before deployment, build a ground truth dataset of 50-100 examples that represent the range of inputs your feature will see. For each example, define the ideal output. Then measure model performance against this dataset using metrics appropriate to your feature type:
| Feature type | Key metric | Minimum threshold |
|---|---|---|
| Text generation | Human preference rating | 4.2/5.0 average across 100 samples |
| Classification | Precision and recall | 92% precision, 88% recall |
| Search/retrieval | nDCG@10 | 0.75 |
| Recommendations | Click-through rate | 2x baseline |
| Summarization | Factual accuracy (LLM-as-judge) | 95% of facts verifiable against source |
Phase 2: Post-launch monitoring. Define the metrics that tell you the feature is degrading after deployment. This goes in your PRD alongside the launch criteria. More on this in the Monitoring section below.
The LLM evaluation framework provides a detailed walkthrough of how to structure evals across accuracy, latency, cost, and safety dimensions. It's worth reading before you finalize this section of your PRD.
Action Guardrails
Guardrails aren't a nice-to-have. Productboard's AI team calls them "core product requirements, not an afterthought." If you're not specifying where the model should stop, you're not really specifying what it should do.
Your PRD should define guardrails across four layers:
Input filtering. What happens when a user sends malicious, nonsensical, or out-of-scope input? Specify the validation rules. "Reject inputs over 10,000 tokens. Strip HTML tags. Detect and block prompt injection patterns."
Output validation. What checks run before the model's output reaches the user? "All generated content is checked against a fact-verification step before display. Outputs containing PII patterns (SSN, credit card, email) are redacted automatically."
Action boundaries. If your AI feature can take actions (send emails, update records, make API calls), define which actions require human approval and which can execute autonomously. This is where the human-in-the-loop vs. fully automated decision becomes a concrete product requirement.
Escalation triggers. Define the specific conditions under which the AI stops and hands off to a human. "If the user expresses frustration (sentiment score below -0.5 for 2 consecutive messages), transfer to a human agent within 30 seconds."
Monitoring Thresholds
Traditional PRDs don't include monitoring requirements because traditional features either work or throw errors. AI features need monitoring baked into the requirements document because they degrade silently.
Specify these in your PRD:
- Accuracy drift threshold: "If weekly eval accuracy drops below 88%, trigger an alert to the ML team and flag for model retraining."
- Latency SLA: "P95 response time must stay below 3 seconds. If P95 exceeds 5 seconds for more than 15 minutes, fall back to a cached/static response."
- Cost ceiling: "Monthly inference cost for this feature must not exceed $2,500. If projected spend exceeds 80% of budget by mid-month, reduce batch sizes and alert the PM."
- User feedback loop: "Surface a thumbs-up/thumbs-down on every AI-generated response. If the negative feedback rate exceeds 15% over a 7-day rolling window, pause the feature for review."
What to Add to Your Existing PRD Template
You don't need to throw out your PRD template. You need to add five AI-specific sections to it. Here's the minimum viable addition:
1. Model Strategy
State which model you plan to use and why. Include the fallback plan.
Primary model: Claude 3.5 Sonnet via API
Fallback: Claude 3.5 Haiku (lower cost, acceptable quality for degraded mode)
Why not fine-tuned: Volume doesn't justify fine-tuning cost ($15k+).
Revisit at 100k monthly requests.
This section prevents the "just use GPT-4" default that leads to cost overruns. It forces the team to evaluate whether a smaller, cheaper model achieves the quality threshold defined in your eval criteria. The AI product lifecycle framework covers how model selection decisions evolve as your product matures.
2. Data Requirements
AI features are only as good as the data they consume. Specify:
- Training/few-shot data: Where does it come from? How is it labeled? How often is it refreshed?
- Runtime context: What data does the model receive at inference time? User history, retrieved documents, structured metadata?
- Data freshness: How stale can the context data be before the feature degrades? Hours? Days? Weeks?
- Privacy constraints: What data categories are off-limits? PII handling? GDPR/CCPA compliance?
3. Failure States
Traditional PRDs define the happy path and maybe one error state. AI PRDs need a failure taxonomy:
| Failure mode | User experience | Recovery action |
|---|---|---|
| Model returns low confidence | Show "I'm not sure about this" label + human review option | Log for retraining |
| Model hallucinates facts | Fact-check layer catches it, returns "Unable to verify" | Escalate to human |
| Model is down / times out | Show cached or template-based fallback | Auto-retry with exponential backoff |
| Model produces harmful output | Content filter blocks it, shows generic safe response | Alert safety team |
| Model cost spikes unexpectedly | Rate limit per user, degrade to cheaper model | Alert PM + finance |
This table does something critical: it forces the team to design the degraded experience before launch, not after the first production incident.
4. Acceptance Criteria (Rewritten for AI)
Traditional acceptance criteria: "Given [input], when [action], then [exact output]."
That format doesn't work when outputs are probabilistic. Instead, rewrite acceptance criteria as statistical assertions:
- ☐ "Given 100 customer support queries from the test set, the model routes 90%+ to the correct department"
- ☐ "Given 50 product descriptions, the AI-generated summary is rated 4+/5 by 3 human reviewers in 80%+ of cases"
- ☐ "Given adversarial inputs (prompt injection attempts), the model refuses 100% of attempts to override system instructions"
- ☐ "The feature maintains <3 second P95 latency under 500 concurrent requests"
Notice the pattern: each criterion specifies a sample size, a metric, and a threshold. This gives QA a concrete testing protocol rather than a subjective judgment call.
5. Responsible AI Checklist
Before your AI feature ships, your PRD should require sign-off on these items. The responsible AI framework provides the full methodology, but here's the minimum checklist:
- ☐ Bias testing across demographic segments completed
- ☐ Content safety filters tested with adversarial inputs
- ☐ User disclosure that AI generated the content (where applicable)
- ☐ Data retention and deletion policy documented
- ☐ Human override mechanism available for all AI decisions
- ☐ Feedback mechanism for users to flag incorrect outputs
Real-World Examples
Notion AI (Document Summarization)
When Notion shipped AI-powered document summaries, they faced the classic "good enough" problem. A summary could be technically accurate but miss the most important points. Their requirements included a custom eval metric they called "salience coverage": the percentage of key points (as identified by human annotators) that appeared in the AI summary. Their launch threshold was 85% salience coverage across a 200-document test set.
They also specified a fallback: if the document exceeded the context window, the feature would summarize the first and last sections and display a "Partial summary" label rather than silently truncating.
Intercom's Fin (Customer Support Agent)
Intercom's AI support agent Fin launched with requirements that most traditional PRDs would never include: a "confidence gating" system. When Fin's confidence score dropped below a configurable threshold, it would transparently hand off to a human agent rather than risk a bad answer. The PRD specified the default threshold (0.8), the UI for the handoff ("Let me connect you with someone who can help"), and the monitoring dashboard that support managers would use to adjust the threshold per topic category.
The result: Fin resolved 50% of support queries autonomously while maintaining a CSAT score within 2 points of human agents, according to Intercom's 2025 case study.
Shopify's AI Product Descriptions
Miqdad Jaffer described how Shopify's "Auto Write" feature used a PRD structure with explicit guardrails: the model could never generate health claims, legal warranties, or pricing guarantees. The eval criteria required 95% factual accuracy verified against the merchant's actual product data. And the PRD defined a cost ceiling per generation that influenced the choice of model and prompt optimization strategy.
Common Mistakes to Avoid
Mistake 1: Treating AI acceptance criteria like traditional QA. If your acceptance criteria say "the summary is accurate," you've written something that can't be tested. Replace it with "the summary contains 85%+ of key facts from the source document, as measured by human annotation on a 100-document test set."
Mistake 2: Skipping the failure taxonomy. Teams that don't define failure states in the PRD end up building them reactively after production incidents. This is more expensive and produces worse user experiences.
Mistake 3: No cost constraints. AI inference costs are variable. A single user can trigger hundreds of API calls in a session. Without cost ceilings in the PRD, your team ships a feature that works in testing (low volume) and breaks the budget in production (high volume).
Mistake 4: Ignoring model drift. A model that performs well at launch can degrade over weeks as user behavior shifts. If your PRD doesn't include monitoring thresholds and a retraining trigger, you'll discover the degradation through angry customer support tickets instead of dashboards.
Mistake 5: Writing guardrails as aspirational goals. "The model should avoid harmful content" is not a guardrail. "All outputs pass through the content moderation API before display, and any output flagged as category 2+ harmful is replaced with a standard safe response" is a guardrail.
If you need to compare different approaches to building these guardrails into your product, the comparison of fine-tuning vs. RAG vs. prompt engineering covers the trade-offs between baking safety into the model versus applying it at the application layer.
The AI PRD Template (Sections to Add)
Copy these sections into your existing PRD template. They sit after your standard problem statement, user stories, and scope sections.
Section A: Model Strategy
- Primary model + version
- Fallback model
- Rationale (cost, latency, quality trade-offs)
- Fine-tuning vs. API decision + revisit criteria
Section B: Behavior Boundaries
- Always do (autonomous actions)
- Ask first (human-in-the-loop actions)
- Never do (hard stops)
Section C: Evaluation Criteria
- Ground truth dataset description (size, source, coverage)
- Pre-launch metrics + thresholds
- Post-launch monitoring metrics + alert thresholds
Section D: Data Requirements
- Training/few-shot data source and refresh cadence
- Runtime context architecture
- Privacy and compliance constraints
Section E: Failure Taxonomy
- Table of failure modes, user experiences, and recovery actions
- Fallback behavior for each degraded state
- Escalation paths
Section F: Responsible AI Checklist
- Bias testing sign-off
- Safety testing sign-off
- User disclosure requirements
- Data retention policy
If you want a structured starting point, Forge can generate an AI-specific PRD from a feature brief in 30 seconds. It includes the standard sections plus evaluation criteria and guardrail prompts that you can customize. For a broader view of how PRDs compare to other spec formats, see our PRD vs. product brief vs. spec comparison.
What to Do Next
Start with one section. If you're writing a PRD for an AI feature this week, add the Behavior Boundaries section (always do / ask first / never do) and the Evaluation Criteria table. These two sections alone will force the conversations that prevent most AI feature failures.
Then build the failure taxonomy before your team starts implementation. The cost of designing fallback states in a PRD is a few hours of thought. The cost of designing them after a production incident is weeks of reactive engineering and lost user trust.
For the complete picture of how AI products move from concept to production, read the AI product lifecycle framework. It covers the full journey from model selection through monitoring and iteration.