Skip to main content
AI14 min

How to Write Product Requirements for AI Features

Last updated:

A practical framework for writing PRDs that account for probabilistic behavior, evaluation criteria, guardrails, and failure modes in AI products.

Published 2026-05-27
Share:
TL;DR: A practical framework for writing PRDs that account for probabilistic behavior, evaluation criteria, guardrails, and failure modes in AI products.
Free PDF

Get the AI Product Launch Checklist

A printable 1-page checklist you can pin to your desk or share with your team. Distilled from the key takeaways in this article.

or use email

Join 10,000+ product leaders. Instant PDF download.

Want full SaaS idea playbooks with market research?

Explore Ideas Pro โ†’

Gartner projects that more than 40% of agentic AI projects will be canceled by the end of 2027. An MIT report found 95% of generative AI pilots fail to deliver measurable impact. The most common root cause isn't bad models or weak engineering. It's bad requirements.

Product managers writing PRDs for AI features face a problem that no one trained them for: traditional requirements assume deterministic behavior. You specify inputs, define expected outputs, and write acceptance criteria that either pass or fail. AI features don't work that way. A recommendation engine might return good results 87% of the time. A text summarizer might hallucinate facts that sound plausible. A classification model might degrade silently over months as the data distribution shifts.

The PRD you wrote for your last CRUD feature won't survive contact with a model that generates different outputs for identical inputs. You need new sections, new metrics, and a fundamentally different approach to defining "done."

This post walks through how to write product requirements that account for probabilistic behavior, evaluation thresholds, guardrails, and the failure modes that kill AI features in production.

Why Traditional PRDs Break for AI Features

Standard PRDs work by specifying behavior as rules. "When the user clicks Submit, the form validates all required fields and displays inline errors." This is deterministic. The same input always produces the same output. QA can write a test, verify it passes, and move on.

AI features introduce three problems that this model can't handle.

Problem 1: Non-deterministic outputs. Ask an LLM-powered feature to summarize the same document twice and you'll get two different summaries. Both might be good. One might hallucinate a fact. There is no single "correct" output to test against.

Problem 2: Gradient quality. Traditional features are binary: they work or they don't. AI features exist on a spectrum. A search result can be slightly relevant, mostly relevant, or exactly right. Your requirements need to define where "good enough" lives on that spectrum.

Problem 3: Silent degradation. A traditional feature breaks visibly. An API returns a 500 error, a button stops responding, a page crashes. AI features degrade invisibly. A recommendation engine slowly gets worse as user behavior shifts. A classification model drifts as the training data ages. By the time anyone notices, the damage is already done.

If you're wrestling with the broader question of whether AI is the right approach for your feature, the AI build vs. buy framework is a good starting point. It helps you evaluate whether you need a custom model, a fine-tuned API, or a third-party solution before you start writing requirements.

The BEAM Framework for AI Feature Requirements

After reviewing dozens of failed AI launches and studying how teams at Shopify, Notion, and Intercom structure their AI PRDs, a pattern emerges. Successful AI feature requirements cover four areas that traditional PRDs skip entirely. I call this BEAM: Behavior Boundaries, Evaluation Criteria, Action Guardrails, and Monitoring Thresholds.

Behavior Boundaries

This is where you define what the AI feature should and should not do. Traditional PRDs handle scope with a "what's included / what's not included" table. For AI features, you need something more granular.

A behavior boundary document specifies three tiers, adapted from a framework that Addy Osmani published on O'Reilly Radar:

  • Always do: Actions the model should take without human review. Example: "Always include the original source link when summarizing a document."
  • Ask first: Actions that require human confirmation before executing. Example: "When the confidence score is below 0.7, present the suggestion as a draft for user review rather than auto-applying."
  • Never do: Hard stops the model must not cross, regardless of how the prompt is phrased. Example: "Never generate medical, legal, or financial advice. If the user's query touches these domains, surface a disclaimer and suggest consulting a professional."

This three-tier structure gives your engineering team clear guardrail requirements and gives your QA team specific scenarios to test against.

Here's what this looks like in practice for a customer support AI:

  • Always do: Greet the user, acknowledge their issue, search the knowledge base
  • Ask first: Offer a refund over $50, escalate to a human agent, access account billing data
  • Never do: Promise delivery dates, make commitments about product roadmap, share other customers' data

Evaluation Criteria

This is the section that separates AI PRDs from traditional ones. You need to answer a question that most PMs never encounter with deterministic features: "How good is good enough?"

Miqdad Jaffer, a Product Lead at OpenAI, recommends defining evaluation in two phases:

Phase 1: Pre-launch eval. Before deployment, build a ground truth dataset of 50-100 examples that represent the range of inputs your feature will see. For each example, define the ideal output. Then measure model performance against this dataset using metrics appropriate to your feature type:

Feature typeKey metricMinimum threshold
Text generationHuman preference rating4.2/5.0 average across 100 samples
ClassificationPrecision and recall92% precision, 88% recall
Search/retrievalnDCG@100.75
RecommendationsClick-through rate2x baseline
SummarizationFactual accuracy (LLM-as-judge)95% of facts verifiable against source

Phase 2: Post-launch monitoring. Define the metrics that tell you the feature is degrading after deployment. This goes in your PRD alongside the launch criteria. More on this in the Monitoring section below.

The LLM evaluation framework provides a detailed walkthrough of how to structure evals across accuracy, latency, cost, and safety dimensions. It's worth reading before you finalize this section of your PRD.

Action Guardrails

Guardrails aren't a nice-to-have. Productboard's AI team calls them "core product requirements, not an afterthought." If you're not specifying where the model should stop, you're not really specifying what it should do.

Your PRD should define guardrails across four layers:

Input filtering. What happens when a user sends malicious, nonsensical, or out-of-scope input? Specify the validation rules. "Reject inputs over 10,000 tokens. Strip HTML tags. Detect and block prompt injection patterns."

Output validation. What checks run before the model's output reaches the user? "All generated content is checked against a fact-verification step before display. Outputs containing PII patterns (SSN, credit card, email) are redacted automatically."

Action boundaries. If your AI feature can take actions (send emails, update records, make API calls), define which actions require human approval and which can execute autonomously. This is where the human-in-the-loop vs. fully automated decision becomes a concrete product requirement.

Escalation triggers. Define the specific conditions under which the AI stops and hands off to a human. "If the user expresses frustration (sentiment score below -0.5 for 2 consecutive messages), transfer to a human agent within 30 seconds."

Monitoring Thresholds

Traditional PRDs don't include monitoring requirements because traditional features either work or throw errors. AI features need monitoring baked into the requirements document because they degrade silently.

Specify these in your PRD:

  • Accuracy drift threshold: "If weekly eval accuracy drops below 88%, trigger an alert to the ML team and flag for model retraining."
  • Latency SLA: "P95 response time must stay below 3 seconds. If P95 exceeds 5 seconds for more than 15 minutes, fall back to a cached/static response."
  • Cost ceiling: "Monthly inference cost for this feature must not exceed $2,500. If projected spend exceeds 80% of budget by mid-month, reduce batch sizes and alert the PM."
  • User feedback loop: "Surface a thumbs-up/thumbs-down on every AI-generated response. If the negative feedback rate exceeds 15% over a 7-day rolling window, pause the feature for review."

What to Add to Your Existing PRD Template

You don't need to throw out your PRD template. You need to add five AI-specific sections to it. Here's the minimum viable addition:

1. Model Strategy

State which model you plan to use and why. Include the fallback plan.

Primary model: Claude 3.5 Sonnet via API
Fallback: Claude 3.5 Haiku (lower cost, acceptable quality for degraded mode)
Why not fine-tuned: Volume doesn't justify fine-tuning cost ($15k+).
  Revisit at 100k monthly requests.

This section prevents the "just use GPT-4" default that leads to cost overruns. It forces the team to evaluate whether a smaller, cheaper model achieves the quality threshold defined in your eval criteria. The AI product lifecycle framework covers how model selection decisions evolve as your product matures.

2. Data Requirements

AI features are only as good as the data they consume. Specify:

  • Training/few-shot data: Where does it come from? How is it labeled? How often is it refreshed?
  • Runtime context: What data does the model receive at inference time? User history, retrieved documents, structured metadata?
  • Data freshness: How stale can the context data be before the feature degrades? Hours? Days? Weeks?
  • Privacy constraints: What data categories are off-limits? PII handling? GDPR/CCPA compliance?

3. Failure States

Traditional PRDs define the happy path and maybe one error state. AI PRDs need a failure taxonomy:

Failure modeUser experienceRecovery action
Model returns low confidenceShow "I'm not sure about this" label + human review optionLog for retraining
Model hallucinates factsFact-check layer catches it, returns "Unable to verify"Escalate to human
Model is down / times outShow cached or template-based fallbackAuto-retry with exponential backoff
Model produces harmful outputContent filter blocks it, shows generic safe responseAlert safety team
Model cost spikes unexpectedlyRate limit per user, degrade to cheaper modelAlert PM + finance

This table does something critical: it forces the team to design the degraded experience before launch, not after the first production incident.

4. Acceptance Criteria (Rewritten for AI)

Traditional acceptance criteria: "Given [input], when [action], then [exact output]."

That format doesn't work when outputs are probabilistic. Instead, rewrite acceptance criteria as statistical assertions:

  • "Given 100 customer support queries from the test set, the model routes 90%+ to the correct department"
  • "Given 50 product descriptions, the AI-generated summary is rated 4+/5 by 3 human reviewers in 80%+ of cases"
  • "Given adversarial inputs (prompt injection attempts), the model refuses 100% of attempts to override system instructions"
  • "The feature maintains <3 second P95 latency under 500 concurrent requests"

Notice the pattern: each criterion specifies a sample size, a metric, and a threshold. This gives QA a concrete testing protocol rather than a subjective judgment call.

5. Responsible AI Checklist

Before your AI feature ships, your PRD should require sign-off on these items. The responsible AI framework provides the full methodology, but here's the minimum checklist:

  • Bias testing across demographic segments completed
  • Content safety filters tested with adversarial inputs
  • User disclosure that AI generated the content (where applicable)
  • Data retention and deletion policy documented
  • Human override mechanism available for all AI decisions
  • Feedback mechanism for users to flag incorrect outputs

Real-World Examples

Notion AI (Document Summarization)

When Notion shipped AI-powered document summaries, they faced the classic "good enough" problem. A summary could be technically accurate but miss the most important points. Their requirements included a custom eval metric they called "salience coverage": the percentage of key points (as identified by human annotators) that appeared in the AI summary. Their launch threshold was 85% salience coverage across a 200-document test set.

They also specified a fallback: if the document exceeded the context window, the feature would summarize the first and last sections and display a "Partial summary" label rather than silently truncating.

Intercom's Fin (Customer Support Agent)

Intercom's AI support agent Fin launched with requirements that most traditional PRDs would never include: a "confidence gating" system. When Fin's confidence score dropped below a configurable threshold, it would transparently hand off to a human agent rather than risk a bad answer. The PRD specified the default threshold (0.8), the UI for the handoff ("Let me connect you with someone who can help"), and the monitoring dashboard that support managers would use to adjust the threshold per topic category.

The result: Fin resolved 50% of support queries autonomously while maintaining a CSAT score within 2 points of human agents, according to Intercom's 2025 case study.

Shopify's AI Product Descriptions

Miqdad Jaffer described how Shopify's "Auto Write" feature used a PRD structure with explicit guardrails: the model could never generate health claims, legal warranties, or pricing guarantees. The eval criteria required 95% factual accuracy verified against the merchant's actual product data. And the PRD defined a cost ceiling per generation that influenced the choice of model and prompt optimization strategy.

Common Mistakes to Avoid

Mistake 1: Treating AI acceptance criteria like traditional QA. If your acceptance criteria say "the summary is accurate," you've written something that can't be tested. Replace it with "the summary contains 85%+ of key facts from the source document, as measured by human annotation on a 100-document test set."

Mistake 2: Skipping the failure taxonomy. Teams that don't define failure states in the PRD end up building them reactively after production incidents. This is more expensive and produces worse user experiences.

Mistake 3: No cost constraints. AI inference costs are variable. A single user can trigger hundreds of API calls in a session. Without cost ceilings in the PRD, your team ships a feature that works in testing (low volume) and breaks the budget in production (high volume).

Mistake 4: Ignoring model drift. A model that performs well at launch can degrade over weeks as user behavior shifts. If your PRD doesn't include monitoring thresholds and a retraining trigger, you'll discover the degradation through angry customer support tickets instead of dashboards.

Mistake 5: Writing guardrails as aspirational goals. "The model should avoid harmful content" is not a guardrail. "All outputs pass through the content moderation API before display, and any output flagged as category 2+ harmful is replaced with a standard safe response" is a guardrail.

If you need to compare different approaches to building these guardrails into your product, the comparison of fine-tuning vs. RAG vs. prompt engineering covers the trade-offs between baking safety into the model versus applying it at the application layer.

The AI PRD Template (Sections to Add)

Copy these sections into your existing PRD template. They sit after your standard problem statement, user stories, and scope sections.

Section A: Model Strategy

  • Primary model + version
  • Fallback model
  • Rationale (cost, latency, quality trade-offs)
  • Fine-tuning vs. API decision + revisit criteria

Section B: Behavior Boundaries

  • Always do (autonomous actions)
  • Ask first (human-in-the-loop actions)
  • Never do (hard stops)

Section C: Evaluation Criteria

  • Ground truth dataset description (size, source, coverage)
  • Pre-launch metrics + thresholds
  • Post-launch monitoring metrics + alert thresholds

Section D: Data Requirements

  • Training/few-shot data source and refresh cadence
  • Runtime context architecture
  • Privacy and compliance constraints

Section E: Failure Taxonomy

  • Table of failure modes, user experiences, and recovery actions
  • Fallback behavior for each degraded state
  • Escalation paths

Section F: Responsible AI Checklist

  • Bias testing sign-off
  • Safety testing sign-off
  • User disclosure requirements
  • Data retention policy

If you want a structured starting point, Forge can generate an AI-specific PRD from a feature brief in 30 seconds. It includes the standard sections plus evaluation criteria and guardrail prompts that you can customize. For a broader view of how PRDs compare to other spec formats, see our PRD vs. product brief vs. spec comparison.

What to Do Next

Start with one section. If you're writing a PRD for an AI feature this week, add the Behavior Boundaries section (always do / ask first / never do) and the Evaluation Criteria table. These two sections alone will force the conversations that prevent most AI feature failures.

Then build the failure taxonomy before your team starts implementation. The cost of designing fallback states in a PRD is a few hours of thought. The cost of designing them after a production incident is weeks of reactive engineering and lost user trust.

For the complete picture of how AI products move from concept to production, read the AI product lifecycle framework. It covers the full journey from model selection through monitoring and iteration.

Frequently Asked Questions

How are AI feature requirements different from traditional PRDs?+
Traditional PRDs assume deterministic behavior: same input, same output, binary pass/fail testing. AI feature requirements must account for probabilistic outputs, gradient quality (not just works/doesn't work), silent degradation over time, and the need for statistical acceptance criteria. You also need sections that traditional PRDs don't include: model strategy, evaluation datasets, guardrail definitions, monitoring thresholds, and a failure taxonomy.
When should I use the BEAM framework vs. a standard PRD?+
Use BEAM sections any time your feature involves a machine learning model, LLM, or AI-powered decision system. If the feature's behavior is deterministic (rules-based, no model inference), a standard PRD is fine. The dividing line: if you can't write a single "expected output" for a given input because the output is generated or predicted, you need BEAM.
What tools help with writing AI product requirements?+
[Forge](/tools/forge) generates structured PRDs with AI-specific sections. [ChatPRD](https://www.chatprd.ai) specializes in AI-assisted PRD writing with templates for LLM-powered features. For evaluation criteria specifically, the [LLM evaluation framework](/frameworks/llm-evaluation-framework) provides metrics and threshold guidance. Beyond dedicated tools, teams are increasingly using their own AI products (Claude, GPT-4) to draft the initial PRD, then refining the AI-specific sections manually.
What are the biggest mistakes PMs make when speccing AI features?+
The top three: writing acceptance criteria that can't be objectively tested ("the output should be helpful"), skipping the failure taxonomy (which leads to panicked fixes after the first production incident), and omitting cost constraints (which leads to budget blowouts when real usage patterns differ from testing). A fourth common mistake is treating guardrails as optional or aspirational rather than concrete, testable requirements with specific implementation details.
How do I define "good enough" for an AI feature?+
Build a ground truth dataset of 50-100 representative inputs with human-rated ideal outputs. Run your model against this dataset and measure the relevant metric (accuracy, salience, preference score). Set the launch threshold based on the user impact of errors: customer-facing features with financial implications need higher thresholds (95%+) than internal productivity features (80%+ may suffice). Then define a post-launch monitoring threshold that's slightly below launch quality. If quality drops below that line, trigger a review.
Free PDF

Get the AI Product Launch Checklist

A printable 1-page checklist you can pin to your desk or share with your team. Distilled from the key takeaways in this article.

or use email

Join 10,000+ product leaders. Instant PDF download.

Want full SaaS idea playbooks with market research?

Explore Ideas Pro โ†’

Keep Reading

Explore more product management guides and templates