"Should we add AI?" is the wrong question. The right question is: what user problem does AI solve that you can't solve otherwise, and what happens when it's wrong?
That reframe matters because it moves the conversation from technology to outcomes. Most AI feature failures don't happen because the model is bad. They happen because the PM never specified what "good" looked like, didn't design for failure cases, and shipped before the trust mechanics were in place.
Here's a practical framework for making these decisions well.
The 4 Decision Filters
Before you write a spec, run the feature idea through all four of these. A "no" on any one of them doesn't automatically kill the idea, but it does require a deliberate answer.
Filter 1: Does AI produce a qualitatively better outcome than rule-based logic?
AI is not always the right tool. If a decision tree or a simple algorithm can do the job reliably and deterministically, it's usually the better choice. AI adds value when the answer space is too large or too varied for rules, when you need to handle natural language, or when the quality of output genuinely scales with model capability.
Ask: "Could I build this with if-then logic and get 90% of the value?" If yes, start there. Deterministic systems are easier to debug, audit, and explain to users.
Filter 2: Do users have enough trust in AI output to act on it?
This depends on two things: the stakes of being wrong, and whether users can easily verify the output. A writing suggestion in a low-stakes context gets acted on with almost no friction. A prioritization recommendation in a strategy review needs provenance and the ability to override.
Trust is not static. It builds through track record and collapses the first time the model gives a confident, wrong answer that costs the user something. Design trust calibration into the feature from the start, not as an afterthought.
Filter 3: Do you have (or can you get) the data to make the AI useful?
A generic LLM will give you generic outputs. Useful AI features are usually grounded in something specific: a user's history, a company's documents, a curated knowledge base. Before you commit to building, map out what data the model needs and whether you have it.
Also consider quality. Training on bad data produces bad outputs. Retrieval-augmented generation on a disorganized knowledge base produces disorganized answers. The AI is only as good as what you feed it.
Filter 4: What's the failure mode, and is it acceptable?
Every AI feature fails sometimes. The question is: what does failure look like for the user, and can they recover from it?
A recommendation engine that occasionally surfaces irrelevant content is annoying but recoverable. An AI that produces a confident incorrect medical dosage is catastrophic. Map the failure mode before you build. This connects directly to the red teaming process, which the red teaming guide covers in depth.
The AI Feature Risk Matrix
Use this 2x2 to decide how much human oversight to build in:
| Low confidence in model | High confidence in model | |
|---|---|---|
| High stakes | Human-in-the-loop mandatory | Human review recommended |
| Low stakes | Show with uncertainty indicator | Automate, monitor override rate |
"Stakes" means: what's the cost to the user if the AI is wrong? "Confidence" means: how well can you measure and predict model accuracy on this task?
If you're in the top-left quadrant (high stakes, low confidence), you should not be shipping AI automation. You're shipping a suggestion with human sign-off required. If you're in the bottom-right, full automation is defensible. Monitor the override rate closely.
Common AI Feature Mistakes
Building AI for AI's sake. Adding AI because your competitor announced it or because leadership asked for "AI features" in the roadmap. The feature needs to solve a real user problem. If the primary driver is optics, the feature will underperform and erode trust in the AI systems you actually need.
Not designing for wrong answers. Every AI feature should have a graceful failure state. What does the user see when the model returns low-confidence output? What happens when it produces something nonsensical? These cases are not edge cases. They're regular events that should be part of your design spec.
Over-promising in marketing copy. "AI-powered" as a selling point sets a high expectation bar. When the feature underperforms, the disappointment scales with the promise. Be specific about what the AI does and what it doesn't do.
Ignoring latency UX. AI inference takes time. Streaming helps. Progress indicators help. But if your AI feature takes 8 seconds to respond in a context where users expect instant feedback, the technical accuracy of the output won't matter. Users will lose patience and stop using it.
How to Write an AI Feature Brief
A good AI feature brief includes everything a standard feature brief has, plus:
- Failure cases: List the three to five most likely failure modes. For each, describe the user impact and your mitigation.
- Confidence threshold: At what confidence level does the feature show output versus withhold it? Who owns this decision?
- Human fallback: When AI fails or is bypassed, what does the user do instead? This should never be a dead end.
- Eval plan: How will you measure model quality before ship and after? See the LLM evals guide for a step-by-step approach.
Without these four additions, you're shipping an AI feature without the scaffolding that makes it shippable safely.
Evaluation Metrics for AI Features
Accuracy is not enough. A model can be 95% accurate in testing and still produce a feature that users don't trust or don't complete tasks with. You need three metric layers.
Model quality: Accuracy, precision/recall (for classification tasks), hallucination rate, latency at p50 and p99. These tell you whether the model is working. They don't tell you whether the feature is working.
User experience: Task completion rate with AI versus without. Time saved per task. Override rate (what percentage of AI suggestions do users change or ignore). These are the metrics that reveal whether users are getting value.
Business impact: Feature adoption rate, retention delta for users who engage versus those who don't, NPS lift, revenue attribution. These connect AI performance to business outcomes.
The product metrics guide covers instrumentation strategy. For AI features specifically, instrument from day one. Retrofitting analytics into an AI feature is much harder than building it in at the start.
What to Prioritize
If you're evaluating a new AI feature idea, the order of operations is:
- Define the specific user problem. Not "users want AI." What task are they trying to complete, and how does AI make it better?
- Run the four filters. Identify which filters raise concerns and address them explicitly.
- Map the failure mode. Spend more time here than feels comfortable.
- Write the feature brief with the AI-specific additions.
- Define your eval plan before writing a line of code.
The when to add AI guide covers the strategic layer of this decision if you're evaluating AI at the product or portfolio level rather than the individual feature level. For the build layer, the prompt engineering guide is the right next step once you've decided to ship.