Skip to main content
New: Deck Doctor. Upload your deck, get CPO-level feedback. 7-day free trial.
AI13 min

How to Prioritize AI Features When RICE Breaks

Standard RICE scoring underestimates AI feature complexity. Learn RICE-A and three other frameworks for prioritizing AI features with real examples.

Published 2026-04-01
Share:
TL;DR: Standard RICE scoring underestimates AI feature complexity. Learn RICE-A and three other frameworks for prioritizing AI features with real examples.
Free PDF

Get the AI Product Launch Checklist

A printable 1-page checklist you can pin to your desk or share with your team. Distilled from the key takeaways in this article.

or use email

Join 10,000+ product leaders. Instant PDF download.

Want full SaaS idea playbooks with market research?

Explore Ideas Pro →

The 80% Failure Rate Problem

According to the RAND Corporation, over 80% of AI projects fail. That is twice the failure rate of non-AI IT projects. Gartner predicts 60% of AI projects will be abandoned by the end of 2026 due to lack of AI-ready data. McKinsey found fewer than 20% of AI pilots scale to production within 18 months.

These numbers share a common root cause: teams prioritize AI features the same way they prioritize traditional ones. They run a RICE score, get a number, and slot it into the backlog. Six months later, the feature is over budget, underperforming, or dead.

The problem is not RICE itself. RICE works for deterministic software where inputs produce predictable outputs, effort scales linearly, and confidence can be estimated from comparable past work. AI features violate all three assumptions. The same input can produce different outputs. Going from 80% to 90% accuracy can take 10x the effort of reaching 80% in the first place. And confidence is nearly impossible to benchmark against historical data when you are shipping something your team has never built before.

This post covers four approaches to fixing this gap. You will walk away with a concrete scoring method you can use this week.

Why Standard RICE Fails for AI Features

Before jumping to solutions, it is worth understanding exactly where each RICE dimension breaks down for AI work.

Reach overestimates adoption. Traditional reach estimates assume a feature is either available or not. AI features have a gradient. A code completion tool might be technically available to every developer on your platform, but only useful to those writing in supported languages, using supported IDEs, and working on codebases where the model performs well. GitHub Copilot is available to millions, but its code acceptance rate varies from 61% for Java down to roughly 30% for less common languages. Your reach is not your user base. It is your user base multiplied by the model's coverage of their use cases.

Impact is probabilistic, not binary. When Duolingo launched Roleplay and Explain My Answer with GPT-4, the impact was not "users can now practice conversation" or "they cannot." The impact depends on how natural the conversation feels, how accurate the corrections are, and whether the model hallucinates vocabulary that does not exist. Duolingo's daily active users surged to 47 million (51% year-over-year growth), but that outcome was not guaranteed from the spec. Impact for AI features requires thinking in probability distributions, not point estimates.

Effort is nonlinear. Traditional effort estimation assumes a roughly linear relationship between scope and work. Add two more fields to a form, add two days of work. AI features break this. The first prototype might take a week. Getting it from a demo to 80% accuracy might take a month. Getting from 80% to 95% might take six months. Lenny Rachitsky's research across 50+ AI implementations found this nonlinear pattern is nearly universal.

Confidence is circular. In standard RICE, confidence reflects how sure you are about your other estimates. For traditional features, you calibrate against similar past work. For AI features, especially ones involving new models or novel use cases, there often is no comparable past work. Teams either default confidence to "medium" (meaningless) or inflate it based on demo performance (dangerous).

RICE-A: Adding the AI Complexity Dimension

Dr. Marily Nika, former AI product lead at Google and Meta, proposed RICE-A as a direct adaptation of RICE for AI features. The framework adds a fifth factor: AI Complexity.

The formula:

RICE-A Score = (Reach x Impact x Confidence) / (Effort x AI Complexity x 0.5)

The AI Complexity score (1-10 scale) captures three dimensions:

1. Data readiness (weight: 40%)

  • Do you have the training data you need?
  • Is it labeled, clean, and representative of production traffic?
  • What are the privacy and compliance constraints on using it?

Score 1-3: Data exists, is clean, no special compliance needs. Score 4-6: Data exists but requires significant cleaning or augmentation. Score 7-10: Data must be collected from scratch, requires labeling pipelines, or involves regulated data.

2. Model maturity (weight: 35%)

  • Can you use an off-the-shelf model, or do you need to fine-tune or train from scratch?
  • How mature is the model ecosystem for this task?
  • What is the compute cost profile?

Score 1-3: Off-the-shelf API call with well-documented capabilities. Score 4-6: Requires prompt engineering, RAG pipeline, or fine-tuning. Score 7-10: Requires custom model training, novel architecture, or novel research.

3. Operational overhead (weight: 25%)

  • What monitoring and evaluation infrastructure do you need?
  • How will you detect and handle model drift?
  • What is the retraining cadence?

Score 1-3: Standard logging plus manual spot checks. Score 4-6: Requires custom eval suite and automated monitoring. Score 7-10: Requires continuous retraining pipeline, A/B testing infrastructure, and human-in-the-loop review.

The 0.5 multiplier ensures AI Complexity is weighted proportionately. Without it, a high complexity score would dominate the formula and make nearly every AI feature score lower than every traditional feature. The goal is not to penalize AI. It is to surface the hidden costs that standard RICE ignores.

Try scoring your own features with the RICE Calculator first, then layer on the AI Complexity adjustment manually. Compare the rankings before and after. You will find that features which looked equally prioritized in standard RICE often separate clearly once you account for data, model, and ops complexity.

Scoring AI Features in Practice: Three Examples

Abstract frameworks are useful. Concrete examples are better. Here is how RICE-A changes prioritization decisions for three real product scenarios.

Example 1: GitHub Copilot's Code Completion

Standard RICE:

  • Reach: 10 (every developer on the platform)
  • Impact: 3 (high value, frequent use)
  • Confidence: 80%
  • Effort: 8 (significant engineering)
  • Score: (10 x 3 x 0.8) / 8 = 3.0

RICE-A adjustment:

  • Data readiness: 2 (GitHub has billions of lines of open-source code)
  • Model maturity: 3 (Codex/GPT well-suited for code completion)
  • Operational overhead: 4 (needs latency monitoring, acceptance rate tracking)
  • AI Complexity: 2.85 (weighted average)
  • Score: (10 x 3 x 0.8) / (8 x 2.85 x 0.5) = 2.11

The RICE-A score is lower, but still high. More importantly, the breakdown tells you why this was a strong bet: GitHub's unique data advantage (readiness = 2) de-risked the hardest part of most AI features. This insight is invisible in standard RICE.

Example 2: Notion AI Writing Assistant

Standard RICE:

  • Reach: 8 (most Notion users write documents)
  • Impact: 2 (moderate value per use)
  • Confidence: 70%
  • Effort: 6
  • Score: (8 x 2 x 0.7) / 6 = 1.87

RICE-A adjustment:

  • Data readiness: 5 (user data available but privacy constraints limit training)
  • Model maturity: 3 (LLMs handle writing well, but RAG over workspace needed)
  • Operational overhead: 5 (quality monitoring across diverse use cases)
  • AI Complexity: 4.35
  • Score: (8 x 2 x 0.7) / (6 x 4.35 x 0.5) = 0.86

Significantly lower. The gap between standard RICE and RICE-A reflects the real challenge Notion faced: privacy constraints on data usage, the need for workspace-specific context (RAG), and quality monitoring across millions of different document types. Notion addressed this by bundling AI into existing plans rather than charging separately, driving adoption from 10-20% to over 50% of customers. The prioritization insight was correct: this feature needed a distribution strategy, not just a shipping date.

Example 3: A Hypothetical Fraud Detection Agent

Standard RICE:

  • Reach: 3 (only relevant to payments team)
  • Impact: 3 (high value when it works)
  • Confidence: 50%
  • Effort: 7
  • Score: (3 x 3 x 0.5) / 7 = 0.64

RICE-A adjustment:

  • Data readiness: 8 (fraud data is sparse, imbalanced, and regulated)
  • Model maturity: 6 (requires custom model, not off-the-shelf LLM)
  • Operational overhead: 9 (continuous retraining, compliance audits, human review loop)
  • AI Complexity: 7.65
  • Score: (3 x 3 x 0.5) / (7 x 7.65 x 0.5) = 0.17

Standard RICE ranked this at 0.64. RICE-A drops it to 0.17. For a payments team deciding between this and a traditional rule-based fraud system upgrade (which might score 0.8 in standard RICE), the gap is now obvious. The AI approach has a much higher complexity cost that standard RICE hides entirely.

Beyond RICE-A: Three Alternative Approaches

RICE-A works well for teams already comfortable with RICE scoring. But it is not the only option. Here are three other frameworks teams use to prioritize AI features.

Approach 1: The Confidence-First Method

Instead of treating confidence as one of four equal inputs, make it the primary filter. Score every AI feature candidate on three confidence dimensions before you score anything else:

  • Technical confidence: Can the model actually do this? Run a quick spike (2-3 days max) with real data, not demo data. What accuracy do you get out of the box?
  • Data confidence: Do you have the data pipeline to sustain this feature in production? Not just training data, but the data infrastructure for monitoring, retraining, and drift detection.
  • Measurement confidence: Can you define success? If you cannot articulate what "good" looks like before building, you will not be able to evaluate after shipping.

Any feature scoring below 5/10 on any dimension gets shelved until the gap is addressed. This simple gate eliminates the majority of AI features that would otherwise waste cycles. Only features passing all three gates proceed to standard prioritization.

This approach aligns with IBM's finding that only 25% of AI initiatives deliver expected ROI. The biggest predictor of that 25% is having clear success criteria before development starts.

Approach 2: Phased Scoring

Lenny Rachitsky's research on AI product development cycles suggests a phased approach. Instead of scoring the full AI feature, break it into three incremental versions and score each independently:

v1 (Assist): Human does the work, AI suggests. Low autonomy, high control.

Example: Grammarly started here with grammar suggestions. Score this version with standard RICE.

v2 (Augment): AI does the work, human reviews. Medium autonomy, medium control.

Example: Grammarly's rewrite suggestions. Score with light RICE-A (add AI Complexity at 0.3 multiplier).

v3 (Automate): AI does the work, human handles exceptions. High autonomy, low control.

Example: Grammarly's autonomous agents. Score with full RICE-A (0.5 multiplier).

This approach does three things. It turns one high-risk bet into three smaller bets. It lets you learn from v1 before committing to v2. And it creates natural checkpoints where you can redirect resources if the AI approach is not working.

Figma learned this lesson the hard way. Their initial "Make Designs" AI feature launched with too much autonomy (generating full designs from prompts) and was pulled after users found it copying existing work. When they relaunched the feature set at Config 2025 (as Figma Make, Figma Sites, and Vectorize), each tool had a clearer scope and better guardrails. Starting at v2 or v3 without the learning from v1 is a common and expensive mistake.

Approach 3: Cost-Adjusted Scoring

LLM API prices dropped roughly 80% between early 2025 and early 2026. The gap between a premium model (Claude Opus at $5/$25 per million tokens) and a budget model (DeepSeek V3 at $0.14/$0.28) is nearly 36x. This means the "Effort" dimension in RICE is not just about engineering time. It is about ongoing operational cost.

Cost-Adjusted Score = Standard RICE Score / Monthly API Cost Factor

Where the API Cost Factor is:

  • 1.0: Feature uses a budget model or local model (< $100/month at projected scale)
  • 1.5: Feature uses a mid-tier model ($100-$1,000/month)
  • 2.0: Feature uses a premium model ($1,000-$10,000/month)
  • 3.0: Feature uses a premium model at high volume (> $10,000/month)

Duolingo's GPT-4 integration maintained a 73% gross margin despite high API costs because the Reach and Impact scores were enormous. But for most products, the margin math is tighter. A feature that scores 2.0 in standard RICE but requires $8,000/month in API costs at scale (cost factor 2.0) drops to an effective score of 1.0. That might push it below a simpler feature that scores 1.5 in standard RICE but costs $50/month.

The cost-adjusted approach is especially useful early in planning when you are choosing between building a feature with a premium model versus a smaller, cheaper model that might be "good enough." For 70-80% of production workloads, mid-tier models perform identically to premium ones. Run a quick benchmark before committing to the expensive option.

Use the AI ROI Calculator to model the cost scenarios before scoring. The breakeven timeline is often the clearest signal for whether an AI feature is worth the investment.

Building Your AI Prioritization Workflow

Here is a step-by-step process for integrating these frameworks into your existing planning workflow.

Step 1: Triage with the Confidence-First gate (30 minutes)

Before your next planning session, score every AI feature candidate on technical, data, and measurement confidence. Kill anything below 5/10 on any dimension. This typically eliminates 40-60% of candidates and saves days of detailed scoring work.

Step 2: Break survivors into phases (1 hour)

For each remaining candidate, define v1 (assist), v2 (augment), and v3 (automate) versions. Write a one-sentence description of each. Many teams discover their "AI feature" is actually a v1 that requires no AI at all, just better UX.

Step 3: Score v1 with standard RICE (30 minutes)

Score only the v1 version using your normal RICE process. If v1 does not score well enough to build, v2 and v3 are irrelevant.

  • Estimate reach based on the specific user segment with the specific use case (not your total user base)
  • Define impact as a measurable outcome, not a vague "high/medium/low"
  • Set confidence based on your spike results, not your intuition
  • Estimate effort including data pipeline work, not just feature code

Step 4: Score v2 and v3 with RICE-A (30 minutes each)

For features where v1 passes, score v2 and v3 using the full RICE-A formula. Pay special attention to the AI Complexity breakdown. This is where most teams find their priorities shift.

Step 5: Apply cost adjustment (15 minutes)

Run the cost-adjusted scoring on your top candidates. If you are comparing two features with similar RICE-A scores, the cost-adjusted score will often break the tie.

Step 6: Sequence by learning dependency

Order your final backlog so that features building on shared data infrastructure come first. If both Feature A and Feature B need the same embedding pipeline, build the pipeline for Feature A's v1 and reuse it for Feature B.

Common Mistakes to Avoid

Scoring the demo, not the product. An AI feature demo with curated inputs and cherry-picked outputs is not evidence of production readiness. The gap between demo and production is where most of the 80% failure rate hides. Score based on expected production performance, not demo performance.

Ignoring the eval tax. Every AI feature needs an evaluation framework. Building and maintaining evals is real, ongoing work that teams consistently undercount. Figma's AI Report found that 52% of AI builders say design is more important for AI products than traditional ones. The same is true for evaluation. Budget 20-30% of your effort estimate for eval infrastructure alone.

Treating model choice as an engineering decision. Which model to use is a product decision that directly affects reach, impact, and cost. A PM who delegates model selection entirely to engineering is delegating a third of the prioritization math.

Comparing AI features against each other in isolation. The real question is rarely "which AI feature should we build?" It is "should we build an AI feature or a traditional feature?" Use the same RICE-A framework to score traditional features (they simply get an AI Complexity of 1.0) so you are comparing apples to apples.

Sources

Frequently Asked Questions

How does RICE-A differ from standard RICE scoring?+
RICE-A adds a fifth factor, AI Complexity, that captures data readiness, model maturity, and operational overhead. These three dimensions represent the hidden costs that standard RICE misses for AI features. The formula uses a 0.5 multiplier on AI Complexity so it does not dominate the score but does surface meaningful differences between AI candidates.
When should PMs use RICE-A versus standard RICE?+
Use RICE-A any time the feature involves a machine learning model, an LLM API call, or a data pipeline that needs ongoing maintenance. If the feature is purely deterministic software (no model inference, no training data, no drift monitoring), standard RICE works fine. When in doubt, score with both and compare. If the scores are meaningfully different, the AI Complexity dimension is flagging real risk.
What tools help with AI feature prioritization?+
The [RICE Calculator](/tools/rice-calculator) handles standard scoring and gives you a baseline. The [AI ROI Calculator](/tools/ai-roi-calculator) models cost scenarios and breakeven timelines. For comparing multiple features side by side, [weighted scoring](/tools/weighted-scoring) lets you add custom dimensions like data readiness and model maturity. For the PRD side of AI features, [Forge](/tools/forge) generates structured requirements with AI-specific sections.
What are the biggest mistakes PMs make when prioritizing AI features?+
Four patterns stand out. First, scoring based on demo performance rather than expected production accuracy. Second, underestimating the ongoing operational cost (model hosting, monitoring, retraining) relative to the one-time build cost. Third, failing to break AI features into v1/v2/v3 phases and trying to ship the autonomous version first. Fourth, treating model selection as a purely technical decision rather than a product decision that directly affects cost, latency, and quality.
How do you estimate confidence for AI features with no precedent?+
Run a spike. Spend 2-3 days building a minimal version with real (not synthetic) data. Measure the baseline accuracy. If you cannot get above 60% accuracy in a spike, your confidence score should be below 40% regardless of how promising the idea feels. If you can get to 75%+, your confidence is moderate (60-70%). Only rate confidence above 80% if your spike uses production-representative data and produces results your target users would accept.
Free PDF

Get the AI Product Launch Checklist

A printable 1-page checklist you can pin to your desk or share with your team. Distilled from the key takeaways in this article.

or use email

Join 10,000+ product leaders. Instant PDF download.

Want full SaaS idea playbooks with market research?

Explore Ideas Pro →

Keep Reading

Explore more product management guides and templates