What This Template Is For
Running an A/B test is straightforward. Interpreting the results and making a defensible decision is where most teams struggle. Common mistakes include calling tests too early, ignoring segment-level effects, conflating statistical significance with practical significance, and failing to document the decision rationale for future reference.
This template structures the full analysis workflow: restating the hypothesis, validating the experiment setup, presenting results with confidence intervals, breaking down effects by segment, assessing practical significance, and arriving at a clear recommendation. It is designed for product managers, data analysts, and growth teams who need to turn experiment data into shipping decisions. The Product Analytics Handbook covers experimentation methodology in depth. For calculating whether a metric lift is worth the engineering investment, the RICE Calculator helps prioritize based on reach, impact, confidence, and effort. The error rate metric is often a critical guardrail in A/B tests.
How to Use This Template
- Complete the Experiment Summary before looking at results. Restate your hypothesis and success criteria to prevent post-hoc rationalization.
- Validate the experiment setup by checking sample sizes, assignment balance, and data quality. A result from a broken experiment is not a result.
- Present primary and guardrail metrics together. A test that lifts conversion but tanks retention is not a win.
- Break results down by segment. An average treatment effect can mask segments where the change hurts users. Check by device, plan tier, geography, and new vs. returning users.
- Separate statistical significance from practical significance. A statistically significant 0.1% lift on a low-traffic feature may not justify the complexity of shipping it.
- End with a single recommendation and the reasoning behind it. Future teams should be able to read this document and understand why you shipped, iterated, or killed the experiment.
The Template
Experiment Summary
- ☐ Experiment name and ID
- ☐ Hypothesis stated before reviewing results
- ☐ Primary metric and minimum detectable effect (MDE)
- ☐ Guardrail metrics (metrics that must not degrade)
- ☐ Test start date and end date
- ☐ Traffic allocation (percentage per variant)
- ☐ Target sample size and actual sample size
- ☐ Experiment owner
Setup Validation
- ☐ Sample ratio mismatch check passed (chi-squared test, p > 0.01)
- ☐ No data collection issues during test period
- ☐ No confounding events during test period (launches, outages, holidays)
- ☐ Assignment randomization verified
- ☐ Pre-experiment metric balance confirmed (A/A check)
- ☐ Minimum test duration met (at least one full business cycle)
Primary Results
- ☐ Control and treatment values for primary metric
- ☐ Absolute and relative lift
- ☐ P-value and confidence interval (95%)
- ☐ Statistical significance determination
- ☐ Practical significance assessment (is the lift worth shipping?)
- ☐ Power analysis confirmation (was the test adequately powered?)
Guardrail Metrics
- ☐ Each guardrail metric with control vs. treatment values
- ☐ Any guardrail degradations flagged
- ☐ Severity assessment for any guardrail violations
- ☐ Acceptable tradeoff rationale if guardrails moved
Segment Analysis
- ☐ Results by device type (mobile, desktop, tablet)
- ☐ Results by user segment (new vs. returning, plan tier, geography)
- ☐ Interaction effects identified (segments where treatment effect differs)
- ☐ Sample size per segment noted (flag underpowered segments)
- ☐ Novelty or primacy effects assessed
Decision and Recommendation
- ☐ Clear recommendation: Ship / Iterate / Kill
- ☐ Reasoning documented in 2-3 sentences
- ☐ Conditions for recommendation (e.g., "Ship if engineering cost < 2 weeks")
- ☐ Follow-up experiments or monitoring identified
- ☐ Learnings documented for future experiments
Filled Example
Experiment Summary
Experiment: checkout-progress-bar-v2
Hypothesis: Adding a progress bar to the 3-step checkout flow will increase checkout completion rate by at least 3% by reducing abandonment due to uncertainty about remaining steps.
Primary Metric: Checkout completion rate (orders / checkout starts)
Guardrail Metrics: Average order value (AOV), page load time, error rate
Duration: Feb 3-17, 2026 (14 days, 2 full business cycles)
Traffic: 50/50 split
Target Sample: 48,000 checkout starts (24,000 per variant)
Actual Sample: 51,200 checkout starts (25,400 control, 25,800 treatment)
Setup Validation
Sample ratio: 25,400 / 25,800 = 0.984 (chi-squared p = 0.12, pass).
No deployments, outages, or holidays during test window.
Pre-experiment checkout rate (Jan 27 - Feb 2): Control 34.1%, Treatment 34.3% (no significant difference).
Primary Results
| Metric | Control | Treatment | Lift | 95% CI | P-value |
|---|---|---|---|---|---|
| Checkout completion | 34.2% | 36.8% | +7.6% relative (+2.6pp) | [1.4pp, 3.8pp] | 0.003 |
Statistically significant: Yes (p < 0.05).
Practically significant: Yes. 2.6 percentage point lift on 180K monthly checkouts = ~4,680 additional orders per month at $67 AOV = ~$313K incremental monthly revenue.
Guardrail Metrics
| Metric | Control | Treatment | Change | Status |
|---|---|---|---|---|
| AOV | $67.20 | $66.80 | -0.6% | Pass (within noise) |
| Page load time (p50) | 1.2s | 1.25s | +4.2% | Pass (< 10% threshold) |
| Error rate | 0.8% | 0.7% | -12.5% | Pass (improved) |
Segment Analysis
| Segment | Control | Treatment | Lift | Significant? |
|---|---|---|---|---|
| Mobile | 28.1% | 31.9% | +13.5% | Yes (p = 0.001) |
| Desktop | 42.3% | 43.1% | +1.9% | No (p = 0.18) |
| New users | 22.4% | 26.8% | +19.6% | Yes (p < 0.001) |
| Returning users | 41.7% | 42.5% | +1.9% | No (p = 0.22) |
The progress bar disproportionately helps mobile users and new users. These segments have the most uncertainty about checkout flow length.
Decision
Recommendation: Ship.
The progress bar lifts checkout completion by 2.6 percentage points overall, driven by mobile and new users. No guardrail degradations. Engineering cost to productionize is < 1 week. The model accuracy score of our conversion prediction model may need recalibration after shipping this change.
Follow-up: Run a separate test on progress bar design variants (numbered steps vs. percentage bar) targeting mobile users specifically.
Key Takeaways
- Restate your hypothesis and success criteria before looking at results to prevent post-hoc rationalization
- Validate experiment setup (sample balance, data quality, duration) before interpreting results
- Always assess both statistical and practical significance when making ship decisions
- Segment analysis reveals where average treatment effects hide important differences
- Document the decision rationale so future teams can learn from every experiment
About This Template
Created by: Tim Adair
Last Updated: 3/4/2026
Version: 1.0.0
License: Free for personal and commercial use
