What This Template Is For
Running A/B tests without a written plan is how teams ship inconclusive experiments and waste engineering cycles. The test runs for "a few weeks," someone checks the dashboard, the results are ambiguous, and the team debates what happened for another week before moving on. The problem is rarely the testing tool. The problem is that nobody agreed upfront on what success looks like, how long the test needs to run, or what they would do with the results.
This template forces clarity before a single line of test code is written. It captures the hypothesis, the primary and secondary metrics, the required sample size, the expected duration, a description of each variant, success criteria, and a structured results table. Every field exists to prevent a specific failure mode: unclear hypotheses that cannot be falsified, tests that run too short, metrics that conflict, and results that sit in a dashboard without driving a decision.
If you are building a culture of experimentation on your team, the Product Analytics Handbook covers how to design an experimentation program from the ground up. For calculating the sample size and statistical significance of a specific test, use the A/B Test Calculator. And if you are new to the concept, the A/B testing glossary entry covers the fundamentals.
When to Use This Template
- Before any A/B test. Every test should have a written plan reviewed by the PM, engineer, and data analyst before implementation begins. This is not bureaucracy. It is the difference between a conclusive experiment and an expensive guess.
- When the team disagrees on a design direction. Instead of debating which CTA color or pricing layout is better, test it. The plan template ensures you test the right thing for long enough with the right metric.
- Before changing a high-traffic flow. Checkout, onboarding, signup, and pricing pages carry significant revenue risk. A structured test plan with guardrail metrics prevents you from shipping a change that improves one metric while silently degrading another.
- When you need stakeholder buy-in for an experiment. A filled-out test plan makes it easy for a VP or CFO to say yes. It shows you have thought through the risks, the duration, and the decision criteria.
- During post-experiment reviews. Use the completed plan (with results filled in) as the single artifact for the experiment retrospective. It answers "what did we test, what did we learn, and what did we decide?" in one document.
How to Use This Template
- Start with the hypothesis. Write it in the format: "If we [change], then [metric] will [improve/decrease] by [amount] because [reason]." A good hypothesis is specific and falsifiable. "If we change the CTA, conversions will go up" is too vague. "If we change the CTA from 'Start Free Trial' to 'See It In Action', trial starts will increase by 10% because the new copy reduces perceived commitment" is testable.
- Define success criteria before you see any data. Write down the minimum detectable effect (MDE), the confidence level (typically 95%), and the decision rule ("If the variant beats control by X%, we ship it. If it loses, we revert. If it is inconclusive, we [extend / kill / iterate]."). Deciding this after seeing partial results is confirmation bias dressed up as analysis.
- Calculate the required sample size and duration. Use the A/B Test Calculator or a power analysis tool. Input your baseline conversion rate, your MDE, and your daily traffic to the test surface. Do not guess. Under-powered tests produce false negatives and waste everyone's time.
- Document both variants clearly. Describe the control and the variant in enough detail that an engineer can implement them without ambiguity and a designer can review them without confusion. Include screenshots or mockups if possible.
- Run the test for the full planned duration. Do not peek at results daily and call the test early when the dashboard looks good. Early stopping inflates false positive rates. If you calculated a 14-day test, run it for 14 days.
The Template
Test Overview
| Field | Details |
|---|---|
| Test Name | [Short, descriptive name, e.g. "Pricing Page CTA Copy Test"] |
| Test ID | [Internal tracking ID, e.g. EXP-042] |
| Owner | [PM name] |
| Engineer | [Engineer implementing the test] |
| Analyst | [Person responsible for results analysis] |
| Start Date | [YYYY-MM-DD] |
| Planned End Date | [YYYY-MM-DD] |
| Status | [Draft / In Review / Running / Complete / Killed] |
Hypothesis
Format. If we [specific change], then [primary metric] will [direction] by [magnitude] because [rationale based on user behavior or prior data].
[Write your hypothesis here]
Confidence level. [Low / Medium / High] based on [prior data, qualitative signal, or intuition]
Metrics
| Metric Type | Metric Name | Current Baseline | Target | Measurement Method |
|---|---|---|---|---|
| Primary | [The single metric this test is designed to move] | [Current value, e.g. 3.2%] | [Target value, e.g. 3.5%] | [Tool and event, e.g. "Mixpanel: button_clicked where page=pricing"] |
| Secondary | [Supporting metric that should also improve or stay flat] | [Value] | [Direction] | [Measurement method] |
| Secondary | [Another supporting metric] | [Value] | [Direction] | [Measurement method] |
| Guardrail | [Metric that must NOT degrade, e.g. bounce rate, support tickets, revenue per user] | [Value] | [Must stay within X%] | [Measurement method] |
| Guardrail | [Another guardrail] | [Value] | [Must stay within X%] | [Measurement method] |
Sample Size and Duration
| Field | Details |
|---|---|
| Baseline conversion rate | [e.g. 3.2%] |
| Minimum detectable effect (MDE) | [e.g. 10% relative lift = 3.52% absolute] |
| Statistical significance threshold | [e.g. 95% confidence / p < 0.05] |
| Statistical power | [e.g. 80%] |
| Required sample size per variant | [e.g. 12,400 visitors] |
| Total sample needed | [e.g. 24,800 visitors] |
| Daily traffic to test surface | [e.g. 1,800 visitors/day] |
| Estimated test duration | [e.g. 14 days] |
| Traffic allocation | [e.g. 50/50] |
Variants
Control (A). [Describe the current experience. What does the user see and interact with today?]
Variant (B). [Describe the change. Be specific about what is different: copy, layout, color, flow, or logic. Include a screenshot or mockup link if available.]
Variant (C), if applicable. [Describe the additional variant. Note: each additional variant increases required sample size.]
Audience and Segmentation
| Field | Details |
|---|---|
| Target audience | [e.g. All visitors to /pricing, or Signed-in users on Free plan] |
| Exclusions | [e.g. Internal users, existing Enterprise customers, users on mobile] |
| Segment for sub-analysis | [e.g. New vs. returning visitors, US vs. international] |
Success Criteria and Decision Rules
Write these before the test starts. Do not change them while the test is running.
| Outcome | Criteria | Action |
|---|---|---|
| Clear win | Variant beats control on primary metric by >= MDE with 95% confidence AND guardrail metrics are flat or improved | Ship the variant to 100% |
| Clear loss | Control beats variant on primary metric with 95% confidence OR guardrail metric degrades by > [X%] | Revert to control. Document learnings. |
| Inconclusive | Neither variant reaches significance after full planned duration | [Kill the test and move on / Extend by [N] days / Redesign with a bolder change] |
| Guardrail breach | Any guardrail metric degrades by more than [X%] at any point during the test | Stop the test immediately. Investigate root cause before deciding next steps. |
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| [e.g. Variant causes confusion, increasing support tickets] | [Low/Med/High] | [Low/Med/High] | [e.g. Monitor support queue daily. Kill switch if ticket volume spikes >20%] |
| [e.g. Test takes longer than expected due to lower traffic] | |||
| [e.g. Novelty effect inflates early results] |
Results Table
Complete this section after the test ends. Do not fill it in while the test is running.
| Metric | Control (A) | Variant (B) | Relative Difference | Confidence | Significant? |
|---|---|---|---|---|---|
| [Primary metric] | [Value] | [Value] | [+/- X%] | [X%] | [Yes / No] |
| [Secondary metric 1] | [Value] | [Value] | [+/- X%] | [X%] | [Yes / No] |
| [Secondary metric 2] | [Value] | [Value] | [+/- X%] | [X%] | [Yes / No] |
| [Guardrail metric 1] | [Value] | [Value] | [+/- X%] | [X%] | [Flat / Degraded] |
| [Guardrail metric 2] | [Value] | [Value] | [+/- X%] | [X%] | [Flat / Degraded] |
Decision. [Ship variant / Revert to control / Iterate and retest]
Key learning. [One paragraph summarizing what you learned, even if the test was inconclusive]
Filled Example: SaaS Pricing Page CTA Test
Test Overview (Example)
| Field | Details |
|---|---|
| Test Name | Pricing Page CTA Copy Test |
| Test ID | EXP-042 |
| Owner | Jordan (PM, Growth) |
| Engineer | Sam (Frontend) |
| Analyst | Priya (Data) |
| Start Date | 2026-02-17 |
| Planned End Date | 2026-03-03 |
| Status | Complete |
Hypothesis (Example)
If we change the primary CTA on the pricing page from "Start Free Trial" to "See PlanForge in Action" with a 14-day trial badge, then trial signups will increase by 12% because user research showed that prospects feel "Start Free Trial" implies they need to commit before they understand the product. The new copy emphasizes exploration over commitment.
Confidence level. Medium, based on 6 customer interviews where 4 of 6 prospects mentioned hesitation around the word "trial."
Metrics (Example)
| Metric Type | Metric Name | Current Baseline | Target | Measurement Method |
|---|---|---|---|---|
| Primary | Trial signup rate (pricing page visitors who start a trial) | 4.1% | 4.6% (+12% relative) | Mixpanel: trial_started where source=pricing |
| Secondary | Pricing page engagement time | 48 seconds | Increase | Mixpanel: avg session duration on /pricing |
| Secondary | Pricing FAQ click rate | 11% | Stay flat or increase | Mixpanel: faq_clicked where page=pricing |
| Guardrail | Trial-to-paid conversion rate (7-day) | 18% | Must stay above 16% | Stripe events: subscription_created / trial_started |
| Guardrail | Support tickets from pricing page | 12/week | Must stay below 18/week | Intercom: tickets tagged "pricing" |
Sample Size and Duration (Example)
| Field | Details |
|---|---|
| Baseline conversion rate | 4.1% |
| Minimum detectable effect | 12% relative lift (4.59% absolute) |
| Statistical significance threshold | 95% confidence |
| Statistical power | 80% |
| Required sample size per variant | 11,200 visitors |
| Total sample needed | 22,400 visitors |
| Daily traffic to pricing page | 1,650 visitors/day |
| Estimated test duration | 14 days |
| Traffic allocation | 50/50 |
Variants (Example)
Control (A). Current pricing page. Green CTA button reads "Start Free Trial" in white text. No badge. Button links to the signup form with email and password fields.
Variant (B). Same pricing page layout. CTA button reads "See PlanForge in Action" in white text. A small badge below the button reads "14-day free trial. No credit card." Button links to the same signup form.
Results (Example)
| Metric | Control (A) | Variant (B) | Relative Difference | Confidence | Significant? |
|---|---|---|---|---|---|
| Trial signup rate | 4.1% | 4.7% | +14.6% | 97.2% | Yes |
| Pricing page engagement time | 47s | 52s | +10.6% | 89% | No |
| Pricing FAQ click rate | 11.3% | 10.8% | -4.4% | 62% | No |
| Trial-to-paid conversion (7-day) | 18.1% | 17.4% | -3.9% | 71% | Flat (within guardrail) |
| Support tickets from pricing | 11/week | 13/week | +18% | N/A | Flat (within guardrail) |
Decision. Ship Variant B. Primary metric exceeded MDE with 97.2% confidence. Guardrail metrics stayed within acceptable range. Trial-to-paid conversion dipped slightly but remained above the 16% floor.
Key learning. Reducing perceived commitment in CTA copy increased trial starts by ~15%. The "14-day free trial. No credit card." badge likely contributed. Engagement time and FAQ clicks were not significantly different, suggesting the change did not alter how users evaluated pricing. It primarily reduced friction at the moment of the click. Next test: apply the same "action-first" copy pattern to the homepage hero CTA.
Key Takeaways
- Write the hypothesis, success criteria, and decision rules before the test starts. Changing them mid-test is not iteration. It is p-hacking.
- Every test needs exactly one primary metric. Secondary metrics add context. Guardrail metrics prevent collateral damage. Conflating these roles leads to ambiguous results.
- Calculate the required sample size and commit to the full test duration. Early stopping based on promising results inflates false positive rates. The A/B Test Calculator handles the math.
- Document variants in enough detail that someone who was not in the planning meeting could understand exactly what changed and why.
- Fill in the results table and key learning even when the test is inconclusive. Negative and null results are data. They prevent the team from rerunning the same failed experiment six months later.
- For a structured approach to building an experimentation practice, see the product experimentation guide.
About This Template
Created by: Tim Adair
Last Updated: 2/19/2026
Version: 1.0.0
License: Free for personal and commercial use
