Quick Answer (TL;DR)
A/B testing is a controlled experiment where you split users into two groups, show each group a different version of your product, and measure which version performs better on a specific metric. It is the most reliable way to measure the causal impact of a product change. Without it, you are guessing whether your changes actually improved anything or whether the numbers moved for unrelated reasons.
What Is A/B Testing?
A/B testing (also called split testing) is a method of comparing two versions of a product experience to determine which one produces a better outcome. You randomly assign users to either the control group (the current experience) or the variant group (the modified experience), then measure the difference in a pre-defined metric.
The concept is borrowed from clinical trials in medicine. Just as a drug trial compares a treatment group to a placebo group, an A/B test compares a product change to the existing baseline. The randomization is what makes it powerful. Because users are assigned randomly, any difference in outcomes between the two groups can be attributed to the change itself rather than to differences in the users.
Here is a concrete example. Your SaaS product has a free trial signup page converting at 12%. You hypothesize that simplifying the form from six fields to three will increase signups. You build the shorter form (variant B), randomly show it to 50% of visitors while the other 50% see the original (control A), and measure the conversion rate for each group over two weeks. If variant B converts at 14.5% and the difference is statistically significant, you have evidence that the shorter form causes higher conversions. You ship it.
That "statistically significant" qualifier matters. Small differences between groups can appear by random chance. Statistical significance tells you how likely it is that the observed difference is real rather than noise. Most teams use a 95% confidence threshold, meaning there is less than a 5% probability that the result is due to chance.
For a deeper overview of testing terminology and related concepts, see the A/B testing glossary entry.
Why A/B Testing Matters for Product Teams
Product teams make hundreds of decisions per quarter about what to build, how to design it, and where to place it. Most of those decisions are based on intuition, stakeholder opinions, or best practices borrowed from other companies. A/B testing replaces guesswork with measurement.
It measures causal impact, not just correlation. Product analytics can show you that users who complete onboarding retain better. But did onboarding cause the retention, or were those users already more motivated? An A/B test isolates the variable and tells you whether the change itself caused the outcome. This is the foundation of being data-informed rather than data-driven.
It reduces the risk of shipping bad changes. Not every "improvement" actually improves things. Redesigns can confuse existing users. New features can distract from core workflows. A/B testing catches regressions before they reach 100% of users. If variant B performs worse than control A, you simply stop the test. No harm done.
It settles debates with evidence. Product teams waste hours arguing over button colors, copy variations, and feature placements. An A/B test moves the conversation from "I think" to "we measured." The PM who can say "we tested both approaches and version B increased activation rate by 8%" wins the argument cleanly.
It compounds over time. Each experiment teaches you something about your users. A team that runs 50 tests per year accumulates a body of knowledge about what works for their specific product and audience. That knowledge informs better hypotheses for future tests, creating a flywheel of learning.
How A/B Tests Work
The anatomy of a test
Every valid A/B test has six components:
- Hypothesis. A specific, falsifiable prediction. "Reducing the signup form from six fields to three will increase free trial conversions by at least 5% because our analytics show a 30% drop-off between fields 4 and 6."
- Control. The current experience. This is your baseline. It does not change during the test.
- Variant. The modified experience. Change one variable at a time so you can attribute any difference to that specific change.
- Primary metric. The single number you are trying to move. Pick one. If you track five metrics, one of them will show a "significant" result by chance alone (see the multiple comparisons problem below).
- Sample size. How many users you need in each group to detect a meaningful difference. Calculate this before launching. It depends on your baseline conversion rate, the minimum effect size you care about, and your chosen significance level.
- Duration. How long the test will run. This is driven by your sample size calculation and your traffic volume. Include at least one full business cycle (typically one to two weeks) to account for weekday/weekend variation.
Statistical significance explained
Statistical significance answers one question: "How confident are we that this result is not due to random chance?"
The p-value is the probability that you would see a result at least as extreme as what you observed, assuming there is actually no difference between control and variant. A p-value of 0.03 means there is a 3% chance the result is noise. Most teams set their significance threshold at 0.05 (95% confidence), meaning they accept a 5% risk of a false positive.
Confidence intervals are the practical complement to p-values. Instead of just telling you "the result is significant," a confidence interval tells you "we are 95% confident the true effect is between +2% and +7%." If the confidence interval includes zero, the result is not significant. If it does not include zero, you have a real effect. The width of the interval tells you how precisely you have estimated the effect.
You do not need a statistics degree to run A/B tests. But you do need to understand these three things: set your sample size before you start, do not peek at results early, and do not declare victory on a p-value alone without checking the confidence interval.
Types of tests
A/B test. Two variants: control (A) and one modification (B). The simplest and most common type. Use this by default.
A/B/n test. Multiple variants tested against a single control. Useful when you have three or four candidate designs. Requires more traffic because each variant needs its own sample. Be cautious: more variants mean a higher risk of false positives if you do not adjust your significance threshold.
Multivariate test (MVT). Tests multiple variables simultaneously and measures all combinations. For example, three headlines crossed with two button colors yields six variants. MVT reveals interaction effects (does headline A work better with a green button than a blue one?). It requires significantly more traffic. Most teams do not have enough.
Holdout group. A small percentage of users (typically 5-10%) who never receive any changes. This lets you measure the cumulative impact of all experiments over a quarter or year. Without a holdout, you cannot tell whether your experimentation program is producing net positive results.
How to Run an A/B Test: Step by Step
Step 1: Start with a hypothesis
A test without a hypothesis is a fishing expedition. "Let's see what happens if we change the button color" teaches you nothing because you have no prediction to validate or invalidate.
A good hypothesis has three parts: what you are changing, what metric you expect to move, and why you believe it will move. "We believe that adding social proof to the pricing page (change) will increase plan upgrades by 10% (metric) because support tickets show users want reassurance from other customers before committing (why)."
Write the hypothesis down before you build anything. This prevents post-hoc rationalization, where you look at the results and invent a story for why they turned out the way they did.
Step 2: Choose your primary metric
Pick one metric. One. This is the number that determines whether the test passes or fails.
Adding secondary metrics is fine for learning (you might track time on page, scroll depth, or support ticket volume alongside your primary conversion metric). But the decision to ship or not ship should be based on the primary metric alone.
Why only one? If you track five metrics and set a 95% confidence threshold, there is a ~23% chance that at least one metric will show a "significant" result purely by chance. This is the multiple comparisons problem. One metric, one decision.
For most product experiments, the primary metric should be tied to user behavior that matters for retention or revenue. Activation rate, conversion rate, feature adoption, or retention rate are all strong choices. Avoid vanity metrics like page views or time on page unless they directly connect to a business outcome.
Step 3: Calculate sample size
Before you launch, calculate how many users you need per variant. You need three inputs:
- Baseline conversion rate. Your current metric value (e.g., 12% signup rate).
- Minimum detectable effect (MDE). The smallest improvement worth detecting. If you only care about changes of 2 percentage points or more, set MDE to 2%.
- Significance level and power. Typically 95% confidence (alpha = 0.05) and 80% power (beta = 0.20).
Most A/B testing tools include a built-in calculator. If yours does not, free online calculators work fine. Plug in the numbers, get the required sample size, and estimate how long it will take to reach that number given your daily traffic.
If the required duration is longer than six weeks, reconsider the test. Either increase your MDE (accept that you can only detect larger effects), narrow the audience (test on a specific segment with higher traffic), or choose a different approach like qualitative testing.
Step 4: Build and QA the variants
Build the variant experience. QA it across devices, browsers, and user states (logged in, logged out, new user, returning user). A broken variant does not just invalidate the test. It damages the user experience for everyone assigned to it.
Ensure the randomization is working correctly. Each user should see only one variant for the duration of the test (this is called "sticky bucketing"). A user who sees variant B on Monday and control A on Tuesday is contaminating both groups.
Step 5: Run the test (full duration, no peeking)
Launch the test and step away. The hardest part of A/B testing is not analyzing results. It is resisting the urge to check every day and react to intermediate numbers.
Peeking at results and stopping early when they look good is the single most common reason A/B tests produce invalid results. The math behind significance testing assumes you look at the data once, at the end. Every additional peek inflates your false positive rate. A team that checks daily and stops when they see p < 0.05 is running a test with an effective false positive rate of 20-30%, not the 5% they intended.
If you must monitor for safety (to catch severe regressions), use sequential testing methods that adjust the significance threshold for each look.
Step 6: Analyze and decide
When the test reaches its pre-planned sample size, analyze the results. Check:
- Is the primary metric statistically significant (p < 0.05)?
- What is the confidence interval? Is the effect size meaningful, or is it significant but tiny?
- Are there segment effects? Did the variant help new users but hurt power users?
- Did any guardrail metrics (revenue, support tickets, page load time) move in a concerning direction?
If the result is significant and the effect is meaningful, ship the variant. If the result is not significant, the test is inconclusive. That does not mean the variant is worse. It means you did not detect a difference. Document the result and move on.
Step 7: Document and share
Every test, whether it wins, loses, or is inconclusive, generates institutional knowledge. Document the hypothesis, the result, and what you learned. Share it with the team.
Teams that maintain a shared experiment log make better hypotheses over time because they build on prior results instead of repeating failed experiments. A simple spreadsheet with columns for date, hypothesis, metric, result, and takeaway is enough.
What to A/B Test (and What Not To)
Good candidates for A/B testing
- Onboarding flows. Small changes in the first-run experience have outsized effects on activation and long-term retention.
- Pricing and packaging. A 10% change in pricing page conversion can be worth millions in annual revenue. Test layout, plan names, and default selections.
- Calls to action. Button text, placement, size, and color. These are low-effort tests with fast results.
- Feature placement and discoverability. Where you surface a feature in the UI affects adoption. Test prominent placements against subtle ones.
- Email and notification copy. Subject lines, send times, and content variations are easy to test and produce clear outcomes.
- Signup and checkout flows. Reducing friction in conversion funnels is the highest-ROI use of A/B testing for most products.
Poor candidates for A/B testing
- Bug fixes. If something is broken, fix it. You do not need a control group of users experiencing the bug.
- Legal and compliance changes. Privacy policy updates, GDPR requirements, and accessibility fixes are not optional.
- Low-traffic pages. A feature used by 50 people per month will never reach statistical significance. Use qualitative research instead.
- Clearly validated features. If five customer interviews and your support ticket data all point to the same need, shipping it directly is faster and cheaper than running a two-week test.
- Infrastructure changes. Database migrations, API refactors, and performance optimizations should be measured with monitoring tools, not A/B tests.
A/B Testing Tools for Product Teams
The tooling market for experimentation has matured. Most tools handle the same core functions: variant assignment, event tracking, and statistical analysis. The differences are in integration, collaboration features, and pricing.
PostHog combines product analytics, feature flags, and A/B testing in one open-source platform. Best for engineering-heavy teams that want experimentation tightly integrated with their analytics. See best product analytics tools for 2026 for a detailed comparison.
LaunchDarkly started as a feature flag tool and added experimentation. Strong for teams already using feature flags for gradual rollouts. Enterprise pricing.
Statsig offers free experimentation for up to 1 million events per month. Strong statistical engine with automatic sequential testing (which addresses the peeking problem). Good for growth-stage teams.
Optimizely is the legacy leader in web experimentation. Best for marketing teams running front-end tests on landing pages and marketing sites. Less suited for deep product experimentation.
Google Optimize was sunset in 2023. Its replacement, A/B testing within GA4, is limited. Most teams that used Optimize have migrated to PostHog, Statsig, or VWO.
For teams just starting with experimentation, the product experimentation guide covers how to set up your first experiment program from scratch.
Common A/B Testing Mistakes
Peeking at results
The most damaging mistake. Checking results daily and stopping as soon as p < 0.05 inflates false positive rates from 5% to 20-30%. The fix is straightforward: calculate your sample size in advance, commit to the full duration, and analyze once at the end. If you need ongoing monitoring, use a tool that supports sequential analysis with adjusted significance thresholds.
Testing too many things at once
Changing the headline, the button color, the layout, and the pricing in a single variant makes it impossible to attribute the result to any specific change. If variant B wins, which change caused it? If it loses, which change dragged it down? Change one thing at a time. If you need to test a full redesign, treat it as a single holistic test and accept that you will not know which individual element drove the result.
No hypothesis
Running a test "to see what happens" produces noise, not insight. Without a hypothesis, you have no prediction to validate. You will end up cherry-picking whichever metric moved in a favorable direction and calling it a win. Start every test with a written, specific, falsifiable prediction.
Wrong success metric
Optimizing for clicks when you should optimize for conversions. Optimizing for signups when you should optimize for activation. The metric you test against should be as close to the business outcome as possible. A change that increases click-through rate by 20% but does not move revenue has not actually improved anything. Choose metrics that matter for retention and revenue, not vanity engagement numbers.
Ignoring segment effects
An overall flat result can hide meaningful segment differences. Variant B might increase conversions for new users by 15% while decreasing them for returning users by 10%. The overall average looks like no effect, but you are missing an opportunity. Always segment your results by key dimensions: new vs. returning, mobile vs. desktop, plan type, and geography. Segment analysis is exploratory (not confirmatory), but it generates hypotheses for follow-up tests.
Not documenting results
A test that is not documented might as well not have been run. Six months from now, when someone proposes the same change, nobody will remember that it was already tested and failed. Maintain a shared experiment log. Include the hypothesis, sample size, result, confidence interval, and what the team learned. This log becomes one of the most valuable assets an experimentation program produces.
Key Takeaways
- A/B testing is a controlled experiment that measures the causal impact of a product change by comparing a variant to a control group.
- Every test needs a written hypothesis, a single primary metric, a pre-calculated sample size, and a commitment to run for the full planned duration.
- The most common mistake is peeking at results and stopping early. This inflates false positive rates and leads teams to ship changes that did not actually work.
- Not everything should be A/B tested. Reserve experimentation for reversible decisions with enough traffic and genuine uncertainty about the outcome.
- Document every result. Wins, losses, and inconclusive tests all build institutional knowledge.
- Start with high-impact surfaces: onboarding, pricing, signup flows, and core feature experiences.
- For a structured approach to building an ongoing experimentation practice, work through the Product Analytics Handbook, which covers metric selection, instrumentation, and experiment design across 12 chapters.
- Combine A/B test results with qualitative research. Numbers tell you what happened. Customer interviews tell you why.