Calculate the required sample size for statistically significant A/B tests. Input baseline rate, minimum detectable effect, significance, and power.
Running an A/B test without enough samples leads to unreliable results — you might declare a winner that isn't actually better, or miss a real improvement because you stopped too early. Sample size calculation is the critical first step in experiment design, determining how many users you need in each variant to detect a meaningful difference with statistical confidence.
The required sample size depends on four key parameters: your baseline conversion rate (what the control currently achieves), the minimum detectable effect (the smallest improvement worth detecting), the significance level (typically 5%, controlling false positive risk), and statistical power (typically 80%, controlling false negative risk). Together, these determine whether your experiment can reliably detect the effect you care about.
This calculator uses the standard normal approximation for two-proportion tests to compute the required sample size per variant. It also estimates test duration based on your daily traffic and shows how different MDE levels affect the required sample size, helping you find the right balance between sensitivity and practical test duration.
Underpowered experiments are one of the biggest wastes in growth optimization. They lead to inconclusive results, false positives, and wasted development time on changes that weren't validated. This calculator ensures your experiments are properly sized before you start, gives you realistic test duration estimates, and helps you negotiate between statistical rigor and business timelines.
n = (Zα/2 + Zβ)² × 2p(1−p) ÷ δ² Where: • Zα/2 = Z-score for significance level (1.96 for 95%) • Zβ = Z-score for power (0.84 for 80%) • p = pooled baseline proportion • δ = absolute difference to detect (baseline × MDE%) Total Sample = n × 2 (for two variants) Test Duration = Total Sample ÷ Daily Traffic
Result: n ≈ 31,234 per variant (62,468 total)
With a 5% baseline conversion rate and wanting to detect a 10% relative improvement (from 5.0% to 5.5%), at 95% significance and 80% power, you need approximately 31,234 users per variant. With 5,000 daily visitors split evenly, the test would run for about 13 days. Reducing MDE to 5% would require ~124,000 per variant.
Sample size is a tradeoff between sensitivity, speed, and confidence. Larger samples detect smaller effects but take longer. The relationship is quadratic: detecting a 5% relative MDE requires roughly 4× the sample of a 10% MDE. This is why choosing the right MDE is crucial — don't over-specify sensitivity you don't need.
Beyond pure sample size, tests should run for complete weeks to capture day-of-week effects. A test that reaches sample size on a Thursday should still run through Sunday. Also account for novelty effects (early users react differently to changes) and external events (holidays, promotions) that can bias results.
For metrics with high variance (like revenue per user), you'll need much larger samples than for binary metrics (like conversion rate). Consider variance reduction techniques like CUPED or stratified sampling to reduce required samples by 30–50%. For high-traffic sites, use multi-armed bandit methods to balance learning and earning during the experiment.
MDE is the smallest improvement your test is designed to detect. For example, 10% MDE on a 5% baseline means you'd detect a lift to 5.5% or higher. Smaller MDEs require larger samples. Choose an MDE based on the smallest improvement that would justify implementing the change.
An underpowered test has a high risk of missing real effects (false negatives) or producing unreliable p-values. You might conclude "no difference" when there actually is one, or worse, declare a winner based on statistical noise. This leads to implementing ineffective changes or abandoning effective ones.
Statistical power is the probability of correctly detecting a real effect of the specified size. At 80% power, you have an 80% chance of detecting a true difference equal to or larger than your MDE. Higher power requires more samples. 80% is the standard default; some teams use 90% for high-stakes decisions.
Two-sided tests are the standard because they detect both improvements and degradations. One-sided tests need fewer samples but only detect effects in one direction. Use two-sided unless you have a strong prior belief that the change can only improve (or only hurt) the metric. This calculator uses two-sided tests.
If you're testing multiple metrics simultaneously, adjust for multiple comparisons using Bonferroni correction (divide significance by number of metrics) or False Discovery Rate control. Without correction, testing 20 metrics at 5% significance means you'll likely get at least one false positive.
Standard fixed-horizon tests should not be stopped early because p-values are unreliable until the planned sample is reached. If you need to monitor results continuously, use sequential testing methods (like group sequential designs or always-valid confidence intervals) that account for repeated analysis.