Calculate the required sample size for statistically significant A/B ad tests. Determine how many impressions or clicks you need for reliable test results.
Running A/B tests on ad creative, landing pages, or bidding strategies without adequate sample sizes leads to false conclusions. A test declared a "winner" with too few impressions may just be random noise. This calculator tells you exactly how many impressions, clicks, or conversions you need for statistically valid results.
The required sample size depends on three key parameters: your baseline conversion rate, the minimum detectable effect (MDE) you want to identify, and the confidence level you require. Smaller effects need larger samples. Higher confidence needs larger samples. Lower baseline rates need larger samples.
Properly sized A/B tests prevent two costly errors: (1) switching to a "better" ad that's actually no different (false positive), and (2) keeping a weaker ad because the test was too small to detect the improvement (false negative).
Tracking this metric consistently enables marketing teams to identify campaign performance trends and reallocate budgets to the highest-performing channels before opportunities are lost.
Underpowered A/B tests waste budget and lead to wrong decisions. This calculator ensures your ad tests have enough data for valid conclusions, preventing both false wins and missed improvements. Precise quantification supports A/B testing and performance benchmarking, ensuring that optimization efforts are grounded in statistical evidence rather than anecdotal observations alone.
n = (Z_α/2 + Z_β)² × 2 × p̄(1 − p̄) ÷ (p₁ − p₂)² Where: n = sample size per variant Z_α/2 = Z-score for confidence (1.96 for 95%) Z_β = Z-score for power (0.84 for 80%) p̄ = average of baseline and variant rates p₁, p₂ = baseline and expected variant rates
Result: ~7,700 per variant (15,400 total)
With a 3% baseline conversion rate, detecting a 20% relative improvement (3% → 3.6%) at 95% confidence and 80% power requires approximately 7,700 samples per variant, or 15,400 total. At 1,000 clicks/day, the test would take about 15 days.
Premature test conclusions are one of the most expensive mistakes in paid advertising. Switching to a "winning" ad variant based on insufficient data can actually decrease performance. Properly calculating sample size before running a test ensures valid, actionable results.
Baseline rate: lower rates need more data (testing a 1% conversion rate needs 4x the data of a 4% rate). MDE: detecting smaller improvements needs exponentially more data. Confidence/Power: stricter statistical requirements need more data. Adjust these three to balance precision with practical test duration.
Ending tests early ("it's already significant at day 3" — no, early peeking inflates false positives). Running too many variants (splits traffic and extends duration). Using clicks instead of conversions (noisy metric). Not accounting for seasonality (weekend traffic differs from weekday). These mistakes make test results unreliable.
For most ad A/B tests: use 95% confidence, 80% power, and 15–20% relative MDE. This balances statistical rigor with realistic test durations. If you need to test faster, increase MDE (only test bold creative differences) rather than reducing confidence.
MDE is the smallest improvement you want to be able to detect reliably. A 20% relative MDE means if the true improvement is 20% or more, your test will detect it. Smaller MDE requires exponentially more data.
95% is the standard for most business decisions. Use 90% for initial screening tests where speed matters more than precision. Use 99% for critical decisions (pricing, major creative changes) where false positives are very costly.
Power (typically 80%) is the probability of detecting a real effect when it exists. 80% power means a 20% chance of missing a real improvement. Increasing power to 90% requires about 30% more samples.
Duration = Required Sample Size ÷ Daily Traffic. But also run for at least 7 days to capture day-of-week variation. Never end early, even if results look significant — early peeking inflates false positive rates.
Yes, but each additional variant needs its own full sample. Testing 4 variants requires 4x the traffic of a simple A/B test. For many variants, use a multi-armed bandit approach or sequential testing frameworks.
Small effects are hard to distinguish from random noise. Detecting a 5% relative improvement requires about 16x more data than detecting a 20% improvement. Focus ad tests on changes expect to produce 15%+ improvements.