Last updated: May 2026
Quick Answer
An A/B test is statistically significant when confidence ≥ 95% — meaning there is less than a 5% chance the result is random. This calculator uses a two-proportion z-test (pooled). Never stop a test early based on promising results — "peeking" dramatically inflates false positive rates.
Key Takeaways
- ✓ 95% confidence = 5% false positive risk: The industry standard, not a universal truth — high-stakes decisions may warrant 99%.
- ✓ Lift is relative, not absolute: 2% → 2.5% is a 25% relative lift, not a 0.5% lift.
- ✓ Don't peek: Checking results and stopping early inflates false positives — always pre-determine sample size.
- ✓ Run for full business cycles: At least 1–2 weeks to capture weekday/weekend variation.
How to Use This Calculator (With Example)
Enter the total number of visitors and conversions for both your control (A) and variant (B). The calculator performs a two-proportion z-test and shows confidence level, relative lift, z-score, and a verdict.
Scenario: "ShopFast" — CTA Button Colour Test
- Control A: 8,200 visitors, 164 conversions (2.00% rate)
- Variant B: 8,200 visitors, 213 conversions (2.60% rate)
- Change tested: Blue "Add to Cart" → Orange "Add to Cart"
The Results
Relative Lift: (2.60% − 2.00%) ÷ 2.00% × 100 = +30% lift
Z-score: 3.21
Confidence: 99.9% — Highly Significant ✅
ShopFast can confidently implement the orange button. At 99.9% confidence, there is less than 0.1% chance this is random variation. The 30% relative lift means a 30% revenue increase from this single change — with no additional ad spend.
The Statistics Behind A/B Testing
This calculator uses the two-proportion z-test (pooled) — the standard method for comparing conversion rates between two independent groups.
The formula:
- Pooled proportion: p̂ = (Conversions A + Conversions B) ÷ (Visitors A + Visitors B)
- Standard error: SE = √[p̂(1−p̂) × (1/n_A + 1/n_B)]
- Z-score: z = (Rate B − Rate A) ÷ SE
- Confidence: (1 − p-value) × 100%, using two-tailed normal CDF
A z-score above 1.96 corresponds to 95% confidence (two-tailed). Above 2.58 → 99% confidence. Above 3.29 → 99.9% confidence.
What Is Statistical Significance — And Why It Matters
When you run an A/B test, you're measuring a sample — not the entire future. Random variation in small samples can make an inferior variant look like a winner. Statistical significance quantifies this risk.
At 95% confidence, you accept a 5% chance of a false positive (Type I error) — declaring B the winner when the difference is actually random. At 99%, that drops to 1%. The tradeoff: higher confidence thresholds require larger sample sizes to achieve.
Practical implication: If you run 20 A/B tests at 95% confidence with no real changes, you'd expect 1 false winner just by chance. This is why CRO practitioners log and track every test — so false positives from one test aren't baked into future decisions.
Common A/B Testing Mistakes to Avoid
- Peeking and stopping early: If you check results daily and stop when you first hit 95%, your actual false positive rate is 22%, not 5%. Decide sample size before the test starts. Run to completion.
- Testing too many variables: Changing headline + button colour + hero image simultaneously makes it impossible to know what caused the change. One variable per test.
- Ignoring seasonal effects: A test run only on weekends, only during a sale, or during a platform outage captures non-representative traffic. Run for at least one full business cycle.
- Under-powered tests: Too little traffic means you need a huge lift to reach significance. At 2% baseline, detecting a 5% relative lift needs ~380,000 visitors total. Many businesses test when they can't reach significance for months.
- Celebrating small samples: 50 visitors per variant is not a test. Even if confidence shows 95%, the confidence interval around a 50-visitor result is enormous. Require minimum 200+ conversions per variant before trusting results.
What to Test — A/B Testing Priority Order
- Value proposition / Headline (highest impact): What you offer and why it matters. Changing the core message can produce 50–200% lifts. Test radically different value props, not minor wording tweaks.
- Call to action (CTA): Text ("Get Started Free" vs "Try for Free"), colour, size, placement, and the number of CTAs on the page. Most impactful element after the headline.
- Form length: Reducing from 8 fields to 3 fields typically increases form completion 30–50%. Test removing every non-essential field.
- Social proof: Customer count, review snippets, logos, case study stats. Test placement near the CTA vs. at the top of the page.
- Pricing display: Annual vs. monthly billing toggle, per-user vs. flat pricing, free tier prominence. Pricing page is typically the highest-value page to test on.
- Page structure: Long-form vs. short-form, video vs. static hero, feature-first vs. benefit-first layout.
Frequently Asked Questions
What is statistical significance in A/B testing?
Statistical significance tells you the probability that the difference between two variants is real — not caused by random variation. A 95% confidence level means there is only a 5% chance the observed difference is due to luck. Most CRO practitioners use 95% as the minimum threshold before declaring a winner.
How many visitors do I need for an A/B test?
It depends on your baseline conversion rate and the minimum effect you want to detect. As a rule of thumb: to detect a 20% relative lift on a 2% baseline conversion rate at 95% confidence, you need roughly 4,700 visitors per variation. Smaller expected lifts require exponentially more traffic — detecting a 5% relative lift needs ~75,000 visitors per variant.
What is 'lift' in an A/B test?
Lift is the relative improvement of variant B over the control (A). Formula: Lift = (Rate B − Rate A) ÷ Rate A × 100. A control rate of 2% and variant rate of 2.5% gives a 25% lift — meaning B converts 25% better than A, not 0.5% better. Relative lift is more meaningful than absolute difference for evaluating test impact.
Should I stop the test early if I see significant results?
No — 'peeking' is one of the most common A/B testing mistakes. If you check results multiple times during a test and stop when you first see p<0.05, you dramatically inflate your false-positive rate. Always pre-determine your sample size and run the test to completion. Only run interim analyses if you've pre-planned them with corrections (Bonferroni or sequential testing).
What is the difference between one-tailed and two-tailed significance tests?
A one-tailed test asks: 'Is B better than A?' A two-tailed test asks: 'Is B different from A (better or worse)?' One-tailed tests reach significance faster but only test one direction. Two-tailed tests are more conservative and are preferred when you want to detect any difference. This calculator uses a two-tailed approach, which is standard for most CRO testing.
What is a Type I vs Type II error?
A Type I error (false positive) declares a winner when there is no real difference — controlled by your significance threshold (5% at 95% confidence). A Type II error (false negative) misses a real effect — controlled by your statistical power (typically 80%). Running underpowered tests (too little traffic) dramatically increases Type II errors, making you miss real improvements.