A/B Test Significance Calculator

Determine if your A/B test results are statistically significant before declaring a winner — using a two-proportion z-test.

A Control (Original)
B Variation (Challenger)

A/B Test Results

Control Rate (A)

0%

Variant Rate (B)

0%

Relative Lift

0%

Confidence

0%

Last updated: May 2026

Quick Answer

An A/B test is statistically significant when confidence ≥ 95% — meaning there is less than a 5% chance the result is random. This calculator uses a two-proportion z-test (pooled). Never stop a test early based on promising results — "peeking" dramatically inflates false positive rates.

Key Takeaways

  • 95% confidence = 5% false positive risk: The industry standard, not a universal truth — high-stakes decisions may warrant 99%.
  • Lift is relative, not absolute: 2% → 2.5% is a 25% relative lift, not a 0.5% lift.
  • Don't peek: Checking results and stopping early inflates false positives — always pre-determine sample size.
  • Run for full business cycles: At least 1–2 weeks to capture weekday/weekend variation.

How to Use This Calculator (With Example)

Enter the total number of visitors and conversions for both your control (A) and variant (B). The calculator performs a two-proportion z-test and shows confidence level, relative lift, z-score, and a verdict.

Scenario: "ShopFast" — CTA Button Colour Test

  • Control A: 8,200 visitors, 164 conversions (2.00% rate)
  • Variant B: 8,200 visitors, 213 conversions (2.60% rate)
  • Change tested: Blue "Add to Cart" → Orange "Add to Cart"

The Results

Relative Lift: (2.60% − 2.00%) ÷ 2.00% × 100 = +30% lift
Z-score: 3.21
Confidence: 99.9% — Highly Significant ✅

ShopFast can confidently implement the orange button. At 99.9% confidence, there is less than 0.1% chance this is random variation. The 30% relative lift means a 30% revenue increase from this single change — with no additional ad spend.

The Statistics Behind A/B Testing

This calculator uses the two-proportion z-test (pooled) — the standard method for comparing conversion rates between two independent groups.

The formula:

  • Pooled proportion: p̂ = (Conversions A + Conversions B) ÷ (Visitors A + Visitors B)
  • Standard error: SE = √[p̂(1−p̂) × (1/n_A + 1/n_B)]
  • Z-score: z = (Rate B − Rate A) ÷ SE
  • Confidence: (1 − p-value) × 100%, using two-tailed normal CDF

A z-score above 1.96 corresponds to 95% confidence (two-tailed). Above 2.58 → 99% confidence. Above 3.29 → 99.9% confidence.

What Is Statistical Significance — And Why It Matters

When you run an A/B test, you're measuring a sample — not the entire future. Random variation in small samples can make an inferior variant look like a winner. Statistical significance quantifies this risk.

At 95% confidence, you accept a 5% chance of a false positive (Type I error) — declaring B the winner when the difference is actually random. At 99%, that drops to 1%. The tradeoff: higher confidence thresholds require larger sample sizes to achieve.

Practical implication: If you run 20 A/B tests at 95% confidence with no real changes, you'd expect 1 false winner just by chance. This is why CRO practitioners log and track every test — so false positives from one test aren't baked into future decisions.

Common A/B Testing Mistakes to Avoid

  • Peeking and stopping early: If you check results daily and stop when you first hit 95%, your actual false positive rate is 22%, not 5%. Decide sample size before the test starts. Run to completion.
  • Testing too many variables: Changing headline + button colour + hero image simultaneously makes it impossible to know what caused the change. One variable per test.
  • Ignoring seasonal effects: A test run only on weekends, only during a sale, or during a platform outage captures non-representative traffic. Run for at least one full business cycle.
  • Under-powered tests: Too little traffic means you need a huge lift to reach significance. At 2% baseline, detecting a 5% relative lift needs ~380,000 visitors total. Many businesses test when they can't reach significance for months.
  • Celebrating small samples: 50 visitors per variant is not a test. Even if confidence shows 95%, the confidence interval around a 50-visitor result is enormous. Require minimum 200+ conversions per variant before trusting results.

What to Test — A/B Testing Priority Order

  • Value proposition / Headline (highest impact): What you offer and why it matters. Changing the core message can produce 50–200% lifts. Test radically different value props, not minor wording tweaks.
  • Call to action (CTA): Text ("Get Started Free" vs "Try for Free"), colour, size, placement, and the number of CTAs on the page. Most impactful element after the headline.
  • Form length: Reducing from 8 fields to 3 fields typically increases form completion 30–50%. Test removing every non-essential field.
  • Social proof: Customer count, review snippets, logos, case study stats. Test placement near the CTA vs. at the top of the page.
  • Pricing display: Annual vs. monthly billing toggle, per-user vs. flat pricing, free tier prominence. Pricing page is typically the highest-value page to test on.
  • Page structure: Long-form vs. short-form, video vs. static hero, feature-first vs. benefit-first layout.

Frequently Asked Questions

What is statistical significance in A/B testing?

Statistical significance tells you the probability that the difference between two variants is real — not caused by random variation. A 95% confidence level means there is only a 5% chance the observed difference is due to luck. Most CRO practitioners use 95% as the minimum threshold before declaring a winner.

How many visitors do I need for an A/B test?

It depends on your baseline conversion rate and the minimum effect you want to detect. As a rule of thumb: to detect a 20% relative lift on a 2% baseline conversion rate at 95% confidence, you need roughly 4,700 visitors per variation. Smaller expected lifts require exponentially more traffic — detecting a 5% relative lift needs ~75,000 visitors per variant.

What is 'lift' in an A/B test?

Lift is the relative improvement of variant B over the control (A). Formula: Lift = (Rate B − Rate A) ÷ Rate A × 100. A control rate of 2% and variant rate of 2.5% gives a 25% lift — meaning B converts 25% better than A, not 0.5% better. Relative lift is more meaningful than absolute difference for evaluating test impact.

Should I stop the test early if I see significant results?

No — 'peeking' is one of the most common A/B testing mistakes. If you check results multiple times during a test and stop when you first see p<0.05, you dramatically inflate your false-positive rate. Always pre-determine your sample size and run the test to completion. Only run interim analyses if you've pre-planned them with corrections (Bonferroni or sequential testing).

What is the difference between one-tailed and two-tailed significance tests?

A one-tailed test asks: 'Is B better than A?' A two-tailed test asks: 'Is B different from A (better or worse)?' One-tailed tests reach significance faster but only test one direction. Two-tailed tests are more conservative and are preferred when you want to detect any difference. This calculator uses a two-tailed approach, which is standard for most CRO testing.

What is a Type I vs Type II error?

A Type I error (false positive) declares a winner when there is no real difference — controlled by your significance threshold (5% at 95% confidence). A Type II error (false negative) misses a real effect — controlled by your statistical power (typically 80%). Running underpowered tests (too little traffic) dramatically increases Type II errors, making you miss real improvements.