Compare two conversion rates: p-value, lift, statistical significance.
Two-proportion z-test on a pooled standard error. p-value comes from the standard normal CDF (Abramowitz & Stegun erf, accurate to 10⁻⁷). The whiskers show ±2 SE for each variant — overlap suggests the difference may be noise. Lift and CI are computed on the unpooled SE.
When you run an A/B test on a website, an email subject line, a pricing page, or any other binary outcome (clicked / didn't click, signed up / didn't sign up, paid / didn't pay), the question you really want to answer is uncomfortably simple: did the change actually move the needle, or am I looking at random noise? Conversion rates jiggle from day to day on every product, even when nothing has changed. If variant B converted 6.4 % of visitors and the control A converted 5.0 %, you cannot declare victory just because B's number is bigger. You have to ask how often a 1.4-point gap could appear by pure chance, given the size of the audiences you measured. That is the entire job of statistical significance testing — separating signals from sampling noise. Skip it and you will ship random changes, claim wins that evaporate the next month, and lose the trust of your team. Use it correctly and you build a culture of evidence: only ship what you can defend with numbers.
The standard test for two-proportion A/B experiments is the two-proportion z-test on a pooled standard error. Given variant A with n_A visitors and c_A conversions, and variant B with n_B visitors and c_B conversions, the calculator computes:
p_A = c_A / n_A, p_B = c_B / n_Bp̂ = (c_A + c_B) / (n_A + n_B)SE = √( p̂·(1 − p̂) · (1/n_A + 1/n_B) )z = (p_B − p_A) / SEp = 2 · (1 − Φ(|z|)) or one-sided p = 1 − Φ(z) for the alternative B > AΦ is the standard normal cumulative distribution function, computed from the Abramowitz & Stegun erf approximation (accurate to about 10⁻⁷). The relative lift is (p_B − p_A) / p_A. The 95 % confidence interval for the absolute difference uses the unpooled standard error SE_u = √( p_A·(1−p_A)/n_A + p_B·(1−p_B)/n_B ), since the pooled form assumes the null hypothesis. The visualization shows ±2 SE whiskers around each rate so you can see whether the two confidence bands overlap.
Type in the four raw counts: visitors and conversions for each variant. "Visitors" should be unique users assigned to the variant, not page views — and the same person should always see the same variant. "Conversions" is the count of those visitors who completed the goal at least once. Pick a confidence level: 95 % is the default in industry; use 99 % when the change is risky and reversing it costs real money or trust; 90 % is acceptable for low-stakes UI tweaks where you mostly care about iteration speed. Pick a tail: two-sided is the safe default — it tests whether the variants differ in either direction, which protects you from cherry-picking. Use one-sided only when you have a hard prior that B can only be ≥ A (a near-impossible claim in practice). Read the verdict: if Significant = Yes you have evidence to declare the winner shown; if No, the data are consistent with no real difference at the chosen α. The lift KPI tells you the magnitude of the change in relative terms — a 2-point lift on a 5 % base is a 40 % relative lift, which is huge.
You ran a landing-page redesign test for two weeks. Variant A (the existing page) was shown to 5 000 visitors and produced 250 sign-ups. Variant B (the redesign) was shown to 5 100 visitors and produced 320 sign-ups. Plug those numbers in. The calculator computes p_A = 5.00 %, p_B = 6.27 %, an absolute gap of 1.27 points, a relative lift of +25.5 % — promising at first glance. The pooled rate is (250 + 320) / (5 000 + 5 100) ≈ 5.64 %, the pooled SE is about 0.00459, and z ≈ +2.78. The two-sided p-value is roughly 0.0055, well below α = 0.05. The verdict: Significant = Yes, winner = B, and you can ship the redesign with reasonable confidence. The 95 % CI for the difference is approximately [+0.36 %, +2.18 %] — note that even the lower bound is comfortably above zero, which is the visual analog of "the whiskers do not cross". Now redo the same example with only 1 500 visitors per variant and a similar absolute lift: the z-statistic shrinks below 1.96 and the result becomes inconclusive. Same effect, less data, no shipping decision. That is the calculator earning its keep.
The single biggest mistake in A/B testing is peeking: checking the p-value daily and stopping the test the first time it drops below 0.05. Doing that inflates your false-positive rate from the nominal 5 % to roughly 25–30 % over a few weeks, because every additional check is another chance for noise to look like signal. Pick a sample size in advance and commit to it. Second, multiple variants without correction: if you run A/B/C/D/E with 4 simultaneous comparisons against control, your effective α explodes. Apply a Bonferroni correction (α_per_test = α / k) or run a single ANOVA-style test first. Third, early stopping on lift size: a 40 % lift over the first three days is almost always regression-to-the-mean — early adopters of a new variant skew enthusiastic. Fourth, novelty effect: any change looks better in week 1 because users react to anything new; let the test run at least one full weekly cycle. Fifth, weekly seasonality: starting on Monday and ending on Saturday breaks the symmetry between variants if traffic mix differs by weekday. Sixth, sample-ratio mismatch (SRM): if your A/B split was set to 50/50 but you measured 5 000 vs 5 800, something is wrong with the assignment plumbing — the test is invalid until you fix it. Seventh, confounding launches: never run two tests on overlapping audiences without proper isolation; results bleed into each other.
Several alternative frameworks address weaknesses of the classic frequentist test. Bayesian A/B testing reports the posterior probability that B is best given priors, which avoids the binary "significant / not" verdict and lets you stop early on probability thresholds — but the answer depends on your prior, which you must defend. Sequential testing with always-valid p-values (mSPRT, group-sequential designs) lets you peek as often as you like without inflating type-I error, at the cost of needing a slightly larger sample to reach the same confidence. CUPED (Controlled-experiment Using Pre-Experiment Data) uses pre-period covariates to subtract baseline noise, often shrinking required sample sizes by 30–50 % on metrics with high pre-period correlation. Multi-armed bandits (Thompson sampling, UCB) allocate more traffic to winning arms in real time — great for short-window decisions or when one arm is dramatically worse, but unsuitable when you want a clean post-test reading. For non-binary metrics like revenue per visitor or pages per session, swap the proportion test for a Welch's t-test, ideally on log-transformed values to tame heavy tails. Survival or funnel-step tests are appropriate when the outcome unfolds over time (time-to-purchase, retention by day-30): use Kaplan-Meier curves with a log-rank test rather than collapsing to a single proportion. Finally, do a power analysis before you start: a tool that tells you the minimum sample size required to detect the smallest effect size your business cares about — without it, "we need more data" is the only honest answer to almost every inconclusive test.