Shopify Conversion Testing: Sample Size & Stats (2026)

Key takeaways

Conversion testing is CRO with statistics attached - you are not just changing a page, you are measuring whether the change caused a real difference.

Every valid test is defined before it starts by four inputs: significance, power, minimum detectable effect, and your baseline conversion rate.

The maths is unforgiving for small stores: at a 2% baseline, detecting a 10% lift needs roughly 80,000 visitors per variation. Halving the effect you chase roughly quadruples the sample you need.

The most common mistake is peeking - stopping the moment a test looks significant. It can push your real false-positive rate above 25%.

Shopify’s native testing (Rollouts) can test layouts but not prices, reports no statistical significance, and gates the experiment features to higher plans.

This is a pillar guide to conversion testing on Shopify. It goes deeper than a general CRO overview into the mechanics that decide whether a test result is real: sample size, statistical significance, test duration, and the special problem faced by stores without much traffic.

If you want the broader picture of what to change on your store first, start with our Shopify CRO guide and come back here when you are ready to test those changes properly.

Why you can trust us

We have spent more than four years in the Shopify ecosystem and built Fudge, an AI page builder used by hundreds of merchants to ship and iterate on store pages. We have watched a lot of stores run a lot of tests, and the failure mode is almost always the same: calling a winner from data that never supported it. This guide is written to prevent that.

What conversion testing actually is

Conversion testing is the practice of changing one thing on your store, showing the change to a random half of visitors, and using statistics to decide whether it caused a real difference in conversion rate. The statistics are the point. Without them, you are just looking at two numbers and guessing.

That distinction separates conversion testing from general conversion rate optimisation. CRO is the whole discipline of improving your store. Conversion testing is the measurement method that tells you which improvements actually worked.

An A/B test splits traffic between a control (A) and a variant (B). A/B/n tests add more variants. Multivariate tests change several elements at once. For most Shopify stores, a clean two-way A/B test is the right tool, because more variants split your traffic thinner and make significance harder to reach.

The anatomy of a valid test

A test worth trusting is defined before it starts. Decide these five things up front:

The hypothesis - a specific, falsifiable statement. “Moving reviews above the fold will increase add-to-cart rate,” not “let’s try some changes.”
The primary metric - one metric that decides the test. Usually conversion rate or revenue per visitor.
The minimum detectable effect - the smallest improvement worth caring about.
The sample size - how many visitors per variation you need, calculated from the inputs below.
The duration - how long that sample will take to collect, rounded up to full weeks.

Deciding sample size and duration in advance is what stops you from fooling yourself later.

The four inputs to every sample-size calculation

Every sample-size number comes from four values.¹

Statistical significance (α). The risk you will accept of declaring a winner that is actually noise. The convention is 5%, which is what “95% confidence” means. A false positive here is a change you roll out that does nothing.

Statistical power (1−β). The chance your test detects a real effect that exists. The convention is 80%, meaning a real winner of your target size gets caught 80% of the time.² Lower power means real wins slip through undetected.

Minimum detectable effect (MDE). The smallest lift you want to be able to detect. This is the input merchants get wrong most often. A smaller MDE sounds better but explodes your sample requirement.

Baseline conversion rate. Your current conversion rate for the metric. Lower baselines need larger samples, because each conversion is a rarer, noisier event.

The maths: why small effects need enormous samples

Here is the relationship that governs everything: the required sample size grows with the square of how small an effect you want to detect. Halve your MDE and you roughly quadruple the visitors you need.³

A worked example makes it concrete. Suppose your store converts at 2% and you want to detect a 10% relative lift - moving from 2% to 2.2% - at 95% confidence and 80% power. Using the standard two-proportion formula:¹

You need roughly 80,000 visitors per variation, about 160,000 total, to call that test.

Now loosen the target. If you are willing to only detect a larger 20% lift (2% to 2.4%), the requirement drops about fourfold to ~20,000 per variation. Chase a smaller 5% lift and it climbs to roughly 320,000 per variation.³

Different calculators return somewhat different numbers depending on their assumptions, so the honest way to use this is to run your own inputs through a sample-size calculator like Evan Miller’s and show your working. The lesson holds regardless of the exact figure: small stores cannot detect small effects in reasonable time.

Ship the variant you want to test in minutes, not sprints.

Try Fudge for Free

The low-traffic problem every Shopify store hits

Put the maths together with real traffic and the problem is obvious. A store at 2% conversion and 30,000 monthly visitors would need months to run that single 10%-lift test to completion. Most Shopify stores do not have the traffic to test small changes.

Practical guidance for smaller stores:

Aim for around 1,000 conversions per variation for a trustworthy result. Some practitioners will not trust a test with fewer than 250 to 400 conversions per variation.⁴
Run for at least one full week, ideally two or more, so the test captures a complete business cycle. Weekday and weekend shoppers behave differently.⁵
Test only your highest-traffic pages - homepage, top collection and product pages - so a test finishes this quarter rather than next year.
Chase bigger swings. Low-traffic stores should test bold changes with large expected effects, not button colours.
Below roughly 5,000 monthly visitors, prefer qualitative research - session recordings, surveys, customer interviews - over split testing you cannot power. This pairs well with the tactics in our Shopify CRO tactics guide.

The peeking problem

The single most damaging testing mistake is stopping a test the moment it looks significant. It feels rational and it quietly destroys your results.

The reason is statistical. If you check a running test repeatedly and stop as soon as it crosses 95%, you get many chances to cross that line by luck. Evan Miller’s analysis showed that continuous monitoring can push the true false-positive rate to around 26% - more than five times the 5% you thought you were accepting.⁶

The fix is the discipline from earlier: decide your sample size and duration in advance, and do not call the test until you reach them. No peeking, no early stopping when a variant “looks like it’s winning.”

Common conversion-testing mistakes

Peeking and early stopping, which inflates false positives as above.⁶
Testing trivial changes. Micro-tweaks usually produce lifts under 7%, which small stores cannot detect anyway.⁷
Running too many variations, splitting traffic thin and multiplying the chances of a false positive.
Sample ratio mismatch. If your intended 50/50 split arrives materially skewed at large volume, your tracking or randomisation is broken and the result is invalid.⁷
Ignoring seasonality. A test run across a promotion or holiday can be distorted by traffic that behaves nothing like your baseline.
Confusing significance with validity. Hitting 95% in a tool means nothing if the sample was too small or the test ran three days.⁴

What you can and cannot test on Shopify in 2026

The Shopify testing options changed recently, so know the current state before you pick a method.

Shopify Rollouts (native). Shopify introduced native, server-side A/B testing that expanded through 2026 to cover themes, sections, navigation, and - on higher plans - checkout and customer-account configurations.⁸ Two limits matter: it cannot test pricing or discount logic, because those are not theme changes, and it reports performance metrics without a statistical-significance test, so it will not call a winner for you. The split-test experiment features are also gated to the Grow plan and above.

Checkout is effectively Plus-only. Meaningful checkout testing and customisation remain practical only on Shopify Plus under checkout extensibility, and injected scripts are blocked in the checkout sandbox.⁹

Google Optimize is gone. Google shut it down on September 30, 2023. If a tutorial recommends it, the tutorial is out of date.¹⁰

Third-party testing tools

Tool	What it tests	Reported pricing
Shopify Rollouts	Themes, layout, sections; checkout config on higher plans. Not prices.	Included; split tests gated to Grow+
Intelligems	Prices, shipping, discounts, offers, content	Content from ~$74/mo; price testing from ~$499/mo
Shoplift	Themes, templates, product and landing pages, prices	From ~$74/mo, scaling with visitors
VWO / Optimizely	Full client-side A/B and multivariate	Platform pricing, verify current tiers

Confirm current pricing on each vendor’s own page before committing - these tiers move. For a fuller breakdown, see our roundup of the best Shopify A/B testing tools and the best Shopify apps for CRO.

Realistic expectations: benchmarks

Two numbers help you set the MDE and read your results honestly.

Baseline conversion rate. Shopify and ecommerce benchmarks put the typical store somewhere in the 1.4% to 3% range, varying widely by category - jewellery and furniture sit under 1.5%, while beauty and food and beverage run higher.¹¹ Know your own number before calculating sample size, and check it against our Shopify conversion rate benchmarks.

Realistic uplift. Meta-analyses of real tests put the average lift around 4 to 5%, and most tested changes produce small effects.¹² Plan for modest wins. A store expecting every test to deliver a 30% jump will chase noise and stop tests early.

How to run a test end to end

Putting it together, a trustworthy test looks like this:

Write a specific hypothesis tied to one primary metric.
Look up your baseline conversion rate for that metric.
Pick the smallest lift worth detecting, biased toward larger effects if traffic is limited.
Calculate the sample size, then the duration, and round up to full weeks.
Build the variant. You can draft and ship the change quickly in the Shopify store editor, then run it through your testing tool.
Run to your predetermined sample and duration. Do not peek.
Read the result on your one primary metric, check for sample ratio mismatch, and decide.
Whether it wins or loses, feed the learning into the next hypothesis. For product-page tests specifically, our guide on A/B testing Shopify product pages walks through the mechanics.

Conversion testing is a loop, not a one-off. The stores that compound gains are the ones that run disciplined tests continuously and actually trust the results.

FAQ

How much traffic do I need to A/B test on Shopify?

It depends on your baseline conversion rate and the effect you want to detect. At a 2% baseline, detecting a 10% lift needs roughly 80,000 visitors per variation. A common rule of thumb is around 1,000 conversions per variation. Below roughly 5,000 monthly visitors, qualitative research is usually more useful than split testing.

What is minimum detectable effect (MDE)?

MDE is the smallest improvement you want your test to be able to detect. It is a key input to sample size, and the relationship is quadratic: halving your MDE roughly quadruples the visitors you need. Small stores should set a larger MDE and test bold changes, because small effects require enormous samples.

How long should a Shopify A/B test run?

At least one full week, and ideally two or more, so the test captures a complete business cycle including weekday and weekend behaviour. More importantly, run until you reach the sample size you calculated in advance. Do not stop early just because a variant looks like it is winning.

Why shouldn't I stop a test as soon as it hits significance?

Because repeatedly checking and stopping at the first significant reading inflates your false-positive rate dramatically. Analysis has shown continuous peeking can push the true false-positive rate to around 26%, more than five times the 5% you intended. Decide sample size and duration up front and wait.

Can I A/B test prices on Shopify?

Not with Shopify's native Rollouts feature, which tests theme and layout changes but not pricing or discount logic. To test prices you need a third-party tool such as Intelligems or Shoplift. Price testing tiers are typically more expensive and often aimed at Shopify Plus stores.

Does Shopify's native A/B testing tell me if a result is significant?

No. Shopify Rollouts reports performance metrics like conversion rate, average order value and sessions, but it does not run a statistical-significance test or declare a winner. You need to judge significance yourself with a calculator or use a third-party tool that does the statistics for you.

Ready to test your way to a higher-converting store?

Try Fudge for Free

See how Fudge builds pages

5.0

VWO, “How to Calculate A/B Test Sample Size,” on the four inputs and the two-proportion formula. https://vwo.com/blog/how-to-calculate-ab-test-sample-size/ ↩ ↩²
CXL, “Statistical Power”: 80% power is the conventional default, balancing false-positive and false-negative risk. https://cxl.com/blog/statistical-power/ ↩
On the quadratic relationship between minimum detectable effect and sample size. Worked figures (roughly 80,000 per variation for a 2% baseline and 10% relative lift at 95%/80%) calculated with the standard two-proportion formula; calculators vary by assumptions. https://splitmetrics.com/resources/minimum-detectable-effect-mde/ ↩ ↩²
CXL, “Stopping A/B Tests: How Many Conversions Do I Need?”: guidance around 1,000 conversions, and that statistical significance does not equal validity. https://cxl.com/blog/stopping-ab-tests-how-many-conversions-do-i-need/ ↩ ↩²
VWO, “Understanding Minimum Test Duration”: a 7-day minimum to capture a full weekly cycle, often longer. https://help.vwo.com/hc/en-us/articles/37026733636121-Understanding-Minimum-Test-Duration ↩
Evan Miller, “How Not to Run an A/B Test”: continuous monitoring and stopping at significance can raise the true false-positive rate to around 26%. https://www.evanmiller.org/how-not-to-run-an-ab-test.html ↩ ↩²
On common A/B testing mistakes including trivial changes, too many variations, and sample ratio mismatch. https://posthog.com/product-engineers/ab-testing-mistakes ↩ ↩²
On Shopify’s native Rollouts A/B testing: what it can test, that it does not test prices, that it reports metrics without a significance test, and that experiments are gated to higher plans. https://www.usestorepilot.com/blog/shopify-rollouts-ab-testing/ ↩
On checkout customisation and testing being effectively limited to Shopify Plus under checkout extensibility. https://www.intelligems.io/resources/blog/the-evolution-of-checkout-customization-is-here ↩
Google Optimize and Optimize 360 were shut down on September 30, 2023. https://www.optimizely.com/optimize/ ↩
Shopify, “How to Improve Ecommerce Conversion Rates”: typical store conversion sits around 1.4% to 3%, varying widely by category. https://www.shopify.com/blog/ecommerce-conversion-rate ↩
Analytics-Toolkit analysis of 115 A/B tests found an average lift around 4%, with most tests underpowered; GoodUI meta-analysis reports a median lift near 5%. https://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-tests-average-lift-statistical-power/ ↩

Best Shopify A/B Testing Tools

19 Jun 2026

Shopify Conversion Rate Benchmarks by Industry

27 Mar 2026

Shopify AI Store Builder: Build and Edit Your Store with AI

26 Jun 2026

The Complete Guide to Conversion Testing on Shopify

Why you can trust us

What conversion testing actually is

The anatomy of a valid test

The four inputs to every sample-size calculation

The maths: why small effects need enormous samples

The low-traffic problem every Shopify store hits

The peeking problem

Common conversion-testing mistakes

What you can and cannot test on Shopify in 2026

Third-party testing tools

Realistic expectations: benchmarks

How to run a test end to end

FAQ

Related posts

Best Shopify A/B Testing Tools

Shopify Conversion Rate Benchmarks by Industry

Shopify AI Store Builder: Build and Edit Your Store with AI

The Complete Guide to Conversion Testing on Shopify

Why you can trust us

What conversion testing actually is

The anatomy of a valid test

The four inputs to every sample-size calculation

The maths: why small effects need enormous samples

The low-traffic problem every Shopify store hits

The peeking problem

Common conversion-testing mistakes

What you can and cannot test on Shopify in 2026

Third-party testing tools

Realistic expectations: benchmarks

How to run a test end to end

FAQ

Footnotes

Related posts

Best Shopify A/B Testing Tools

Shopify Conversion Rate Benchmarks by Industry

Shopify AI Store Builder: Build and Edit Your Store with AI