The Complete Guide to Conversion Testing on Shopify

Published
Expert reviewed
5 min read
Simeon Mantel
Simeon Mantel
CEO at Fudge.
Simeon is CEO at Fudge with 12 years of experience in product and ecommerce, including heading product at a YC-backed startup. He's spoken with thousands of Shopify founders, agencies, and operators about how they build and launch storefronts — research that directly shapes Fudge, which now powers 22,000+ pages across 400+ merchants. He writes about applied AI for ecommerce, the changing role of page builders, and what it takes to launch revenue-driving pages without templates or developers.

Key takeaways

  • Conversion testing is CRO with statistics attached - you are not just changing a page, you are measuring whether the change caused a real difference.
  • Every valid test is defined before it starts by four inputs: significance, power, minimum detectable effect, and your baseline conversion rate.
  • The maths is unforgiving for small stores: at a 2% baseline, detecting a 10% lift needs roughly 80,000 visitors per variation. Halving the effect you chase roughly quadruples the sample you need.
  • The most common mistake is peeking - stopping the moment a test looks significant. It can push your real false-positive rate above 25%.
  • Shopify’s native testing (Rollouts) can test layouts but not prices, reports no statistical significance, and gates the experiment features to higher plans.

This is a pillar guide to conversion testing on Shopify. It goes deeper than a general CRO overview into the mechanics that decide whether a test result is real: sample size, statistical significance, test duration, and the special problem faced by stores without much traffic.

If you want the broader picture of what to change on your store first, start with our Shopify CRO guide and come back here when you are ready to test those changes properly.

Why you can trust us

We have spent more than four years in the Shopify ecosystem and built Fudge, an AI page builder used by hundreds of merchants to ship and iterate on store pages. We have watched a lot of stores run a lot of tests, and the failure mode is almost always the same: calling a winner from data that never supported it. This guide is written to prevent that.

What conversion testing actually is

Conversion testing is the practice of changing one thing on your store, showing the change to a random half of visitors, and using statistics to decide whether it caused a real difference in conversion rate. The statistics are the point. Without them, you are just looking at two numbers and guessing.

That distinction separates conversion testing from general conversion rate optimisation. CRO is the whole discipline of improving your store. Conversion testing is the measurement method that tells you which improvements actually worked.

An A/B test splits traffic between a control (A) and a variant (B). A/B/n tests add more variants. Multivariate tests change several elements at once. For most Shopify stores, a clean two-way A/B test is the right tool, because more variants split your traffic thinner and make significance harder to reach.

The anatomy of a valid test

A test worth trusting is defined before it starts. Decide these five things up front:

  1. The hypothesis - a specific, falsifiable statement. “Moving reviews above the fold will increase add-to-cart rate,” not “let’s try some changes.”
  2. The primary metric - one metric that decides the test. Usually conversion rate or revenue per visitor.
  3. The minimum detectable effect - the smallest improvement worth caring about.
  4. The sample size - how many visitors per variation you need, calculated from the inputs below.
  5. The duration - how long that sample will take to collect, rounded up to full weeks.

Deciding sample size and duration in advance is what stops you from fooling yourself later.

The four inputs to every sample-size calculation

Every sample-size number comes from four values.1

Statistical significance (α). The risk you will accept of declaring a winner that is actually noise. The convention is 5%, which is what “95% confidence” means. A false positive here is a change you roll out that does nothing.

Statistical power (1−β). The chance your test detects a real effect that exists. The convention is 80%, meaning a real winner of your target size gets caught 80% of the time.2 Lower power means real wins slip through undetected.

Minimum detectable effect (MDE). The smallest lift you want to be able to detect. This is the input merchants get wrong most often. A smaller MDE sounds better but explodes your sample requirement.

Baseline conversion rate. Your current conversion rate for the metric. Lower baselines need larger samples, because each conversion is a rarer, noisier event.

The maths: why small effects need enormous samples

Here is the relationship that governs everything: the required sample size grows with the square of how small an effect you want to detect. Halve your MDE and you roughly quadruple the visitors you need.3

A worked example makes it concrete. Suppose your store converts at 2% and you want to detect a 10% relative lift - moving from 2% to 2.2% - at 95% confidence and 80% power. Using the standard two-proportion formula:1

You need roughly 80,000 visitors per variation, about 160,000 total, to call that test.

Now loosen the target. If you are willing to only detect a larger 20% lift (2% to 2.4%), the requirement drops about fourfold to ~20,000 per variation. Chase a smaller 5% lift and it climbs to roughly 320,000 per variation.3

Different calculators return somewhat different numbers depending on their assumptions, so the honest way to use this is to run your own inputs through a sample-size calculator like Evan Miller’s and show your working. The lesson holds regardless of the exact figure: small stores cannot detect small effects in reasonable time.

Ship the variant you want to test in minutes, not sprints.
Try Fudge for Free

The low-traffic problem every Shopify store hits

Put the maths together with real traffic and the problem is obvious. A store at 2% conversion and 30,000 monthly visitors would need months to run that single 10%-lift test to completion. Most Shopify stores do not have the traffic to test small changes.

Practical guidance for smaller stores:

The peeking problem

The single most damaging testing mistake is stopping a test the moment it looks significant. It feels rational and it quietly destroys your results.

The reason is statistical. If you check a running test repeatedly and stop as soon as it crosses 95%, you get many chances to cross that line by luck. Evan Miller’s analysis showed that continuous monitoring can push the true false-positive rate to around 26% - more than five times the 5% you thought you were accepting.6

The fix is the discipline from earlier: decide your sample size and duration in advance, and do not call the test until you reach them. No peeking, no early stopping when a variant “looks like it’s winning.”

Common conversion-testing mistakes

What you can and cannot test on Shopify in 2026

The Shopify testing options changed recently, so know the current state before you pick a method.

Shopify Rollouts (native). Shopify introduced native, server-side A/B testing that expanded through 2026 to cover themes, sections, navigation, and - on higher plans - checkout and customer-account configurations.8 Two limits matter: it cannot test pricing or discount logic, because those are not theme changes, and it reports performance metrics without a statistical-significance test, so it will not call a winner for you. The split-test experiment features are also gated to the Grow plan and above.

Checkout is effectively Plus-only. Meaningful checkout testing and customisation remain practical only on Shopify Plus under checkout extensibility, and injected scripts are blocked in the checkout sandbox.9

Google Optimize is gone. Google shut it down on September 30, 2023. If a tutorial recommends it, the tutorial is out of date.10

Third-party testing tools

ToolWhat it testsReported pricing
Shopify RolloutsThemes, layout, sections; checkout config on higher plans. Not prices.Included; split tests gated to Grow+
IntelligemsPrices, shipping, discounts, offers, contentContent from ~$74/mo; price testing from ~$499/mo
ShopliftThemes, templates, product and landing pages, pricesFrom ~$74/mo, scaling with visitors
VWO / OptimizelyFull client-side A/B and multivariatePlatform pricing, verify current tiers

Confirm current pricing on each vendor’s own page before committing - these tiers move. For a fuller breakdown, see our roundup of the best Shopify A/B testing tools and the best Shopify apps for CRO.

Realistic expectations: benchmarks

Two numbers help you set the MDE and read your results honestly.

Baseline conversion rate. Shopify and ecommerce benchmarks put the typical store somewhere in the 1.4% to 3% range, varying widely by category - jewellery and furniture sit under 1.5%, while beauty and food and beverage run higher.11 Know your own number before calculating sample size, and check it against our Shopify conversion rate benchmarks.

Realistic uplift. Meta-analyses of real tests put the average lift around 4 to 5%, and most tested changes produce small effects.12 Plan for modest wins. A store expecting every test to deliver a 30% jump will chase noise and stop tests early.

How to run a test end to end

Putting it together, a trustworthy test looks like this:

  1. Write a specific hypothesis tied to one primary metric.
  2. Look up your baseline conversion rate for that metric.
  3. Pick the smallest lift worth detecting, biased toward larger effects if traffic is limited.
  4. Calculate the sample size, then the duration, and round up to full weeks.
  5. Build the variant. You can draft and ship the change quickly in the Shopify store editor, then run it through your testing tool.
  6. Run to your predetermined sample and duration. Do not peek.
  7. Read the result on your one primary metric, check for sample ratio mismatch, and decide.
  8. Whether it wins or loses, feed the learning into the next hypothesis. For product-page tests specifically, our guide on A/B testing Shopify product pages walks through the mechanics.

Conversion testing is a loop, not a one-off. The stores that compound gains are the ones that run disciplined tests continuously and actually trust the results.

FAQ

How much traffic do I need to A/B test on Shopify?

It depends on your baseline conversion rate and the effect you want to detect. At a 2% baseline, detecting a 10% lift needs roughly 80,000 visitors per variation. A common rule of thumb is around 1,000 conversions per variation. Below roughly 5,000 monthly visitors, qualitative research is usually more useful than split testing.

What is minimum detectable effect (MDE)?

MDE is the smallest improvement you want your test to be able to detect. It is a key input to sample size, and the relationship is quadratic: halving your MDE roughly quadruples the visitors you need. Small stores should set a larger MDE and test bold changes, because small effects require enormous samples.

How long should a Shopify A/B test run?

At least one full week, and ideally two or more, so the test captures a complete business cycle including weekday and weekend behaviour. More importantly, run until you reach the sample size you calculated in advance. Do not stop early just because a variant looks like it is winning.

Why shouldn't I stop a test as soon as it hits significance?

Because repeatedly checking and stopping at the first significant reading inflates your false-positive rate dramatically. Analysis has shown continuous peeking can push the true false-positive rate to around 26%, more than five times the 5% you intended. Decide sample size and duration up front and wait.

Can I A/B test prices on Shopify?

Not with Shopify's native Rollouts feature, which tests theme and layout changes but not pricing or discount logic. To test prices you need a third-party tool such as Intelligems or Shoplift. Price testing tiers are typically more expensive and often aimed at Shopify Plus stores.

Does Shopify's native A/B testing tell me if a result is significant?

No. Shopify Rollouts reports performance metrics like conversion rate, average order value and sessions, but it does not run a statistical-significance test or declare a winner. You need to judge significance yourself with a calculator or use a third-party tool that does the statistics for you.

Simeon's signature
Ready to test your way to a higher-converting store?

Footnotes

  1. VWO, “How to Calculate A/B Test Sample Size,” on the four inputs and the two-proportion formula. https://vwo.com/blog/how-to-calculate-ab-test-sample-size/ 2

  2. CXL, “Statistical Power”: 80% power is the conventional default, balancing false-positive and false-negative risk. https://cxl.com/blog/statistical-power/

  3. On the quadratic relationship between minimum detectable effect and sample size. Worked figures (roughly 80,000 per variation for a 2% baseline and 10% relative lift at 95%/80%) calculated with the standard two-proportion formula; calculators vary by assumptions. https://splitmetrics.com/resources/minimum-detectable-effect-mde/ 2

  4. CXL, “Stopping A/B Tests: How Many Conversions Do I Need?”: guidance around 1,000 conversions, and that statistical significance does not equal validity. https://cxl.com/blog/stopping-ab-tests-how-many-conversions-do-i-need/ 2

  5. VWO, “Understanding Minimum Test Duration”: a 7-day minimum to capture a full weekly cycle, often longer. https://help.vwo.com/hc/en-us/articles/37026733636121-Understanding-Minimum-Test-Duration

  6. Evan Miller, “How Not to Run an A/B Test”: continuous monitoring and stopping at significance can raise the true false-positive rate to around 26%. https://www.evanmiller.org/how-not-to-run-an-ab-test.html 2

  7. On common A/B testing mistakes including trivial changes, too many variations, and sample ratio mismatch. https://posthog.com/product-engineers/ab-testing-mistakes 2

  8. On Shopify’s native Rollouts A/B testing: what it can test, that it does not test prices, that it reports metrics without a significance test, and that experiments are gated to higher plans. https://www.usestorepilot.com/blog/shopify-rollouts-ab-testing/

  9. On checkout customisation and testing being effectively limited to Shopify Plus under checkout extensibility. https://www.intelligems.io/resources/blog/the-evolution-of-checkout-customization-is-here

  10. Google Optimize and Optimize 360 were shut down on September 30, 2023. https://www.optimizely.com/optimize/

  11. Shopify, “How to Improve Ecommerce Conversion Rates”: typical store conversion sits around 1.4% to 3%, varying widely by category. https://www.shopify.com/blog/ecommerce-conversion-rate

  12. Analytics-Toolkit analysis of 115 A/B tests found an average lift around 4%, with most tests underpowered; GoodUI meta-analysis reports a median lift near 5%. https://blog.analytics-toolkit.com/2018/analysis-of-115-a-b-tests-average-lift-statistical-power/

Related posts