One of the questions we will often get asked in customer calls is – How much volume of business do I need for Luca’s Pricing Engine to successfully perform price optimization? Or put in more science-y terms, they are asking - How much data do you need to estimate statistically significant effects of price recommendations?
Unfortunately, the answer is not as simple as “100 units sold per day”, or “$5 mill in annual revenue”. In fact the answer is a pretty unsatisfactory “It depends”. More specifically, the answer depends on the amount of historical price variation in your data.
The long answer is more technical. So, first, we need to talk about Power.
When we run an experiment or perform any intervention, we want to have a high probability of rejecting the null hypothesis when the alternative hypothesis is true. This is the probability of avoiding a type II error and is known as the power of the experiment. It’s important to do a power analysis before an experiment to sanity check your setup. Checking after you've run your experiment isn't helpful; the horse has left the barn at that point!
It’s best understood visually with this example of a two-sided test:
Distribution 1 represents the distribution of data if the null hypothesis were true. Distribution 2 represents the distribution of data if the alternative hypothesis is true. Chance of Type I error is the significance level, i.e., the probability of observing a statistic if the null hypothesis were true. If the alternative hypothesis were true, the probability of being able to reject the null given the data is the red shaded area in distribution 2.
Typically, we target 80% power and 95% significance, though some practitioners target 70% power or 90% significance.
If you’ve done user A/Bs before, this section will look familiar.
Suppose we have 𝑁 total users in the experiment with 50% of users assigned to control and 50%to treatment. The null hypothesis is that value of the metric of interest in our control group, μ𝑐, is equal to the value of the metric in the treatment group μ𝑡 (i.e., there’s no treatment effect).
Let’s start with a one-sided test:
𝐻0: μ𝑐 = μ𝑡
𝐻𝑎: μ𝑡 − μ𝑐 = δ
where δ = 𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒 · 𝑀𝐷𝐸 and the MDE represents the minimum detectable effect. That is, any effect below this value will not be detected (from the perspective of statistical significance).
For instance, we might be testing an email ad campaign’s impact on sales, in which case the baseline would be historical sales volume, δ would be the incremental sales we would want to observe, and the MDE would represent the lift in percentage terms.
It takes a little legwork, but you can solve for the number of users needed to detect a given effect at a set power and statistical significance level:
where 𝑠 represents the standard deviation of the baseline metric, 𝑧α represents the critical value for significance level α, β represents power, and Φ represents the cumulative distribution function for the normal distribution.
If, instead, the number of users is fixed at 𝑁, you solve for the MDEs:
As an aside, the approach needs to be modified if you don’t use even splits, or you use ratiometrics like conversion; for those metrics, use the delta method.
There’s some nuggets of wisdom when you peel apart the math.
First, when the MDE you want to detect is small, the sample size you need is high. It’s harder to detect a statistically significant small difference than a large one; you need a larger sample size to confidently say a small difference is statistically significant.
You can only set a value for the sample size or MDE, but not both. I.e., if you set a value for total sample size in each group, that defines the effect size you can observe; if you define the MDE you want to observe, that defines the sample size you need.
In practice, we likely have to work with a fixed sample size, in which case we need to sanity check that the corresponding MDE is attainable. Let’s continue with the example of the email ad campaign. Suppose historical data shows that average weekly order volume can vary up to 10%. If we see that the MDE we can detect is 500% lift in average weekly orders, we need to revisit our experimental design – we wouldn’t be able to detect effects below this value, but, based on historical data, we’re pretty sure 500% is unattainable. However, if the MDE we calculate is 0.05%, we can be comfortable with our experimental design; we’ll be able to detect effect sizes larger than this value.
Finally, required sample size also depends on the variance of the data relative to the baselin evalue 𝑠 / 𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒, i.e., the coefficient of variation (aka signal-to-noise ratio). The higher the variability relative to the data, the more sample size you need.
You’ll recall from my first post that user A/Bs are impractical for most Pricing Operators. The intuition we talked about still holds, but you need to tweak the approach. What you actually need to do is apply your intended model on the historical data in such a way that mimics the intended price launch.
Here’s an example. Suppose we’ve decided to launch prices on a group of SKUs (= the treatment group) and keep prices unchanged for another unrelated group (= control group) and we want to measure the impact of price on each product category’s demand.
Let’s suppose the math works out that the price recommendation for treatment SKUs is +5%. The price “recommendation” for control SKUs is 0%.
Let’s say the final model we’ve decided to run is:
The MDEs for a given product category 𝑛 would then be:
where σ^are the standard errors for ε obtained through running the model on placebo data and 𝑁 represents the size of the total catalog
As an example of how you might use this formula, suppose you intend on launching prices for 5 weeks – you can take 5 weeks of data, run this regression, calculate the MDEs, and sanity check that the effect sizes make sense.
You don’t have to be wedded to this model – it’s just an example of how you might think about this problem.
Have fun analyzing your power, or if you would like our help to do this, reach out to book a call here.