Every AB test is wrong
The industry, through concepts like agility and Lean Startups, has taken on rapidly the practice of performing AB tests. In a nutshell, the AB tests are cont...
We know that a complete certainty on the value of \(\sigma\) (a Dirac’s delta distribution) yields a normal distribution for \(\mu\), with variance equal to \(\sigma^2/n\). Modifying the probability density function of \(\sigma\) to a uniform distribution yielded (unsurprisingly) a different distribution for \(\mu\), no longer a normal one. Let’s think about the differences: as we said, this is analogous to drawing \(\sigma\) from the uniform distribution and using it to get a sample of \(\mu\). Since the \(\sigma\) distribution is uniform, we have a lot of samples that have very small \(\sigma\) and, therefore, sharpen the definition around \(\mu\) when those low \(\sigma\) are drawn. Let’s change that a bit, and try to ease the weight of the low \(\sigma\) in the distribution. For example, let’s use a lognormal
distribution for \(\sigma\):
The center peak is still higher than that of a normal distribution, but lower than it was before. And the tails are starting to look much heavier than that of the normal distribution.
We can keep doing this with any distribution for \(\sigma\) and there is (at least for now) no reason to choose one over the other. And it’s clear that the moment we choose a distribution for \(\sigma\), the distribution for \(\mu\) will change, so there is an association between the distributions \(\mathcal{F}(\sigma)\) and \(\mathcal{G}(\mu)\). We have already discussed \(\mathcal{F}\) being the uniform and the lognormal distribution, and how it generated specific \(\mathcal{G}\) distributions (we could give them a name if we wanted). But among all of the possible distributions for \(\mathcal{F}\) there is a special one.
Now we will propose another \(\mathcal{F}\) distribution for \(\sigma\): inverse-chi-squared distribution, essentially a variation on the \(\Gamma\) distribution. It takes two parameters: the degrees of freedom \(\nu\) and the location \(x_0\). Since the mean of \(\text{Inv-}\chi^2(\nu, x_0)\) is \(x_0\) and \(\nu\) has a very suggestive name, let’s check a distribution with \(\nu = n - 1\) and \(x_0 = s^2\):
Turns out that this specific \(\mathcal{F}\) distribution we set for the variance is so important that the generated \(\mathcal{G}\) function has already been named: it’s the Student’s t-distribution:
As the sample size increases, we have more certainty on the variance, and the \(\text{Inv-}\chi^2\) distribution narrows towards the center. Since \(\mathcal{F}\) becomes better defined in the center (closer to a Dirac’s delta), the \(\mathcal{G}\) distribution becomes closer to a normal distribution:
As a final note, let’s see quickly why inverse-chi-squared is of particular importance. In Student’s original paper, Willam Gosset expands the moment coefficients of the distribution of \(\sigma\) that could have resulted in the observed variance \(s^2\) and, although not arriving at a formal proof, finds \(\mathcal{F}\) as the \(\text{Inv-}\chi^2\) (the derivation is in section I of the paper, only 4 pages long; it looks like he was not aware that he had arrived to a function related to \(\Gamma\)). As we mentioned in the introduction, nowadays the derivation can be greatly simplified through the Bayesian framework, with \(\text{Inv-}\chi^2\) as a posterior distribution for an uninformative prior for the variance. A derivation of this can be found in section 3.2 of the book Bayesian Data Analysis2.
All the calculations done for this article are in the notebook.
Leave a Comment