I’m currently in the middle of a Bayesian statistics course. It’s exciting for several reasons (yes, I get excited about statistics courses) but one thing was particularly troubling when it first started: I didn’t fully understand what made Bayesian statistics different from all the classical (or frequentist) courses I had taken previously. Hopefully this post will help clarify things for others who are just starting to learn about Bayesian statistics.

One of the reasons why it took me a bit to understand the fundamental difference between Bayesian and frequentist statistics is that I could not find a satisfactory explanation online. I did find a number of posts, lecture notes, etc. that tried to explain the difference but they didn’t really help that much. Most of them either said something along the lines of “the difference is in the interpretation of probability”, or gave an explanation through a specific example (which is rarely helpful for me until I’ve seen some math), or just explained Bayes rule which is not exclusive to Bayesian statistics. These approaches are not necessarily wrong, I just felt they were either too simplified for someone like me who has a math and stats background or they didn’t really get at the core of the difference in the two approaches.

### They Bayesian Approach

To remedy this, I would like to give my take on the difference between Bayesian and frequentist statistics. For me, the difference is easiest to understand with the following table:

Bayesian | Frequentist | |
---|---|---|

parameter |
random | fixed |

data |
fixed | random |

To elaborate, given a model for the data, a frequentist^{1} views the parameters of that model to be some fixed (but unknown) value and the data to be random. The view is that the data will be generated by some “true” model with a given value for the parameters. The randomness comes from the data itself as each sample we take will be different.

A Bayesian takes sort of the opposite approach. Like a frequentist, the data is viewed as coming from a model with a set of parameters. But the parameters of the model are considered random and the data is viewed as fixed.

To me, this is the fundamental difference between the classical and Bayesian approaches: how parameters and data are treated. The common explanations I mentioned above don’t capture this. One is more of a philosophical difference and the others, although they may be useful for some, are too simplified and I don’t think they really add anything to what a first course in probability or statistics would cover.

This difference in how data and parameters are treated imparts a difference in how inference is approached. In the frequentist world, we want to estimate the parameter (a fixed value) along with a confidence interval and typically a p-value. We then use that estimate to infer the model for the entire population. In the Bayesian context, we want to find the (new) distribution of the parameter, given the data. This lets us infer how the distribution of our parameter has changed based on the data at hand and gives us a current, updated estimate of the distribution of that parameter.

### An Example

To illustrate the Bayesian approach, let’s consider a simple example. Suppose we have data from a Binomial distribution with parameters \(n\) and \(p\) where \(n\) is known. In the Bayesian approach, we consider \(p\), the success probability, to be a random variable instead of a fixed value. Here, we consider \(p\) to follow a Beta distribution. This distributional assumption is not necessary. \(p\) could have whatever distribution we want as long as it meets the conditions for a Binomial (between 0 and 1). However, it makes the math easy and the Beta distribution is flexible enough that it should suffice in most cases. We can write this model as follows:

\[X| p \sim \mathrm{Binomial}(n,p) \] where \[ p \sim \mathrm{Beta}(a,b). \]

The first equation is the model for the data (a conditional distribution) and the second is the model for the success probability \(p\). The distribution for \(p\) is called the **prior distribution**. The reason for this is that we think of having some knowledge about \(p\) before data is collected. For example, we think \(p\) can be no larger than 0.8. The parameters \(a\) and \(b\) for \(p\) are called **hyperparameters**. They are set independently of the data based on our prior knowledge. What we are interested in is finding the **posterior distribution** of \(p\) given the data. That is, we want to find \(f(p|X)\). This will give us insight into how \(p\) behaves and can be used as the prior distribution for a future data analysis.

Because we chose the distribution for \(p\) to be a Beta, it’s not too difficult to find the posterior. We just need to invoke Bayes rule:

\[ f(p|X) = \frac{f(X|p)f(p)}{\int_0^1f(X|p)f(p) \hspace{3pt} dp}. \]

Substituting the appropriate densities gives \[\begin{align*} f(p|X) &= \dfrac{\binom{n}{x}p^x(1-p)^{n-x} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}p^{a-1}(1-p)^{b-1}}{\int_0^1 \binom{n}{x}p^x(1-p)^{n-x} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} p^{a-1}(1-p)^{b-1} \hspace{3pt} dp} \\ &= \dfrac{p^{(x+a)-1}(1-p)^{(n-x+b)-1}}{\int_0^1 p^{(x+a)-1}(1-p)^{(n-x+b)-1} \hspace{3pt} dp} \\ &= \dfrac{p^{(x+a)-1}(1-p)^{(n-x+b)-1}}{\frac{\Gamma(x+a)\Gamma(n-x+b)}{\Gamma(n+a+b)}} \end{align*}\]This may look ugly and complicated at first glance so let’s walk through it. The first line is simply substituting \(f(X|p)\) and \(f(p)\) based on the Binomial and Beta distributions. Next, all the constants to make them densities can cancel. For the last line, note that the numerator is the “meat” of a Beta\((x+a, n-x+b)\) distribution. So the integral in the denominator is just the inverse of the constant for the Beta distribution.

It turns out we can simplify the above expression because the denominator is just a normalizing constant to ensure the density integrates to 1. We can also simplify by ignoring the constants for each of the densities because they will cancel out since the integration is with respect to \(p\). We only need to figure out the kernel of the density because this is enough to tell us the distribution. This is illustrated as follows:

\[\begin{align*} f(p|X) &\propto p^x(1-p)^{n-x} p^{a-1} (1-p)^{b-1} \\ &= p^{(x+a)-1}(1-p)^{(n-x+b)-1}. \end{align*}\]This looks like a Beta distribution with parameters \(x+a\) and \(n-x+b\). Typically, this is how posteriors are worked out because it lets us focus on the important bits without getting bogged down by writing extra stuff that we know will cancel anyways.

### Summary

To reiterate, the difference between the Bayesian and frequentist approach to statistics boils down to how the data and parameters of a model are treated (fixed vs. random). I also think it’s easier to understand the difference in this way particularly for someone who already knows a bit about statistics. Hopefully this post will help to clarify things for those who end up in a similar situation as I was in.

I use the terms “a frequentist” or “a Bayesian” but I don’t see them as distinct in the sense that one should use whatever approach is more appropriate for the given situation.↩