11 min read

Bayesian Linear Regression

This is the first of two posts on Bayesian linear regression with a proper prior. In this post I give an overview of the model and how to set up the prior distribution using expert knowledge. The next post describes how to sample from the posterior using Gibbs sampling and using Stan.

Update: Actually, I got pretty busy and didn’t get around to the second post. I hope to come back to it later but for now it’s not going to happen.

The Model

In Bayesian linear regression, we model the data as

\[ Y = X\beta + \varepsilon, \hspace{10pt} \varepsilon|\sigma \sim N_n(0,\sigma^2I) \]

Here, \(Y\) is the response and \(X\) is the matrix of covariates. We can write this model more compactly as

\[ Y|\beta,\tau \sim N_n\left(X\beta, \frac{1}{\tau} I\right). \]

In other words, we model \(Y\) as multivariate normal with mean vector \(X\beta\) and covariance matrix \(\frac{1}{\tau}I\). \(Y\) is \(n\) dimensional and \(X\) is \(n \times r\). This looks a lot like the classical linear regression model but the difference here is that we condition on \(\beta\) and \(\tau\) which are random variables instead of fixed values. We write \(1 / \tau\) instead of \(\sigma^2\) which is often done in Bayesian statistics because it makes some of the math easier.

The Prior

Now that a model is specified for the data, we need to specify a prior distribution for the parameters. Here we use the following prior:

\[ \beta \sim N_r(\beta_0, C), \hspace{10pt} \tau \sim \text{Gamma}(a,b). \]

From here, we either need to have our own prior knowledge of the data at hand or have previous data we can use. Otherwise, we need to talk to an expert and try to get enough information out of them to specify the parameters. I consider this the hard case. The expert here is someone who knows about the underlying subject but may not have much knowledge of statistics, let alone Bayesian statistics. And even if they are well versed in Bayesian statistics, we can’t exactly ask them what the covariance matrix is for \(\beta\) or what the parameter \(a\) is for \(\tau\). So we need to ask our expert a series of questions, then use their answers to set values for \(\beta_0, C, a\) and \(b\).

An Example

To illustrate what questions to ask the expert and how to turn the information into prior parameters, let’s consider an example. The data we will use is on NBA players from the 2015 season1. There are 28 variables but we will only consider 4 here. The response is personal fouls (PF) and we are going to model this as a function of steals (STL), blocks (BLK), and minutes (MIN):

\[ PF|\beta,\tau \sim N(\beta_1 + \beta_2 STL + \beta_3 BLK + \beta_4 MIN, \hspace{5pt} 1/\tau) \]

These variables are recorded on a per game basis across the season (i.e. the first player below had 0.6 personal fouls per game, 0.3 steals per game, 0 blocks per game, and played 12.4 minutes per game for the 2015 season).

nba <- read_csv('../../static/data/nba2015.csv') %>%
  select(PF, STL, BLK, MIN)
## # A tibble: 492 x 4
##       PF   STL   BLK   MIN
##    <dbl> <dbl> <dbl> <dbl>
##  1   0.6   0.3   0    12.4
##  2   2.3   0.7   0.2  23  
##  3   1.8   0.4   0.5  17  
##  4   2.8   0.6   0.3  23.1
##  5   1.6   0.9   1.3  30.5
##  6   2.1   0.7   1.3  30.6
##  7   1.9   0.9   0.8  18.5
##  8   2     0.8   0.1  23.6
##  9   2.4   0.6   0.2  33.3
## 10   0.2   0     0     2.8
## # ... with 482 more rows

The Questions

I don’t really know anything about basketball so I had to consult with someone who does (my brother-in-law). For the \(\beta\)s, I followed the process described in Bayesian Ideas and Data Analysis. First, I asked

What is the average number of personal fouls per game for someone who has 0.5 steals, 1 block, and plays 10 minutes?

The expert’s answer was “1-2”. This was great because it actually answers a second question that I would have had to ask:

How big could that be?

We need to ask two questions here because \(\beta\) has two parameters, the mean and precision. The answer to the first question gives us the mean, and the answer to the second, in combination with the first, gives us the variance. Luckily, both questions were answered in one shot. This series of questions was then repeated for different values of STL, BLK, and MIN2. Since we have four \(\beta\)s here, we need to ask four questions.

For the prior on \(\tau\), the questions are a bit different. Instead of asking about the mean of the response (PF), we ask about the response itself. Depending on the expert and the data, it may be difficult to take this distinction into account. But we don’t need a ‘perfect’ prior, we just need something that seems to reasonably capture their knowledge. The questions we ask are:

  1. For someone with 2 steals, 2 blocks, and plays 30 minutes3, what is the biggest number of personal fouls you would expect?
  2. What is the biggest that could be?

We might also clarify first question by adding something like “What is the number of personal fouls that only 5% of players like this would exceed?” The point is that we want to ask about a percentile of the response (in this case the 95th percentile). The answer to the second question then gives us an idea of how much that percentile can vary (so it’s the percentile of the percentile of the data). The answers I got were 4.5 and 5.

Obtaining the Prior

Now we have some numbers but how do we translate that into parameters for \(\beta\) and \(\tau\)? We actually asked questions about things that are easier to think about so we will need to apply some transformations to obtain the parameters we need.

For \(\tau\), we actually asked about the distribution of a percentile of the data. We set the answer to question 1 to the mode of the percentile, and the answer to question 2 as the 95th percentile of this percentile. We then use facts about normal percentiles and gamma distributions to transform these into the mode and 95th percentile of \(\sigma\), \(\sigma^2\), then finally \(\tau\). It’s a bit messy to think about but easy to implement in code.

For the \(\beta\)s, the questions we ask give information about the conditional mean of the response given the number of steals, blocks, and minutes. Specifically, we ask about \(\tilde{m}_j = E[Y|\tilde{x}_j, \beta, \tau] = \tilde{x}_j'\beta\) for \(j = 1,2,\ldots,r\) (\(r=4\) in this example) . The answers to these questions establish a prior on \(\tilde{m} = (\tilde{m}_1, \ldots, \tilde{m}_r)'\) which we assume to be normal: \[ \tilde{m} \sim N_r(\tilde{Y}, D(\tilde{w})). \] Here \(D(\tilde{w})\) is a diagonal matrix with \(\tilde{w}_j\) as the entries. The components of \(\tilde{Y}\) are the answers to the first question (average # of personal fouls for players with \(x\) steals, \(y\) blocks, and \(z\) minutes play time). The \(\tilde{w}_j\)s reflect the variability of the expert’s best guess. These are the answers to the “How big could that be” question. \(\tilde{x}_j\) is the vector with given values for the covariates. For example, \(\tilde{x}_1 = (1, 0.5, 1, 10)'\) corresponding to 0.5 steals, 1 block, and 10 minutes, plus the intercept.

This is all set up to induce a prior on \(\beta\). \(\tilde{m}\) is easier to ask questions about and the way it is set up we have \(\tilde{m} = \tilde{X}\beta\) so that \(\beta = \tilde{X}^{-1}\tilde{m}\) (\(\tilde{X}\) is the matrix whose rows are \(\tilde{x}_j'\)). Since we specified \(\tilde{m}\) to be normal, \(\beta\) will also be normal. We get the parameters by setting \[ \beta_0 = \tilde{X}^{-1}\tilde{Y}, \hspace{10pt} C = \tilde{X}^{-1}D(\tilde{w})\left(\tilde{X}^{-1}\right)'.\] This induces a normal prior on \(\beta\). It will probably help to see this in the context of the data at hand.

For the questions for \(\beta\), these are the values I got:

\[ \tilde{X} = \begin{pmatrix} 1 & 0.5 & 1 & 10 \\ 1 & 1 & 1 & 20 \\ 1 & 1 & 1 & 30 \\ 1 & 2 & 2 & 30 \end{pmatrix}, \hspace{5pt} \tilde{Y} = \begin{pmatrix} 1 \\ 1.5 \\ 2.5 \\ 3 \end{pmatrix}, \hspace{5pt} \tilde{w} = \begin{pmatrix} 2 \\ 2.5 \\ 3.5 \\ 4 \end{pmatrix}. \]

We already saw the values in the first row of \(\tilde{X}\) and the first components of \(\tilde{Y}\) and \(\tilde{w}\). Let’s consider the last components. The question would be “What’s the average number of personal fouls per game for someone who has 2 steals, 2 blocks, and plays 30 minutes?” The answer was “3-4” so \(\tilde{Y}_4 = 3\) and \(\tilde{w}_j = 4\).

Obtaining the Prior (in R)

We implement this process in R below. We’ll start with the prior on \(\tau\) because it’s a bit less involved. Below are the computations that perform the transformations mentioned earlier. We use two functions here4. The first one, norm_perc_to_sd takes the mean and percentile of a normal distribution and outputs the standard deviation. We need to use this function twice. Once for the initial guess of the percentile, and one for the percentile of this. tau_gamma_params takes the mode and a percentile for a standard deviation \(\sigma\) and returns the gamma parameters for the precision \(\tau\).

# based on 2 STL, 2 BLK, 30 MIN
norm_perc_to_sd <- function(mean, percentile, p = 0.95){
  sd = (percentile - mean) / qnorm(p)

# mean = 3 b/c Y_tilde_4 = 3
sigma_mode <- norm_perc_to_sd(mean = 3, percentile = 4.5)
sigma_perc <- norm_perc_to_sd(mean = 3, percentile = 5)

tau_gamma_params <- function(sigma_mode, sigma_perc, p = 0.95,a_start = 1, a_end = 1000, increment = 0.01){
  a <- seq(a_start, a_end, increment)
  b <- sigma_mode^2 * (a + 1)
  index <- which.max(qgamma(1-p, a, b) > 1 / sigma_perc^2)
  out <- data.frame(shape = a[index], rate = b[index])

gamma_params <- tau_gamma_params(sigma_mode = sigma_mode, sigma_perc = sigma_perc)
tau_prior_shape <- gamma_params$shape
tau_prior_rate <- gamma_params$rate
## [1] 13.88
## [1] 12.37459

That takes care of \(\tau\) so now let’s get the prior for \(\beta\). We use another function here, norm_params, which takes the mean and percentile for a normal distribution and returns the normal parameters for a mean. We use this to compute the variances for the matrix \(D\). Note that when we create the \(\tilde{X}\) matrix, we should check that it has full rank as all the above assumes this.

norm_params <- function(mean, percentile, p = 0.95){
  sd <- 1 / ( qnorm(p) / (percentile - mean) )
  out <- data.frame(mu = mean, sigma = sd)

x_tilde <- matrix(c(1, 0.5, 1, 10, 1, 1, 1, 20, 1, 1, 1, 30, 1, 2, 2, 30),
                  nrow = 4, byrow = TRUE)
y_tilde <- c(1, 1.5, 2.5, 3)  # best guess
w_tilde <- c(2, 2.5, 3.5, 4)  # guess for .95 quantile
D <- diag(4)
for (i in 1:4){
  D[i, i] <- norm_params(mean = y_tilde[i], percentile = w_tilde[i])$sigma^2

x_tilde_inv <- solve(x_tilde)
beta_prior_mean <- x_tilde_inv %*% y_tilde
beta_prior_var <- x_tilde_inv %*% D %*% t(x_tilde_inv)

##      [,1]
## [1,] -1.0
## [2,] -1.0
## [3,]  1.5
## [4,]  0.1
##            [,1]       [,2]       [,3]        [,4]
## [1,]  4.0657266  5.1745611 -5.1745611 -0.14784460
## [2,]  5.1745611  8.8706762 -8.1314532 -0.22176691
## [3,] -5.1745611 -8.1314532  8.1314532  0.18480575
## [4,] -0.1478446 -0.2217669  0.1848058  0.00739223


From here, we would want to go back to our expert and make sure the priors reflect their knowledge. We could show them a density plot of \(\sigma\) for example and talk with them to see if it accurately reflects their beliefs. The \(\beta\)s will be harder because they are not as easy to think about. You may want to just examine them yourself and see if they seem reasonable based on your conversations with the expert. If you find they don’t quite agree with the priors you have, adjust them, go back to the expert, and repeat until they are satisfied. Remember though, the prior doesn’t need to be perfect, especially if you have a sizable amount of data. It just needs to reasonably capture the experts opinion.

It should be noted that the method described here is fairly limiting. If we have a model with more than just a few covariates, it will probably be very difficult to obtain a prior this way. If we can find another source of the same data, we can use a reference prior on that and use the posterior as a prior for the data at hand. Or we could just use a reference prior but this doesn’t take into account expert knowledge which we should try to incorporate because that’s kind of the point of Bayesian statistics. We could also use the method here on only some of the \(\beta\)s and a reference or a wide normal prior on the rest. Basically, we want to capture as much expert knowledge as we can without being too cumbersome.

Hopefully all this makes sense. It’s not an easy task to set up a prior but you definitely should if you can. I suggest you try setting up a prior yourself in two different ways. On data where you are an expert, asking and answering the questions as though you were talking to an expert. And on data you know little about and thus need to talk to an expert in the area. This will help you get more familiar with the process and let you get practice asking the questions in a way the expert understands so they can provide you with useful answers.

  1. The data was collected by Dr. JC Roman at San Diego State University.

  2. You pick some arbitrary values.

  3. Or 0.5 STL, 1 BLK, and 10 MIN. Again, the values are arbitrary here.

  4. All functions in this post are based on code by Dr. Roman.