I took Iain Murray’s Machine Learning and Pattern Recognition course this fall. It was fun and challenging. I often studied by coding demos, and some of them are kinda cool and I wanted to share them. I’m going to start with something that shows up a few times: priors on linear regression parameters. This is based on this lecture.
Quick review: least squares linear regression that finds a single optimal fit to observations by reducing the least squares error.
Instead of finding a single fit, Bayesian linear regression model gives a distribution over all possible fits. It’s a little weird to think about. One component of Bayesian Linear Regression is a prior on the linear regression parameters. If I have a linear model
instead of setting a single value for \(w\) and \(b\), I could say \(w\) and \(b\) are normally distributed with means \(\mu_w\) and \(\mu_b\) and variances \(\sigma_w^2\) and \(\sigma_b^2\).
To find those parameter distributions in Bayesian land, I’d start with a prior distribution that says what kinds of functions I’d expect before seeing any data. Then I’d combine that with my observations in the form of the likelihood. The result is the posterior distribution, which is what I’d predict based on.
In this post, I’ll just show how to sample a couple \(w\) and \(b\) from a 2D Gaussian distribution and plot them.
Sampling from the prior
Here’s some code that samples from a prior distribution on parameters.
sigma_b changes the most likely y-intercept. A small
sigma_b makes more points go through the origin (top), and a large
sigma_b makes points spread out more. And by setting the \(\mu\), it makes the slopes and intercepts usually near that value.