# Regression with Gaussian Processes

This post talks about a neat way to do regression using a Gaussian Process, or a huge multivariate Gaussian that represents a distribution over functions. Given some observations from a function I might not know much about, I’ll end up with an estimated function with nice error bars.

First I sample some points from some fake function to be my observation. Gaussian Processes can work with more arbitrary functions, so here’s a fun-looking function. The red line is the underlying function. The black x’s are samples with added noise.

Below is the result of the regression. The black x’s are the same observations as above and the black line is the same function as above. This time, the blue line is the estimate of the function and the blue shaded area is a measure of uncertainty.

This post is based on Iain Murray’s course notes on Gaussian processes, and parts are from Murray’s demo. It also seems to follow Chapter 19 of Bayesian Reasoning and Machine Learning. I also thought this video was useful.

## Generating Data

First I’ll generate some fake observations. I’ll sample \( N = 20 \) points from an underlying function \( f \) and add a little noise to get \( y \).

### Sample \( N \) points

One special detail with sampling the points is that I choose the \( x \) values by making a range of numbers representing the \( x \) axis of the points I want to plot. Then I choose \( N \) of those to be the observations. I split the range into those \( N \) points for observations and the remaining points. This will be important when I build the giant multivariate Gaussian.

As a result, `observed_xs`

will be of size \( N = 20 \), and `unobserved_xs`

will be of size \( 500 - N = 480 \).

### Generating toy observations

Now I’ll make some observed points \( y \), which I assume come from a function \( f \) with some added noise. So \( y = f(x) + \epsilon \) where \( \epsilon \sim \mathcal{N}(0, \sigma_w^2) \).

The code below doesn’t have much to do with Gaussian processes and usually you’d be working with real data rather than fake observations, so feel free to skip it. You can also check out my previous post to learn more about generating demo data.

I create the true function \( f \), compute the values of \( f \) at the \( N \) data points from above, and add Gaussian noise \( \epsilon \sim \mathcal{N}(0, \sigma_w) \) to make the observations \( y \).

### Plotting

Here I plot the function \( f \) and the observed \( y \).

Except for the interesting way of choosing the \( N \) observed x values, none of this is specific for regression using Gaussian Processes.

## Gaussian Processes

I’ll make a huge multivariate Gaussian distribution with all possible points I care about. In this case, the only points I care about are the 500 points in `all_xs`

. This is an interesting way to simplify this problem! The justification is that when I’m done, I’m only going to plot a finite number of points anyway, and because of the nature of Gaussians, I can do this ignoring the other infinite points in that range. (It’s also a little funny because a “Gaussian Process” is based on having an infinite variable Gaussian, but we quickly turn around and make it finite again!)

To define the 500-dimensional multivariate Gaussian distribution, I’ll need the mean for each dimension, a vector of size \( 500\times 1 \), and the covariance matrix, a matrix of size \( 500\times 500 \)*. For example, one of the rows in the mean vector will represent the mean at \( x = 0.00 \), another will be the mean at \( x = 0.01 \).

*The size of this covariance matrix is one of the reasons why GP’s tend to only be used for small datasets.

### Initializing the means and covariances

The means and covariances of \( p(\textbf f, \textbf g) \) don’t depend on the observations, and only depend on hyperparameters. The observations only come in when I compute \( p(\textbf f \mid \textbf g) \).

I’ll set the means \( \textbf{a} \) and \( \textbf{b} \) to be 0.

The covariance matrix is probably the most important part.

When I increase an element in the covariance matrix, it means the corresponding points are more likely to have similar values. This is how we get a smooth function instead of random points.

For example, adjusting the value representing `0.15`

vs `0.40`

in the covariance matrix corresponds affects how close they are together. Given a set value for `0.15`

, if the covariance is high, `0.40`

will usually be assigned a similar number. If the covariance is low, `0.40`

can be assigned different values. The below two images try to visualize this, inspired by this video.

The elements of the covariance matrix are typically populated using a kernel function, such as the RBF kernel function:

There are other kernel functions, but they are required to create a valid covariance matrix.

One way that’s fun to visualize this is a heatmap. The brighter points have a higher covariance value. This shows that points close to each other have a higher covariance, and those farther away have a lower one.

This can show the impact of \( \ell \). A larger \( \ell \) means that points farther away from each other are considered more related than if we had used a smaller \( \ell \).

Changes in \( \sigma_f \) will change how large each value is.

### Prior

I can plot what the Gaussian Process prior thinks of the world before it’s given any data.

To plot this, I look at the marginal distribution of this multivariate Gaussian for just one dimension, say for \( x = 0.01 \). I end up with the univariate Gaussian I know and love. I’ll use this Gaussian’s mean as my estimate for that point. I can also use its standard deviation to mark a region of uncertainty. Then I’ll repeat this for all 500 points in `all_xs`

, getting 500 univariate Gaussians, and plotting the mean and standard deviations of them.

## Given observations

Now I can incorporate the observations. I’ll split up the multivariate Gaussian’s dimensions into the dimensions corresponding to my observations, and those corresponding to the points I want to know. So if \( \textbf{f} \) represents the points I don’t know and \( \textbf{g} \) represents the points I do know, then I can rearrange my points and write a big matrix like

where \( \textbf a \) and \( \textbf b \) are vectors, and \( A \), \( B \), and \( C \) are matrices. I think it’s a little useful to imagine at the shapes of the covariance matrix like this:

The full covariance matrix is filled with the values dictated by the kernel function, and so are the \( A \), \( B \), and \( C \).

Now let’s predict the unobserved values given the observed values! To do this, I compute the posterior \( p(\textbf{f} \mid \textbf{g}) \), or the probability distribution of the unobserved given the values of the observed points. It turns out Gaussians have a convenient closed-form solution to this given by:

Since I’m assuming \( \textbf{a} \) and \( \textbf{b} \) are 0, this is just

So all I need to do is compute the big vector of \( 500 - N \) values that represent the mean \( \textbf{a} + CB^{-1}(\textbf{g} - \textbf{b}) \), and the diagonal of the covariance, \( A - CB^{-1}C^{\top} \) to get the variances.

There’s one catch: I’m assuming my observations also included some noise \( \sigma_y \), so I need to add that to the variances \( A \).

To summarize:

- Build \( A \), \( B \), and \( C \) using the kernel function. Add noise to \( A \).
- Compute the \( 500 - N \) values that represent the mean and the \( 500 - N \times 500 - N \) matrix representing covariance.
- Use this to plot the estimate and error bars

And now I’ll plot it!

The graph of these is interesting. When there are large gaps of missing values, the uncertainty increases, with a tendency towards the prior’s mean which is 0. The uncertainty pinches towards the observed points, but `sigma_w`

says I’m uncertain about my observation so there’s still uncertainty around the observations.

Super neat! I can also do this again with different hyperparameters. If I turn \( \sigma_f \) way down, it’ll pull the GP’s predictions closer to 0. Interestingly, this one confidently says the wrong thing. This is one of the things to watch out about GPs and Bayesian methods. The way to solve this is by adjusting the hyperparameters when it does a poor job like this.