by Tadas Vilkeliskis, September 8, 2013

In *the previous post* I covered linear
regression which can be used to predict continuous values. This post is a
sequel to simple linear regression and will talk about weighted linear
regression.

One of the problems with linear regression is that it tries to fit a constant line to your data once the model was created. Such behaviour might be okay when your data follows linear pattern and does not have much noise. However, when the data set is not linear, linear regression tends to under fit the training data.

Locally weighted regression (LWR) tries to alleviate this problem by assigning weights to your training data. Weights are bigger for the data points that are closer to the data you are trying to predict, thus local in the name. Another thing to note, LWR requires the entire data set every time you try to make a prediction making it much more computationally expensive compared to the simple linear regression. We have to do this because every time we try to make a prediction we are constructing a regression line that’s local to the data point of our interest.

LWR model is very similar to simple regression model, the only difference is that we are introducing a weight matrix \(W\). Once we have the weight matrix we can find the model parameters as follows:

\[\beta = (X'WX)^{-1}X'WY\]

To get the prediction we need to multiply betas with our inputs \(X_0\), \(\hat{y} = \beta X_0\).

Kernel smoothing allows us to to smooth out our point of interest (\(X_0\)) using nearby data points (\(X\)). The kernel is usually defined by a function \(D\) whose values are decreasing as as the distance between \(X_0\) and \(X\) increases:

\[D = \frac{\lVert X - X_0 \rVert}{h_{\lambda}(X_0)}\]

To construct matrix \(W\) we need to evaluate kernel for each training input (\(X\)) and the value we are trying to predict, \(X_0\). The weight matrix is diagonal where all non-diagonal cells are 0. After running the kernel on our data set we will essentially create a weight matrix where the weights are decreasing as the distance between \(X_0\) and \(X\) increases. This kernel property will make nearby data points more important when solving linear regression.

I am going to use the Gaussian kernel function for illustration purposes:

\[D = ae^{ - \displaystyle\frac{\lVert X - X_0 \rVert}{2c}}\]

To demonstrate LWR we are going to use the same Iris data set we used in my previous *post on linear
regression*. We
will try to estimate sepal length from its width.

The code is pretty straightforward and maps almost directly to mathematical expressions.

```
import numpy as np
def gaussian_kernel(x, x0, c, a=1.0):
"""
Gaussian kernel.
:Parameters:
- `x`: nearby datapoint we are looking at.
- `x0`: data point we are trying to estimate.
- `c`, `a`: kernel parameters.
"""
# Euclidian distance
diff = x - x0
dot_product = diff * diff.T
return a * np.exp(dot_product / (-2.0 * c**2))
def get_weights(training_inputs, datapoint, c=1.0):
"""
Function that calculates weight matrix for a given data point and training
data.
:Parameters:
- `training_inputs`: training data set the weights should be assigned to.
- `datapoint`: data point we are trying to predict.
- `c`: kernel function parameter
:Returns:
NxN weight matrix, there N is the size of the `training_inputs`.
"""
x = np.mat(inputs)
n_rows = x.shape[0]
# Create diagonal weight matrix from identity matrix
weights = np.mat(np.eye(n_rows))
for i in xrange(n_rows):
weights[i, i] = gaussian_kernel(datapoint, x[i], c)
return weights
def lwr_predict(training_inputs, training_outputs, datapoint, c=1.0):
"""
Predict a data point by fitting local regression.
:Parameters:
- `training_inputs`: training input data.
- `training_outputs`: training outputs.
- `datapoint`: data point we want to predict.
- `c`: kernel parameter.
:Returns:
Estimated value at `datapoint`.
"""
weights = get_weights(training_inputs, datapoint, c=c)
x = np.mat(training_inputs)
y = np.mat(training_outputs).T
xt = x.T * (weights * x)
betas = xt.I * (x.T * (weights * y))
return datapoint * betas
```

I ran LWR on the Iris data set with 3 different \(c\) values for the Gaussian kernel: 0.1, 0.2 and 1. You can see the results in the chart below.

It is important to note that one must pick good kernel parameters; otherwise it’s possible that the kernel will be too wide and include a non-linear region of the data set. In our case this is fine since our data set is linear from start to end (except for the \(c=0.1\) case). A kernel that’s too narrow might exclude data points that increase the fit (see Figure 3). As \(c\) gets smaller the curvature of our model increases. The reason this is happening is that our kernel is becoming more narrow and only very close data points will have substantial weights assigned to them rendering other data points useless.