by Tadas Vilkeliskis, August 14, 2013

This is my first post on machine learning, and hopefully not the last one. The main goal of these posts is to serve as a quick reference for simple machine learning problems and their solutions, meanwhile allowing me to get a better understanding of the field itself. That said, don’t take anything for granted.

Linear regression is one of the algorithms used to predict scalar values given some inputs. Linear regression can be modelled like this:

\[y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots +\beta_p x_{ip} + \epsilon_i\]

It can also be written in a vector form as:

\[\boldsymbol Y = \boldsymbol X \boldsymbol \beta + \boldsymbol \epsilon\]

We first need to find the model parameters \(\beta\) before any predictions, \(y\), can be made for our inputs \(x\).

Given a set of inputs and outputs we would have to solve a set equations for \(\beta\). The vector form can be solved using linear algebra; the solution takes the form of:

\[\beta = (X'X)^{-1}X'Y\]

This particular solution will find betas using the method of ordinary least squares. If linear algebra is not an option, another way to find betas is to minimize the difference between actual values and predicted values by using some function optimization method, e.g. gradient descent.

We are going to use a minimal version of the Iris data set in the code examples.

In the examples below I am going to implement a simple linear regression model that takes only a single input. This way we can much more easily visualize our data and model; however, the code can be easily extended to support more than one input.

In this example we are going to model the sepal length using sepal width only for setosa flower class.

```
import csv
def read_iris_data(filename):
"""
:Parameters:
- `filename`: name of the file that contains Iris data set.
"""
sepalWidths = []
sepalLengths = []
with open(filename) as fd:
reader = csv.DictReader(fd)
for row in reader:
# Widths will be our inputs, include intercept
sepalWidths.append([1.0, float(row['sepalWidth'])])
sepalLengths.append(float(row['sepalLength']))
return sepalWidths, sepalLengths
```

One thing to note here is that we are adding intercept to our inputs (ones). It’s not very important in this particular case since our inputs will never evaluate to 0. In other cases it may be useful.

```
import numpy as np
def fit(inputs, outputs):
"""
Solve a set of linear regression equations for betas using linear algebra.
:Parameters:
- `inputs`: a list of inputs.
- `outputs`: a list of outputs for given inputs.
"""
X = np.mat(inputs)
Y = np.mat(outputs).T
x_trans_x = X.T * X
if np.linalg.det(x_trans_x) == 0:
raise Exception('Cannot inverse singular matrix')
return (x_trans_x.I * (X.T * Y)).A.flatten()
```

The `fit()` function expects a list of tuples as inputs, and a list of output
values. The model parameters will be returned as a list. If we are going to run
this function on our data set we will get a these beta coefficients
`[2.63900125 0.69048972]`. Thus, our regression function looks as follows:

\[sepalLength = 2.63900125 + 0.69048972 \times sepalWidth\]

scikit-learn package comes with a linear model which can be used to solve linear regression problems.

```
import numpy as np
from sklearn import linear_model
X = np.array(inputs)
Y = np.array(outputs)
# Don't include intercept since we already added it while loading
# data.
model = linear_model.LinearRegression(fit_intercept=False)
model.fit(X, Y)
```

The model parameters are stored in `model.coef_` and for our dataset they are
`array([ 2.63900125, 0.69048972])`. Predictions can be done by calling
`model.predict` method.

For simple cases that have only a single input we can plot the regression line and the data points together to see how well our model fits the data.

The red line represents our original model, which I think looks fairly well. However, if we look at our dataset the first data point looks like an outlier. We can try to build a model using a data set without the outlier. The green regression line represents the model where the outlier was removed from the training set.

By looking at the graphs it’s diffucult to tell which model is a better one. We
can take a look at the coefficient of determination (\(R^2\)) for
each of them. scikit-learn model provides `score` method which can be used to
obtain it. For the first model we have \(R^2\) of `0.551375580392` and
for the second one—`0.547248091457`. The closer \(R^2\) to 1 the better
our model is.