Machine Learning Notes - Linear Regression

Published: August 14, 2013

This is my first post on machine learning, and hopefully not the last one. The main goal of these posts is to serve as a quick reference for simple machine learning problems and their solutions, meanwhile allowing me to get a better understanding of the field itself. That said, don't take anything for granted.

Linear Regression

Linear regression is one of the algorithms used to predict scalar values given some inputs. Linear regression can be modelled like this:

y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots +\beta_p x_{ip} + \epsilon_i

It can also be written in a vector form as:

\boldsymbol Y = \boldsymbol X \boldsymbol \beta + \boldsymbol \epsilon

We first need to find the model parameters \beta before any predictions, y, can be made for our inputs x.

Finding Betas

Given a set of inputs and outputs we would have to solve a set equations for \beta. The vector form can be solved using linear algebra; the solution takes the form of:

\beta = (X'X)^{-1}X'Y

This particular solution will find betas using the method of ordinary least squares. If linear algebra is not an option, another way to find betas is to minimize the difference between actual values and predicted values by using some function optimization method, e.g. gradient descent.

Data Set

We are going to use a minimal version of the Iris data set in the code examples.

Implementing The Model

In the examples below I am going to implement a simple linear regression model that takes only a single input. This way we can much more easily visualize our data and model; however, the code can be easily extended to support more than one input.

Reading The Data Set

In this example we are going to model the sepal length using sepal width only for setosa flower class.

import csv

def read_iris_data(filename):
    """
    :Parameters:
      - `filename`: name of the file that contains Iris data set.
    """

    sepalWidths = []
    sepalLengths = []

    with open(filename) as fd:
        reader = csv.DictReader(fd)
        for row in reader:
            # Widths will be our inputs, include intercept
            sepalWidths.append([1.0, float(row['sepalWidth'])])
            sepalLengths.append(float(row['sepalLength']))

    return sepalWidths, sepalLengths

One thing to note here is that we are adding intercept to our inputs (ones). It's not very important in this particular case since our inputs will never evaluate to 0. In other cases it may be useful.

Solving For Betas With NumPy And Linear Algebra

import numpy as np

def fit(inputs, outputs):
    """
    Solve a set of linear regression equations for betas using linear algebra.

    :Parameters:
      - `inputs`: a list of inputs.
      - `outputs`: a list of outputs for given inputs.
    """
    X = np.mat(inputs)
    Y = np.mat(outputs).T
    x_trans_x = X.T * X
    if np.linalg.det(x_trans_x) == 0:
        raise Exception('Cannot inverse singular matrix')
    return (x_trans_x.I * (X.T * Y)).A.flatten()

The fit() function expects a list of tuples as inputs, and a list of output values. The model parameters will be returned as a list. If we are going to run this function on our data set we will get a these beta coefficients [2.63900125 0.69048972]. Thus, our regression function looks as follows:

sepalLength = 2.63900125 + 0.69048972 \times sepalWidth

Solving For Betas With scikit-learn

scikit-learn package comes with a linear model which can be used to solve linear regression problems.

import numpy as np
from sklearn import linear_model

X = np.array(inputs)
Y = np.array(outputs)
# Don't include intercept since we already added it while loading
# data.
model = linear_model.LinearRegression(fit_intercept=False)
model.fit(X, Y)

The model parameters are stored in model.coef_ and for our dataset they are array([ 2.63900125, 0.69048972]). Predictions can be done by calling model.predict method.

Assessing Your Model

For simple cases that have only a single input we can plot the regression line and the data points together to see how well our model fits the data.

The red line represents our original model, which I think looks fairly well. However, if we look at our dataset the first data point looks like an outlier. We can try to build a model using a data set without the outlier. The green regression line represents the model where the outlier was removed from the training set.

By looking at the graphs it's diffucult to tell which model is a better one. We can take a look at the coefficient of determination (R^2) for each of them. scikit-learn model provides score method which can be used to obtain it. For the first model we have R^2 of 0.551375580392 and for the second one—0.547248091457. The closer R^2 to 1 the better our model is.