by Tadas Vilkeliskis, August 14, 2013
This is my first post on machine learning, and hopefully not the last one. The main goal of these posts is to serve as a quick reference for simple machine learning problems and their solutions, meanwhile allowing me to get a better understanding of the field itself. That said, don’t take anything for granted.
Linear regression is one of the algorithms used to predict scalar values given some inputs. Linear regression can be modelled like this:
It can also be written in a vector form as:
We first need to find the model parameters \(\beta\) before any predictions, \(y\), can be made for our inputs \(x\).
Given a set of inputs and outputs we would have to solve a set equations for \(\beta\). The vector form can be solved using linear algebra; the solution takes the form of:
This particular solution will find betas using the method of ordinary least squares. If linear algebra is not an option, another way to find betas is to minimize the difference between actual values and predicted values by using some function optimization method, e.g. gradient descent.
In the examples below I am going to implement a simple linear regression model that takes only a single input. This way we can much more easily visualize our data and model; however, the code can be easily extended to support more than one input.
In this example we are going to model the sepal length using sepal width only for setosa flower class.
import csv def read_iris_data(filename): """ :Parameters: - `filename`: name of the file that contains Iris data set. """ sepalWidths =  sepalLengths =  with open(filename) as fd: reader = csv.DictReader(fd) for row in reader: # Widths will be our inputs, include intercept sepalWidths.append([1.0, float(row['sepalWidth'])]) sepalLengths.append(float(row['sepalLength'])) return sepalWidths, sepalLengths
One thing to note here is that we are adding intercept to our inputs (ones). It’s not very important in this particular case since our inputs will never evaluate to 0. In other cases it may be useful.
import numpy as np def fit(inputs, outputs): """ Solve a set of linear regression equations for betas using linear algebra. :Parameters: - `inputs`: a list of inputs. - `outputs`: a list of outputs for given inputs. """ X = np.mat(inputs) Y = np.mat(outputs).T x_trans_x = X.T * X if np.linalg.det(x_trans_x) == 0: raise Exception('Cannot inverse singular matrix') return (x_trans_x.I * (X.T * Y)).A.flatten()
The fit() function expects a list of tuples as inputs, and a list of output values. The model parameters will be returned as a list. If we are going to run this function on our data set we will get a these beta coefficients [2.63900125 0.69048972]. Thus, our regression function looks as follows:
scikit-learn package comes with a linear model which can be used to solve linear regression problems.
import numpy as np from sklearn import linear_model X = np.array(inputs) Y = np.array(outputs) # Don't include intercept since we already added it while loading # data. model = linear_model.LinearRegression(fit_intercept=False) model.fit(X, Y)
The model parameters are stored in model.coef_ and for our dataset they are array([ 2.63900125, 0.69048972]). Predictions can be done by calling model.predict method.
For simple cases that have only a single input we can plot the regression line and the data points together to see how well our model fits the data.
The red line represents our original model, which I think looks fairly well. However, if we look at our dataset the first data point looks like an outlier. We can try to build a model using a data set without the outlier. The green regression line represents the model where the outlier was removed from the training set.
By looking at the graphs it’s diffucult to tell which model is a better one. We can take a look at the coefficient of determination (\(R^2\)) for each of them. scikit-learn model provides score method which can be used to obtain it. For the first model we have \(R^2\) of 0.551375580392 and for the second one—0.547248091457. The closer \(R^2\) to 1 the better our model is.