Machine Learning Notes - Linear Regression
This is my first post on machine learning, and hopefully not the last one. The main goal of these posts is to serve as a quick reference for simple machine learning problems and their solutions, meanwhile allowing me to get a better understanding of the field itself. That said, don't take anything for granted.
Linear Regression
Linear regression is one of the algorithms used to predict scalar values given some inputs. Linear regression can be modelled like this:
y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots +\beta_p x_{ip} + \epsilon_i
It can also be written in a vector form as:
\boldsymbol Y = \boldsymbol X \boldsymbol \beta + \boldsymbol \epsilon
We first need to find the model parameters \beta before any predictions, y, can be made for our inputs x.
Finding Betas
Given a set of inputs and outputs we would have to solve a set equations for \beta. The vector form can be solved using linear algebra; the solution takes the form of:
\beta = (X'X)^{-1}X'Y
This particular solution will find betas using the method of ordinary least squares. If linear algebra is not an option, another way to find betas is to minimize the difference between actual values and predicted values by using some function optimization method, e.g. gradient descent.
Data Set
We are going to use a minimal version of the Iris data set in the code examples.
Implementing The Model
In the examples below I am going to implement a simple linear regression model that takes only a single input. This way we can much more easily visualize our data and model; however, the code can be easily extended to support more than one input.
Reading The Data Set
In this example we are going to model the sepal length using sepal width only for setosa flower class.
import csv def read_iris_data(filename): """ :Parameters: - `filename`: name of the file that contains Iris data set. """ sepalWidths = [] sepalLengths = [] with open(filename) as fd: reader = csv.DictReader(fd) for row in reader: # Widths will be our inputs, include intercept sepalWidths.append([1.0, float(row['sepalWidth'])]) sepalLengths.append(float(row['sepalLength'])) return sepalWidths, sepalLengths
One thing to note here is that we are adding intercept to our inputs (ones). It's not very important in this particular case since our inputs will never evaluate to 0. In other cases it may be useful.
Solving For Betas With NumPy And Linear Algebra
import numpy as np def fit(inputs, outputs): """ Solve a set of linear regression equations for betas using linear algebra. :Parameters: - `inputs`: a list of inputs. - `outputs`: a list of outputs for given inputs. """ X = np.mat(inputs) Y = np.mat(outputs).T x_trans_x = X.T * X if np.linalg.det(x_trans_x) == 0: raise Exception('Cannot inverse singular matrix') return (x_trans_x.I * (X.T * Y)).A.flatten()
The fit() function expects a list of tuples as inputs, and a list of output
values. The model parameters will be returned as a list. If we are going to run
this function on our data set we will get a these beta coefficients
[2.63900125 0.69048972]. Thus, our regression function looks as follows:
sepalLength = 2.63900125 + 0.69048972 \times sepalWidth
Solving For Betas With scikit-learn
scikit-learn package comes with a linear model which can be used to solve linear regression problems.
import numpy as np from sklearn import linear_model X = np.array(inputs) Y = np.array(outputs) # Don't include intercept since we already added it while loading # data. model = linear_model.LinearRegression(fit_intercept=False) model.fit(X, Y)
The model parameters are stored in model.coef_ and for our dataset they are
array([ 2.63900125, 0.69048972]). Predictions can be done by calling
model.predict method.
Assessing Your Model
For simple cases that have only a single input we can plot the regression line and the data points together to see how well our model fits the data.
The red line represents our original model, which I think looks fairly well. However, if we look at our dataset the first data point looks like an outlier. We can try to build a model using a data set without the outlier. The green regression line represents the model where the outlier was removed from the training set.
By looking at the graphs it's diffucult to tell which model is a better one. We
can take a look at the coefficient of determination (R^2) for
each of them. scikit-learn model provides score method which can be used to
obtain it. For the first model we have R^2 of 0.551375580392 and
for the second one—0.547248091457. The closer R^2 to 1 the better
our model is.