Gradient descent for linear regression using Golang
December 18, 2019
I recently decided to dive into machine learning, a field I have wanted to understand for a long time but have never had the time to pursue.
I’ve been taking the free (and amazing!) course from Stanford University’s Andrew Ng on Coursera. The first two weeks are dedicated to the Linear Gradient algorithm.
In this post, I’ll provide an overview of how it works and share how I implemented the vectorized version and parts of the non-vectorized version in Golang using the gonum library.
Linear regression is a technique used in modeling the linear relationship between an input and its output. Given a set of known input/output values, linear regression will find the linear function that best fits these values and that can predict the outcome output for any new input value.
For multiple inputs, the process is called Multiple Linear Regression.
What problem does it solve?
Imagine you know multiple inputs and one output for a problem, and you want to be able to compute the output for any new input.
For instance, let’s say you know the characteristics of houses (size, age, number of rooms, etc.) and the selling prices in Amsterdam for the last month, and you would like to predict the selling price of a new house, knowing its size.
Linear regression will find the best prediction given our example training set.
Single input-output problem
For clarity, we will restrain our model to one input `x` and one output `y`. This will make it easier to plot the data and follow the graph alongside the explanation. For our example, we will predict the price of a house based on its size.
First, let’s plot the data. The house size is represented on the x-axis and the price on the y-axis:
What we want to find is the line – represented here in red – that best estimates our example data. The best estimate will be the line that reduces the most the error between the predicted value and the real one.
This line is called the hypothesis and its mathematical representation is the following function: h(x) = ax + b
The question becomes how do we best determine a and b so the error between the plot data and the line is the smallest?
The cost function will help us determine how far from our hypothesis the data points are. For that, we need to calculate the error between our prediction and the actual result:
If we minimize the sum of errors for all the points in our plot, we will find the best hypothesis.
To do so, we calculate the sum of the squared difference between the output we predicted, h(x), and the actual output y, for all values in the training set. Then we divide this sum by two times the amount of training examples.
In math, we represent the cost by the following function:
If we substitute h(x) with its mathematical representation we defined earlier, h(x) = ax + b, the above function becomes:
m = number of training examples
hi(x) = hypothesis calculated for row i of the training set
y = output row i of the training set
Getting the lowest cost becomes a matter of minimizing
Minimizing the cost with gradient descent
Gradient descent is an optimization algorithm for finding the minimum of a function and it is what we will use to find our linear regression.
Let’s consider for a moment that b=0 in our hypothesis, just to keep things simple and plot the cost function on a 2D graph. The cost function will consequently become dependent solely on a, which means that in order to minimize it, we will have to find where a minimizes it.
To find the minimum of the cost function we are going to take a small step along the path of the function towards the minima. Let’s denote α, the positive integer that indicates the size of our step.
For every step, we will calculate the slope of the tangent to this function and decide which way to go:
Find the minima of the cost function
- If the slope of the tangent is negative, we increase the value of a, thus moving it to the right, toward the minima.
- If the slope is positive, we decrease the value of a, thus moving it to the left, toward the minima.
- A slope of 0 means we have found our minima.
To calculate the minimum, here is the algorithm of gradient descent :
This will guide us to the minimum cost! You can stop the loop if you find that two iterations lead to a very small decrease in the value of cost function.
After applying the partial derivative term of Cost(a,b) for each expression, we end up with this algorithm:
Every iteration will compute a new hypothesis.
Here is a graph with some of the computed hypotheses while running the gradient descent algorithm:
hypotheses calculated while running gradient descent
To start the algorithm, we usually assign “a” and “b” to zero, which gives us the bottom line on this graph. Then for every iteration, the algorithm finds a better hypothesis.
α has to be chosen carefully:
- If α is too small then the gradient descent will take a really small step, calculate lots of hypotheses functions, and therefore slow down the algorithm.
- If α is too big, it will miss the minima (it will step over) resulting in a Cost function that increases instead of decreasing.
Multiple inputs problem
If the output depends on multiple inputs, the mathematical representation of the hypothesis will look like this:
where x1, x2, …, xn are our inputs.
The cost function will be relatively the same:
And our gradient descent will update all the parameters:
After a certain number of iterations and a good step α, we will find the minimum of the cost function and the theta parameters. We are then able to compute the y value for new inputs.
In order to have a better performing gradient descent algorithm, we have to scale our inputs so they are all in the same range (for example between [-1 1]).
There are various scaling methods out there, but in my example I am using one of the simpler ones, called mean normalization. Mean normalization computes the normalized value x` with respect to the original input value x, the mean of all inputs and the min/max input values:
We have to apply this formula on all of our inputs and before computing our hypothesis.
The GO Code
In the Coursera course, all examples are implemented in Octave/Matlab, which are very helpful tools when dealing with matrix manipulation.
But I’m a Golang software engineer so I wanted to implement this machine learning algorithm using Go.
Please keep in mind that my implementation isn’t by any means the most performant, optimized or production-ready one. The purpose of my code is mostly educational. If you are looking for examples or libraries that provide more optimized implementations, please check the References section at the end of the article.
How my code works
First, we need to read the data of our training set. I read it in a CSV format. The last column will be the expected output and the other columns represent all the inputs.
To read the data, you can use the CSV reader from Golang, it’s pretty easy.
First, we will instantiate three slices:
- inputs will hold all csv columns except the last one.
- y will hold the last column, the expected result
- theta is just a slice of (n+1)) dimension and we start by a 0 init (n is the amount of input)
Once we loaded all the data, we can normalize it:
Then we can run gradient descent. hypothesis.ComputeHypothesis calculates the hypothesis function.
Our model is now trained, and we can use our thetas to compute an output from new inputs:
In the Coursera course, it is strongly recommended to use a vectorized version using algebra and matrices in order to have an efficient algorithm. But matrix manipulation is not something that engineers use on a regular basis and therefore it’s not the version I have chosen to show here. However, if you do want to check the vectorized version out, you can have a look at my Golang implementation using gonum library for matrices calculus. All of my code is available on Github !