Over the past few years there has been a real buzz over machine learning and its possible applications, as shown by Google trends data. While machine learning is involved in hugely complex processes like computer vision and self-driving vehicles, it is useful to start with the basics. In this blog post I will be covering the maths involved in simple linear regression and the gradient descent algorithm, as well as applying this to identify a relationship in Google ads data.
This post does assume prior knowledge of python and linear algebra.
Machine learning is defined as the field of study that gives computers the ability to learn without being explicitly programmed. A way of thinking about this is that a program is said to learn if a programs performance on a specific task improves with experience.
For simple linear regression I believe it will be best to demonstrate with our Google ads example. For this post we will just be going over single variable linear regression. To generate the data I simply downloaded a day to day report from one of our clients accounts for the past 3 years which contained, among other variables, cost per conversions and the conversion rate for each day. When exploring the data I thought that it may seem logical to see that higher conversion rates lead to a lower cost per conversion, so let’s explore this further.
The first step is to visualise our data to see if there is indeed a relationship.
import matplotlib.pyplot as plt import pandas as pd import numpy as np data = pd.read_excel("C:\\...\\googleads.xlsx") #save data as numpy arrays of size (n,1), where n is the number of examples X = np.array(data['Conv. rate']) X = np.reshape(X,(len(X),1)) Y = np.array(data['Cost / conv.']) Y = np.reshape(Y,(len(Y),1)) #plot data to investigate relationship plt.figure(1) plt.scatter(X,Y,s=8,marker = 'x') plt.ylabel("Cost Per Conversion") plt.xlabel("Conversion Rate")
Here I am saving our X and Y variables as matrices (or column vectors in this case), the reason for doing this will become apparent later on.
Our plot seems to show a relationship between the two but from the shape of the data it appears it may be a power law.
A power law means our variables follow the relationship Y = aXk, where a and k are constants . The easiest way to find our coefficients a and k is to plot our values on a log-log graph instead. When data that follows a power law is fitted on a log-log graph the result is a linear graph. This is due to the fact that if we take logs of our power law equation we yield log(Y) = klog(X) + log(a) which resembles the well known linear equation Y = mX + c. Here k = m and c = log(a). The base that we choose for our logarithm doesn’t matter, you just need to take note of it to pull out a from c = log(a) after we calculate our coefficient c.
So lets plot our data on a logarithmic scale. For reference I will be using a log of base 10 but natural logarithms or any other base are fine.
#suspected power law so lets see what a log log graph looks like X_log = np.log10(X) Y_log = np.log10(Y) plt.figure(2) plt.scatter(X_log, Y_log, s = 8, marker = 'x') plt.ylabel("Log of Cost Per Conversion") plt.xlabel("Log of Conversion Rate")
This seems like a promising correlation. Now to fit our line of best fit.
From the data we can see that a fitted line of the equation hθ(X) = θ0 + θ1x1 would work well. This is called our hypothesis and has the expected form Y = c + mX.
To make this simpler to write we define 2 column vectors Θ = [ θ0 ; θ1 ] and X = [ x0 ; x1 ] , where x0 = 1. Here our Θ vector is a vector of our parameters that we need to fit. Our X vector is a vector of our features and in this example we only have a single feature, the logarithm of our conversion rate. Defining these vectors allows us to write our hypothesis as hθ(X) = ΘTX. Here ΘT is the transpose of the Θ vector and usual matrix multiplication rules apply.
As an aside we could actually fit higher order polynomial terms, such as x12, and it would still be classed as linear regression as our parameters still scale linearly. For example if hθ(X) = θ0 + θ1x1+θ2x2, our x2 feature could equal x12. To decide if these polynomial features should be included in your model its useful to explore things like residual plots, bias/variance curves and statistical fitting parameters such as the R2 value, however to keep the blog short I won’t go into that exploration for this data.
To decide if a fitted line is actually a good fit to our data we need to define a value that indicates this. This is our cost function. A cost function represents a cost associated with fitting that specific line to our data. A common cost function used for linear regression is the mean squared error. Lets go through this step by step.
Say we have a hypothesis hθ(X) that has some pre-determined parameter values. For a single examples feature vector, X(i), our hypothesis has an output of hθ(X(i)). We first look at the difference between this output and the actual expected output, Y(i), to obtain (hθ(X(i)) – Y(i)). We then square this to avoid negative values which would make this cost function useless for our purpose, (hθ(X(i)) – Y(i))2. Now we want to sum over all of our examples and find a mean, (1/(2m))*∑i((hθ(X(i)) – Y(i))2). Here m is the number of examples and we’ve also multiplied by a half for simplicity in later steps.
So our final cost function is:
J(Θ) = (1/(2m))*∑i((hθ(X(i)) – Y(i))2)
The idea here is that the values of our parameters that give the lowest possible value of our cost function results in the best possible fit to our data. So in order to find the best values of our parameters we have to somehow minimise our cost function.
Now we know we want to minimise the cost function to obtain our optimised parameters it may be useful to plot out how our cost function varies with θ0 and θ1. One thing we know about our cost function is that it will always be convex and only have a single minima which makes our life easier.
Our first step in plotting our graph is creating a function to return the value of our cost function for a specific value of θ0 and θ1. I will add comments to our code to explain each step.
def cost_function(theta_0, theta_1, X, Y): #build a column vector out of our input parameter values of size (2,1) theta_vec = np.array([[theta_0],[theta_1]]) #create a matrix of a column of ones (for our x_0 parameter) and a column of our example values X_mat = np.concatenate((np.ones((len(X),1)),X),axis=1) #vectorized approach to find our output value for each example hyp = X_mat@theta_vec #vectorized approach to calculate the total cost function cost_func = (1/(2*len(X)))*(((hyp - Y).T)@(hyp - Y)) return cost_func[0,0]
In order to make our calculations as efficient as possible we implement a vectorized approach as above. This removes the need for for loops and makes the function much less computationally expensive. This approach isn’t really necessary for our application as using for loops instead wouldn’t make a massive difference to execution speed. However when moving onto bigger data sets and looking at using something like multivariate linear regression this could make a huge impact on execution speed so it’s good practice to use it now. For clarity, .T and @ in the above code represent the transpose of a matrix and the dot product for matrix multiplication, respectively.
It’s possible to choose a sensible range of parameters just by looking at our log-log graph above. We expect θ0 to be approximately in the range 0 to 1. We also expect θ1 to definitely be negative and approximately be between 0 and -1. Having said this lets choose a bigger range for a nicer cost function graph.
#generate our parameters vs cost graph data theta_0_arr =  theta_1_arr =  cost_func_arr =  for theta_0 in np.linspace(-3,3, num = 100): for theta_1 in np.linspace(-3,3, num = 100): theta_0_arr.append(theta_0) theta_1_arr.append(theta_1) cost_func_arr.append(cost_function(theta_0, theta_1, X_log, Y_log))
Now for the plot. Our graph has a fairly shallow gradient differential in one direction on our graph so a traditional 3D scatter plot wouldn’t help us much as our minima isn’t obvious. Instead let’s plot a contour plot and see where our approximate minima occurs.
import scipy.interpolate #plot our data onto a contour plot to get an idea of where gradient descent should take us N = 1000 xi = np.linspace(min(theta_0_arr), max(theta_0_arr), N) yi = np.linspace(min(theta_1_arr), max(theta_1_arr), N) zi = scipy.interpolate.griddata((theta_0_arr, theta_1_arr), cost_func_arr, (xi[None,:], yi[:,None]), method = 'cubic') #levels sets out array of levels to set lines at (chosen based around min(cost_func_arr)) levels = np.arange(0,1,0.05) plt.contour(xi, yi, zi, levels = levels) plt.xlabel("Theta 0") plt.ylabel("Theta 1")
Here we are ignoring cost function value levels higher than 1 as this is of no interest to us.
Not the prettiest plot but placing my mouse over where I best believe the centre to be reveals an approximate θ0 = 0.24 and θ1 = -0.85.
However, we can do much better than a just-by-eye fit. This is where the gradient descent algorithm comes in.
For linear regression the values of our parameters can actually be found numerically and there are other more complex methods which have certain advantages over gradient descent that can also be used. Having said this, the gradient descent algorithm is a simple algorithm that gives a nice intuition into exactly what we are trying to do. So let’s take a look.
The algorithm is to repeat until convergence:
θj = θj – α*partial differential of J(Θ)
For all parameters j, updating all of them simultaneously. Here α is a constant. This makes sense logically as we are taking our value for θj and are taking away the gradient of our cost function at θj. This means that on our graph we are travelling down the contour lines and, if we repeat the process enough, we will eventually arrive at our minima.
Using θ1 as an example it is quite straightforward to show that our partial differential is:
(1/m)∑i(hθ(X(i)) – Y(i))x1(i)
Therefore our gradient descent algorithm for θ1 becomes:
θ1 = θ1 – α*(1/m)∑i(hθ(X(i)) – Y(i))x1(i)
This can actually be vectorized to be used for our Θ vector. This updates all of our parameters simultaneously. So in our code we will repeat
Θ = Θ – α*(1/m)XmatT(XmatΘ – Y)
until convergence, where Xmat is our matrix of ones and example values as defined in our code snippet for the cost function.
So let’s build our gradient descent algorithm in python.
def gradient_descent(initial_theta_0, initial_theta_1, X, Y, alpha, iterations): theta_vec = np.array([[initial_theta_0],[initial_theta_1]]) X_mat = np.concatenate((np.ones((len(X),1)),X),axis=1) for i in range(0,iterations): hyp = X_mat@theta_vec theta_vec = theta_vec - alpha*(1/len(X))*(X_mat.T)@(hyp - Y) return theta_vec
It is important to not choose an alpha value that is too large as this can actually lead to our function diverging as the algorithm completely misses the minima and higher and higher values for the modulus of our differential are attained.
Now let’s run this with a value of 0.1 for alpha and for 10000 iterations. These values were chosen via experimentation and could be improved upon. This is done by creating plots for how different values of alpha influence the cost function value over our iteration range.
init_theta_0 = 10 init_theta_1 = 10 cost_function_init = cost_function(init_theta_0, init_theta_1, X_log, Y_log) optimised_theta = gradient_descent(init_theta_0, init_theta_1, X_log, Y_log, 0.1, 10000) cost_function_at_optimised = cost_function(optimised_theta[0,0],optimised_theta[1,0],X_log, Y_log)
Our initial values for our parameters were chosen at random. For our initial values our cost function value is 35.97. After our gradient descent the cost function dropped to 0.018. Our parameter values at this cost function value are θ0 = 0.2309 and θ1 = -0.8643. These values are around the values we expected form our cost function plot and are not far off from our just-by-eye fit values.
Our hypothesis for our log-log graph is hθ(X) = 0.2309 – 0.8643x1. This means our values for our power law coefficients are k = -0.8643 and log(a) = 0.2309 therefore a = 1.702.
So our final relationship from our findings appears to be
Cost/Conv = 1.702*conv rate-0.8643
Here is how our model fits to our data
For a simple linear regression our model didn’t come out too bad. It seems to struggle at extremely low conversion rates with high costs per conversion however captures a general trend in the data. Our model shows that for lower conversion rates just a slight increase in the conversion rate could result in a large decrease in cost for each conversion. Easier said than done I know but CRO is a service offered right here at Blueclaw.
Our model may be improved by capturing further data at these extreme points. Having said that, there are probably other variables that impact our cost per conversion that our model doesn’t account for. We could improve this by moving to something like a multivariate regression approach that may include things like polynomial features and cross-interaction terms.
For reference the above curve has an R2 value of 0.79.
The main purpose of this blog post was to introduce the type of maths involved in machine learning. In practice, something like this could be done in a few minutes in excel, however when moving onto more complex analysis and models it’s necessary to move to something like python.
The maths involved above also provide a nice basis for more complex methods like multivariate linear regression, logistic regression and even neural networks. Even then there are plenty of Python libraries like keras, pytorch and sklearn that can build these models rapidly. Still I believe it’s useful to know the kind of maths going on behind the scenes to be able to potentially modify these models and at the very least understand the inputs, outputs and the arguments used in these pre-built functions.
To discuss anything mentioned in this blog post, or to have a chat about how machine learning can be implented into your digital strategy, get in touch.