- Blueclaw Media Ltd
- 46 The Calls
- Leeds
- LS2 7EY
- United Kingdom

June 12, 2020

Over the past few years there has been a real buzz over machine learning and its possible applications, as shown by Google trends data. While machine learning is involved in hugely complex processes like computer vision and self-driving vehicles, it is useful to start with the basics. In this blog post I will be covering the maths involved in simple linear regression and the gradient descent algorithm, as well as applying this to identify a relationship in Google ads data.

This post does assume prior knowledge of python and linear algebra.

Machine learning is defined as the field of study that gives computers the ability to learn without being explicitly programmed. A way of thinking about this is that a program is said to learn if a programs performance on a specific task improves with experience.

For simple linear regression I believe it will be best to demonstrate with our Google ads example. For this post we will just be going over single variable linear regression. To generate the data I simply downloaded a day to day report from one of our clients accounts for the past 3 years which contained, among other variables, cost per conversions and the conversion rate for each day. When exploring the data I thought that it may seem logical to see that higher conversion rates lead to a lower cost per conversion, so let’s explore this further.

The first step is to visualise our data to see if there is indeed a relationship.

import matplotlib.pyplot as plt import pandas as pd import numpy as np data = pd.read_excel("C:\\...\\googleads.xlsx") #save data as numpy arrays of size (n,1), where n is the number of examples X = np.array(data['Conv. rate']) X = np.reshape(X,(len(X),1)) Y = np.array(data['Cost / conv.']) Y = np.reshape(Y,(len(Y),1)) #plot data to investigate relationship plt.figure(1) plt.scatter(X,Y,s=8,marker = 'x') plt.ylabel("Cost Per Conversion") plt.xlabel("Conversion Rate")

Here I am saving our X and Y variables as matrices (or column vectors in this case), the reason for doing this will become apparent later on.

Our plot seems to show a relationship between the two but from the shape of the data it appears it may be a power law.

A power law means our variables follow the relationship *Y = aX ^{k}*, where

So lets plot our data on a logarithmic scale. For reference I will be using a log of base 10 but natural logarithms or any other base are fine.

#suspected power law so lets see what a log log graph looks like X_log = np.log10(X) Y_log = np.log10(Y) plt.figure(2) plt.scatter(X_log, Y_log, s = 8, marker = 'x') plt.ylabel("Log of Cost Per Conversion") plt.xlabel("Log of Conversion Rate")

This seems like a promising correlation. Now to fit our line of best fit.

From the data we can see that a fitted line of the equation *h _{θ}(X) = θ_{0} + θ_{1}x_{1 }*would work well. This is called our hypothesis and has the expected form

To make this simpler to write we define 2 column vectors Θ = [ *θ _{0 }*;

As an aside we could actually fit higher order polynomial terms, such as *x _{1}*

To decide if a fitted line is actually a good fit to our data we need to define a value that indicates this. This is our cost function. A cost function represents a cost associated with fitting that specific line to our data. A common cost function used for linear regression is the mean squared error. Lets go through this step by step.

Say we have a hypothesis *h _{θ}(X) *that has some pre-determined parameter values. For a single examples feature vector,

So our final cost function is:

*J(Θ) = (1/(2m))*∑ _{i}((h_{θ}(X(i)) – Y(i))^{2})*

The idea here is that the values of our parameters that give the lowest possible value of our cost function results in the best possible fit to our data. So in order to find the best values of our parameters we have to somehow minimise our cost function.

Now we know we want to minimise the cost function to obtain our optimised parameters it may be useful to plot out how our cost function varies with *θ _{0 }* and

Our first step in plotting our graph is creating a function to return the value of our cost function for a specific value of *θ _{0 }* and

def cost_function(theta_0, theta_1, X, Y): #build a column vector out of our input parameter values of size (2,1) theta_vec = np.array([[theta_0],[theta_1]]) #create a matrix of a column of ones (for our x_0 parameter) and a column of our example values X_mat = np.concatenate((np.ones((len(X),1)),X),axis=1) #vectorized approach to find our output value for each example hyp = X_mat@theta_vec #vectorized approach to calculate the total cost function cost_func = (1/(2*len(X)))*(((hyp - Y).T)@(hyp - Y)) return cost_func[0,0]

In order to make our calculations as efficient as possible we implement a vectorized approach as above. This removes the need for for loops and makes the function much less computationally expensive. This approach isn’t really necessary for our application as using for loops instead wouldn’t make a massive difference to execution speed. However when moving onto bigger data sets and looking at using something like multivariate linear regression this could make a huge impact on execution speed so it’s good practice to use it now. For clarity, .T and @ in the above code represent the transpose of a matrix and the dot product for matrix multiplication, respectively.

It’s possible to choose a sensible range of parameters just by looking at our log-log graph above. We expect *θ _{0} *to be approximately in the range 0 to 1. We also expect

#generate our parameters vs cost graph data theta_0_arr = [] theta_1_arr = [] cost_func_arr = [] for theta_0 in np.linspace(-3,3, num = 100): for theta_1 in np.linspace(-3,3, num = 100): theta_0_arr.append(theta_0) theta_1_arr.append(theta_1) cost_func_arr.append(cost_function(theta_0, theta_1, X_log, Y_log))

Now for the plot. Our graph has a fairly shallow gradient differential in one direction on our graph so a traditional 3D scatter plot wouldn’t help us much as our minima isn’t obvious. Instead let’s plot a contour plot and see where our approximate minima occurs.

import scipy.interpolate #plot our data onto a contour plot to get an idea of where gradient descent should take us N = 1000 xi = np.linspace(min(theta_0_arr), max(theta_0_arr), N) yi = np.linspace(min(theta_1_arr), max(theta_1_arr), N) zi = scipy.interpolate.griddata((theta_0_arr, theta_1_arr), cost_func_arr, (xi[None,:], yi[:,None]), method = 'cubic') #levels sets out array of levels to set lines at (chosen based around min(cost_func_arr)) levels = np.arange(0,1,0.05) plt.contour(xi, yi, zi, levels = levels) plt.xlabel("Theta 0") plt.ylabel("Theta 1")

Here we are ignoring cost function value levels higher than 1 as this is of no interest to us.

Not the prettiest plot but placing my mouse over where I best believe the centre to be reveals an approximate *θ _{0 }*= 0.24 and

However, we can do much better than a just-by-eye fit. This is where the gradient descent algorithm comes in.

For linear regression the values of our parameters can actually be found numerically and there are other more complex methods which have certain advantages over gradient descent that can also be used. Having said this, the gradient descent algorithm is a simple algorithm that gives a nice intuition into exactly what we are trying to do. So let’s take a look.

The algorithm is to repeat until convergence:

*θ _{j} = θ_{j} – α*partial differential of J(Θ)*

For all parameters *j*, updating all of them simultaneously. Here *α *is a constant. This makes sense logically as we are taking our value for *θ _{j }*and are taking away the gradient of our cost function at

Using *θ _{1}* as an example it is quite straightforward to show that our partial differential is:

*(1/m)∑ _{i}(h_{θ}(X(i)) – Y(i))x_{1}(i)*

Therefore our gradient descent algorithm for *θ _{1}* becomes:

*θ _{1} = θ_{1} – α*(1/m)∑_{i}(h_{θ}(X(i)) – Y(i))x_{1}(i)*

This can actually be vectorized to be used for our Θ vector. This updates all of our parameters simultaneously. So in our code we will repeat

Θ = Θ – *α*(1/m)X _{mat}^{T}(X_{mat}Θ – Y)*

until convergence, where *X _{mat}* is our matrix of ones and example values as defined in our code snippet for the cost function.

So let’s build our gradient descent algorithm in python.

def gradient_descent(initial_theta_0, initial_theta_1, X, Y, alpha, iterations): theta_vec = np.array([[initial_theta_0],[initial_theta_1]]) X_mat = np.concatenate((np.ones((len(X),1)),X),axis=1) for i in range(0,iterations): hyp = X_mat@theta_vec theta_vec = theta_vec - alpha*(1/len(X))*(X_mat.T)@(hyp - Y) return theta_vec

It is important to not choose an alpha value that is too large as this can actually lead to our function diverging as the algorithm completely misses the minima and higher and higher values for the modulus of our differential are attained.

Now let’s run this with a value of 0.1 for alpha and for 10000 iterations. These values were chosen via experimentation and could be improved upon. This is done by creating plots for how different values of alpha influence the cost function value over our iteration range.

init_theta_0 = 10 init_theta_1 = 10 cost_function_init = cost_function(init_theta_0, init_theta_1, X_log, Y_log) optimised_theta = gradient_descent(init_theta_0, init_theta_1, X_log, Y_log, 0.1, 10000) cost_function_at_optimised = cost_function(optimised_theta[0,0],optimised_theta[1,0],X_log, Y_log)

Our initial values for our parameters were chosen at random. For our initial values our cost function value is 35.97. After our gradient descent the cost function dropped to 0.018. Our parameter values at this cost function value are *θ _{0 }*= 0.2309 and

Our hypothesis for our log-log graph is *h _{θ}(X) = 0.2309 – 0.8643x_{1}*. This means our values for our power law coefficients are

So our final relationship from our findings appears to be

*Cost/Conv = 1.702*conv rate ^{-0.8643}*

Here is how our model fits to our data

For a simple linear regression our model didn’t come out too bad. It seems to struggle at extremely low conversion rates with high costs per conversion however captures a general trend in the data. Our model shows that for lower conversion rates just a slight increase in the conversion rate could result in a large decrease in cost for each conversion. Easier said than done I know but CRO is a service offered right here at Blueclaw.

Our model may be improved by capturing further data at these extreme points. Having said that, there are probably other variables that impact our cost per conversion that our model doesn’t account for. We could improve this by moving to something like a multivariate regression approach that may include things like polynomial features and cross-interaction terms.

For reference the above curve has an *R ^{2} *value of 0.79.

The main purpose of this blog post was to introduce the type of maths involved in machine learning. In practice, something like this could be done in a few minutes in excel, however when moving onto more complex analysis and models it’s necessary to move to something like python.

The maths involved above also provide a nice basis for more complex methods like multivariate linear regression, logistic regression and even neural networks. Even then there are plenty of Python libraries like keras, pytorch and sklearn that can build these models rapidly. Still I believe it’s useful to know the kind of maths going on behind the scenes to be able to potentially modify these models and at the very least understand the inputs, outputs and the arguments used in these pre-built functions.

To discuss anything mentioned in this blog post, or to have a chat about how machine learning can be implented into your digital strategy, get in touch.

We use cookies to ensure that we give you the best experience on our website. Please accept cookies for optional performance.AcceptPrivacy policy