In the previous part of the Introduction to Linear Regression, we discussed simple linear regression.

Simple linear regression is a basic model with just two variables an independent variable x, and a dependent variable y based on the equation

\mathrm{y}=\mathrm{b0}+\mathrm{b} 1^{*} \mathrm{X}

In real-world, the simple linear regression is seldom used.

Practically, we often don’t find a variable that exclusively depends on another variable.

Usually, we have to create a model with a dependent variable which depends on more than one independent variable.

A model with more than one independent variable is called multiple linear regression.

In this post, we’ll explore two ways in how we can find the best fit line for a multiple linear regression.

- Ordinary Least Squares(OLS)
- Gradient Descent

**MULTIPLE LINEAR REGRESSION USING OLS:**

The following equation gives multiple linear regression,

y=\beta_{0}+\beta_{1} * x_{1}+\beta_{2} * x_{2}+\ldots+\beta_{n} * x_{n} + \epsilon

where x_{1}, x_{2}, …, x_{n} are independent variables, y is the dependent variable and β_{0}, β_{1}, …, β_{2} are coefficients and \epsilon is the residual terms of the model

The coefficients β_{i} represents the change in the dependent variable(y) for each unit change in the independent variable(x_{i}) while keeping all the other variables constant.

In simple linear regression, we try to find the best fitting line. In multiple regression we are looking for a plane which can best fit our data.

One major problem we have to deal with multiple regression is multicollinearity.

With multiple independent variables, there is a chance that some of them might be correlated.

Multicollinearity is often a dire threat to our model. To detect this menace called multicollinearity, we use the Variance Inflation Factor(VIF).

In fact, no multicollinearity is one of the assumptions made by OLS about the model.

There are several assumptions made by the OLS model. I have written an article explaining five such assumptions. To learn more about them you can read my article Assumptions made by OLS.

**ESTIMATION OF MODEL PARAMETERS:**

We know that multiple regression is expressed as,

y=\beta_{0}+\beta_{1} * x_{1}+\beta_{2} * x_{2}+\ldots+\beta_{n} * x_{n} + \epsilon

We can rewrite this equation in matrix form as,

\left[\begin{array}{c}{Y_{1}} \\ {Y_{2}} \\ {\vdots} \\ {\vdots} \\ {Y_{n}}\end{array}\right]=\left[\begin{array}{ccccc}{1} & {X_{11}} & {X_{21}} & {\dots} & {X_{k 1}} \\ {1} & {X_{12}} & {X_{22}} & {\dots} & {X_{k 2}} \\ {\vdots} & {\vdots} & {\vdots} & {\ldots} & {\vdots} \\ {\vdots} & {\vdots} & {\vdots} & {\ldots} & {\vdots} \\ {1} & {X_{1 n}} & {X_{2 n}} & {\dots} & {X_{k n}}\end{array}\right]\left[\begin{array}{c}{\beta_{1}} \\ {\beta_{2}} \\ {\vdots} \\ {\vdots} \\ {\beta_{n}}\end{array}\right]+\left[\begin{array}{c}{\epsilon_{1}} \\ {\epsilon_{2}} \\ {\vdots} \\ {\vdots} \\ {\epsilon_{n}}\end{array}\right]

This can be rewritten as,

Y=X^{\prime} \beta+\varepsilon

In simple linear regression, we find the best fit line in such a way that the sum of squared residuals is minimum.

We are going to do the same in multiple linear regression also.

That is we want to minimize \sum_{i=1}^{N} \epsilon_{i}

In matrix form, this is given as e^{\prime} e where e is given by: e=y-X \beta

\begin{aligned} e^{\prime} e &=(y-X \hat{\beta})^{\prime}(y-X \hat{\beta}) \\ &=y^{\prime} y-\hat{\beta}^{\prime} X^{\prime} y-y^{\prime} X \hat{\beta}+\hat{\beta}^{\prime} X^{\prime} X \hat{\beta} \\ &=y^{\prime} y-2 \hat{\beta}^{\prime} X^{\prime} y+\hat{\beta}^{\prime} X^{\prime} X \hat{\beta} \end{aligned}

To minimize this equation, we need to take the derivative of e^{\prime} e w.r.t, β

\frac{\partial e^{\prime} e}{\partial \hat{\beta}}=-2 X^{\prime} y+2 X^{\prime} X \hat{\beta}

\left(X^{\prime} X\right) \hat{\beta}=X^{\prime} y

Finally multiplying the inverse matrix of \left(X^{\prime} X\right)^{-1} we get the OLS estimator of β

\hat{\beta}=\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{y}

Let’s implement the OLS method with the help of numpy.

Importing the prerequisite packages

1 2 3 |
import numpy as np from numpy.linalg import inv from sklearn.datasets import make_regression |

Next, we create a dataset of 200 samples with 7 features using sklearn’s make_regression.

1 |
X, Y =make_regression(n_samples=200, n_features=7, n_targets=1, random_state=47) |

Let’s add the intercept term to the data.

1 2 |
intercept = np.ones((X.shape[0], 1)) X = np.hstack((intercept, X)) |

Now, we can calculate the coefficients using the formula we have derived

1 2 |
coeffs = np.dot(inv(np.dot(X.T, X)), np.dot(X.T, Y)) print(coeffs.round(2)) |

1 2 |
OUTPUT: [ 0. 72.93 18.28 49.51 10.37 52.03 83.65 56.57] |

However, the problem with this method is that it becomes computationally expensive for large datasets.

Recall, we compute the inverse of X^{T} X which results in a square matrix with dimension n x n. Where n is the number of features.

The computational complexity of inverting such a square matrix is typically cubic in the number of features.

So, an alternative approach to estimate the parameters of linear regression is by using a technique called gradient descent.

**GRADIENT DESCENT:**

We’ll see the classical analogy of a blindfolded person who is trying to get to the bottom of the valley to understand the gradient descent algorithm.

Assume you are at the top of a mountain with your eyes blindfolded and you want to go to the bottom of the valley.

Now, what will you do to get to the lowest point of the valley?

First, you will take a step in the direction where the land tends to descend, and then you retake another step in the course of the steepest decline.

If you continue the same, eventually, you will reach the lowest point of the valley.

This is precisely what Gradient Descent does.

Initially, we assign random values for the coefficients b.

From that point onward, the algorithm tries to adjust the value of the coefficient and attempts to decrease the cost function until the algorithm converges to a minimum.

**MULTIPLE LINEAR REGRESSION USING GRADIENT DESCENT:**

To obtain an estimate of the coefficient(b), we need to minimize a cost function which is Mean Squared Error(MSE) which measures the average of the squared difference between the actual and the predicted values.

The cost function(J) is defined as,

J(\beta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right)^{2}

Where h_{a}\left(x^{(i)}\right) is defined as,

{h_{a}(x)=\beta^{T} x}

We multiplied the cost function with ½ to make derivation calculations simpler.

We can use gradient descent to minimize this cost function. The gradient descent update rule is as follows,

\beta_{j} :=\beta_{j}-\alpha \frac{\delta}{\left(\delta \beta_{j}\right)} J_{\beta}

Where \alpha is the learning rate and \frac{\delta}{\left(\delta \beta_{j}\right)} J_{\beta} is the partial derivative of J w.r.t, to \beta and is defined as,

\frac{\delta}{\left(\delta \beta_{j}\right)} J_{\beta}=\frac{1}{m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right) x^{(i)}

Now let’s put these terms back into the update rule

\beta_{j} :=\beta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right) x^{(i)}

Let’s implement multiple linear regression with gradient descent

First, let’s import the prerequisite packages

1 2 3 |
import numpy as np Import matplotlib.pyplot as plt from sklearn.datasets import make_regression |

Next, we create a dataset of 200 samples with 7 features using sklearn’s make_regression.

1 |
X, Y =make_regression(n_samples=200, n_features=7, n_targets=1) |

Let’s add the intercept term to the data.

1 2 |
intercept = np.ones((X.shape[0], 1)) X = np.hstack((intercept, X)) |

Now, we can implement our gradient_descent function

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def gradient_descent(X, Y): m = len(X) b = np.zeros(X.shape[1]) #initialize the value of coefficients as 0. alpha = 0.01 #learning rate iterations = 2500 # max number of iterations costs = [] for i in range(iterations): pred = X.dot(b) cost = np.mean(np.square(X.dot(b) - Y)) / 2.0 costs.append(cost) gradient = np.dot(X.T, (pred-Y)) / m b = b - alpha * gradient return b, costs |

We can now call this method to estimate the coefficients.

1 2 |
b, costs = gradient_descent(X,Y) print(b.round(4)) |

1 2 |
OUTPUT: [ 0. 72.9311 18.2797 49.5081 10.3699 52.0302 83.6473 56.5662] |

Let’s plot the costs

1 2 3 4 5 6 |
import matplotlib.pyplot as plt plt.title('Cost Function') plt.xlabel('No. Of Iterations') plt.ylabel('Cost') plt.plot(costs) plt.show() |

As you can see from the preceding plot, the cost initially was around 10000, and it gradually decreases as the number of iteration increases.

**SUMMARY:**

I hope this article has given an introduction to multiple linear regression.

In this article, we have discussed two methods to estimate the coefficients in multiple linear regression.

In the Ordinary Least Squares(OLS) method, we estimate the coefficients using the formula,

\hat{{\beta}}=\left({X}^{\prime} {X}\right)^{-1} {X}^{\prime} {y}

We then discussed why OLS cannot be used for large datasets and discussed an alternative method using gradient descent.

The gradient descent method estimates the coefficients by minimizing the following cost function,

J(\beta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right)^{2}

We also implemented multiple regression using both OLS and Gradient Descent from scratch in python using numpy.