# Introduction To Linear Regression(Part-2)

In the previous part of the Introduction to Linear Regression, we discussed simple linear regression.

Simple linear regression is a basic model with just two variables an independent variable x, and a dependent variable y based on the equation

$\mathrm{y}=\mathrm{b0}+\mathrm{b} 1^{*} \mathrm{X}$

In real-world, the simple linear regression is seldom used.

Practically, we often don’t find a variable that exclusively depends on another variable.

Usually, we have to create a model with a dependent variable which depends on more than one independent variable.

A model with more than one independent variable is called multiple linear regression.

In this post, we’ll explore two ways in how we can find the best fit line for a multiple linear regression.

• Ordinary Least Squares(OLS)

## MULTIPLE LINEAR REGRESSION USING OLS:

The following equation gives multiple linear regression,

$y=\beta_{0}+\beta_{1} * x_{1}+\beta_{2} * x_{2}+\ldots+\beta_{n} * x_{n} + \epsilon$

where x1, x2, …, xn are independent variables, y is the dependent variable and β0, β1, …, β2 are coefficients and $\epsilon$ is the residual terms of the model

The coefficients βi represents the change in the dependent variable(y) for each unit change in the independent variable(xi) while keeping all the other variables constant.

In simple linear regression, we try to find the best fitting line. In multiple regression we are looking for a plane which can best fit our data.

One major problem we have to deal with multiple regression is multicollinearity.

With multiple independent variables, there is a chance that some of them might be correlated.

Multicollinearity is often a dire threat to our model. To detect this menace called multicollinearity, we use the Variance Inflation Factor(VIF).

In fact, no multicollinearity is one of the assumptions made by OLS about the model.

### ESTIMATION OF MODEL PARAMETERS:

We know that multiple regression is expressed as,

$y=\beta_{0}+\beta_{1} * x_{1}+\beta_{2} * x_{2}+\ldots+\beta_{n} * x_{n} + \epsilon$

We can rewrite this equation in matrix form as,

$\left[\begin{array}{c}{Y_{1}} \\ {Y_{2}} \\ {\vdots} \\ {\vdots} \\ {Y_{n}}\end{array}\right]=\left[\begin{array}{ccccc}{1} & {X_{11}} & {X_{21}} & {\dots} & {X_{k 1}} \\ {1} & {X_{12}} & {X_{22}} & {\dots} & {X_{k 2}} \\ {\vdots} & {\vdots} & {\vdots} & {\ldots} & {\vdots} \\ {\vdots} & {\vdots} & {\vdots} & {\ldots} & {\vdots} \\ {1} & {X_{1 n}} & {X_{2 n}} & {\dots} & {X_{k n}}\end{array}\right]\left[\begin{array}{c}{\beta_{1}} \\ {\beta_{2}} \\ {\vdots} \\ {\vdots} \\ {\beta_{n}}\end{array}\right]+\left[\begin{array}{c}{\epsilon_{1}} \\ {\epsilon_{2}} \\ {\vdots} \\ {\vdots} \\ {\epsilon_{n}}\end{array}\right]$

This can be rewritten as,

$Y=X^{\prime} \beta+\varepsilon$

In simple linear regression, we find the best fit line in such a way that the sum of squared residuals is minimum.

We are going to do the same in multiple linear regression also.

That is we want to minimize $\sum_{i=1}^{N} \epsilon_{i}$

In matrix form, this is given as $e^{\prime} e$ where e is given by: $e=y-X \beta$

\begin{aligned} e^{\prime} e &=(y-X \hat{\beta})^{\prime}(y-X \hat{\beta}) \\ &=y^{\prime} y-\hat{\beta}^{\prime} X^{\prime} y-y^{\prime} X \hat{\beta}+\hat{\beta}^{\prime} X^{\prime} X \hat{\beta} \\ &=y^{\prime} y-2 \hat{\beta}^{\prime} X^{\prime} y+\hat{\beta}^{\prime} X^{\prime} X \hat{\beta} \end{aligned}

To minimize this equation, we need to take the derivative of $e^{\prime} e$ w.r.t, β

$\frac{\partial e^{\prime} e}{\partial \hat{\beta}}=-2 X^{\prime} y+2 X^{\prime} X \hat{\beta}$

$\left(X^{\prime} X\right) \hat{\beta}=X^{\prime} y$

Finally multiplying the inverse matrix of $\left(X^{\prime} X\right)^{-1}$ we get the OLS estimator of β

$\hat{\beta}=\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{y}$

Let’s implement the OLS method with the help of numpy.

Importing the prerequisite packages

Next, we create a dataset of 200 samples with 7 features using sklearn’s make_regression.

Let’s add the intercept term to the data.

Now, we can calculate the coefficients using the formula we have derived

However, the problem with this method is that it becomes computationally expensive for large datasets.

Recall, we compute the inverse of $X^{T} X$ which results in a square matrix with dimension n x n. Where n is the number of features.

The computational complexity of inverting such a square matrix is typically cubic in the number of features.

So, an alternative approach to estimate the parameters of linear regression is by using a technique called gradient descent.

We’ll see the classical analogy of a blindfolded person who is trying to get to the bottom of the valley to understand the gradient descent algorithm.

Assume you are at the top of a mountain with your eyes blindfolded and you want to go to the bottom of the valley.

Now, what will you do to get to the lowest point of the valley?

First, you will take a step in the direction where the land tends to descend, and then you retake another step in the course of the steepest decline.

If you continue the same, eventually, you will reach the lowest point of the valley.

This is precisely what Gradient Descent does.

Image credits

Initially, we assign random values for the coefficients b.

From that point onward, the algorithm tries to adjust the value of the coefficient and attempts to decrease the cost function until the algorithm converges to a minimum.

## MULTIPLE LINEAR REGRESSION USING GRADIENT DESCENT:

To obtain an estimate of the coefficient(b), we need to minimize a cost function which is Mean Squared Error(MSE) which measures the average of the squared difference between the actual and the predicted values.

The cost function(J) is defined as,

$J(\beta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$

Where $h_{a}\left(x^{(i)}\right)$ is defined as,

${h_{a}(x)=\beta^{T} x}$

We multiplied the cost function with ½ to make derivation calculations simpler.

We can use gradient descent to minimize this cost function. The gradient descent update rule is as follows,

$\beta_{j} :=\beta_{j}-\alpha \frac{\delta}{\left(\delta \beta_{j}\right)} J_{\beta}$

Where $\alpha$ is the learning rate and $\frac{\delta}{\left(\delta \beta_{j}\right)} J_{\beta}$ is the partial derivative of J w.r.t, to $\beta$ and is defined as,

$\frac{\delta}{\left(\delta \beta_{j}\right)} J_{\beta}=\frac{1}{m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right) x^{(i)}$

Now let’s put these terms back into the update rule

$\beta_{j} :=\beta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right) x^{(i)}$

Let’s implement multiple linear regression with gradient descent

First, let’s import the prerequisite packages

Next, we create a dataset of 200 samples with 7 features using sklearn’s make_regression.

Let’s add the intercept term to the data.

Now, we can implement our gradient_descent function

We can now call this method to estimate the coefficients.

Let’s plot the costs

As you can see from the preceding plot, the cost initially was around 10000, and it gradually decreases as the number of iteration increases.

## SUMMARY:

I hope this article has given an introduction to multiple linear regression.

In this article, we have discussed two methods to estimate the coefficients in multiple linear regression.

In the Ordinary Least Squares(OLS) method, we estimate the coefficients using the formula,

$\hat{{\beta}}=\left({X}^{\prime} {X}\right)^{-1} {X}^{\prime} {y}$

We then discussed why OLS cannot be used for large datasets and discussed an alternative method using gradient descent.

The gradient descent method estimates the coefficients by minimizing the following cost function,

$J(\beta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{a}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$

We also implemented multiple regression using both OLS and Gradient Descent from scratch in python using numpy.