Before we venture into linear regression, let’s first try to understand what regression analysis is?

**What is Regression?**

Regression is a statistical approach used for predicting real values like the age, weight, salary, for example.

In regression, we have a dependent variable which we want to predict using some independent variables.

The goal of regression is to find the relationship between an independent variable and a dependent variable, which then be used to predict the outcome of an event.

In this post, we’ll see one type of regression technique called linear regression.

**LINEAR REGRESSION:**

Linear regression is one of the simplest algorithms in machine learning.

The main objective of this algorithm is to find the straight line which best fits the data.

The best fit line is chosen such that the distance from the line to all the points is minimum.

The linear regression model makes an assumption that the dependent variable is linearly related to the independent variable.

There are two types of linear regression,

**Simple linear regression:** If we have a single independent variable, then it is called simple linear regression.

**Multiple linear regression:** If we have more than one independent variable, then it is called multiple linear regression.

In this post, we will concentrate on simple linear regression and implement it from scratch.

**SIMPLE LINEAR REGRESSION:**

If we have an independent variable x and a dependent variable y, then the linear relationship between both the variables can be given by the equation

Y = b0 + b1*X

Where b0 is the y-intercept and b1 is the slope.

There are many techniques to estimate these parameters. One such technique is Ordinary Least Square, which is the most popular.

Ordinary least square finds the parameters of the equation in such a way that the sum of squared error is minimum.

The squared error is calculated between the actual and predicted values.

The sum of squared error is given by,

where y is the actual value, and ŷ is the predicted value.

We can calculate b0 using the formula.

and b1 is defined as,

Now, let’s implement the simple linear regression using python

You can download the dataset here.

Let’s take a quick look at the dataset.

We have to predict the salaries based on the years of experience

Let’s import the data using pandas

1 2 3 4 |
import pandas as pd data = pd.read_csv('Salary.csv') X = data.iloc[:,0] Y = data.iloc[:,1] |

Now let us calculate the coefficients b0 and b1

1 2 3 4 5 |
from numpy import mean x_mean = mean(X) y_mean = mean(Y) b1 = ((sum((X - x_mean) * (Y - y_mean))) / (sum((X - x_mean) ** 2))) b0 = y_mean - (b1 * x_mean) |

Next, we can use the coefficients to predict the y values

1 2 3 |
regression_line = [] for x in X: regression_line.append((b1*x) + b0) |

Now we can use this list of predicted values to draw the regression line.

1 2 3 4 5 6 |
import matplotlib.pyplot as plt plt.scatter(X,Y,color='g') plt.plot(X,regression_line,color='r') plt.xlabel('Experience in Years') plt.ylabel('Salary') plt.show() |

The red line represents our model predictions, and green points represent our data.

**GOODNESS OF FIT:**

There are various regression metrics to measure the goodness of the regression line.

I have written an article discussing several important evaluation metrics for regression. If you want to learn more about these metrics, you can read the article regression evaluation metrics.

In this post, we’ll discuss a metric called R-squared, which is one of the most commonly used regression evaluation metrics.

R-squared measures the variance explained by the model. It ranges from 0 to 1, where 0 indicates that the fit is poor.

If the R-squared value is 0.90, then we can say that the independent variables have explained 90% of the variance in the dependent variable.

It is determined by,

where SSE is the sum of squared errors and is given by

and the total sum of squares(SST) is determined as,

1 2 3 4 |
y_mean = mean(Y) sst = sum((Y - y_mean) ** 2) sse = sum((Y - regression_line) ** 2) print(1 - (sse / sst)) |

1 2 |
OUTPUT: 0.95696 |

**SUMMARY:**

Finally, to sum up, in this post, we have discussed linear regression and one of its type called simple linear regression.

The equation y = b0 + b1*X gives the linear relationship between an independent and a dependent variable.

We also discovered how to implement simple linear regression using Ordinary Least Squares.

Finally, we use R-squared to assess the goodness of fit of our model.

In the next part, we’ll discuss multiple linear regression an extension of the simple linear regression which enables us to assess the associaton between two or more independent variables.