Introduction:
In this tutorial, we’ll discuss how to build a linear regression model using statsmodels.
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and exploring the data.
Statsmodel is built explicitly for statistics; therefore, it provides a rich output of statistical information.
Before we build a linear regression model, let’s briefly recap Linear Regression.
In general, regression is a statistical technique which is used to investigate the relationship between variables.
The main objective of linear regression is to find a straight line which best fits the data.
The best fit line is chosen such that the distance from the line to all the points is minimum.
There are two types of linear regression, Simple and Multiple linear regression.
Simple Linear Regression:
If we have a single independent variable, then it is called simple linear regression.
For an independent variable x and a dependent variable y, the linear relationship between both the variables is given by the equation,
Y=b 0+b 1 * X
Where b0 is the y-intercept and b1 is the slope.
Multiple Linear Regression:
If we have more than one independent variable, then it is called multiple linear regression.
Multiple regression is given by the equation,
y=\beta_{0}+\beta_{1} * x_{1}+\beta_{2} * x_{2}+\ldots+\beta_{n} * x_{n}+\epsilon
where x1, x2, …, xn are independent variables, y is the dependent variable and β0, β1, …, β2 are coefficients and ϵ\epsilonϵ is the residual terms of the model.
If you want to learn more about linear regression and implement it from scratch, you can read my article Introduction to Linear Regression.
Now let’s use the statsmodels to build a linear regression model.
Linear Regression Using Statsmodels:
There are two ways in how we can build a linear regression using statsmodels; using statsmodels.formula.api or by using statsmodels.api
First, let’s import the necessary packages.
1 2 |
import statsmodels.formula.api as smf import pandas as pd |
Now we can import the dataset. You can download the dataset using the following link.
The dataset contains information on sales of a product in 200 different markets, together with advertising budgets in each of these markets for various media channels: TV, radio, and newspaper.
1 2 |
df = pd.read_csv('Advertising.csv', index_col=0) df.head() |
We’ll now run a linear regression on the data using the OLS function of the statsmodel.formula.api module.
1 |
lr = smf.ols(formula='sales ~ TV + radio + newspaper', data=df) |
The formula notation has two parts, where the name left to the tilde(~) indicates the response variable and the variable name to the right of the tilde is the predictor.
Now we can fit the data to the model by calling the fit method.
1 |
fitted_model = lr.fit() |
Let’s view the detailed statistics of the model.
1 |
fitted_model.summary() |
In the above diagram, we have the results of our linear regression model. As you can see, it provides a comprehensive output with various statistics about the fit of our model.
The first table is broken down into two columns. Let’s start by explaining the variables in the left column first.
Dep. Variable: It just tells us what the response variable was
Model: It reminds us of the model we have fitted
Method: How the parameters of the model were fitted
No. Of Observations: Number of observations used
DF Residuals: Degrees of freedom of the residuals, which is the sample size minus the number of parameters being estimated
DF Model: The number of estimated parameters in the model, without taking the constant into account
The right part of the first table gives information about the fit of our model
R-Squared: Also known as the coefficient of determination measures the goodness of fit
Adj. R-Squared: R-Squared adjusted for the number of predictor variables
F-Statistic: is the ratio of explained variance of the model by unexplained variance.
Prob(F-statistic): F-statistic transformed into a probability
AIC: Akaike Information Criterion, assesses model on the basis of the number of observations and the complexity of the model.
BIC: Bayesian Information Criterion, similar to AIC, but penalizes model more severely than AIC.
The second table gives us information about the coefficients and a few statistical tests
coeff: The estimated coefficient
std err: Represents the standard error of the coefficient
t: The t-statistic value, for testing the null hypothesis that the predictor coefficient is zero.
P > |t|: The p-value, if the p-value is <0.05, then that variable is statistically significant.
[95.0% Conf. Interval]: The lower and upper values of the coefficient, taking into account a 95% confidence interval
The last table gives us information about the distribution of residuals.
Skewness: A measure of the asymmetry of the distribution
Kurtosis: A measure of how peaked a distribution is
Omnibus D’Angostino’s test: A test of the skewness and kurtosis that indicates the normalcy of a distribution
Prob(Omnibus): Indicates the probability of the normality of a distribution
Jarque-Bera: Like the Omnibus, it tests for skewness and kurtosis
Prob (JB): JB statistic transformed into a probability
Durbin-Watson: A test for autocorrelation(occurs when there is a correlation between the error values)
Cond. No: A test for multicollinearity(occurs when the independent variables are highly correlated)
Now let’s use the statsmodel.api module to build the model
First, we need to define the X and y variables. The X will have the predictors, and the y variable will have the response variable.
1 2 3 |
import statsmodels.api as sm X = df.copy() y = X.pop('sales') |
As a second step, we need to add an intercept to the data. Unlike the formula API, where the intercept is added automatically, here we need to add it manually.
1 |
X = sm.add_constant(X) |
Now we can initialize the OLS and call the fit method to the data.
1 2 |
lr2 = sm.OLS(y, X) fitted_model2 = lr2.fit() |
Now we can call the summary() method.
1 |
fitted_model2.summary() |
SUMMARY:
In this article, you have learned how to build a linear regression model using statsmodels.
Statsmodels is an extraordinarily helpful package in python for statistical modeling. Since it is built explicitly for statistics; therefore, it provides a rich output of statistical information.
We can either use statsmodel.formula.api or statsmodel.api to build a linear regression model.
We also build a linear regression model using both of them and also discussed how to interpret the results.