Introduction:
Ordinary Least Squares(OLS) is a commonly used technique for linear regression analysis. OLS makes certain assumptions about the data like linearity, no multicollinearity, no autocorrelation, homoscedasticity, normal distribution of errors.
Violating these assumptions may reduce the validity of the results produced by the model.
There are several statistical tests to check whether these assumptions hold true.
In this article, we’ll discuss the assumptions made by OLS in detail and see how to test them in python.
We’ll start by importing the necessary packages and import the Boston housing dataset.
1 2 3 4 5 6 7 8 |
import statsmodels.formula.api as smf import statsmodels.api as sm import pandas as pd from sklearn.datasets import load_boston boston = load_boston() X = boston.data y = boston.target |
We’ll use statsmodels to build a linear regression model. You can learn more about statsmodels by reading the article Introduction to Statsmodels.
1 2 3 |
X_constant = sm.add_constant(X) lr = sm.OLS(y, X_constant).fit() print(lr.summary()) |
Linearity:
The first assumption we are going to see is linearity.
This assumption states that the relationship between the dependent and the independent variable should be linear.
We can test this assumption with a simple scatter plot.
A plot of the fitted versus residuals values can be used to test this assumption. This plot indicates if there are non-linear patterns in the residuals and thus in the data as well.
1 2 3 4 |
fig, ax = plt.subplots(1, 1) sns.residplot(lr.predict(), y, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color':'red'}, ax=ax) ax.title.set_text('Residuals vs Fitted') ax.set(xlabel='Fitted', ylabel='Residuals') |
We can clearly see a pattern in the plot indicating non-linearity. An ideal plot will have the residuals spread equally around the horizontal line.
No Multicollinearity:
Multiple linear regression assumes that there is no multicollinearity in the data.
Multicollinearity occurs when the independent variables are highly correlated, i.e., the independent variables depend on each other.
Multicollinearity is often a dire threat to our model. To detect the impact of multicollinearity among the variables we can use the Variance Inflation Factor(VIF).
If the VIF is high for an independent variable then there is a chance that it is already explained by another variable.
Independent variables with VIF value greater than 10 may be correlated.
However, independent variables with VIF value greater than 100 indicates certain multicollinearity and should be eliminated.
1 2 3 4 5 6 |
from statsmodels.stats.outliers_influence import variance_inflation_factor vif = pd.DataFrame() vif["VIF Factor"] = [variance_inflation_factor(X, i) for i in range(X.shape[1])] vif["features"] = boston.feature_names print(vif) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
OUTPUT: VIF Factor features 0 2.100373 CRIM 1 2.844013 ZN 2 14.485758 INDUS 3 1.152952 CHAS 4 73.894947 NOX 5 77.948283 RM 6 21.386850 AGE 7 14.699652 DIS 8 15.167725 RAD 9 61.227274 TAX 10 85.029547 PTRATIO 11 20.104943 B 12 11.102025 LSTAT |
There is no serious violation of this assumption. But still, there are variables like NOX, RM, and PTRATIO with high VIF values and these may require further investigation.
Errors should be normally distributed:
Another assumption made by the OLS is that the errors are normally distributed. Plotting the errors will show the distribution of errors.
Q-Q plot is a graphical technique used to determine whether the residuals are normally distributed.
If the points are normally distributed then the points will fall on a straight diagonal line.
1 2 3 |
fig, ax = plt.subplots(1, 1) sm.ProbPlot(lr.resid).qqplot(line='s', color='#1f77b4', ax=ax) ax.title.set_text('QQ Plot') |
From the above plot it looks like there are many points falling away from the red line indicating that the errors are not normally distributed.
No Autocorrelation:
Autocorrelation occurs when there is a correlation between the error values. It is also called as serial correlation.
It mostly occurs in time series data because data collected over time will more likely dependent on one another.
One of the common tests for autocorrelation of residuals is the Durbin-Watson test.
It ranges from 0 to 4. d value of 2 indicates that there is no autocorrelation. There is negative autocorrelation if the value of d is nearing 4 and positive correlation if the value is close to 0.
The Durbin-Watson test is printed with the statsmodels summary.
The Durbin-Watson score for this model is 1.078, which indicates positive autocorrelation.
Homoscedasticity:
This tongue twister simply means that the variance of the error terms should be constant with respect to the independent variable.
We can use the scale-location plot to test this assumption.
If the plot depicts a clear pattern like a cone-shaped pattern then it is not homoscedastic. The absence of homoscedastic is called as heteroscedastic.
Presence of outliers in the data is one of the reasons for not holding homoscedasticity.
Violation of this assumption may give unreliable or biased standard error of the coefficients.
1 2 3 4 5 |
fig, ax = plt.subplots(1, 1) standardized_resid1 = np.sqrt(np.abs(lr.get_influence().resid_studentized_internal)) sns.regplot(lr.predict(), standardized_resid1, color='#1f77b4', lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color':'red'}, ax=ax) ax.title.set_text('Scale Location') ax.set(xlabel='Fitted', ylabel='Standardized Residuals') |
We can see that the variance of the residuals in not random. Even though it is not a cone-shaped we can still notice a pattern.
SUMMARY:
In this tutorial, we discussed several assumptions made by OLS like linearity, no multicollinearity, no autocorrelation, homoscedasticity, normal distribution of errors.
We then tested whether these assumptions hold on the Boston housing dataset.
Even though it is not an assumption, it is essential to check for the presence of outliers.
The existence of outliers in our data can lead to violations of some of the assumptions mentioned above. For example, the presence of outliers in the data may lead to heteroskedasticity.
It is better to remove the outliers since it affects the precision of the regression estimates.
Techniques like Box-Plot can be used to detect and remove outliers from the data.