In this article, we’ll discuss a supervised machine learning algorithm logistic regression.
Logistic regression is a probabilistic classifier. It outputs the probability of a point belonging to a specific class. The output lies between [0,1].
Logistic regression is similar to linear regression; however, the difference is that linear regression can only be used to model continuous variables and cannot be used when the response variable is dichotomous – for example, whether a customer will churn or not or whether a tumor is malignant or benign.
In such cases, rather than predicting an output directly, logistic regression outputs a probability whether a point belongs to a certain class.
Now let’s see an example of why we can’t use linear regression for the dichotomous response variable. Consider the below plot
The response variable here is dichotomous. It takes only two values, either 0 or 1.
Now if we apply linear regression to the above data, it will look like the following.
The linear function is not appropriate, as expected in the case of dichotomous values, for representing the relation between the independent variable and the dependent variable.
The main problem in fitting a linear function for dichotomous values is that the values do not always fall in the anticipated range.
If we take a look at the fitted line, we can see that in some cases the predicted result is a negative value and in some cases as the predictor increases, it is higher than one.
Logistic regression uses an S-shaped(sigmoidal) curve to solve this problem.
As seen, the S-shaped curve can illustrate binary logistic probabilities much better.
We can see from the above plot that the predicted value approaches one if the predictor goes towards ∞. Similarly, the predicted value approaches 0 if the predictor goes towards -∞.
The horizontal line at 0.5 is the threshold. Anything higher than this value will be considered as class 1, and anything below this value will be considered as class 0.
Probability, Odds and Log-Odds:
Logistic regression is based on concepts like probability and odds, so before proceeding further, let’s first discuss them.
Probability:
Probability is defined as the outcomes of interest divided by all the possible outcomes.
P=\frac{\text { outcomes of interest }}{\text { all possible outcomes }}
For example, let’s say we flip a fair coin. The probability of it being heads is 0.5
P(\text {heads})=\frac{1}{2}=0.5
Odds:
Odds defined as the probability of something happening divided by the probability of it not happening.
o d d s=\frac{P}{1-P}
For example, the odds of landing a head when you flip a coin are 1.
o d d s(\text {heads})=\frac{0.5}{0.5}=1
Log-Odds:
If we apply natural logarithm to the odds then we’ll get the log-odds.
LogOdds=\ln \left(\frac{P}{1-P}\right)
Logistic Regression:
If you recall in linear regression, the predicted output is given by,
\mathrm{Y}=\mathrm{a}+\mathrm{b}^{*} \mathrm{x}
Where x is the data, y is the response variable, b is the coefficient
Similarly, for logistic regression, we can write,
\mathrm{P}=\mathrm{a}+\mathrm{b}^{*} \mathrm{x}
Where P is the probability that it belongs to a specific class.
In linear regression,Y ranges from -∞ to +∞ and the X on the right side also lives in the range -∞ to +∞.
However, the problem with the logistic regression equation is that the probability P on the left-hand side ranges between [0,1] and the covariates on the right-hand side can take any real number.
To make the ranges on both the sides equal, we transform the probability to an odds.
o d d s=\frac{P}{1-P}
o d d s=\frac{P}{1-P}=\mathrm{a}+\mathrm{b}^{*} \mathrm{x}
Like probability, odds have a lower bound. But there is no upper bound. The range of the odds is from 0 to ∞.
To remove the floor restrictions we take logarithms of the odds which live in the range -∞ to +∞.
Now the equation becomes
\ln \left(\frac{P}{1-P}\right)=\mathrm{a}+\mathrm{b}^{*} \mathrm{x}
The log(odds) is called the logit function.
Now we can solve the logit equation for P to obtain,
\frac{P}{1-P}=e^{a+b \cdot X}
P=\frac{e^{a+b \cdot X}}{1+e^{a+b * X}}
P=\frac{1}{1+e^{-(a+b * X)}}
This equation will always give a value between 0 and 1.
We can expand this equation for multiple variables.
P=\frac{1}{1+e^{-\left(a+b_{1}^{*} X_{1}+b_{2} * X_{2}+b_{3} * X_{3}+\ldots+b n^{*} X n\right)}}
P=\frac{1}{1+e^{-\left(a+b^{*} X_{i}\right)}}
Now we need to estimate the parameters a and b. The most common method for estimation of b is the Maximum Likelihood Estimation.
Let’s formulate the likelihood function
The probability that y = 1 given x is denoted by P(x).
\operatorname{Pr}(\mathrm{y}=1 | \mathrm{x})=\mathrm{P}(\mathrm{x})
Similarly, the probability that y=0 given x is denoted by 1 – P(x)
\operatorname{Pr}(\mathrm{y}=0 | \mathrm{x})=\mathrm1 -{P}(\mathrm{x})
Combining these two we’ll get
\operatorname{Pr}(\mathrm{y} | \mathrm{x})=\mathrm{P}(\mathrm{x})^{\mathrm{y}}*(1-\mathrm{P}(\mathrm{x}))^{1-\mathrm{y}}
As the observations are assumed to be independent, the likelihood function is obtained as the product of the terms
L=\prod_{i=1}^{N} P\left(x_{i}\right)^{y_i} \prod_{i=1, y_{i}=0}^{N}\left(1-P\left(x_{i}\right)\right)^{1 - y_i}
Now we’ll take the log of this likelihood function so it is easier to work with.
L=\sum_{i=1}^{N} y_i\log P\left(x_{i}\right)+(1-y_i) \log \left(1-P\left(x_{i}\right)\right)
This is the log-likelihood function for logistic regression. To estimate the parameters, we need to maximize the log-likelihood.
We can use the Newton-Raphson method to find the Maximum Likelihood Estimation.
Now that we have covered what logistic regression is let’s do some coding.
We’ll apply logistic regression on the breast cancer data set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3, random_state=43) lr = LogisticRegression() lr.fit(X_train, y_train) pred = lr.predict(X_test) accuracy = accuracy_score(y_test, pred) print(f'Accuracy:',accuracy) print(f'Confusion Matrix:\n',confusion_matrix(y_test, pred)) |
1 2 3 4 5 |
OUTPUT: Accuracy: 0.9766081871345029 Confusion Matrix: [[ 55 2] [ 2 112]] |