Logistic regression is a widely used statistical method for binary classification problems, where the outcome or target variable is categorical and typically has two possible values (e.g., 0/1, True/False, Yes/No). Unlike linear regression, which predicts continuous outcomes, logistic regression predicts the probability of a categorical outcome and maps it to a binary decision.
Key Concepts:
- Logistic Function (Sigmoid Function): The logistic regression model uses the logistic function, also known as the sigmoid function, to map predicted values to probabilities:σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1where z=β0+β1X1+⋯+βnXnz = \beta_0 + \beta_1X_1 + \dots + \beta_nX_nz=β0+β1X1+⋯+βnXn.The output of the logistic function is a value between 0 and 1, which represents the probability that the given input belongs to the positive class.
- Decision Boundary: The model predicts the positive class (e.g., 1) if the probability is greater than a certain threshold (commonly 0.5) and the negative class (e.g., 0) otherwise.
- Loss Function: Logistic regression uses the log-loss (binary cross-entropy) as its loss function, which is minimized during training:Log Loss=−1n∑i=1n[yilog(y^i)+(1−yi)log(1−y^i)]\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 – y_i) \log(1 – \hat{y}_i) \right]Log Loss=−n1i=1∑n[yilog(y^i)+(1−yi)log(1−y^i)]where yiy_iyi is the actual label and y^i\hat{y}_iy^i is the predicted probability.
Applications:
- Spam Detection: Classifying emails as spam or not spam.
- Customer Churn Prediction: Predicting whether a customer will leave or stay with a company.
- Medical Diagnosis: Predicting whether a patient has a certain disease (e.g., diabetes, cancer) based on medical test results.
Example Implementation in Python:
Let’s implement logistic regression using the scikit-learn
library and evaluate its performance on a synthetic dataset.
Step 1: Import Necessary Libraries
pythonCopy codeimport numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Generate or Load Data
We’ll generate a synthetic binary classification dataset for this example.
pythonCopy codefrom sklearn.datasets import make_classification
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
# Convert to DataFrame for better readability (optional)
data = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(1, 11)])
data['Target'] = y
Step 3: Split Data into Training and Testing Sets
pythonCopy code# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Logistic Regression Model
pythonCopy code# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
Step 5: Make Predictions
pythonCopy code# Make predictions on the test data
y_pred = model.predict(X_test)
Step 6: Evaluate the Model
We can evaluate the performance of our logistic regression model using various metrics:
- Accuracy: The ratio of correctly predicted instances to the total instances.
- Confusion Matrix: A table that is used to describe the performance of a classification model.
- Classification Report: Includes precision, recall, and F1-score for each class.
pythonCopy code# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)
Step 7: Interpretation of Results
After running the above code, you might see output like this:
luaCopy codeAccuracy: 0.89
Confusion Matrix:
[[89 11]
[12 88]]
Classification Report:
precision recall f1-score support
0 0.88 0.89 0.88 100
1 0.89 0.88 0.89 100
accuracy 0.89 200
macro avg 0.89 0.89 0.89 200
weighted avg 0.89 0.89 0.89 200
- Accuracy: The model correctly predicted the outcome 89% of the time.
- Confusion Matrix: Shows the number of true positives, true negatives, false positives, and false negatives.
- Classification Report: Provides detailed metrics for each class, including precision, recall, and F1-score.
Full Code Example
Here’s the complete code:
pythonCopy codeimport numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import make_classification
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
This implementation can be extended to handle multi-class classification problems and more complex datasets.
One thought on “Logistic Regression: Used for binary classification problems.”
Comments are closed.