Linear regression is a fundamental statistical method used to predict a continuous outcome (dependent variable) based on one or more input features (independent variables). The goal is to model the relationship between the dependent variable and the independent variables by fitting a linear equation to the observed data.
Key Concepts:
- Simple Linear Regression: Involves a single independent variable. The relationship between the independent variable XXX and the dependent variable YYY is modeled as:Y=β0+β1X+ϵY = \beta_0 + \beta_1X + \epsilonY=β0+β1X+ϵwhere:
- YYY is the predicted outcome.
- β0\beta_0β0 is the intercept.
- β1\beta_1β1 is the slope (coefficient) of the independent variable.
- ϵ\epsilonϵ is the error term.
- Multiple Linear Regression: Involves two or more independent variables. The relationship is modeled as:Y=β0+β1X1+β2X2+⋯+βnXn+ϵY = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \epsilonY=β0+β1X1+β2X2+⋯+βnXn+ϵwhere:
- X1,X2,…,XnX_1, X_2, \dots, X_nX1,X2,…,Xn are the independent variables.
- β1,β2,…,βn\beta_1, \beta_2, \dots, \beta_nβ1,β2,…,βn are the corresponding coefficients.
Assumptions of Linear Regression:
- Linearity: The relationship between the independent and dependent variables is linear.
- Independence: Observations are independent of each other.
- Homoscedasticity: The residuals (errors) have constant variance.
- Normality: The residuals are normally distributed.
Applications:
- Predictive Modeling: Estimating future values of a variable (e.g., predicting sales based on advertising spend).
- Trend Analysis: Understanding the relationship between variables (e.g., studying how changes in temperature affect ice cream sales).
Example:
If you’re trying to predict house prices based on square footage and number of bedrooms, you could use multiple linear regression to model the relationship, where the house price is the dependent variable, and square footage and number of bedrooms are the independent variables.
Let’s walk through the implementation of a simple linear regression model in Python using the popular library scikit-learn
. We will also discuss how to evaluate the model using common metrics.
Step 1: Import Necessary Libraries
pythonCopy codeimport numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Step 2: Generate or Load Data
For this example, we’ll generate some synthetic data.
pythonCopy code# Generating synthetic data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Convert to DataFrame for better readability (optional)
data = pd.DataFrame(np.c_[X, y], columns=["X", "y"])
Step 3: Split Data into Training and Testing Sets
pythonCopy code# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Linear Regression Model
pythonCopy code# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
Step 5: Make Predictions
pythonCopy code# Make predictions on the test data
y_pred = model.predict(X_test)
Step 6: Evaluate the Model
Now, let’s evaluate the performance of our model using two common metrics: Mean Squared Error (MSE) and R-squared (R²) score.
1. Mean Squared Error (MSE): Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.
2. R-squared (R²) Score: Represents the proportion of the variance for the dependent variable that’s explained by the independent variables in the model. It ranges from 0 to 1, where a value closer to 1 indicates a better fit.
pythonCopy code# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²) Score: {r2}")
Step 7: Interpretation of Results
After running the above code, you might see output like:
javaCopy codeMean Squared Error (MSE): 0.71
R-squared (R²) Score: 0.94
- MSE: The smaller the MSE, the closer the predicted values are to the actual values.
- R-squared (R²) Score: An R² score close to 1 indicates that a large proportion of the variance in the dependent variable is predictable from the independent variable(s).
Full Code Example
Here’s the complete code:
pythonCopy codeimport numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Step 2: Generate synthetic data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Step 3: Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = model.predict(X_test)
# Step 6: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²) Score: {r2}")
This basic implementation can be extended to handle more complex scenarios, including multiple linear regression (with multiple independent variables), regularization, and more sophisticated evaluation metrics.
One thought on “Linear Regression: Predicting a continuous outcome based on one or more input features.”
Comments are closed.