Regression is a cornerstone of supervised machine learning, enabling predictions of numerical outcomes, such as house prices or stock values, based on input data. Unlike classification, which deals with categorical labels, regression focuses on continuous variables. This blog explores the essentials of regression, from simple linear models to advanced techniques like ridge and lasso regression, and covers how to evaluate and validate these models. A quiz at the end tests your understanding, with answers provided.
Linear regression models the relationship between a dependent variable (the target) and one or more independent variables (predictors) using a straight line or hyperplane.
Imagine predicting someone's salary based on their years of experience. Simple linear regression uses a single predictor to estimate the target with the equation: $$ y = \theta_0 + \theta_1 x + \varepsilon $$ Here, $ y $ is the target (salary), $ x $ is the predictor (experience), $ \theta_0 $ is the intercept, $ \theta_1 $ is the slope, and $ \varepsilon $ is the error term capturing unmodeled factors. The goal is to find $ \theta_0 $ and $ \theta_1 $ that best fit the data.
When multiple factors influence the target, such as experience, education, and location affecting salary, we use multiple linear regression: $$ y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_m x_m + \varepsilon $$ This extends the model to $ m $ predictors ($ x_1, x_2, \ldots, x_m $), with coefficients $ \theta_1, \theta_2, \ldots, \theta_m $. The challenge is to estimate these coefficients accurately using training data.
OLS is the standard method for fitting linear regression models by minimizing the sum of squared errors between predicted and actual values. For $ N $ data points $ {(x_i, y_i)}
OLS can struggle when predictors outnumber observations or when predictors are highly correlated, leading to unstable or overfitting models. Ridge regression addresses this by adding an L2 penalty to the loss function: $$ J_{\text{ridge}}(\boldsymbol{\theta}) = \frac{1}{2} (\mathbf{y} - \mathbf{\Phi} \boldsymbol{\theta})^T (\mathbf{y} - \mathbf{\Phi} \boldsymbol{\theta}) + \lambda \boldsymbol{\theta}^T \boldsymbol{\theta} $$ The penalty term $ \lambda \boldsymbol{\theta}^T \boldsymbol{\theta} $ shrinks coefficients toward zero, improving stability and generalization. The solution is: $$ \widehat{\boldsymbol{\theta}}_{\text{ridge}} = (\mathbf{\Phi}^T \mathbf{\Phi} + \lambda \mathbf{I})^{-1} \mathbf{\Phi}^T \mathbf{y} $$ Here, $ \lambda $ controls the strength of regularization; larger $ \lambda $ values reduce coefficient magnitudes but may underfit the data.
Lasso regression uses an L1 penalty instead, promoting sparsity by driving some coefficients exactly to zero: $$ J_{\text{lasso}}(\boldsymbol{\theta}) = \frac{1}{2} (\mathbf{y} - \mathbf{\Phi} \boldsymbol{\theta})^T (\mathbf{y} - \mathbf{\Phi} \boldsymbol{\theta}) + \lambda \sum_{j=1}^m |\theta_j| $$ This makes lasso ideal for feature selection, as it identifies the most important predictors. Unlike ridge, lasso lacks an analytical solution and requires numerical optimization techniques like ISTA or FISTA. The choice of $ \lambda $ is critical: larger values increase sparsity, while smaller ones retain more predictors.
Evaluating regression models involves measuring how well predictions match actual values. Common metrics include:
- Mean Squared Error (MSE): Averages squared differences between predictions and actuals: $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2 $$ It penalizes large errors heavily due to squaring.
- Root Mean Squared Error (RMSE): The square root of MSE, in the same units as the target: $$ \text{RMSE} = \sqrt{\text{MSE}} $$ It’s widely used for its interpretability.
- Mean Absolute Error (MAE): Averages absolute differences, less sensitive to outliers: $$ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |\hat{y}_i - y_i| $$
- R-squared ($ R^2 $): Measures the proportion of variance explained by the model: $$ R^2 = 1 - \frac{\sum_{i=1}^n (\hat{y}i - y_i)^2}{\sum{i=1}^n (y_i - \bar{y})^2} $$ Values range from 0 to 1, with higher values indicating better fit. However, $ R^2 $ increases with more predictors, even if they’re irrelevant.
- Adjusted R-squared: Adjusts $ R^2 $ for the number of predictors: $$ \text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - m - 1} $$ It penalizes unnecessary variables, aiding model selection.
Linear regression assumes:
- Linearity: The relationship between predictors and target is linear, verifiable via scatter plots.
- Normality of Residuals: Errors should be normally distributed, checked with histograms or Q-Q plots.
- Zero Mean Residuals: The average error should be near zero.
- Multivariate Normality: Predictors should follow a multivariate normal distribution, assessed with Q-Q plots.
- Homoscedasticity: Residuals should have constant variance across predictor values. Heteroscedasticity (varying variance) can be detected in residual vs. predicted value scatter plots, often showing a funnel shape. Violations of these assumptions may require data transformations or alternative models.
When relationships are not linear, nonlinear regression models arbitrary functions: $$ y = f(x, \boldsymbol{\theta}) + \varepsilon $$ Examples include polynomial models ($ y = \theta_0 + \theta_1 x + \theta_2 x^2 + \varepsilon $) or rational models. While powerful, nonlinear regression is less common today due to the rise of neural networks, which handle complex relationships effectively.
Below are questions to reinforce your understanding. Answers follow in the next section.
- What is the key difference between simple and multiple linear regression?
- Why does OLS estimation fail when the number of predictors exceeds the number of observations?
- How does the L1 penalty in lasso regression differ from the L2 penalty in ridge regression in terms of model outcomes?
- Why is adjusted $ R^2 $ preferred over $ R^2 $ for model selection?
- What does heteroscedasticity indicate in a regression model, and how can it be detected?
Use Python with scikit-learn, numpy, and matplotlib to answer these questions based on the Auto MPG dataset.
- Fit a simple linear regression model to predict
mpgusingdisplacement. Report the intercept and slope. - Fit a multiple linear regression model using
displacement,horsepower, andweightto predictmpg. Calculate the RMSE on the training data. - Apply ridge regression to the same predictors as in question 2 with $ \lambda = 1.0 $. Compare the coefficients with those from OLS.
- Use lasso regression with $ \lambda = 0.1 $ on the same predictors. Identify which coefficients are set to zero.
- Create a scatter plot of residuals vs. predicted
mpgfrom the multiple linear regression model to check for heteroscedasticity.
- Answer: Simple linear regression uses one predictor variable to model the target, while multiple linear regression uses several predictors, allowing for more complex relationships.
- Answer: OLS fails when predictors exceed observations because $ \mathbf{\Phi}^T \mathbf{\Phi} $ becomes singular (non-invertible), leading to no unique solution for the coefficients.
- Answer: The L1 penalty in lasso promotes sparsity by setting some coefficients to zero, selecting key predictors, whereas the L2 penalty in ridge shrinks all coefficients toward zero without eliminating any, improving stability.
- Answer: Adjusted $ R^2 $ penalizes the addition of unnecessary predictors, preventing overfitting, unlike $ R^2 $, which always increases with more variables.
- Answer: Heteroscedasticity indicates non-constant residual variance across predictor values, violating linear regression assumptions. It’s detected via scatter plots of residuals vs. predicted values, often showing a funnel shape.
Below are sample solutions using Python and the Auto MPG dataset. Ensure the dataset is cleaned (e.g., handle missing values in horsepower) before running the code.
-
Fit a simple linear regression model to predict
mpgusingdisplacement. Report the intercept and slope.import pandas as pd from sklearn.linear_model import LinearRegression # Load and clean dataset df = pd.read_csv('auto-mpg.csv') df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median()) # Simple linear regression X = df[['displacement']] y = df['mpg'] model = LinearRegression().fit(X, y) print(f"Intercept: {model.intercept_:.4f}, Slope: {model.coef_[0]:.4f}")
Expected Output: Intercept and slope values depend on the dataset but might be around 35.0 and -0.06, respectively.
-
Fit a multiple linear regression model using
displacement,horsepower, andweightto predictmpg. Calculate the RMSE on the training data.import numpy as np from sklearn.metrics import mean_squared_error # Multiple linear regression X = df[['displacement', 'horsepower', 'weight']] y = df['mpg'] model = LinearRegression().fit(X, y) y_pred = model.predict(X) rmse = np.sqrt(mean_squared_error(y, y_pred)) print(f"RMSE: {rmse:.4f}")
Expected Output: RMSE typically around 4.0–5.0, depending on data preprocessing.
-
Apply ridge regression to the same predictors as in question 2 with $ \lambda = 1.0 $. Compare the coefficients with those from OLS.
from sklearn.linear_model import Ridge # Ridge regression ridge = Ridge(alpha=1.0).fit(X, y) print("OLS Coefficients:", model.coef_) print("Ridge Coefficients:", ridge.coef_)
Expected Output: Ridge coefficients are slightly smaller than OLS due to the L2 penalty shrinking them toward zero.
-
Use lasso regression with $ \lambda = 0.1 $ on the same predictors. Identify which coefficients are set to zero.
from sklearn.linear_model import Lasso # Lasso regression lasso = Lasso(alpha=0.1).fit(X, y) print("Lasso Coefficients:", lasso.coef_)
Expected Output: Some coefficients (e.g.,
horsepower) may be zero, indicating feature selection by lasso. -
Create a scatter plot of residuals vs. predicted
mpgfrom the multiple linear regression model to check for heteroscedasticity.import matplotlib.pyplot as plt # Residual plot residuals = y - y_pred plt.scatter(y_pred, residuals) plt.axhline(0, color='red', linestyle='--') plt.xlabel('Predicted MPG') plt.ylabel('Residuals') plt.title('Residuals vs Predicted MPG') plt.savefig('residual_plot.png') plt.close()
Expected Output: A scatter plot with no clear pattern (e.g., funnel shape) indicates homoscedasticity; a funnel shape suggests heteroscedasticity.
Regression is a versatile tool in machine learning, from simple linear models to regularized techniques like ridge and lasso. Understanding their assumptions, evaluation metrics, and validation methods is crucial for building robust models. The quiz above helps solidify your knowledge, and tools like scikit-learn make it easy to apply these concepts in practice. Keep experimenting with datasets like Auto MPG to deepen your skills!