Simplifying Regularization Part 1 Ridge to Elastic Net

In the world of machine learning, overfitting is a common enemy—your model performs great on training data but flops when faced with unseen inputs. The model hugs the training data so well that when the model faces unseen data, it is unable to produce the best result. This is a problem very common with machine learning models.
Regression models are among the most fundamental and widely used techniques in statistical modeling and machine learning. Its appeal lies in its simplicity, interpretability, and effectiveness when the underlying assumptions are met. However, traditional regression models such as linear regression often struggle in real-world applications where the dataset includes a large number of features, multicollinearity, or noisy observations. In such scenarios, the model tends to overfit, capturing noise rather than meaningful patterns, which leads to poor generalization on unseen data.
Model overfitting occurs due to several reasons, such as:
Excessive Model Complexity
Too Little Training Data
- Not enough examples for the model to learn general patterns
Too Many Features
- Irrelevant or redundant features can cause the model to fit noise
Too Many Training Epochs
- The model keeps adjusting until it memorizes
Lack of Regularization
- No penalties on large weights allow the model to fit extreme values
Low Noise Tolerance
Sensitive models like decision trees are prone to overfit noisy data
Imagine fitting a 15-degree polynomial to just 10 data points. The curve may pass through all points exactly but will wildly oscillate between them, producing absurd predictions on new data.
Now, model underfitting occurs when the model is too simplistic to learn the underlying patterns in the data, leading to poor performance on both training and test sets.
Imagine trying to fit a straight line (linear regression) to a clearly curved dataset will miss the true relationships and produce large errors.
️ A few of the major causes of Underfitting
- The model is Too Simple
- Linear model for a non-linear problem
- Too few layers or parameters
- Inadequate Training
- Too few training epochs
- Learning rate is too low, or the optimization is poor
- Poor Feature Representation
- Input features do not carry enough signal
- No feature engineering or data preprocessing
- Downsampling the Data Excessively
- Losing important variance during preprocessing
We should always remember that when we create any model, be it a regression or classification model, we should always create a generalised model. The model should always have low bias and variance, which means that it should perform decently with training, test, and validation data.
This is where regularization techniques come into play.
Regularization techniques combat overfitting and increase a model’s ability to generalize. They achieve this by adding a penalty to the linear regression cost function, which limits the model’s complexity and improves generalization.
Ridge Regression and Lasso Regression stand out as popular regularization methods. These are essentially linear regression extensions that incorporate a regularization term. This term helps reduce the model’s variance, potentially increasing bias slightly, ultimately leading to a more favorable balance between bias and variance.
-
Ridge Regression, also known as L2 regularization, penalizes the sum of the squared coefficients. It is particularly effective when all features are relevant, but multicollinearity is an issue. Ridge shrinks coefficients toward zero but never exactly to zero.
-
Lasso Regression, or L1 regularization, penalizes the sum of the coefficients’ absolute values. It not only helps with overfitting but also performs feature selection by shrinking some coefficients exactly to zero, effectively removing them from the model.
To start with, let us take a simple linear regression and define its cost or loss function.
In simple linear regression, the cost function quantifies the error between the predicted values (ŷ) and the actual observed values (y). The most commonly used cost function is the Mean Squared Error (MSE).
Where:
- J(θ 0, θ 1) is the cost function.
- m is the number of training examples.
- ŷ i is the predicted value for the i-th observation.
- y i is the actual observed value for the i-th observation.
Further, the below equation represents the sum of squared errors (SSE) or the residual sum of squares (RSS), which is commonly used in regression analysis to measure the discrepancy between the predicted and actual values.
Our main aim when creating a model is to lower this discrepancy, that is to reduce the difference between the observed and the predicted value.
However, in the case of overfitting, this value is closer to zero in the training dataset and increases in the test and validation datasets.
Ridge Regression, also known as Tikhonov regularization, or L2 regularization, along with the loss function or the residuals we add two more parameters or a penalty term, hence shrinking model coefficients and reducing model complexity.
Ridge Regression Formula
Where:
- yi= actual output for the i-th data point
- ŷi = Xi⋅β = predicted output
- β j = model coefficients
- λ≥0 = regularization strength
- n = number of observations
- p = number of features
Here, our main aim should be to reduce the entire equation along with the penalty term.
Overfitting can lead to a training dataset where the residual sum of squares is near zero, and the slope of the regression line is very steep. Ridge Regression addresses this by introducing a penalty term to the Ridge Loss, function. This penalty term increases with the steepness of the slope, discouraging it from becoming excessively large. Consequently, to minimize the overall Ridge Loss, in such scenarios, the best-fit line is adjusted, thereby mitigating the effects of overfitting.
In summary:
- In plain linear regression, if the line has a very large slope (steep line), small changes in input make large changes in output. That’s dangerous if there’s noise.
- In ridge regression, large slopes are penalized, so the model prefers flatter slopes that generalize better.
The best line gets selected, which has a low residual sum of squares and a low penalty. This line will be chosen based on multiple iterations, so the chance of overfitting will be much lesser in this case. The model will be a generalised model using Ridge regression.
Lasso Regression employs L1 regularization, incorporating the absolute value of the coefficients (or the magnitude of the slope) into the loss function. It is also known as Least Absolute Shrinkage and Selection Operator.:
Key Difference
Unlike Ridge, Lasso can shrink some coefficients to exactly zero, making it great for feature selection. Suppose there are multiple features in a dataset,
Where:
- y = predicted output (dependent variable)
- x1,x2,…,xn = input features (independent variables)
- m1,m2,…, mn = coefficients (slopes) corresponding to each input feature
- c = intercept (bias term)
In the case of Lasso regression, the penalty term for (L1 regularization) is:
Where:
- λ = regularization parameter (controls the strength of the penalty)
- m1,m2,…,mn = model coefficients (excluding the intercept)
- The absolute values |m_i| are used to encourage sparsity (some coefficients become zero)
Now, unlike Ridge (which only shrinks coefficients), Lasso can make some weights exactly 0.
This means the model completely ignores those features. Hence, automatically, features with non-zero coefficients are kept and features with zero coefficients are removed. So, you’re left only with features that really help in predicting the output.
This often reduces the complexity of the model and makes it faster and easier to interpret. Thus, it helps avoid overfitting, especially when there are many features.
Imagine you have 100 features, but only 10 of them are actually useful. Lasso will:
- Give non-zero weights to those 10 useful features.
- Set the weights of the other 90 to exactly zero.
- So your model now uses only the important 10 features — no manual selection needed!
Feature | Ridge Regression | Lasso Regression |
---|---|---|
Type of penalty | L2 (squared magnitude) | L1 (absolute magnitude) |
Feature selection | No | Yes |
When to use | Many small effects | Few strong effects |
Coefficient shrinkage | Yes | Yes (can become zero) |
Model interpretability | Moderate | High (fewer features) |
Elastic Net is a type of regression that combines both Lasso (L1) and Ridge (L2) regularization techniques.
Elastic Net improves on the limitations of Lasso, especially when working with high-dimensional data and a small number of samples. Lasso tends to select just one variable from a group of highly correlated features and ignore the rest, which can be a problem when all those features carry useful information.
To fix this, Elastic Net adds a quadratic term (the L2 norm, like in Ridge Regression) to the penalty. This makes the loss function more stable and convex, and helps include more relevant variables rather than dropping them entirely. Essentially, Elastic Net combines the strengths of both Lasso and Ridge—it selects important features while handling correlated ones better.
The process of finding Elastic Net coefficients happens in two stages:
- First, it applies Ridge regression to get the initial coefficient estimates.
- Then, it applies Lasso-like shrinkage to refine those estimates.
Because it applies two layers of shrinkage, this naive approach can sometimes increase bias and reduce predictive accuracy. To balance things out, the final coefficients are rescaled by multiplying them with (1+λ2), correcting the double shrinkage effect.
In short, it’s especially useful when:
- The data contains too many correlated features
- There is also a need for feature selection power of Lasso
- But also want the stability of Ridge
The loss function for Elastic Net is:
Let’s walk through a Ridge and Lasso regression example using the California Housing dataset with scikit-learn.
Step 1: Load and Preprocess Data
from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler X, y = fetch_california_housing(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
Scaling is important for Ridge and Lasso since they are sensitive to feature magnitude.
Step 2: Train the Models
from sklearn.linear_model import Ridge, Lasso ridge = Ridge(alpha=1.0) lasso = Lasso(alpha=0.1) ridge.fit(X_train, y_train) lasso.fit(X_train, y_train)
Step 3: Evaluate the Models
from sklearn.metrics import mean_squared_error ridge_pred = ridge.predict(X_test) lasso_pred = lasso.predict(X_test) print(“Ridge MSE:”, mean_squared_error(y_test, ridge_pred)) print(“Lasso MSE:”, mean_squared_error(y_test, lasso_pred))
Step 4: Visualize Coefficients
import matplotlib.pyplot as plt plt.plot(ridge.coef_, label=’Ridge’) plt.plot(lasso.coef_, label=’Lasso’) plt.legend() plt.title(“Ridge vs Lasso Coefficients”) plt.xlabel(“Feature Index”) plt.ylabel(“Coefficient Value”) plt.grid(True) plt.show()
You can easily tune Ridge Regression using RidgeCV:
from sklearn.linear_model import RidgeCV ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5) ridge_cv.fit(X_train, y_train) print(“Optimal alpha:”, ridge_cv.alpha_)
Optimal alpha: 0.1
What is Ridge Regression in machine learning?
It’s a regularized form of linear regression that prevents overfitting by penalizing large coefficients using L2 regularization.
How does Ridge Regression prevent overfitting?
By adding a penalty term to the cost function, it discourages the model from fitting noise in the data.
Can Ridge Regression perform feature selection?
No. It shrinks coefficients but does not zero them out like Lasso does.
When should I use Ridge Regression over Lasso?
Use Ridge when all features are likely to contribute and multicollinearity is an issue.
How do I implement Ridge Regression in Python?
Using Ridge() from scikit-learn. See the code example above!
Ridge Regression vs ElasticNet—what’s the difference?
ElasticNet combines Ridge (L2) and Lasso (L1), balancing shrinkage and sparsity.
Regularization is not a fancy buzzword—it’s a must-have in real-world machine learning. Ridge Regression gives you stability in high dimensions, while Lasso can give you simplicity through feature selection.
Both the techniques Lasso and Ridge regression are powerful for regularization, each with its own strengths—Lasso for feature selection and sparsity, and Ridge for handling multicollinearity and stabilizing models.
However, neither is perfect on its own. There is always a need to iterate through each step and process to find the method that works best for the specific model, dataset, and features. This often involves experimenting with different regularization strengths, tuning hyperparameters, and evaluating model performance using cross-validation.
The choice between Lasso, Ridge, or Elastic Net depends on the problem at hand—whether you need feature selection, stability with multicollinearity, or a balance of both. Ultimately, thoughtful experimentation and model evaluation are key to selecting the most effective regularization technique for your machine learning task.
Elastic Net shines when there is a need to combine the best of both worlds by applying both L1 and L2 penalties. This allows it to select important features like Lasso while maintaining the grouping effect and robustness of Ridge.
In high-dimensional datasets with correlated features, Elastic Net offers a more balanced and reliable approach, reducing overfitting and improving generalization. By understanding when and how to use these regularization techniques, you can build more interpretable and efficient machine learning models.