Skip to content
AA.

Ángel Arróspide • Cybersecurity & AI

Go back

Machine Learning Part 1: Linear Regression & Gradient Descent

An introduction to Linear Regression and Gradient Descent, exploring the mathematical foundation and vectorization techniques.

6 min
Table of Contents

Linear regression is a supervised Machine Learning (ML) algorithm that models the linear relationship between a dependent variable (the target or output) and one or more independent variables (the features or inputs).

1. Introduction to Linear Regression

Objective

Given a dataset DD containing nn entries, each with a set of features x1,x2,,xdx_1, x_2, \dots, x_d and the target variable yy, our objective is to find a mapping function hh.

h:xy(x,y)Dh:x\mapsto y \quad \forall\:(x, y) \in D

Notation

When you see a term like xj(i)x_j^{(i)}:

Real-World Example

We define an example for house pricing:

Then:

2. Mathematical Foundation

Linear Regression Function

We define the linear hypothesis function hθ(x)h_\theta(x). The goal is to find a set of parameters (called weights), denoted as θ0,θ1,,θd\theta_0, \theta_1, \dots, \theta_d, so that the prediction hθ(x)h_\theta(x) is as close to the actual yy as possible.

hθ(x)=θ0+θ1x1++θdxdyh_\theta(x) = \theta_0 + \theta_1 x_1 + \dots\: + \theta_d x_d \approx y

For simplicity we can introduce an intercept term x0=1x_0 = 1, which allows us to write the function as follows:

hθ(x)=θ0x0+θ1x1++θdxd=i=0dθi×xih_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \dots\: + \theta_d x_d = \sum_{i=0}^d \theta_i \times x_i

Or in a vectorized form:

hθ(x)=[θ0θ1θd][x0x1xd]=θTxh_\theta(x) = \begin{bmatrix} \theta_0 & \theta_1 & \cdots & \theta_d \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_d \end{bmatrix} = \theta^T \cdot\: x

Cost Function

To measure how far our predictions hθ(x)h_\theta(x) are from the real values yy based on our current weights, we define a cost function J(θ)J(\theta). The training process is entirely about minimizing this cost:

θargminθJ(θ)\theta \leftarrow \arg\min_\theta J(\theta)

A general cost function averages the loss over the entire dataset DD:

J(θ)=1D(x,y)Dloss(y,hθ(x))J(\theta) = \frac{1}{|D|}\sum_{(x, y)\in D} \text{loss}(y, h_\theta(x))

While many loss functions exist, Linear Regression standardly uses the Least Squares approach, also called Mean Squared Error (MSE):

J(θ)=12ni=1n(hθ(x(i))y(i))2J(\theta) = \frac{1}{2n}\sum_{i=1}^n(h_\theta(x^{(i)}) - y^{(i)})^2

(Note: The 1/21/2 is added for mathematical convenience when computing the derivative).

Interactive Linear Regression: h(x) = θ₁ * x

Data vs. Prediction

Feature (x)Target (y)Model: y = 1.00x

Cost Function Landscape

Slope (θ₁)MSE (Loss)Current MSE: 2.07
-24

3. Gradient Descent

Gradient Descent is an optimization algorithm used to find the weights θ\theta that minimize the cost function J(θ)J(\theta).

The Algorithm

  1. The weights are initially chosen at random.
θθ0\theta \leftarrow \theta_0
  1. Then, we iteratively update the weights in the opposite direction of the gradient until we reach a minimum:
θθγ×Jθ\theta \leftarrow \theta - \gamma \times \frac{\partial J}{\partial \theta}

Key Definitions

The Learning Rate Simulator

Cost Function: J(θ) = θ²  |  Derivative: dJ/dθ = 2θ

Current θ: 4.00

Gradient Breakdown

θjJ(θ)=θ(12n(hθ(x)y)2)=12n2(hθ(x)y)θ(hθ(x)y)=1n(hθ(x)y)θ(i=0dθixiy)=1n(hθ(x)y)xj\begin{aligned} \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta} \left( \frac{1}{2n}\:(h_\theta(x) - y)^2 \right) \\ &= \frac{1}{2n} \cdot 2 \: (h_\theta(x) - y) \cdot \frac{\partial}{\partial \theta} (h_\theta(x) - y) \\ &= \frac{1}{n} (h_\theta(x) - y) \cdot \frac{\partial}{\partial \theta} \left(\sum_{i=0}^{d} \theta_i x_i - y\right) \\ &= \frac{1}{n} (h_\theta(x) - y) \cdot x_j\\ \end{aligned}

4. Vectorization in Practice

To implement this algorithm efficiently in code, we translate the summations and individual updates into matrix operations.

[y(1)y(2)y(n)]=[1x1(1)x2(1)xd(1)1x1(2)x2(2)xd(2)1x1(n)x2(n)xd(n)][θ0θ1θd]+[ϵ(1)ϵ(2)ϵ(n)]\begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)} \end{bmatrix}= \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} & \dots & x_d^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} & \dots & x_d^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(n)} & x_2^{(n)} & \dots & x_d^{(n)} \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_d \end{bmatrix}+ \begin{bmatrix} \epsilon^{(1)} \\ \epsilon^{(2)} \\ \vdots \\ \epsilon^{(n)} \end{bmatrix}

The training loop consists of three main steps:

  1. Forward Pass (Calculating predictions)
  2. Backward-propagation (Calculating gradients)
  3. Weight Update (Optimizing parameters)

These steps are repeated until convergence is achieved. Convergence is usually detected by storing the previous step MSE and comparing it with the new one. (No improvement equals convergence).

Step 1 - Forward Pass

We make a prediction with our hypothesis hθ(X)h_\theta(X) by multiplying our feature matrix by our weights:

hθ(X)=[1x1(1)xd(1)1x1(2)xd(2)1x1(n)xd(n)][θ0θ1θd]=[y^(1)y^(2)y^(n)]=Y^h_\theta(X) = \begin{bmatrix} 1 & x_1^{(1)} & \dots & x_d^{(1)} \\ 1 & x_1^{(2)} & \dots & x_d^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(n)} & \dots & x_d^{(n)} \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_d \end{bmatrix} = \begin{bmatrix} \hat{y}^{(1)} \\ \hat{y}^{(2)} \\ \vdots \\ \hat{y}^{(n)} \end{bmatrix} = \hat{Y}

However, when our model makes the prediction Y^\hat{Y}, it is rarely perfect. We define the error vector e\vec{e} (also known as the residuals) as the difference between our predictions and the actual target values YY:

e=Y^Y=[y^(1)y(1)y^(2)y(2)y^(n)y(n)]=[e(1)e(2)e(n)]\vec{e} = \hat{Y} - Y = \begin{bmatrix} \hat{y}^{(1)} - y^{(1)} \\ \hat{y}^{(2)} - y^{(2)} \\ \vdots \\ \hat{y}^{(n)} - y^{(n)} \end{bmatrix} = \begin{bmatrix} e^{(1)} \\ e^{(2)} \\ \vdots \\ e^{(n)} \end{bmatrix}

Step 2 - Backward-propagation

Instead of calculating the partial derivative for each weight θj\theta_j individually using a loop, we can compute the entire gradient vector θJ\nabla_\theta J at once. By multiplying the transposed feature matrix XTX^T by our error vector e\vec{e}, we sum the errors scaled by their respective feature values across all nn samples simultaneously.

XTe=[111x1(1)x1(2)x1(n)x2(1)x2(2)x2(n)xd(1)xd(2)xd(n)][e(1)e(2)e(n)]X^T \vec{e} = \begin{bmatrix} 1 & 1 & \dots & 1 \\ x_1^{(1)} & x_1^{(2)} & \dots & x_1^{(n)} \\ x_2^{(1)} & x_2^{(2)} & \dots & x_2^{(n)} \\ \vdots & \vdots & \ddots & \vdots \\ x_d^{(1)} & x_d^{(2)} & \dots & x_d^{(n)} \end{bmatrix} \begin{bmatrix} e^{(1)} \\ e^{(2)} \\ \vdots \\ e^{(n)} \end{bmatrix}

To keep the gradients stable regardless of dataset size, we usually average them by dividing by nn:

θJ=1nXTe=1nXT(Y^Y)\nabla_\theta J = \frac{1}{n}\: X^T \vec{e} = \frac{1}{n}\: X^T (\hat{Y} - Y)

Step 3 - Weight Update

Finally, we apply the Gradient Descent rule to all weights simultaneously. We multiply the computed gradient vector by our learning rate γ\gamma and subtract it from our current weights vector θ\theta:

θθγ×θJ\theta \leftarrow \theta - \gamma \times \nabla_\theta J

Share this post on: