Linear regression is a supervised Machine Learning (ML) algorithm that models the linear relationship between a dependent variable (the target or output) and one or more independent variables (the features or inputs).
1. Introduction to Linear Regression
Objective
Given a dataset D containing n entries, each with a set of features x1,x2,…,xd and the target variable y, our objective is to find a mapping function h.
h:x↦y∀(x,y)∈D
Notation
When you see a term like xj(i):
The base letter
Letter x represents an input feature.
Letter y represents the target variable.
The subscript (j):
Indicates the specific feature index (the column in our dataset).
Note that 0≤j≤d.
The superscript in parentheses ((n) or (i))
Indicates the specific sample or observation (the row in our dataset).
Note that 1≤i≤n.
Real-World Example
We define an example for house pricing:
x1 = Square footage.
x2 = Square foot price.
y = Price of the house.
Then:
x2(5) represents the square foot price for the fifth house in the dataset.
The target variable is y=x1×x2.
We try to find a hypothesis h(x)≈x1×x2.
2. Mathematical Foundation
Linear Regression Function
We define the linear hypothesis function hθ(x). The goal is to find a set of parameters (called weights), denoted as θ0,θ1,…,θd, so that the prediction hθ(x) is as close to the actual y as possible.
hθ(x)=θ0+θ1x1+…+θdxd≈y
For simplicity we can introduce an intercept term x0=1, which allows us to write the function as follows:
hθ(x)=θ0x0+θ1x1+…+θdxd=i=0∑dθi×xi
Or in a vectorized form:
hθ(x)=[θ0θ1⋯θd]x0x1⋮xd=θT⋅x
Cost Function
To measure how far our predictions hθ(x) are from the real values y based on our current weights, we define a cost function J(θ). The training process is entirely about minimizing this cost:
θ←argθminJ(θ)
A general cost function averages the loss over the entire dataset D:
J(θ)=∣D∣1(x,y)∈D∑loss(y,hθ(x))
While many loss functions exist, Linear Regression standardly uses the Least Squares approach, also called Mean Squared Error (MSE):
J(θ)=2n1i=1∑n(hθ(x(i))−y(i))2
(Note: The 1/2 is added for mathematical convenience when computing the derivative).
Interactive Linear Regression: h(x) = θ₁ * x
Data vs. Prediction
Cost Function Landscape
-24
3. Gradient Descent
Gradient Descent is an optimization algorithm used to find the weights θ that minimize the cost function J(θ).
The Algorithm
The weights are initially chosen at random.
θ←θ0
Then, we iteratively update the weights in the opposite direction of the gradient until we reach a minimum:
θ←θ−γ×∂θ∂J
Key Definitions
Simultaneous Update: All weights (θ1,θ2,…,θj) should be updated simultaneously.
Global Minimum: Because the linear hypothesis h(x) makes J(θ) a convex quadratic function (a bowl shape), there is only one minimum. Therefore, any local minimum is guaranteed to be the absolute (global) minimum.
Learning Rate (γ): Determines the step size of each update.
If γ is too small, convergence is slow.
If γ is too large, the algorithm may overshoot the minimum and diverge.
The Learning Rate Simulator
Cost Function: J(θ) = θ² | Derivative: dJ/dθ = 2θ
Current θ: 4.00
Stochastic Gradient Descent (SGD): Calculating the derivative over large datasets is computationally expensive, so we estimate it using a subset (Mini-batch).
These steps are repeated until convergence is achieved. Convergence is usually detected by storing the previous step MSE and comparing it with the new one. (No improvement equals convergence).
Step 1 - Forward Pass
We make a prediction with our hypothesis hθ(X) by multiplying our feature matrix by our weights:
However, when our model makes the prediction Y^, it is rarely perfect. We define the error vector e (also known as the residuals) as the difference between our predictions and the actual target values Y:
Instead of calculating the partial derivative for each weight θj individually using a loop, we can compute the entire gradient vector ∇θJ at once. By multiplying the transposed feature matrix XT by our error vector e, we sum the errors scaled by their respective feature values across all n samples simultaneously.
To keep the gradients stable regardless of dataset size, we usually average them by dividing by n:
∇θJ=n1XTe=n1XT(Y^−Y)
Step 3 - Weight Update
Finally, we apply the Gradient Descent rule to all weights simultaneously. We multiply the computed gradient vector by our learning rate γ and subtract it from our current weights vector θ: