, we are going to discuss not only how but also why gradient descent and stochastic gradient descent are used.
We already know about linear regression, and recently I wrote about it in the context of vectors and projections.
Now, we will try to understand gradient descent with the help of a linear regression problem.
But before that, I just want to briefly recall what we already know about linear regression and the math behind it, so that anyone starting out finds it easy to follow.
If you already know the basic math behind linear regression, then you can directly start from the section titled Why Do We Need Gradient Descent?
Let’s say we started our machine learning journey, and the first thing we did was implementing a linear regression model using Python.
We implemented it successfully and got the best values for the slope and intercept.
Now we have a question: What’s actually happening behind this algorithm?
We want to understand the math behind it.
For that, let’s consider this data.

Image by Author
Now, we want to understand the math behind the algorithm.

Image by Author
We come across these formulas for the slope and intercept.
\[
\beta_1 = \frac{\sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^{n} (x_i – \bar{x})^2}
\]
\[
\beta_0 = \bar{y} – \beta_1\bar{x}
\]
Now, by using these formulas we calculate the slope and intercept.
The Simple Linear Regression equation is:
\[
\hat{y}
=
\beta_0+\beta_1x
\]
The slope formula is:
\[
\beta_1
=
\frac{
\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})
}{
\sum_{i=1}^{n}(x_i-\bar{x})^2
}
\]
The intercept formula is:
\[
\beta_0
=
\bar{y}
–
\beta_1\bar{x}
\]
The dataset is:
\[
x=
[1.2,1.4,1.6,2.1,2.3,3.0,3.1,3.3,3.3,3.8]
\] \[
y=
[39344,46206,37732,43526,39892,56643,60151,54446,64446,57190]
\]
Compute the mean of x:
\[
\bar{x}
=
\frac{1.2+1.4+1.6+2.1+2.3+3.0+3.1+3.3+3.3+3.8}{10}
\] \[
\bar{x}
=
\frac{25.1}{10}
=
2.51
\]
Compute the mean of y:
\[
\bar{y}
=
\frac{
39344+46206+37732+43526+39892+56643+60151+54446+64446+57190
}{10}
\] \[
\bar{y}
=
\frac{499576}{10}
=
49957.6
\]
Now compute:
\[
\sum(x_i-\bar{x})(y_i-\bar{y})
\]
After substitution and calculation:
\[
\sum(x_i-\bar{x})(y_i-\bar{y})
=
41663.44
\]
Now compute:
\[
\sum(x_i-\bar{x})^2
\]
After calculation:
\[
\sum(x_i-\bar{x})^2
=
4.619
\]
Now compute the slope:
\[
\beta_1
=
\frac{41663.44}{4.619}
\] \[
\beta_1
=
9020.66
\]
Now compute the intercept:
\[
\beta_0
=
49957.6-(9020.66)(2.51)
\] \[
\beta_0
=
27315.74
\]
Therefore:
\[
\beta_0=27315.74
\] \[
\beta_1=9020.66
\]
Final regression equation:
\[
\hat{y}
=
27315.74+9020.66x
\]
We got the values using the formulas, but we are not satisfied and want to go deeper.
Now our goal is to learn how we got these formulas.
To understand that, we will now see a 3D bowl curve. We get that bowl curve when we plot all the possible combinations of β0\beta_0, β1\beta_1 and the mean squared error (MSE).

Image by Author
Now, by looking at the curve, we understand that we need the mean squared error to be as low as possible, and it reaches it’s minimum when the gradient becomes zero.
We already know that to find the slope of any curve, we need differentiation.
Next, we perform differentiation on the loss function, since the bowl curve is the 3D representation of it, and you realize that here we have two variables.
So, we perform partial differentiation and then solve further to get the formulas for the slope and intercept.
Start with the Mean Squared Error (MSE) loss function:
\[
MSE(\beta_0,\beta_1)
=
\frac{1}{n}
\sum_{i=1}^{n}
(y_i-(\beta_0+\beta_1x_i))^2
\]
Rearrange the inner expression:
\[
=
\frac{1}{n}
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)^2
\]
Now take partial derivative with respect to \( \beta_0 \):
\[
\frac{\partial MSE}{\partial \beta_0}
=
\frac{\partial}{\partial \beta_0}
\left(
\frac{1}{n}
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)^2
\right)
\]
Take constant outside:
\[
=
\frac{1}{n}
\frac{\partial}{\partial \beta_0}
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)^2
\]
Move derivative inside the summation:
\[
=
\frac{1}{n}
\sum_{i=1}^{n}
\frac{\partial}{\partial \beta_0}
(y_i-\beta_0-\beta_1x_i)^2
\]
Apply chain rule:
\[
=
\frac{1}{n}
\sum_{i=1}^{n}
2(y_i-\beta_0-\beta_1x_i)
\cdot
\frac{\partial}{\partial \beta_0}
(y_i-\beta_0-\beta_1x_i)
\]
Apply derivative rules:
\[
\frac{d}{d\beta_0}(y_i)=0
\] \[
\frac{d}{d\beta_0}(-\beta_0)=-1
\] \[
\frac{d}{d\beta_0}(-\beta_1x_i)=0
\]
So the inner derivative becomes:
\[
\frac{\partial}{\partial \beta_0}
(y_i-\beta_0-\beta_1x_i)
=
-1
\]
Substitute back:
\[
\frac{\partial MSE}{\partial \beta_0}
=
\frac{1}{n}
\sum_{i=1}^{n}
2(y_i-\beta_0-\beta_1x_i)(-1)
\]
Simplify:
\[
=
-\frac{2}{n}
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)
\]
Set derivative equal to zero:
\[
-\frac{2}{n}
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)
=
0
\]
Multiply both sides by:
\[
-\frac{n}{2}
\] \[
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)
=
0
\]
Expand:
\[
\sum_{i=1}^{n}y_i
–
n\beta_0
–
\beta_1\sum_{i=1}^{n}x_i
=
0
\]
Rearrange:
\[
n\beta_0
=
\sum_{i=1}^{n}y_i
–
\beta_1\sum_{i=1}^{n}x_i
\]
Divide by \( n \):
\[
\beta_0
=
\frac{1}{n}\sum_{i=1}^{n}y_i
–
\beta_1
\frac{1}{n}\sum_{i=1}^{n}x_i
\]
Using means:
\[
\bar{x}
=
\frac{1}{n}\sum_{i=1}^{n}x_i
\] \[
\bar{y}
=
\frac{1}{n}\sum_{i=1}^{n}y_i
\]
Final intercept formula:
\[
\beta_0
=
\bar{y}
–
\beta_1\bar{x}
\]
Now take partial derivative with respect to \( \beta_1 \):
\[
\frac{\partial MSE}{\partial \beta_1}
=
\frac{\partial}{\partial \beta_1}
\left(
\frac{1}{n}
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)^2
\right)
\]
Take constant outside:
\[
=
\frac{1}{n}
\frac{\partial}{\partial \beta_1}
\sum_{i=1}^{n}
(y_i-\beta_0-\beta_1x_i)^2
\]
Move derivative inside the summation:
\[
=
\frac{1}{n}
\sum_{i=1}^{n}
\frac{\partial}{\partial \beta_1}
(y_i-\beta_0-\beta_1x_i)^2
\]
Apply chain rule:
\[
=
\frac{1}{n}
\sum_{i=1}^{n}
2(y_i-\beta_0-\beta_1x_i)
\cdot
\frac{\partial}{\partial \beta_1}
(y_i-\beta_0-\beta_1x_i)
\]
Apply derivative rules:
\[
\frac{d}{d\beta_1}(y_i)=0
\] \[
\frac{d}{d\beta_1}(-\beta_0)=0
\] \[
\frac{d}{d\beta_1}(-\beta_1x_i)=-x_i
\]
So the inner derivative becomes:
\[
\frac{\partial}{\partial \beta_1}
(y_i-\beta_0-\beta_1x_i)
=
-x_i
\]
Substitute back:
\[
\frac{\partial MSE}{\partial \beta_1}
=
\frac{1}{n}
\sum_{i=1}^{n}
2(y_i-\beta_0-\beta_1x_i)(-x_i)
\]
Simplify:
\[
=
-\frac{2}{n}
\sum_{i=1}^{n}
x_i(y_i-\beta_0-\beta_1x_i)
\]
Set derivative equal to zero:
\[
-\frac{2}{n}
\sum_{i=1}^{n}
x_i(y_i-\beta_0-\beta_1x_i)
=
0
\]
Multiply both sides by:
\[
-\frac{n}{2}
\] \[
\sum_{i=1}^{n}
x_i(y_i-\beta_0-\beta_1x_i)
=
0
\]
Expand:
\[
\sum_{i=1}^{n}x_iy_i
–
\beta_0\sum_{i=1}^{n}x_i
–
\beta_1\sum_{i=1}^{n}x_i^2
=
0
\]
Substitute:
\[
\beta_0
=
\bar{y}
–
\beta_1\bar{x}
\]
into the equation:
\[
\sum_{i=1}^{n}x_iy_i
–
(\bar{y}-\beta_1\bar{x})
\sum_{i=1}^{n}x_i
–
\beta_1\sum_{i=1}^{n}x_i^2
=
0
\]
Expand:
\[
\sum_{i=1}^{n}x_iy_i
–
\bar{y}\sum_{i=1}^{n}x_i
\beta_1\bar{x}\sum_{i=1}^{n}x_i
–
\beta_1\sum_{i=1}^{n}x_i^2
=
0
\]
Since:
\[
\sum_{i=1}^{n}x_i=n\bar{x}
\]
Substitute:
\[
\sum_{i=1}^{n}x_iy_i
–
n\bar{x}\bar{y}
\beta_1n\bar{x}^2
–
\beta_1\sum_{i=1}^{n}x_i^2
=
0
\]
Group \( \beta_1 \) terms:
\[
\beta_1
(n\bar{x}^2-\sum_{i=1}^{n}x_i^2)
=
n\bar{x}\bar{y}
–
\sum_{i=1}^{n}x_iy_i
\]
Multiply both sides by -1:
\[
\beta_1
(\sum_{i=1}^{n}x_i^2-n\bar{x}^2)
=
\sum_{i=1}^{n}x_iy_i
–
n\bar{x}\bar{y}
\]
Final slope formula:
\[
\beta_1
=
\frac{
\sum_{i=1}^{n}x_iy_i
–
n\bar{x}\bar{y}
}{
\sum_{i=1}^{n}x_i^2
–
n\bar{x}^2
}
\]
Equivalent covariance form:
\[
\beta_1
=
\frac{
\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})
}{
\sum_{i=1}^{n}(x_i-\bar{x})^2
}
\]
Finally, substitute the computed value of \( \beta_1 \) into the intercept equation:
\[
\beta_0
=
\bar{y}
–
\beta_1\bar{x}
\]
Thus, the final regression equation becomes:
\[
\hat{y}
=
\beta_0
\beta_1x
\]
Now, we learned how we got the formulas for the slope and intercept.
But one thing we need to consider here is that we derived these formulas for a case where we only have one feature, and even for one feature, we can see how complex the math was.
What if we have more than one feature, as most real-world datasets do?
The math becomes more complex, and this is where we use the matrix form to represent the equations. Using matrix notation, we can derive the normal equation, which generalizes to any number of features.
In Simple Linear Regression, we derived one intercept and one slope:
\[
\hat{y}
=
\beta_0+\beta_1x
\]
However, real-world problems usually contain multiple features.
For example:
years of experience
education level
age
In such cases, Linear Regression becomes:
\[
\hat{y}
=
\beta_0
\beta_1x_1
\beta_2x_2
\beta_3x_3
\cdots
\beta_px_p
\]
where:
\( \beta_0 \) is the intercept and
\( \beta_1,\beta_2,\beta_3,\dots,\beta_p \) are slopes for different features
As the number of features increases, solving separate equations for every parameter becomes difficult.
To solve this easily, Linear Regression is rewritten using matrix notation.
Suppose we have \( n \) observations and \( p \) features.
First define the target vector:
\[
Y
=
\begin{bmatrix}
y_1\\
y_2\\
y_3\\
\vdots\\
y_n
\end{bmatrix}
\]
Now define the feature matrix.
The first column contains only 1s to represent the intercept term.
\[
X
=
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1p}\\
1 & x_{21} & x_{22} & \cdots & x_{2p}\\
1 & x_{31} & x_{32} & \cdots & x_{3p}\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
1 & x_{n1} & x_{n2} & \cdots & x_{np}
\end{bmatrix}
\]
Now define the parameter vector:
\[
\beta
=
\begin{bmatrix}
\beta_0\\
\beta_1\\
\beta_2\\
\vdots\\
\beta_p
\end{bmatrix}
\]
Using matrix multiplication:
\[
X\beta
=
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1p}\\
1 & x_{21} & x_{22} & \cdots & x_{2p}\\
1 & x_{31} & x_{32} & \cdots & x_{3p}\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
1 & x_{n1} & x_{n2} & \cdots & x_{np}
\end{bmatrix}
\begin{bmatrix}
\beta_0\\
\beta_1\\
\beta_2\\
\vdots\\
\beta_p
\end{bmatrix}
\]
Performing the multiplication:
\[
=
\begin{bmatrix}
\beta_0+\beta_1x_{11}+\beta_2x_{12}+\cdots+\beta_px_{1p}\\
\beta_0+\beta_1x_{21}+\beta_2x_{22}+\cdots+\beta_px_{2p}\\
\beta_0+\beta_1x_{31}+\beta_2x_{32}+\cdots+\beta_px_{3p}\\
\vdots\\
\beta_0+\beta_1x_{n1}+\beta_2x_{n2}+\cdots+\beta_px_{np}
\end{bmatrix}
\]
This gives the prediction vector:
\[
\hat{Y}=X\beta
\]
Now define the residual vector.
Residuals are the differences between actual and predicted values.
\[
Y-\hat{Y}
\]
Substituting:
\[
Y-X\beta
\]
The Mean Squared Error (MSE) becomes:
\[
MSE
=
\frac{1}{n}
(Y-X\beta)^T(Y-X\beta)
\]
The transpose is required because:
\[
(Y-X\beta)
\]
is a column vector.
Multiplying by its transpose converts the expression into a scalar sum of squared residuals.
Now expand the expression.
\[
MSE
=
\frac{1}{n}
(Y-X\beta)^T(Y-X\beta)
\] \[
=
\frac{1}{n}
\left(
Y^TY
–
Y^TX\beta
–
(X\beta)^TY
(X\beta)^TX\beta
\right)
\]
Using transpose property:
\[
(X\beta)^T
=
\beta^TX^T
\]
Substitute into the equation:
\[
MSE
=
\frac{1}{n}
\left(
Y^TY
–
Y^TX\beta
–
\beta^TX^TY
\beta^TX^TX\beta
\right)
\]
Notice that:
\[
Y^TX\beta
\]
is a scalar.
Scalars are equal to their transpose.
Therefore:
\[
Y^TX\beta
=
\beta^TX^TY
\]
So the middle two terms combine:
\[
MSE
=
\frac{1}{n}
\left(
Y^TY
–
2\beta^TX^TY
\beta^TX^TX\beta
\right)
\]
To minimize MSE, take derivative with respect to \( \beta \).
Derivative of:
\[
Y^TY
\]
is zero because it does not contain \( \beta \).
Derivative of:
\[
-2\beta^TX^TY
\]
becomes:
\[
-2X^TY
\]
Derivative of:
\[
\beta^TX^TX\beta
\]
becomes:
\[
2X^TX\beta
\]
Therefore:
\[
\frac{\partial MSE}{\partial \beta}
=
\frac{1}{n}
\left(
-2X^TY
2X^TX\beta
\right)
\]
Simplify:
\[
=
\frac{-2}{n}X^TY
\frac{2}{n}X^TX\beta
\]
Set derivative equal to zero for minimization:
\[
\frac{-2}{n}X^TY
\frac{2}{n}X^TX\beta
=
0
\]
Multiply both sides by:
\[
\frac{n}{2}
\] \[
-X^TY
X^TX\beta
=
0
\]
Rearrange:
\[
X^TX\beta
=
X^TY
\]
Now multiply both sides by:
\[
(X^TX)^{-1}
\] \[
(X^TX)^{-1}X^TX\beta
=
(X^TX)^{-1}X^TY
\]
Using the identity matrix property:
\[
(X^TX)^{-1}(X^TX)=I
\]
we get:
\[
I\beta
=
(X^TX)^{-1}X^TY
\]
Since:
\[
I\beta=\beta
\]
the final Normal Equation becomes:
\[
\beta
=
(X^TX)^{-1}X^TY
\]
This equation simultaneously computes:
the intercept
all slopes
the optimal parameters
that minimize the Mean Squared Error.
In general, the normal equation is derived by minimizing the RSS (Residual Sum of Squares). However, since MSE is simply RSS divided by the number of observations, minimizing MSE also produces the same normal equation.
Now we have the normal equation. Let’s solve for the slope and intercept once again using this equation.
The matrix form of Linear Regression is:
\[
\beta=(X^TX)^{-1}X^TY
\]
Construct the feature matrix.
The first column contains 1s for the intercept term.
\[
X
=
\begin{bmatrix}
1 & 1.2\\
1 & 1.4\\
1 & 1.6\\
1 & 2.1\\
1 & 2.3\\
1 & 3.0\\
1 & 3.1\\
1 & 3.3\\
1 & 3.3\\
1 & 3.8
\end{bmatrix}
\]
Construct the target vector:
\[
Y
=
\begin{bmatrix}
39344\\
46206\\
37732\\
43526\\
39892\\
56643\\
60151\\
54446\\
64446\\
57190
\end{bmatrix}
\]
Parameter vector:
\[
\beta
=
\begin{bmatrix}
\beta_0\\
\beta_1
\end{bmatrix}
\]
Now compute the transpose:
\[
X^T
=
\begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1\\
1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
\end{bmatrix}
\]
Compute:
\[
X^TX
=
\begin{bmatrix}
10 & 25.1\\
25.1 & 67.89
\end{bmatrix}
\]
Now compute the inverse:
\[
(X^TX)^{-1}
=
\begin{bmatrix}
1.4547 & -0.5378\\
-0.5378 & 0.2142
\end{bmatrix}
\]
Now compute:
\[
X^TY
=
\begin{bmatrix}
493576\\
1326200.7
\end{bmatrix}
\]
Substitute into the Normal Equation:
\[
\beta
=
\begin{bmatrix}
1.4547 & -0.5378\\
-0.5378 & 0.2142
\end{bmatrix}
\begin{bmatrix}
493576\\
1326200.7
\end{bmatrix}
\]
After multiplication:
\[
\beta
=
\begin{bmatrix}
27315.02\\
9020.93
\end{bmatrix}
\]
Therefore:
\[
\beta_0=27315.02
\] \[
\beta_1=9020.93
\]
Final regression equation:
\[
\hat{y}
=
27315.02+9020.93x
\]
Now, after getting the normal equation for linear regression, we might think that we can solve for the optimal parameters even when we have many features.
But one thing we need to observe here is that this method works well only for small or medium-sized datasets. When we have very large datasets, solving the normal equation becomes computationally expensive.
Let’s look at the normal equation:
\[
\beta = (X^TX)^{-1}X^Ty
\]
From the equation, we can observe the inverse calculation, and this is where solving for the slope and intercept using the normal equation becomes computationally expensive.
This works well for small datasets, but in the real world, we often have thousands of features and millions of data points.
In such cases, solving the normal equation becomes slow and requires a lot of computational power.
This is where gradient descent is used, because instead of directly solving for the solution, we gradually move toward the optimal solution step by step.
Now, to understand how gradient descent works, let’s look at the math behind it.
When we were deriving the normal equation, we arrived at this equation.
\[
\frac{\partial MSE}{\partial \beta}
=
\frac{2}{n}X^T(X\beta-Y)
\]
This equation represents the gradient (slope) of the bowl-shaped loss curve.
We made it equal to zero and then solved further to get the normal equation, which is used to find the optimal solution.
But in gradient descent, we stop at this equation and initialize some random values forβ\beta. Using these values, we calculate the gradient (slope) and gradually move toward the minimum loss step by step.
Let’s assume we initialize:
β0=2\beta_0 = 2 and β1=5\beta_1 = 5
\[
\beta^{(0)}=
\begin{bmatrix}
\beta_0 \\
\beta_1
\end{bmatrix}
=
\begin{bmatrix}
2 \\
5
\end{bmatrix}
\]
Next, we calculate the slope of the bowl curve by substituting these values into the gradient equation.
We already know that the gradient equation is:
\[
\frac{\partial MSE}{\partial \beta}
=
\frac{-2}{n}X^Ty
\frac{2}{n}X^TX\beta
\]
The initialized parameter values are:
\[
\beta^{(0)}=
\begin{bmatrix}
2 \\
5
\end{bmatrix}
\]
These are just the starting values from where Gradient Descent begins searching for the minimum loss.
Now let’s construct the feature matrix.
Since we have one feature, the matrix \(X\) becomes:
\[
X=
\begin{bmatrix}
1 & 1.2 \\
1 & 1.4 \\
1 & 1.6 \\
1 & 2.1 \\
1 & 2.3 \\
1 & 3.0 \\
1 & 3.1 \\
1 & 3.3 \\
1 & 3.3 \\
1 & 3.8
\end{bmatrix}
\]
The first column contains ones for the intercept term.
Now calculate:
\[
X^T
\] \[
X^T=
\begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\
1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
\end{bmatrix}
\]
Now calculate:
\[
X^TX
\] \[
X^TX=
\begin{bmatrix}
10 & 25.1 \\
25.1 & 67.89
\end{bmatrix}
\]
Next, let the target vector be:
\[
y=
\begin{bmatrix}
39344 \\
46206 \\
37732 \\
43526 \\
39892 \\
56643 \\
60151 \\
54446 \\
64446 \\
57190
\end{bmatrix}
\]
Now calculate:
\[
X^Ty
\] \[
X^Ty=
\begin{bmatrix}
493576 \\
1326200.7
\end{bmatrix}
\]
Since our dataset contains:
\[
n=10
\]
Now substitute all the values into the gradient equation:
\[
\frac{\partial MSE}{\partial \beta}
=
\frac{-2}{10}
\begin{bmatrix}
493576 \\
1326200.7
\end{bmatrix}
\frac{2}{10}
\begin{bmatrix}
10 & 25.1 \\
25.1 & 67.89
\end{bmatrix}
\begin{bmatrix}
2 \\
5
\end{bmatrix}
\]
First, calculate the matrix multiplication:
\[
\begin{bmatrix}
10 & 25.1 \\
25.1 & 67.89
\end{bmatrix}
\begin{bmatrix}
2 \\
5
\end{bmatrix}
=
\begin{bmatrix}
(10)(2)+(25.1)(5) \\
(25.1)(2)+(67.89)(5)
\end{bmatrix}
\] \[
=
\begin{bmatrix}
20+125.5 \\
50.2+339.45
\end{bmatrix}
\] \[
=
\begin{bmatrix}
145.5 \\
389.65
\end{bmatrix}
\]
Now multiply by:
\[
\frac{2}{10}
\] \[
\frac{2}{10}
\begin{bmatrix}
145.5 \\
389.65
\end{bmatrix}
=
\begin{bmatrix}
29.1 \\
77.93
\end{bmatrix}
\]
Next, calculate:
\[
\frac{-2}{10}
\begin{bmatrix}
493576 \\
1326200.7
\end{bmatrix}
=
\begin{bmatrix}
-98715.2 \\
-265240.14
\end{bmatrix}
\]
Now substitute everything back:
\[
\frac{\partial MSE}{\partial \beta}
=
\begin{bmatrix}
-98715.2 \\
-265240.14
\end{bmatrix}
\begin{bmatrix}
29.1 \\
77.93
\end{bmatrix}
\]
Finally:
\[
\frac{\partial MSE}{\partial \beta}
=
\begin{bmatrix}
-98686.1 \\
-265162.21
\end{bmatrix}
\]
This gradient represents the slope of the bowl-shaped MSE loss curve at the current parameter values.
Here:
\[
-98686.1
\]
represents the slope with respect to \(\beta_0\)
and
\[
-265162.21
\]
represents the slope with respect to \(\beta_1\)
Since both values are negative, the loss decreases toward the right, so Gradient Descent moves toward the right to reduce the loss.
Now, instead of directly solving for the optimal parameters mathematically, Gradient Descent gradually updates the parameter values step by step until it reaches the minimum point of the bowl-shaped loss curve.
This update is performed using the Gradient Descent update equation:
\[
\beta:=\beta-\alpha\frac{\partial MSE}{\partial \beta}
\]
where:
\[
\alpha
\]
is called the learning rate and controls how large each update step should be.
The update equation can be understood step by step.
\[
\beta
\]
represents the current parameter values.
\[
\frac{\partial MSE}{\partial \beta}
\]
represents the slope (gradient) of the bowl-shaped loss curve at the current point.
The gradient tells us the direction in which the loss increases the fastest.
Therefore, to reduce the loss, we move in the opposite direction of the gradient.
This is why the update equation subtracts the gradient:
\[
\beta:=\beta-\alpha\frac{\partial MSE}{\partial \beta}
\]
Here:
\[
\alpha
\]
controls how large each step should be while moving toward the minimum point.
If the gradient is positive, Gradient Descent moves toward the left.
If the gradient is negative, Gradient Descent moves toward the right.
By repeatedly calculating gradients and updating parameters, Gradient Descent gradually moves toward the minimum point of the bowl-shaped loss curve.
After updating the parameters, the entire process is repeated again until the loss becomes minimum, and the model reaches the optimal parameters.
We can observe here is that there is no inverse calculation involved.
One important thing we need to understand here is the learning rate.
Let’s assume:
\[
\alpha = 0.01
\]
and the calculated gradient is:
\[
\frac{\partial MSE}{\partial \beta}
=
\begin{bmatrix}
-98686.1 \\
-265162.21
\end{bmatrix}
\]
Now substitute these values into the update equation:
\[
\beta=
\begin{bmatrix}
2 \\
5
\end{bmatrix}
–
0.01
\begin{bmatrix}
-98686.1 \\
-265162.21
\end{bmatrix}
\]
First, multiply the learning rate with the gradient:
\[
0.01
\begin{bmatrix}
-98686.1 \\
-265162.21
\end{bmatrix}
=
\begin{bmatrix}
-986.861 \\
-2651.6221
\end{bmatrix}
\]
Now substitute back:
\[
\beta=
\begin{bmatrix}
2 \\
5
\end{bmatrix}
–
\begin{bmatrix}
-986.861 \\
-2651.6221
\end{bmatrix}
\]
then
\[
\beta=
\begin{bmatrix}
2+986.861 \\
5+2651.6221
\end{bmatrix}
\]
Finally:
\[
\beta=
\begin{bmatrix}
988.861 \\
2656.6221
\end{bmatrix}
\]
After one iteration of Gradient Descent:
\[
\beta_0
\]
changed from:
\[
2 \rightarrow 988.861
\]
and
\[
\beta_1
\]
changed from:
\[
5 \rightarrow 2656.6221
\]
These updated parameter values move us closer to the minimum point of the bowl-shaped MSE loss curve.
Now using these updated values, the entire process is repeated again:
\[
\text{Predictions}
\rightarrow
\text{Residuals}
\rightarrow
\text{Loss}
\rightarrow
\text{Gradient}
\rightarrow
\text{Parameter Update}
\]
This iterative process continues until the loss becomes minimum and the model reaches the optimal parameters.
Now let’s understand why choosing the learning rate is very important.
If the learning rate is very small:
\[
\alpha = 0.000001
\]
then the updates become extremely small.
As a result:
\[
\text{Very Slow Learning}
\]
and Gradient Descent may require thousands of iterations to reach the minimum point.
On the other hand, if the learning rate is very large:
\[
\alpha = 10
\]
then the updates become extremely large.
As a result, Gradient Descent may overshoot the minimum point repeatedly and fail to reach the solution.
Therefore, choosing a proper learning rate is very important for efficient optimization.

GIF by Author
Now we have an idea about what gradient descent actually is.
In this method, we can observe that we used the entire dataset to calculate the gradients before updating the parameters.
This process can become slow for very large datasets, and this approach is called batch gradient descent because it uses the entire dataset for every update step.
Now imagine a dataset containing millions of data points.
For every single update step, Gradient Descent would need to:
\[
\text{Process Entire Dataset}
\] \[
\text{Calculate Loss}
\] \[
\text{Calculate Gradients}
\]
and then finally update the parameters.
This repeated computation becomes computationally expensive and time taking process.
This is where Stochastic Gradient Descent (SGD) comes into the picture.
Instead of calculating gradients using the entire dataset, SGD randomly selects only one observation at a time and immediately updates the parameters.
The update equation still remains the same:
\[
\beta:=\beta-\alpha\frac{\partial MSE}{\partial \beta}
\]
The only difference is that the gradient is now calculated using a single observation instead of the entire dataset.
We can understand this by using one data point from our dataset.
The parameter values are:
\[
\beta^{(0)}=
\begin{bmatrix}
2 \\
5
\end{bmatrix}
\]
and the learning rate is:
\[
\alpha = 0.01
\]
Now let’s say SGD randomly selected the following training example from our dataset:
\[
(x,y)=(3.0,56643)
\]
For this single observation:
\[
X=
\begin{bmatrix}
1 & 3.0
\end{bmatrix}
\]
and
\[
y=
\begin{bmatrix}
56643
\end{bmatrix}
\]
Now calculate:
\[
X^T=
\begin{bmatrix}
1 \\
3.0
\end{bmatrix}
\]
Next calculate:
\[
X^TX
\] \[
=
\begin{bmatrix}
1 \\
3.0
\end{bmatrix}
\begin{bmatrix}
1 & 3.0
\end{bmatrix}
\] \[
=
\begin{bmatrix}
1 & 3.0 \\
3.0 & 9.0
\end{bmatrix}
\]
Now calculate:
\[
X^Ty
\] \[
=
\begin{bmatrix}
1 \\
3.0
\end{bmatrix}
\begin{bmatrix}
56643
\end{bmatrix}
\] \[
=
\begin{bmatrix}
56643 \\
169929
\end{bmatrix}
\]
Since SGD is using only one observation:
\[
n=1
\]
Now substitute everything into the gradient equation:
\[
\frac{\partial MSE}{\partial \beta}
=
\frac{-2}{n}X^Ty
\frac{2}{n}X^TX\beta
\]
Substituting:
\[
=
\frac{-2}{1}
\begin{bmatrix}
56643 \\
169929
\end{bmatrix}
\frac{2}{1}
\begin{bmatrix}
1 & 3.0 \\
3.0 & 9.0
\end{bmatrix}
\begin{bmatrix}
2 \\
5
\end{bmatrix}
\]
First calculate the matrix multiplication:
\[
\begin{bmatrix}
1 & 3.0 \\
3.0 & 9.0
\end{bmatrix}
\begin{bmatrix}
2 \\
5
\end{bmatrix}
\] \[
=
\begin{bmatrix}
(1)(2)+(3.0)(5) \\
(3.0)(2)+(9.0)(5)
\end{bmatrix}
\] \[
=
\begin{bmatrix}
2+15 \\
6+45
\end{bmatrix}
\] \[
=
\begin{bmatrix}
17 \\
51
\end{bmatrix}
\]
Now multiply by:
\[
\frac{2}{1}
\] \[
=
\begin{bmatrix}
34 \\
102
\end{bmatrix}
\]
Now calculate:
\[
\frac{-2}{1}
\begin{bmatrix}
56643 \\
169929
\end{bmatrix}
=
\begin{bmatrix}
-113286 \\
-339858
\end{bmatrix}
\]
Now substitute everything back:
\[
\frac{\partial MSE}{\partial \beta}
=
\begin{bmatrix}
-113286 \\
-339858
\end{bmatrix}
\begin{bmatrix}
34 \\
102
\end{bmatrix}
\]
Finally:
\[
\frac{\partial MSE}{\partial \beta}
=
\begin{bmatrix}
-113252 \\
-339756
\end{bmatrix}
\]
This gradient represents the slope of the bowl-shaped loss curve for this single training example.
Now update the parameters using:
\[
\beta:=\beta-\alpha\frac{\partial MSE}{\partial \beta}
\]
Substituting the values:
\[
\beta=
\begin{bmatrix}
2 \\
5
\end{bmatrix}
–
0.01
\begin{bmatrix}
-113252 \\
-339756
\end{bmatrix}
\]
First multiply the learning rate:
\[
=
\begin{bmatrix}
2 \\
5
\end{bmatrix}
–
\begin{bmatrix}
-1132.52 \\
-3397.56
\end{bmatrix}
\]
Now subtract:
\[
=
\begin{bmatrix}
2+1132.52 \\
5+3397.56
\end{bmatrix}
\]
Finally:
\[
\beta=
\begin{bmatrix}
1134.52 \\
3402.56
\end{bmatrix}
\]
After solving for just one observation, the parameters immediately get updated.
Now SGD randomly selects another observation from the dataset and repeats the same process again.
Unlike batch gradient descent, which waits to process the entire dataset before updating the parameters, SGD updates the parameters after every single training example.
Because of these frequent updates, SGD reaches the solution faster.
We can observe how simple the calculation becomes when using just one observation.
SGD continues updating the parameters repeatedly using different training examples until the loss becomes minimum or stops changing significantly.
But the path toward the minimum point becomes noisy and zig-zag in nature.
This makes SGD highly useful for modern machine learning and deep learning problems involving very large datasets.
Now we have an idea of both gradient descent and stochastic gradient descent.
First, we derived the normal equation, and then we learned that the inverse matrix calculation becomes computationally expensive and memory usage becomes high for large datasets.
To solve this problem, we used gradient descent, which is not limited to linear regression but is also used in many machine learning and deep learning algorithms.
Next, we learned that even the first method of gradient descent that we used, called batch gradient descent, can become slow for very large datasets because it uses the entire dataset before updating parameters.
This led us to stochastic gradient descent (SGD), which updates the parameters using one training example at a time and works faster than batch gradient descent for large datasets.
We also have another variation of gradient descent called mini-batch gradient descent, in which we use a small batch of training examples from the dataset, such as 32 or 64 rows, before updating the parameters.
In this way, it becomes faster than batch gradient descent and more stable than stochastic gradient descent.
Even though linear regression has a closed-form solution, we often prefer to use gradient descent when working with large datasets containing millions of observations because the normal equation becomes computationally expensive and impractical.
In deep learning, however, closed-form solutions usually do not exist, which makes optimization algorithms like gradient descent even more important.
Dataset License
The dataset used in this blog is the Salary dataset.
It is publicly available on Kaggle and is licensed under the Creative Commons Zero (CC0 Public Domain) license. This means it can be freely used, modified, and shared for both non-commercial and commercial purposes without restriction.
I hope you now have a better understanding of what gradient descent and stochastic gradient descent actually are.
If you’d like to read more of my writing, you can also find it on Medium and LinkedIn.
I recently wrote a detailed breakdown of Lasso Regression from a geometric and intuitive perspective.
You can read it here.
Thanks for reading!