Author: Yeshwanth Buggaveeti
We are going to discuss about, how well the gradient descent method works and to build it in python from scratch First, we define linear regression, and then we define the loss function. We will study how the gradient descent method works and then apply it to a specific data set to generate predictions.
Introduction:
Gradient descent is a first-order iterative optimization technique for locating the loss function of a differentiable function. The aim is to take repeated steps in the opposite direction of the gradient (or approximation gradient) of the function also at current point, because this is the fastest descent. Stepping in the opposite direction of the gradient, on the other hand, will result in a local maximum of that function; this technique is known as gradient ascent. In a single line, the formula below summarises the whole Gradient Descent method.
![](https://static.wixstatic.com/media/6e3b57_7cb18204b66a4bafa2d9c76f75bfffba~mv2.jpeg/v1/fill/w_980,h_288,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/6e3b57_7cb18204b66a4bafa2d9c76f75bfffba~mv2.jpeg)
FIG:1
Regression Model:
Linear regression is a statistical tool for modelling the connection between an explanatory variables and one or more predictor variable. Assume that X is the independent variable and that Y is the response variable. A linear relationship between these two variables will be defined as follows:
Y = M(X) + C
This is the formula for the line you learned in college. The slope of the line is m, and the y intercept is c. Today, we'll apply this equation to train our model with a given dataset and predict the value of Y for every given X and y. Today's task is to find the values of m and c such that the line relating to those values is the best fitting line or has the least error.
Loss Function:
The loss is the difference between our projected values of m and c. Our objective is to reduce this mistake as much as possible in order to achieve the most precise value of m and c. The Loss function computes the error for a single training sample, whereas the Cost function averages the loss functions over all training instances. I'll be using both words interchangeably from now on.
To compute the loss, we shall employ the Mean Squared Error function (M.S.E). This function consists of three steps:
For a given Independent variable, find the difference between the actual and predicted y values (y = mx + c).
The formula mentioned below is ERROR calculation between the predictions, which is related to Loss Function:
ERROR = Y’ (Predicted) – Y (Actual)
This difference should be squared.
Calculate the mean of the squares for each value in X.
A Cost function essentially informs us "how good" our model is at making predictions for a given m and C value.
E=1ni=0nyi-yi2
Mean Squared Error Formula
yᵢ is the actual value
ȳᵢ is the predicted value.
Lets substitute the value of ȳᵢ with regression formula. The substituted formula is given below:
E=1ni=0nyi-mxi+C2
As a result, we square the error and calculate the mean. As a result, the term Mean Squared Error was coined. Now that we've defined the loss function, we can go on to the fun part: reducing it and determining m and c.
The Algorithm of Gradient Descent:
Consider a valley and a person with no sense of direction who want to reach the valley's bottom. He walks down the slope, taking huge steps when the slope is steep and little ones when it is not. He determines his next location depending on his present position and comes to a halt when he reaches the bottom of the valley, which was his aim.
![](https://static.wixstatic.com/media/6e3b57_4a5a9b7679eb42cfb28a7c208d820b3b~mv2.jpeg/v1/fill/w_980,h_664,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_4a5a9b7679eb42cfb28a7c208d820b3b~mv2.jpeg)
FIG:2 Image explanation of gradient descent
Let's try using gradient descent on m, c, a and take it one step at a time:
Let m and c start at zero. Let L represent our learning rate. This determines how much m changes with each step. For high precision, L may be set to a tiny number such as 0.0001.
Calculate the partial derivative of the loss function with respect to m and then enter in the current values of x, y, m, and c to get the derivative value D. Dₘ is the value of the partial derivative with respect to m. Similarly let’s find the partial derivative with respect to c, Dc:
![](https://static.wixstatic.com/media/b670ab_633c7a24c27c466ea765b4f3ac274f34~mv2.jpg/v1/fill/w_636,h_239,al_c,q_80,enc_auto/b670ab_633c7a24c27c466ea765b4f3ac274f34~mv2.jpg)
FIG:3 Derivative with respect to m
![](https://static.wixstatic.com/media/6e3b57_e686761adc854038afb3ef3bec3494d9~mv2.jpg/v1/fill/w_802,h_211,al_c,q_80,enc_auto/6e3b57_e686761adc854038afb3ef3bec3494d9~mv2.jpg)
FIG:4 Derivative with respect to c
We now use the following equation to update the current values of m and c:
![](https://static.wixstatic.com/media/b670ab_3be776c50aef4c6cb9024555e6b714e9~mv2.jpg/v1/fill/w_577,h_215,al_c,q_80,enc_auto/b670ab_3be776c50aef4c6cb9024555e6b714e9~mv2.jpg)
FIG:5
We repeat this approach until our loss function is very tiny, or ideally zero (which means 0 error or 100 percent accuracy). The values of m and c that remain will be the optimal values.
Returning to our analogy, m represents the person's current position. D represents the slope's steepness, and L represents the pace at which he goes. Now, the new value of m determined by the preceding equation will be his next location, and LD will be the size of the steps he will take. When the slope is steeper (D is greater), he takes a bit longer steps; so when slope is shallower (D is less), he takes smaller steps. Eventually, he reaches the valley's floor, which corresponds to our loss = 0. Our model is now ready to generate predictions using the optimal values of m and c!
Implementation using python:
![](https://static.wixstatic.com/media/6e3b57_b1e83024cf464b5ab0066c1e8beedd32~mv2.jpeg/v1/fill/w_980,h_820,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_b1e83024cf464b5ab0066c1e8beedd32~mv2.jpeg)
FIG:6
In the above image, we defined some required libraries and loaded all the data points of the dataset on a graph using scatter plot function.
![](https://static.wixstatic.com/media/6e3b57_ea8e24308cb94ab29274c3308f797aa0~mv2.jpeg/v1/fill/w_980,h_475,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_ea8e24308cb94ab29274c3308f797aa0~mv2.jpeg)
FIG:7
We found M and C values using Gradient Descent formulas/algorithms. For making the final prediction of the best fitted line among the data points with less errors/residuals.
![](https://static.wixstatic.com/media/6e3b57_8d266713cb344a548a8bb4e30494f593~mv2.jpeg/v1/fill/w_980,h_742,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_8d266713cb344a548a8bb4e30494f593~mv2.jpeg)
FIG:8
The regression line drawn among the data points is the Best fit.
Conclusion:
For linear regression, we employed gradient descent as our optimization technique. Gradient descent is one of the most basic and commonly used ml algorithms, primarily because it can be used to improve any function. It sets the groundwork for understanding data science. by using a line of best fit to calculate the connection between X and Y. It is crucial to note however that the linear regression example was selected for simplicity, but it may be applied with other Machine Learning approaches as well.
GitHub:
https://github.com/yeshwanth69/Gradient-Descent
References:
https://towardsdatascience.com/understanding-the-mathematics-behind-gradient-descent-dde5dc9be06e
https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.81.5461
https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/
https://www.kaggle.com/residentmario/gradient-descent-with-linear-regression
https://www.analyticsvidhya.com/blog/2021/04/gradient-descent-in-linear-regression/
Comments