Satyajit Das posted an update in the group Machine Learning 1 month, 3 weeks ago
What is Gradient Descent and Stochastic Gradient Descent :-
==============================================
Mathematically speaking a slope=(delta(y)/delta(x)) is a scalar that is used when there is single variable function, but when there is multiple variable function
we call it term “Gradient” which is a vector, the scalar part is the function’s value at that point and vector part is tan(theta), or derivative of function at
that point.
Now, in ML Language a “Gradient Descent” is a first-order iterative optimization algorithm for finding out the minimum of a function. In this algorithm we keep on looking for decrease of the gradient at current point.
Let us take a case with a function J(theta0,theta1), Suppose we want to find out the best parameter(theta0) and (theta1) for our learning algorithm
J(theta0,theta1) that is a multivariate function.
Here we keep changing the theta0,theta1 value to reduce J(theta0,theta1) until we end up at a minimum value called “Local minima”.
repeat untill converge {
thetaj=thetaj-alpha*d(j(theta0,theta1)) for j=0,j=1
}
Note :-
Here theta0, theta1 are the variable parameters, by adjusting these we want our function J(theta0,theta1) to be minima.
d= represents the derivative of J w.r.t to thetaj ( j=0,j=1)
alpha is the “learning rate” practically it must be kept low so to get the minima, othetwise the J(theta0,theta1) value will overshoot to some maximum value.
Consider a 3-D space (x,y,z) x represents theta0, y ->theta1 and z -> function J(theta0,theta1)
This J(theta0,theta1) also called as cost space(as it is a 3D) or cost function.
A cost space/cost function is the value of J(theta0,theta1) for different values of theta0,theta1.
Note:- To check if the loss function we have chosen is correct or not we check it for small set of data first then we go for all remaining datsets,
otherwise there is a loss of computations.
Broadly, “Gradient Descent Algorithms is classified by two methods:-
1) On the basis of data ingestion.
2) On the basis of differentiation techniques.
1) On the basis of data ingestion :-
=====================================
a) Full Batch Gradient Descent Algorithm.
b) Stochastic Gradient Descent Algorithm.
c) Mini-Batch Gradient Descent Algorithm.
a) Full Batch Gradient Descent Algorithm :-
=========================================
It calculates the error for each example within the training dataset, and only after all the training examples have been evaluated, the model/parametrs gets updated.
Advantage:- It is computationally efficient, it produces a stable error gradient and stable convergence and fast.
Disadvantage:- It sometimes result in a state of convergence that is not the best model, as the best algorithm is only formed when we consider the entire dataset.
b) Stochastic Gradient Descent Algorithm :-
==========================================
It calculates the error for each training example within the dataset. Means it updates the parameter for each training example, one by one.
Advantage:- We can monitor the frequent updates and it allows us to have pretty detailed rate of improvement.
Disadvantage :- The frequent updates are more computationally expensive. we have to take care of the learning rate, which may cause the error rate to jump if learning rate is more.
c) Mini-Batch Gradient Descent Algorithm:-
=========================================
Mini-batch gradient Descent is the best method since it is a combination of the concepts of SGDA and BGDA.
It simply splits the training datsets into small batches and performs an update for each of these batches, hence it maintains balance between the robustness of SGDA and the efficiency of BGDA.
Common mini-batch sizes range is 50 and 256, but it is not a clear rule.
Note: Mini-batch is the best algorithm when we are trining a neural netwrk and it is the most common type of gradient descent within deep-learning.
2) On the basis of differentiation techniques :-
================================================
a) First order Differentiation :- The above we discussed is the first order differentiation, the value J must decrease with the change of the theta0, theta1. if we are not satisfied with the cost function
value then we go for 2nd order Differentiation.
b) Second order Differentiation :- It do the double derivative of the function and then checks the point of convergence and function value.
In the first derivative it will the slope and second derivative it will the point at which fun attain minima.
Note :- When applying Gradient Descent algorithms, we have to look at these points which might be helpful in finding out the problem.
1) Error rates :- You should check the error, after specific iterations it must decrease, if not there might be a problem.
2) Learning rate :- We have to continuously check the learning rate, sometime there is a sudden increase in the J-> cost function,
this shows the overshot due to high value of learning rate. try to take the value often in the range between 0.0 to 1.0.
Please let me know if there is any error! I will be happy to improve!.
Thanks.