CODEBUG (Posts about gradient-descent)https://sijanb.com.np/enContents © 2024 <a href="mailto:sijanonly@gmail.com">Sijan Bhandari</a> Sat, 01 Jun 2024 05:11:05 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rss- What are different types of gradient descent algorithm in machine learning ?https://sijanb.com.np/posts/what-are-different-types-of-gradient-descent-algorithm-in-machine-learning/Sijan Bhandari<p>There exist three distinct types of gradient descent learning algorithms: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.</p>
<p>Batch Gradient Descent (BGD)</p>
<p>In Batch Gradient Descent, the term 'batch' signifies the utilization of the entire training dataset during each iteration of the learning process. By incorporating all training examples for each update, Batch Gradient Descent ensures stable error gradients and a consistent trajectory towards the optimal solution, albeit with significant computational demands.
This batching method enhances computational efficiency; however, it can still result in extended processing times for large training datasets due to the necessity of storing all data in memory. While Batch Gradient Descent typically yields a stable error gradient and reliable convergence, it occasionally converges to a local minimum rather than the global optimum.</p>
<p>Stochastic Gradient Descent (SGD)</p>
<p>Stochastic Gradient Descent (SGD) enhances parameter updates by leveraging individual data points during each iteration. By conducting a training epoch for each dataset example and updating parameters sequentially, SGD minimizes memory requirements, as only a single training example needs to be stored at any given time. These frequent updates, while providing detailed and rapid adjustments, may lead to decreased computational efficiency relative to batch gradient descent. Despite the potential for noisy gradients, which arise from these frequent updates, this noise can facilitate the escape from local minima, thereby aiding in the pursuit of a global minimum. The principle of SGD is characterized by its utilization of a single example per iteration, hence the term "stochastic" reflects the random selection of each example within the batch. Given sufficient iterations, SGD proves effective, albeit with inherent noisiness.</p>
<p>Mini Batch Gradient Descent</p>
<p>Mini-batch gradient descent integrates principles from both batch gradient descent and stochastic gradient descent. It partitions the training dataset into smaller batch sizes and executes updates on each of these batches. This methodology achieves a balance between the computational efficiency of batch gradient descent and the rapidity of stochastic gradient descent. Similar to stochastic gradient descent, the average cost over epochs in mini-batch gradient descent exhibits fluctuations due to the averaging of a limited number of examples at a time.</p>gradient-descentmachine-learningmachine-learning-glossaryhttps://sijanb.com.np/posts/what-are-different-types-of-gradient-descent-algorithm-in-machine-learning/Mon, 27 May 2024 17:12:35 GMT
- What is gradient descent ?https://sijanb.com.np/posts/what-is-gradient-descent/Sijan Bhandari<p>Short:</p>
<p>Gradient descent is an optimization algorithm used to determine the coefficients / parameters of a function (f) that minimize a cost function.</p>
<p><a class="reference external" href="https://machinelearningmastery.com/gradient-descent-for-machine-learning/">reference</a></p>
<p>Detail:</p>
<p>Gradient descent is a widely used optimization approach for training machine learning models and neural networks. Optimization is the process of minimizing or increasing an objective function.
Optimization entails calculating the gradient (partial derivatives) of the cost function for each parameter (weights and biases). To do this, the models are given training data iteratively.
And, the gradient points are determined. The gradient consistently indicates the direction of the steepest increase in the loss function. The gradient descent algorithm proceeds by taking a step in the direction of the negative gradient to minimize the loss as efficiently as possible.</p>
<ul class="simple">
<li><p>The selection of the learning rate has a substantial influence on the effectiveness of gradient descent. An excessively high learning rate may cause the algorithm to overshoot the minimum, while an excessively low learning rate may result in prolonged convergence times.</p></li>
<li><p>Due to its non-convexity, the loss function L(w) of a neural network is generally known to potentially have multiple local minima. When multiple local minima are present, it is highly likely that the algorithm may fail to converge to a global minimum. Thus, local minima pose significant challenges as they may cause the training process to stall rather than progress towards the global minimum.</p></li>
<li><p>Gradient descent, despite being a widely utilized optimization algorithm, does not ensure convergence to an optimum in all scenarios. Various factors can hinder convergence: Saddle Points: In high-dimensional spaces, gradient descent may become trapped at saddle points where the gradient is zero but does not correspond to a minimum.</p></li>
<li><p>Saddle points in a multivariable function are critical points where the function does not achieve either a local maximum or a local minimum value.</p></li>
<li><p>A common issue with both local minima and saddle points is the presence of plateaus with low curvature in the error landscape. Although gradient descent dynamics are repelled from a saddle point towards lower error by following directions of negative curvature, this repulsion can be slow due to the plateau.</p></li>
<li><p>Stochastic Gradient Descent (SGD) can occasionally escape simple saddle points if fluctuations occur in different directions and the step size is sufficiently large to overcome the flatness. However, saddle regions can sometimes be quite complex.</p></li>
<li><p>The gradient of error, defined over the difference between the actual and predicted outputs, approaches zero at a local minimum, causing progress to stall due to weight correction steps being proportional to the gradient's magnitude, which is near zero at a minimum. Techniques such as 'random weight initiation' can be employed to avoid this issue.</p></li>
</ul>gradient-descentmachine-learningmachine-learning-glossaryhttps://sijanb.com.np/posts/what-is-gradient-descent/Sat, 11 May 2024 07:02:15 GMT