Optimization is the process of minimizing the cost function for a given set of data. Yet in order for a model to actually perform better in the field, or with data outside of the training set, we have to use Optimization strategies in conjunction with Regularization, which increases model’s ability to generalize.
Gradient Descent is a blindfolded hiker trying to reach the bottom of the mountain by using his feet to feel the slope of the ground. He then takes a step of certain size along the direction toward which the slope is descending. More formally, Gradient Descent is updating model parameters by subtracting gradients of parameters times step size from current parameters. The gradient is taken relative to the Loss Function, and is the average of gradients for all examples in training set.
Mini-batch/Stochastic Gradient Descent
In the full version of Gradient Descent, a single round of parameter update requires calculating the gradients of all training examples, which could be in millions. Mini-batch Gradient Descent uses a much smaller batch of examples, e.g. 64, to update the parameters.
The reason this works is that most training data are largely correlated in the high-dimensional space. Thus a smaller batch often yields a gradient that is generally in the same direction as one given by the full batch.
Stochastic Gradient Descent is much faster then GD, so it allows for more iterations. Higher number of iterations on smaller batch almost guarantees better result than small number of iterations on the complete dataset. SGD also introduces noise in its gradient calculation because of a smaller batch, which is always advantageous for generalization. Noise enable SGD to bounce around among different valleys and possibly into deeper local minima than GD brings the model to. From an application point of view, SGD works well even if the cost function and/or training data changes, whereas GD will likely fail into an undesired local minima.
Maximum Information Content
Model often learns the fastest with the most unexpected data. If we keep feeding the model with unfamiliar training data, the model will take confident big step and arrive at the optimal faster.
One way is to place images from different classes in successive batches. This way the model learns from completely different domains during every update. Another way is to increase the occurrence frequency of unusual data so that the model gets exposed to unexpected data more often. The definition of unusual can be based on how much discrepancy there is between prediction and ground truth. Notice however that it is a bad idea to increase the frequency of outliers, because they tend to bias our model to learn idiosyncrasy rather than true pattern.