Optimization is the process of minimizing the cost function for a given set of data. Yet in order for a model to actually perform better in the field, or with data outside of the training set, we have to use Optimization strategies in conjunction with Regularization, which increases model’s ability to generalize.



Gradient Descent

Gradient Descent is a blindfolded hiker trying to reach the bottom of the mountain by using his feet to feel the slope of the ground. He then takes a step of certain size along the direction toward which the slope is descending. More formally, Gradient Descent is updating model parameters by subtracting gradients of parameters times step size from current parameters. The gradient is taken relative to the Loss Function, and is the average of gradients for all examples in training set.

Mini-batch/Stochastic Gradient Descent

In the full version of Gradient Descent, a single round of parameter update requires calculating the gradients of all training examples, which could be in millions. Mini-batch Gradient Descent uses a much smaller batch of examples, e.g. 64, to update the parameters.

The reason this works is that most training data are largely correlated in the high-dimensional space. Thus a smaller batch often yields a gradient that is generally in the same direction as one given by the full batch.

Stochastic Gradient Descent is much faster then GD, so it allows for more iterations. Higher number of iterations on smaller batch almost guarantees better result than small number of iterations on the complete dataset. SGD also introduces noise in its gradient calculation because of a smaller batch, which is always advantageous for generalization. Noise enable SGD to bounce around among different valleys and possibly into deeper local minima than GD brings the model to. From an application point of view, SGD works well even if the cost function and/or training data changes, whereas GD will likely fail into an undesired local minima. 



Maximum Information Content

Model often learns the fastest with the most unexpected data. If we keep feeding the model with unfamiliar training data, the model will take confident big step and arrive at the optimal faster.

One way is to place images from different classes in successive batches. This way the model learns from completely different domains during every update. Another way is to increase the occurrence frequency of unusual data so that the model gets exposed to unexpected data more often. The definition of unusual can be based on how much discrepancy there is between prediction and ground truth. Notice however that it is a bad idea to increase the frequency of outliers, because they tend to bias our model to learn idiosyncrasy rather than true pattern.







Convex Optimization by Stanford

Efficient BackProp by Yann LeCun



Regularization is calming down your model so that it doesn’t start modeling noise. For any data we are interested in, we can always break them down into pattern plus noise. Regularization enable our model to capture the pattern without too much noise at the cost of model flexibility. One way to do so is to suppress the model weights from rising too high, so that the model predictions look smooth and even out.


The ultimate goal of regularization is to improve generalization. In L1 or L2 regularization for example, the regularization term prevents model parameters from getting too big. The weights are encouraged to spread out evenly among all parameters. This prevents the model from putting too much focus on any particular parameters.

In practice, when we are dealing with data that are not complex enough to prevent overfitting, it might be tempting to choose a smaller network which has smaller capacity. However, it is always better to use a larger network that has a strong Regularization in this case. The reason for this has to do with optimization. In a simple model with fewer parameters, although it is usually faster to converge to a local minima, the local minima is often a bad one. There are large variance in such local minima and we have a high chance of falling into one with huge loss. In a more expressive model with more parameters, there are usually a lot more local minima that take longer to converge to, but the variance of their loss is small. With proper regularization, we can usually fall into a good one. In a sentence, Regularization is the preferred way to control overfitting in a Neural Network.


Regularization adds a penalty term to final Loss Function. The penalty term usually has one hyper-parameter $\lambda$ that controls the degree of model flexibility, and usually is a function of all parameters that accounts for their size. The value of $\lambda$ is often chosen by cross validation.

Note that to use Regularization, we usually need to normalize the data.


$Equation here$

L2-norm always keeps all the parameters non-zero.



$Equation here$

L1-norm tends to shrink less important feature’s coefficient to zero, effectively removing the feature from the model. This is a way to achieve feature selection.





Loss Function


Loss Function is a measure of how unsatisfied we are with the current model, and is the single drive force for model optimization. The Loss Function does not exclusively depend on if the model makes the right prediction. Instead, it’s based on how confident the model is on its result, regardless of right or wrong. The loss is large for a model confident in its wrong result or one that is uncertain about its right prediction.


Loss Function appears right after Score Function, the function that actually makes the prediction. Based on this prediction, Loss Function measures our unhappiness with the solution and then drives model parameters to update themselves. Loss Function is essentially a guidebook that instructs the model how to evolve. Just like different educational philosophy leads to different personality in children, depending on which Loss Function is employed during training, model parameters are updated quite differently and final models may be very different.


Hinge Loss


Hinge Loss wants the score of correct class to be at least $\delta$ above those of incorrect classes, where $\delta$ is usually set to one. As long as the minimum difference of $\delta$ is satisfied, Hinge Loss is completely happy and loss becomes zero, which means the model parameters won’t update anymore.

Mostly used in SVM.




Cross Entropy Loss


Cross Entropy Loss works by assigning a probability of being the correct class to all possible classes based on outputs from Score Function, and defining the loss to be the negative log probability of the actual correct class. By minimizing the derivative of this negative probability and back propagating to update the model, we are essentially performing a Maximum Likelihood Estimation.  We could take a step further and think of Regularization as the prior distribution of the model parameters coming from a Gaussian Distribution, so instead of doing a MLE, we are performing Maximum a Posteriori(MAP) estimation.

The negative log part is a mathematical result from the definition of cross entropy, an idea from information theory. Cross entropy is a measure of how different are two probabilistic distributions. In our case, one distribution is our actual distribution, where each class has its assigned non-zero probability from Score Function.  The other is the true distribution, where the true class has a probability of 1 and all other classes has 0. Cross Entropy Loss thus wants the predicted distribution to have all its mass on the correct class. Note that this is equivalent to minimize the KL divergence between two distributions.

Note that we conveniently obtain the probability distribution over all classes, which could be useful when we need not just the prediction label but probability as well.


Mostly used in Softmax Classifier.






Cross Entropy Loss is never completely satisfied, because it wants the score to correct class to be infinity, and scores to others negative infinity. This means the model evolution will never fully stop.





Linear Classifier


Linear Classifier is simply an use case of template matching. It classifies a given test vector by comparing it to multiple template vectors corresponding to different classes and creating a score for each comparison. Then the classifier puts the label with highest score to the test vector. The training process then is to find the most representative template for each class.

More specifically, the comparison metric we used here is dot product with a bias term. As explained in Vector Distance, dot product is another way to measure the degree of similarity between vectors. The bias term is an additional threshold that is dependent on classes.

Note the process sounds suspiciously similar to Nearest Neighbor Classifier, where a test vector is classified to have the same label as its closest neighbor in the vector space. This is actually the case. Linear Classifier is an optimized version of Nearest Neighbor Classifier. It summarizes all training vectors into representative templates for each class. And instead of comparing the test vector to all training vectors, it compares it to each class template.



We can also think of Linear Classifier as drawing linear boundaries in vector space in order to separate vectors into different groups. If we imagine images live in 2-dimensional space, then an image vector would be a point in the vector space, and the Linear Classifier is multiple lines trying to group the points into their corresponding classes. The training process is then rotating and translating these lines to better group the points in this 2-D space.





Intuitive Example first




Formal Definition of Linear classifier

A Linear Classifier is based on the following linear mapping:

$f(x_i, W, b) = Wx_i + b$


Score Function

This is a Score Function in its simplest form. Notice that each row of W is essentially a template for one class, and we use dot product and a bias term in b to generate the class score.


Loss Function

To train the Linear Classifier, we use multi-class SVM loss as our Loss Function. Loss Function measures how unsatisfied we are with the current classification results, and how we want our model to grow into. Specifically, multi-class SVM loss wants our correct class score to be at least $\delta$ above any other class score.


For Regularization, the common choice is L2 norm which discourages any large weight by adding a quadratic penalty over all parameters to the final loss.






When to use it





Vector Normalization


Data Normalization is one type of data preprocessing in order to make the convergence faster. Below outlined three steps to normalize data.

  1. Center data around zero.
  2. De-correlate data. E.g. we can remove linear correlation by Principle Component Analysis.
  3. Equalize covariance. This is to make the covariance of all data the same, possibly to 1.


Centering all data around zero makes convergence much faster. If the input vector has all positive elements, then parameters of the model can only be updated together in positive or negative direction. In order to update toward the desired direction, parameters need to move in a zigzag fashion, which is extremely slow.

Data that are linearly dependent on each other can slow down or even completely kill the convergence. Both removing correlation and equalizing the covariance between input vectors effectively speed up the convergence.



Efficient BackProp by Yann LeCun

Dimensionality Reduction

Why and What

Think of dimensionality as the number of descriptors we use to fully represent an object of interest. For example, to completely represent the price of a Oculus VR headset on official website, the primary affecting factors we need to know are its model name, build version, hardware performance, accessories included and probably a few others. There are minor details like color, size, shape, etc. that also affect the price of the headset but to much smaller extent. It is not hard to imagine that there are probably much more such minor factors than primary factors listed above. All of these descriptors together determine the official online price, each with a different degree of significance. Note that some of these factors are related to each other. For instance, each build version might correspond to exactly one size, which would result in a strict one-to-one relationship between build version and size.

If one is given all the related factors and is asked to predict the price, intuitively he will start ruling out noisy or insignificant factors. Information like Oculus webpage color or date of last purchase are most likely not going to contribute to the prediction. By ruling out these information, he can put more focus on factors that offer deeper insight into predicting the price. With regard to factors that strongly correlate with each other, he can simply choose one of them and ignore the rest. This process of ruling out factors, put more formally in mathematics, is called Dimensionality Reduction.

On a vector, dimensionality is simply how many cells it has. A high dimensional vector is one super long vector. If a vector lives in a high dimensional space, it becomes extremely difficult for the machine to focus on the meaningful parts, as claimed by the notorious Curse of Dimensionality. Fortunately, it is often the case that most of the information are contained in a much smaller number of dimensions than the total dimensionality of the vector. Thus various Dimensionality Reduction techniques are employed to help us/machine focus on a compact number of useful aspects.


What is dimensionality

why we need to reduce

intuitive example for dimensionality reduction


Principal Component Analysis

PCA is a lossy data compression technique. It finds a k-dimensional subspace of original data’s full m-dimensional space, where k << m, and projects the data onto this subspace. The subspace is chosen to preserve the maximum amount of variation present in the data given any k-dimensional space. The projected data are much less noisy for processing, while retaining most of the information in the original dataset.


Be careful with choosing a too small value for k, the dimension of subspace we project our original data to. An aggressively small k value might drop crucial information for classification. This is because information essential for classification might not lie on dimensions with large data variation.

In this scenario, the dimension with largest variation is horizontal. Yet if we project the data onto the horizontal subspace, we would lose the vertical information, thus unable to classify the two classes effectively.



PCA by Carlos Tomasi



Nearest Neighbor Classifier


Nearest Neighbor Classifier, as the name suggests, classifies a given vector by grabbing the closest vector to it from the ground truth and putting the same label on it. As a Chinese proverb goes, similar objects/alike people are always attracted to each other(人以类聚物以群分). This classifier presumes that there is a strong relationship between spatial closeness and semantic similarity. When the classifier is given a test vector, the classifier would traverse all the training vectors, and classify the test vector to have the same label as the closest training vector based on a certain definition of Distance.

Intuitive explanation.

Relevant concepts



$ A detailed example with code/formula $

In actual application, however, instead of always finding the right label for a given test image, Nearest Neighbor Classifier tends to associate two images together if they share similar looking background, which usually takes much more space than the object of interest in an image.


What’s the catch/ How does each component work individually and/or with each other



Nearest Neighbor Classifier’s performance is one of the worst among all existing classifiers. Being not much better than guessing(NNC ~35% vs. Guessing 10%), there are two primary reasons for its bad performance. The presumption that vectors with similar pixel values share similar semantic meaning is not sound, if not arbitrary.

{Image pairs with same distance but different meaning}

Even if the above presumption is held, the definition of distance, if not chosen wisely, can also lead to bad performance.

Despite Nearest Neighbor Classifier’s bad performance, it is usually the first model to study in CV research, simply because of its easy-to-understand and elegant(which is all that matters), although wrongful, assumption. We eventually come to realize that spatial closeness does not directly establish semantic similarity. To find objects with similar meaning, the classifiers need to take one step further to establish its own semantic understanding from training data. This is when Neural Network comes into play.


Examples when NNC works well

Scenarios when it sucks ass


One popular extension to Nearest Neighbor Classifier is called k Nearest Neighbor Classifier. It is basically Nearest Neighbor Classifier plus voting. Just like a voting regime in Distributed System, where we count on majority rule to elect a leader, in k Nearest Neighbor Classifier we find k closest images to the test image, and take the majority choice as the final label. This classifier often offers much more reliable results than a single Nearest Neighbor Classifier, and leads to better generalization.



Post Template


Simple Definition for the concept. Extend slightly if necessary

Raise Example for Illustration

Delve into details of the concept, including different branches, sub-modules, various related theories


Why do do need this module and Possible Alternatives

Extension and Further Study based on this concept

Applications of the Concept

Business Models?


General Workflow for Similar Problem Domain

Possible Solution to a problem

Where to go from the base Solution



Any idea worth delving into.

Connections drawn among different concepts

Further materials to study