# What

Regularization is calming down your model so that it doesn’t start modeling noise. For any data we are interested in, we can always break them down into pattern plus noise. Regularization enable our model to capture the pattern without too much noise at the cost of model flexibility. One way to do so is to suppress the model weights from rising too high, so that the model predictions look smooth and even out.

# Why

The ultimate goal of regularization is to improve generalization. In L1 or L2 regularization for example, the regularization term prevents model parameters from getting too big. The weights are encouraged to spread out evenly among all parameters. This prevents the model from putting too much focus on any particular parameters.

In practice, when we are dealing with data that are not complex enough to prevent overfitting, it might be tempting to choose a smaller network which has smaller capacity. However, it is always better to use a larger network that has a strong Regularization in this case. The reason for this has to do with optimization. In a simple model with fewer parameters, although it is usually faster to converge to a local minima, the local minima is often a bad one. There are large variance in such local minima and we have a high chance of falling into one with huge loss. In a more expressive model with more parameters, there are usually a lot more local minima that take longer to converge to, but the variance of their loss is small. With proper regularization, we can usually fall into a good one. In a sentence, Regularization is the preferred way to control overfitting in a Neural Network.

# How

Regularization adds a penalty term to final Loss Function. The penalty term usually has one hyper-parameter $\lambda$ that controls the degree of model flexibility, and usually is a function of all parameters that accounts for their size. The value of $\lambda$ is often chosen by cross validation.

Note that to use Regularization, we usually need to normalize the data.

# L2-Norm

$Equation here$

L2-norm always keeps all the parameters non-zero.

# L1-Norm

$Equation here$

L1-norm tends to shrink less important feature’s coefficient to zero, effectively removing the feature from the model. This is a way to achieve feature selection.

# Resource

http://cs231n.github.io/linear-classify/

https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a

# What

Loss Function is a measure of how unsatisfied we are with the current model, and is the single drive force for model optimization. The Loss Function does not exclusively depend on if the model makes the right prediction. Instead, it’s based on how confident the model is on its result, regardless of right or wrong. The loss is large for a model confident in its wrong result or one that is uncertain about its right prediction.

# Why

Loss Function appears right after Score Function, the function that actually makes the prediction. Based on this prediction, Loss Function measures our unhappiness with the solution and then drives model parameters to update themselves. Loss Function is essentially a guidebook that instructs the model how to evolve. Just like different educational philosophy leads to different personality in children, depending on which Loss Function is employed during training, model parameters are updated quite differently and final models may be very different.

# Hinge Loss

What

Hinge Loss wants the score of correct class to be at least $\delta$ above those of incorrect classes, where $\delta$ is usually set to one. As long as the minimum difference of $\delta$ is satisfied, Hinge Loss is completely happy and loss becomes zero, which means the model parameters won’t update anymore.

Mostly used in SVM.

Why

# Cross Entropy Loss

What

Cross Entropy Loss works by assigning a probability of being the correct class to all possible classes based on outputs from Score Function, and defining the loss to be the negative log probability of the actual correct class. By minimizing the derivative of this negative probability and back propagating to update the model, we are essentially performing a Maximum Likelihood Estimation.  We could take a step further and think of Regularization as the prior distribution of the model parameters coming from a Gaussian Distribution, so instead of doing a MLE, we are performing Maximum a Posteriori(MAP) estimation.

The negative log part is a mathematical result from the definition of cross entropy, an idea from information theory. Cross entropy is a measure of how different are two probabilistic distributions. In our case, one distribution is our actual distribution, where each class has its assigned non-zero probability from Score Function.  The other is the true distribution, where the true class has a probability of 1 and all other classes has 0. Cross Entropy Loss thus wants the predicted distribution to have all its mass on the correct class. Note that this is equivalent to minimize the KL divergence between two distributions.

Note that we conveniently obtain the probability distribution over all classes, which could be useful when we need not just the prediction label but probability as well.

Mostly used in Softmax Classifier.

How

Why

Cross Entropy Loss is never completely satisfied, because it wants the score to correct class to be infinity, and scores to others negative infinity. This means the model evolution will never fully stop.

# Resource

http://cs231n.github.io/linear-classify/

# What

Linear Classifier is simply an use case of template matching. It classifies a given test vector by comparing it to multiple template vectors corresponding to different classes and creating a score for each comparison. Then the classifier puts the label with highest score to the test vector. The training process then is to find the most representative template for each class. More specifically, the comparison metric we used here is dot product with a bias term. As explained in Vector Distance, dot product is another way to measure the degree of similarity between vectors. The bias term is an additional threshold that is dependent on classes.

Note the process sounds suspiciously similar to Nearest Neighbor Classifier, where a test vector is classified to have the same label as its closest neighbor in the vector space. This is actually the case. Linear Classifier is an optimized version of Nearest Neighbor Classifier. It summarizes all training vectors into representative templates for each class. And instead of comparing the test vector to all training vectors, it compares it to each class template.

We can also think of Linear Classifier as drawing linear boundaries in vector space in order to separate vectors into different groups. If we imagine images live in 2-dimensional space, then an image vector would be a point in the vector space, and the Linear Classifier is multiple lines trying to group the points into their corresponding classes. The training process is then rotating and translating these lines to better group the points in this 2-D space. Intuitive Example first

# How

Formal Definition of Linear classifier

A Linear Classifier is based on the following linear mapping:

$f(x_i, W, b) = Wx_i + b$

Score Function

This is a Score Function in its simplest form. Notice that each row of W is essentially a template for one class, and we use dot product and a bias term in b to generate the class score.

Loss Function

To train the Linear Classifier, we use multi-class SVM loss as our Loss Function. Loss Function measures how unsatisfied we are with the current classification results, and how we want our model to grow into. Specifically, multi-class SVM loss wants our correct class score to be at least $\delta$ above any other class score.

For Regularization, the common choice is L2 norm which discourages any large weight by adding a quadratic penalty over all parameters to the final loss.

When to use it

Caveat

# Resource

http://cs231n.github.io/linear-classify/

## Vector Distance

L1

L2

Dot Product

Resource

http://cs231n.github.io/linear-classify/

# What

Data Normalization is one type of data preprocessing in order to make the convergence faster. Below outlined three steps to normalize data.

1. Center data around zero.
2. De-correlate data. E.g. we can remove linear correlation by Principle Component Analysis.
3. Equalize covariance. This is to make the covariance of all data the same, possibly to 1.

# Why

Centering all data around zero makes convergence much faster. If the input vector has all positive elements, then parameters of the model can only be updated together in positive or negative direction. In order to update toward the desired direction, parameters need to move in a zigzag fashion, which is extremely slow.

Data that are linearly dependent on each other can slow down or even completely kill the convergence. Both removing correlation and equalizing the covariance between input vectors effectively speed up the convergence.

# Resource

Efficient BackProp by Yann LeCun

# Why and What

Think of dimensionality as the number of descriptors we use to fully represent an object of interest. For example, to completely represent the price of a Oculus VR headset on official website, the primary affecting factors we need to know are its model name, build version, hardware performance, accessories included and probably a few others. There are minor details like color, size, shape, etc. that also affect the price of the headset but to much smaller extent. It is not hard to imagine that there are probably much more such minor factors than primary factors listed above. All of these descriptors together determine the official online price, each with a different degree of significance. Note that some of these factors are related to each other. For instance, each build version might correspond to exactly one size, which would result in a strict one-to-one relationship between build version and size.

If one is given all the related factors and is asked to predict the price, intuitively he will start ruling out noisy or insignificant factors. Information like Oculus webpage color or date of last purchase are most likely not going to contribute to the prediction. By ruling out these information, he can put more focus on factors that offer deeper insight into predicting the price. With regard to factors that strongly correlate with each other, he can simply choose one of them and ignore the rest. This process of ruling out factors, put more formally in mathematics, is called Dimensionality Reduction.

On a vector, dimensionality is simply how many cells it has. A high dimensional vector is one super long vector. If a vector lives in a high dimensional space, it becomes extremely difficult for the machine to focus on the meaningful parts, as claimed by the notorious Curse of Dimensionality. Fortunately, it is often the case that most of the information are contained in a much smaller number of dimensions than the total dimensionality of the vector. Thus various Dimensionality Reduction techniques are employed to help us/machine focus on a compact number of useful aspects.

What is dimensionality

why we need to reduce

intuitive example for dimensionality reduction

# Principal Component Analysis

PCA is a lossy data compression technique. It finds a k-dimensional subspace of original data’s full m-dimensional space, where k << m, and projects the data onto this subspace. The subspace is chosen to preserve the maximum amount of variation present in the data given any k-dimensional space. The projected data are much less noisy for processing, while retaining most of the information in the original dataset.

Be careful with choosing a too small value for k, the dimension of subspace we project our original data to. An aggressively small k value might drop crucial information for classification. This is because information essential for classification might not lie on dimensions with large data variation. In this scenario, the dimension with largest variation is horizontal. Yet if we project the data onto the horizontal subspace, we would lose the vertical information, thus unable to classify the two classes effectively.

### Resource

PCA by Carlos Tomasi

# What

Nearest Neighbor Classifier, as the name suggests, classifies a given vector by grabbing the closest vector to it from the ground truth and putting the same label on it. As a Chinese proverb goes, similar objects/alike people are always attracted to each other(人以类聚物以群分). This classifier presumes that there is a strong relationship between spatial closeness and semantic similarity. When the classifier is given a test vector, the classifier would traverse all the training vectors, and classify the test vector to have the same label as the closest training vector based on a certain definition of Distance.

Intuitive explanation.

Relevant concepts

# How

$A detailed example with code/formula$

In actual application, however, instead of always finding the right label for a given test image, Nearest Neighbor Classifier tends to associate two images together if they share similar looking background, which usually takes much more space than the object of interest in an image.

What’s the catch/ How does each component work individually and/or with each other

Performance

# Why

Nearest Neighbor Classifier’s performance is one of the worst among all existing classifiers. Being not much better than guessing(NNC ~35% vs. Guessing 10%), there are two primary reasons for its bad performance. The presumption that vectors with similar pixel values share similar semantic meaning is not sound, if not arbitrary.

{Image pairs with same distance but different meaning}

Even if the above presumption is held, the definition of distance, if not chosen wisely, can also lead to bad performance.

Despite Nearest Neighbor Classifier’s bad performance, it is usually the first model to study in CV research, simply because of its easy-to-understand and elegant(which is all that matters), although wrongful, assumption. We eventually come to realize that spatial closeness does not directly establish semantic similarity. To find objects with similar meaning, the classifiers need to take one step further to establish its own semantic understanding from training data. This is when Neural Network comes into play.

Examples when NNC works well

Scenarios when it sucks ass

# Extension

One popular extension to Nearest Neighbor Classifier is called k Nearest Neighbor Classifier. It is basically Nearest Neighbor Classifier plus voting. Just like a voting regime in Distributed System, where we count on majority rule to elect a leader, in k Nearest Neighbor Classifier we find k closest images to the test image, and take the majority choice as the final label. This classifier often offers much more reliable results than a single Nearest Neighbor Classifier, and leads to better generalization.

# What

Simple Definition for the concept. Extend slightly if necessary

Raise Example for Illustration

Delve into details of the concept, including different branches, sub-modules, various related theories

# Why

Why do do need this module and Possible Alternatives

Extension and Further Study based on this concept

Applications of the Concept

# How

General Workflow for Similar Problem Domain

Possible Solution to a problem

Where to go from the base Solution

# Thoughts

Any idea worth delving into.

Connections drawn among different concepts

Further materials to study

# What

Image Classification is categorizing pictures into different classes. The most common example here is classifying an image into Cat or Dog. Notice, however, that the computer is completely ignorant of the definition of Cat or Dog in such process, and it would make absolutely no difference from the perspective of the computer to label the Cat instead as Cute Beings and the Dog as Loyal Creature. The training, validation and testing accuracy would stay exactly the same regardless of what labels we put on each category. To capture such abstraction, we usually represent different label as distinct integer value. Each image, along the same line of thought, is nothing more than a 3-D array(height * width * channels) of integer values ranging from 0 to 255.

# Why

Image Classification in its own form is not the most powerful CV algorithm ever designed. It is a supervised learning method that works well only with limited number of discrete outputs. However, Image Classification is the foundation of numerous other tasks, including Object Detection and Segmentation. Being able to classify an image accurately not only boosts the accuracy for all tasks built on top of it, but also helps us understand the scope and mechanism of machine vision a whole lot more. A lot of CV breakthroughs happen exactly when researchers spend a long time delving into basic tasks like Edge Detection and Image Classification.

Although sounds simple and naive, Image Classification has various applications in its most basic form. Just to name some, emotion analysis can capture emotions through facial features ; gesture classification relies on hand cues to capture poser’s intention; Amazon Go utilizes object classification for instant check out, etc.

# How

As a starter task in CV, Image Classification is already a very daunting problem to attempt.  To give a sense of its difficulty, consider the following scenario.

You are in a Guess It game and behind the curtain, there is an animal that could be a cat, a dog, a monkey or a Guinea pig. The host gives you information on its price, weight and adorability(which is subjective and pretty much random in this context). With these information, you are given two attempts to guess the animal right.

This sounds hard, but not impossible, right? Except in the context of Image Classification, the host would give a couple hundred thousands more pieces of information, the possible number of animals behind the curtain go from 3 to 1000, the distinction between possible animals goes from cat and dog to Cairn Terrier and Norwich Terrier, and there is only for chance for you to guess right.

Particular challenges in this context include:

• Occlusion
• Different Illumination Condition
• Scale Variation
• Deformation
• Viewpoint Variation
• Intra-class Variaion

It is also a huge challenge for the computer to figure out which information are relevant to labels and which are complete noise.

Image Classification is categorizing pictures. If someone shows us this picture, we can immediately think of the word ‘dog’, and those dog experts can relate it to ‘Boston Terrier’.

Memorizing and making use of these names are not hard for human to do, but way too complicated for a computer. Instead, we ask the computer to perform easier tasks, which is to categorize this picture into Category A or B.  In common examples, A would be dog and B would be cat. We’ll try something different here.

Image Classification is the process where computer tries to associate the first image with Category B above.

Contrary to our first impression, computer is completely ignorant of the label ‘dog’, or ‘robot’, or any idea about ‘category’. In the eyes of computer, the target image is a 3-D array of integers, with dimension 300x169x3 and integer value ranging from 0 to 255. The computer then needs to choose one option out of a finite number of choices to associate this array to. Upon making its choice, the computer would signal us by passing back a numerical indicator.