Random Forest

What

Random Forest is Decision Tree plus Bagging, with slight modification. A single Decision Tree often includes all given features and is responsible for predicting output on its own. Bagging(Bootstrap Aggregating), which is a voting mechanism, instead trains multiple Trees based on subsets of data and ask them to vote for the final answer. Random Forest takes a step further. Instead of including all the given features in every Decision Tree, Random Forest randomly selects M features for each Tree. Then each Tree is again trained on randomly chosen subset of data. The final decision is still based on the aggregated result from all Trees. Both Bagging and parameter dropping introduce stochasticity and thus improve generalization.

The goal of Bagging in general is to reduce variance in output. In comparison, Boosting is another technique aiming at reducing bias. Empirically Boosting tends to give higher test accuracy.

How

 

Advantages

Random Forest deals with high-dimensional data very well.

Generalizes well to unseen data, due to the random feature inclusion mechanism.

Caveats

 

Boosting Decision Tree

What

Boosting Decision Tree is a type of Model Ensemble where we learn iteratively better Decision Tree classifiers based on errors from prior Tree. Data are first given the same weights. During training, wrongly classified data gain bigger weights, allowing the next Decision Tree to focus specifically on them. Each Decision Tree also has a score indicating its classification accuracy. These scores will then be used to weight each Tree in the final aggregated model.

The main goal of Boosting Decision Tree is to reduce bias, whereas the goal of Bagging is to reduce variance.

Algorithm

Stanford on Boost

 

Advantages

Allow to use different Loss Function

Caveats

Prone to overfitting. Should stop early to regularize.

Data Vectorization

What

Data Vectorization is the process of translating real life information into mathematical representation. This translation makes it possible for machine to learn.

Vector Space Model

Bag of Words

TF-IDF

N-Gram

Kernel Hashing

Data Attribute Types

Nominal/Unordered Qualitative

E.g. (blue, yellow, green), (republic, democratic)

Ordinal/Ordered Qualitative

E.g. (short, medium, tall), (average, good, excellent)

Interval/Discrete Quantitative

E.g. year

Ratio/Continuous Quantitative

E.g. speed

ReLU Layer

Definition

ReLU Layer is used to threshold feature map or a general input vector.

Functionality

Because ReLU is often the only non-linear Layer in a network, it prevents the whole network from collapsing into one single matrix operation. For the same reason, it contributes to network expressiveness.

Input:

Layer:

Output

Use Cases

ReLU Layer is mostly used right after Convolution Layer.

Caveats

Dead ReLU

 

Convolution Layer

Definition

Convolution Layer is a systematic template matching mechanism with learnable templates. It has a set of filters, each of the same depth as the input vector and a spatial extent as specified by the user(receptive field). A feature map is generated by sliding one filter across the entire input volume by certain stride. By stacking together all the feature maps generated by all filters, we get the output volume.

There are two features in this design that call for attention. One is local connectivity. When we create the feature map, each value is only dependent on a small patch of the original image, with depth included. In other words, each neuron(a filter with same depth) is only connected to local input values. In comparison, in a Fully Connected Layer, each neuron connects to all input values. In this way we not only reduce the parameters by a lot but also instruct the layer where in the input to focus on.

The second feature by design is parameter sharing, which means the same filter is used across the entire image to generate the feature map. The reason behind this is that lots of low level features are repentant across the image so there is no need to relearn. This also greatly reduces the number of parameters. In cases where features are supposed to differ by a lot, we should have different filters at different (x, y) positions, then we would forfeit such parameter sharing schema and instead learn each filter separately. Network constructed like this is called Locally Connected Layer. Convolution Layer is a special Locally Connected Layer with sharing parameters.

Input:

Layer:

Output:

Functionality

Convolution Layer is used to filter out image features, from abstract features in the beginning layer to more high-level ones in the latter layers. The cool part is that Layer will learn useful filters at different layers by itself.

1 x 1 Convolution Layer helps to reduce depth dimension of the input.

Tricks & Caveats

We often prefer stacking small Conv Layers to a single Conv Layer with large receptive field. First reason is that with the same effective receptive field, stacking smaller layers need much less parameters. Second reason is that in between small layers we can put more ReLU non-linearity, whereas we don’t have the option with a single giant layer. The non-linearity promises more expressiveness.

 

Pooling Layer

Definition

Pooling Layer usually uses a max-pool operation on a small spatial extent to reduct spatial size.

Pooling Layer is gradually falling out of favor, as people realize the same thing can be done in Convolution Layer as well.

Functionality

Pooling Layer is used to reduce spatial size in output, thus reducing the number of parameters present in the network. This further helps controlling overfitting.

Input:

Layer:

Output

Use Cases

Pooling Layer is mostly used immediately after ReLU Layer.

Caveats

Pooling Layer is always labeled destructive, in the sense that down-sampling tends to remove useful information if done carelessly. For this reason, Pooling Layer should use a tiny receptive field(2-3), or it should be avoided completely.

FC Layer

Definition

Fully Connected Layer is just a traditional linear classifier. There are N vectors(weights) and N scalars(bias) in such layer, and the dimension of the vector is determined by the input vector dimension. When an input vector is given, we calculate the dot product of it with N weights and then add N biases respectively to generate N scores. Each score corresponds to roughly how similar is the given vector to its corresponding template vector(weight).

Input:

Layer:

Output

Functionality

A single FC Layer is an universal approximator. This means FC Layer can  express any given function with proper tunning. Yet this statement carries no weight in itself because it is impractical to train such layer. Only when we stack together several such layers, the network becomes possible to optimize. That said, although a single FC layer has the same representational power as a multi-layer deeper network, empirically the multi-layer network works much better. s

 

Use Cases

FC Layer is the primary component in Neural Network. Notice that a Neural Net with 3 FC layers often outperform one with 2 FC layers, yet going deeper that 3 layers rarely helps. This stands in stark contrast to Deep Convolution Network where depth plays a much bigger role in performance.

FC Layer is used  mostly in the last several layers in ConvNet to generate scores.

 

Caveats

FC Layer is very computationally expensive. Use FC layer only if there is a necessity to connect each neuron to all input vectors.

Convolution

What

Image Convolution is template matching done in a streamline fashion. To do template matching, we need a template image and an image to be matched upon. We sweep the template image across the full image pixel by pixel to check how similar is the template to the overlapped full image below it. After a full sweep, we return the coordinate at which the portion of the full image is most similar to the template. In Convolution, it is almost exactly the same. We have a template and a full image and we sweep the template across. Instead of returning a single position, however, we return a matrix, where each point records the measure of similarity between the template and the full image. If we think of the template as an image feature, then we can think of the matrix as a feature map of the full image where each value in the map indicates how strong is the feature showing. This is an important step in the Convolution Neural Network.

Why

Before Convolution, image is transformed into a vector by connecting all columns into a single column. The number of parameters needed for this is ridiculous and unnecessary. We realize that a lot of image information is present locally, so we don’t have to connect the neuron to all the elements in the vector. Instead we connect neurons to inputs locally by performing convolution. In addition to greatly reducing the number of parameters needed, Convolution help the network focus on local information, an idea similar to Attention Mechanism.

How

Various techniques are designed to deal with edge cases in Convolution.

Padding is to pad the original image with zero’s on the edge so the sizing works out. It also prevents the information on the edge of the image from getting washed away too quickly. Stride defines how big each step is during template sweeping. It is usually set to 1 because smaller stride works better in practice.

Thoughts

Convolution compares image similarity by the dot product of two image vectors. Is there better way to measure such similarity? Especially between images, not vectors.

Smaller image patches make weaker assumptions on the possible template feature to be matched, thus performs better than bigger image patches.

Visualization

What

 

Why

 

How

High-Dimensional Visualization

This refers to visualization of functions that exist in a high-dimensional space. One most important example is Loss Function, which has an enormous number of parameters. Visualization on such function can provide insight on how we should optimize most efficiently.

One possible way is to visualize along a random plane. We can first generate a random point $W$ in space and two orthogonal directions $W_1$ and $W_2$. In a Loss Function, such $W$ would correspond to a complete set of model parameters. Then we plot out different loss value corresponding to different $W + aW_1 + bW_2$ value on a plane by using different colors. By doing so we gain some intuition about how the Loss Function looks like in space.

SVM Loss Function along a plane. Left image plots loss from a single example and shows a piece-wise linear structure, whereas the right image shows average loss of a hundred examples and demonstrates a bowl-shape structure.