Random Forest is Decision Tree plus Bagging, with slight modification. A single Decision Tree often includes all given features and is responsible for predicting output on its own. Bagging(Bootstrap Aggregating), which is a voting mechanism, instead trains multiple Trees based on subsets of data and ask them to vote for the final answer. Random Forest takes a step further. Instead of including all the given features in every Decision Tree, Random Forest randomly selects M features for each Tree. Then each Tree is again trained on randomly chosen subset of data. The final decision is still based on the aggregated result from all Trees. Both Bagging and parameter dropping introduce stochasticity and thus improve generalization.
The goal of Bagging in general is to reduce variance in output. In comparison, Boosting is another technique aiming at reducing bias. Empirically Boosting tends to give higher test accuracy.
Random Forest deals with high-dimensional data very well.
Generalizes well to unseen data, due to the random feature inclusion mechanism.
Boosting Decision Tree is a type of Model Ensemble where we learn iteratively better Decision Tree classifiers based on errors from prior Tree. Data are first given the same weights. During training, wrongly classified data gain bigger weights, allowing the next Decision Tree to focus specifically on them. Each Decision Tree also has a score indicating its classification accuracy. These scores will then be used to weight each Tree in the final aggregated model.
The main goal of Boosting Decision Tree is to reduce bias, whereas the goal of Bagging is to reduce variance.
ReLU Layer is used to threshold feature map or a general input vector.
Because ReLU is often the only non-linear Layer in a network, it prevents the whole network from collapsing into one single matrix operation. For the same reason, it contributes to network expressiveness.
ReLU Layer is mostly used right after Convolution Layer.
Convolution Layer is a systematic template matching mechanism with learnable templates. It has a set of filters, each of the same depth as the input vector and a spatial extent as specified by the user(receptive field). A feature map is generated by sliding one filter across the entire input volume by certain stride. By stacking together all the feature maps generated by all filters, we get the output volume.
There are two features in this design that call for attention. One is local connectivity. When we create the feature map, each value is only dependent on a small patch of the original image, with depth included. In other words, each neuron(a filter with same depth) is only connected to local input values. In comparison, in a Fully Connected Layer, each neuron connects to all input values. In this way we not only reduce the parameters by a lot but also instruct the layer where in the input to focus on.
The second feature by design is parameter sharing, which means the same filter is used across the entire image to generate the feature map. The reason behind this is that lots of low level features are repentant across the image so there is no need to relearn. This also greatly reduces the number of parameters. In cases where features are supposed to differ by a lot, we should have different filters at different (x, y) positions, then we would forfeit such parameter sharing schema and instead learn each filter separately. Network constructed like this is called Locally Connected Layer. Convolution Layer is a special Locally Connected Layer with sharing parameters.
Convolution Layer is used to filter out image features, from abstract features in the beginning layer to more high-level ones in the latter layers. The cool part is that Layer will learn useful filters at different layers by itself.
1 x 1 Convolution Layer helps to reduce depth dimension of the input.
Tricks & Caveats
We often prefer stacking small Conv Layers to a single Conv Layer with large receptive field. First reason is that with the same effective receptive field, stacking smaller layers need much less parameters. Second reason is that in between small layers we can put more ReLU non-linearity, whereas we don’t have the option with a single giant layer. The non-linearity promises more expressiveness.
Pooling Layer usually uses a max-pool operation on a small spatial extent to reduct spatial size.
Pooling Layer is gradually falling out of favor, as people realize the same thing can be done in Convolution Layer as well.
Pooling Layer is used to reduce spatial size in output, thus reducing the number of parameters present in the network. This further helps controlling overfitting.
Pooling Layer is mostly used immediately after ReLU Layer.
Pooling Layer is always labeled destructive, in the sense that down-sampling tends to remove useful information if done carelessly. For this reason, Pooling Layer should use a tiny receptive field(2-3), or it should be avoided completely.
Fully Connected Layer is just a traditional linear classifier. There are N vectors(weights) and N scalars(bias) in such layer, and the dimension of the vector is determined by the input vector dimension. When an input vector is given, we calculate the dot product of it with N weights and then add N biases respectively to generate N scores. Each score corresponds to roughly how similar is the given vector to its corresponding template vector(weight).
A single FC Layer is an universal approximator. This means FC Layer can express any given function with proper tunning. Yet this statement carries no weight in itself because it is impractical to train such layer. Only when we stack together several such layers, the network becomes possible to optimize. That said, although a single FC layer has the same representational power as a multi-layer deeper network, empirically the multi-layer network works much better. s
FC Layer is the primary component in Neural Network. Notice that a Neural Net with 3 FC layers often outperform one with 2 FC layers, yet going deeper that 3 layers rarely helps. This stands in stark contrast to Deep Convolution Network where depth plays a much bigger role in performance.
FC Layer is used mostly in the last several layers in ConvNet to generate scores.
FC Layer is very computationally expensive. Use FC layer only if there is a necessity to connect each neuron to all input vectors.
Image Convolution is template matching done in a streamline fashion. To do template matching, we need a template image and an image to be matched upon. We sweep the template image across the full image pixel by pixel to check how similar is the template to the overlapped full image below it. After a full sweep, we return the coordinate at which the portion of the full image is most similar to the template. In Convolution, it is almost exactly the same. We have a template and a full image and we sweep the template across. Instead of returning a single position, however, we return a matrix, where each point records the measure of similarity between the template and the full image. If we think of the template as an image feature, then we can think of the matrix as a feature map of the full image where each value in the map indicates how strong is the feature showing. This is an important step in the Convolution Neural Network.
Before Convolution, image is transformed into a vector by connecting all columns into a single column. The number of parameters needed for this is ridiculous and unnecessary. We realize that a lot of image information is present locally, so we don’t have to connect the neuron to all the elements in the vector. Instead we connect neurons to inputs locally by performing convolution. In addition to greatly reducing the number of parameters needed, Convolution help the network focus on local information, an idea similar to Attention Mechanism.
Various techniques are designed to deal with edge cases in Convolution.
Padding is to pad the original image with zero’s on the edge so the sizing works out. It also prevents the information on the edge of the image from getting washed away too quickly. Stride defines how big each step is during template sweeping. It is usually set to 1 because smaller stride works better in practice.
Convolution compares image similarity by the dot product of two image vectors. Is there better way to measure such similarity? Especially between images, not vectors.
Smaller image patches make weaker assumptions on the possible template feature to be matched, thus performs better than bigger image patches.
This refers to visualization of functions that exist in a high-dimensional space. One most important example is Loss Function, which has an enormous number of parameters. Visualization on such function can provide insight on how we should optimize most efficiently.
One possible way is to visualize along a random plane. We can first generate a random point $W$ in space and two orthogonal directions $W_1$ and $W_2$. In a Loss Function, such $W$ would correspond to a complete set of model parameters. Then we plot out different loss value corresponding to different $W + aW_1 + bW_2$ value on a plane by using different colors. By doing so we gain some intuition about how the Loss Function looks like in space.