Kept reading papers. Some interesting ideas includes some claim that SGD happens on a small subspace whose dimension is data-dependent (see the two papers Yikai sent to Teams). The dynamics of GAN training from a landscape perspective (https://arxiv.org/1906.04848)
Interesting new ideas (that might worth discussion) includes
- Combining the hessian information calculation, restrict the optimization process to the space spanned by several dominant eigenvectors of the hessian and see what happens.
- Calculate the eigenvalue spectrum of the hessian on the full training set and compare that with the spectrum calculated on smaller subsets of the hessian (e.g. the mini-batch size)
- Look into other structures (recurrent neural nets / GAN etc.)
Most of the time is used to construct and finalize the code base, so not much of experiments are done. So far, experiments includes removing shortcut and batch normalization on ResNet34, and changing the mini-batch size of training. More experiments will be done in the next week as the code base is mature.
Some of the current experiment results can be seen at https://users.cs.duke.edu/~xz231/
It is confirmed that training from different gaussian initialized models can lead to different local minima (as shown in the site above).
The local hessian’s eigenvector and eigenvalues can be calculated numerically with an acceptable accuracy. But for large networks (ResNet34) it takes ~2hrs to calculate the top 10 eigenvector and eigenvalue. For smaller networks (e.g. VGG11) it takes ~10 mins. So experiments involves hessian along the trajectory should be done on smaller nets.
The currently implemented visualization: includes 1D and 2D visualization along random vectors and eigenvectors (filterwise normalization can be used). There are currently no new ideas on this part.
The basic code base is finished, which includes training, testing, visualization, local hessian computation. The organization of experiment management (cooperations through git etc.) was also developed.