Categories
Uncategorized

Summary of the second week.

Theory:

Kept reading papers. Some interesting ideas includes some claim that SGD happens on a small subspace whose dimension is data-dependent (see the two papers Yikai sent to Teams). The dynamics of GAN training from a landscape perspective (https://arxiv.org/1906.04848)

Ideas:

Interesting new ideas (that might worth discussion) includes

  1. Combining the hessian information calculation, restrict the optimization process to the space spanned by several dominant eigenvectors of the hessian and see what happens.
  2. Calculate the eigenvalue spectrum of the hessian on the full training set and compare that with the spectrum calculated on smaller subsets of the hessian (e.g. the mini-batch size)
  3. Look into other structures (recurrent neural nets / GAN etc.)

Experiments:

Most of the time is used to construct and finalize the code base, so not much of experiments are done. So far, experiments includes removing shortcut and batch normalization on ResNet34, and changing the mini-batch size of training. More experiments will be done in the next week as the code base is mature.

Some of the current experiment results can be seen at https://users.cs.duke.edu/~xz231/

It is confirmed that training from different gaussian initialized models can lead to different local minima (as shown in the site above).

The local hessian’s eigenvector and eigenvalues can be calculated numerically with an acceptable accuracy. But for large networks (ResNet34) it takes ~2hrs to calculate the top 10 eigenvector and eigenvalue. For smaller networks (e.g. VGG11) it takes ~10 mins. So experiments involves hessian along the trajectory should be done on smaller nets.

Visualization:

The currently implemented visualization: includes 1D and 2D visualization along random vectors and eigenvectors (filterwise normalization can be used). There are currently no new ideas on this part.

Coding:

The basic code base is finished, which includes training, testing, visualization, local hessian computation. The organization of experiment management (cooperations through git etc.) was also developed.

 

Categories
Uncategorized

Progress update 6.5-6.7

Conducted tests on several hessian calculating packages. Completed the hessian_calc code and integrate that to the training pipeline.

Read some interesting paper about GAN from a landscape perspective (https://arxiv.org/1906.04848)

Categories
Uncategorized

Some rough ideas of topic

There are mainly four paths to follow:

  1. To propose a new measure and conduct experiments to justify its usefulness, like all similar papers (rigorous proofs would be better but harder)
    1. Some properties of the low dimension submanifold of the loss landscape (e.g. curvature) and their probabilistic distribution (if the low dimension submanifold are selected randomly)
    2. The statistical distribution of some well-known measures (if they are not invariant)
    3. Some stable indicators as mentioned by inFERENCe (https://www.inference.vc/sharp-vs-flat-minima-are-still-a-mystery-to-me/) such as the ratio of two traditional measures.
    4. Fluctuation of training error during the training process (may also include gradient and even trace along the training trajectory), from the idea that sampling batches from the training set is an operation similar to the sampling of data set from the entire distribution.
  2. Disproof some of the popular measures (like what was done in the Sharp Minima Can Generalize Well). Probably we can show that filter-wise rescale is useless through some theoretical work, we may also search for other measures.
    1. E.g. The filterwise normalization may have some significant error when dealing with some reparameterizations, follow similar logic as Sharp Minima Can Generalize Well
  3. Combine the ideas of several papers (e.g. calculating information of the hessian on a rescaled loss landscape)
    1. E.g. try to combine filterwise normalization with hessian based measures
  4. Show some disproved measures may acturally work in practice (i.e. even though through reparameterization, models with relu may always have some sharp minima that may generalize well, we can calculate the distribution of the sharpness to show that the sharp minima are in the minor tail of the paper)
Categories
Daily Update General Implementation Log Theoretical Ideas Weekly Reflection

Hello world!

I am Xingyu Zhu (Jupiter), a rising junior majoring in Mathematics and Computer Science at Duke University, Trinity College of Arts and Sciences.

On this site, I will update the progress of my Duke CS+ research project: Visualizing Optimization Landscape for Deep Neural Networks. Such updates includes daily updates, weekly reflections,  comments on research papers, and ideas for research.