arXiv:1601.00917 [cs.LG]AbstractReferencesReviewsResources
Distilling Reverse-Mode Automatic Differentiation (DrMAD) for Optimizing Hyperparameters of Deep Neural Networks
Jie Fu, Hongyin Luo, Jiashi Feng, Tat-Seng Chua
Published 2016-01-05Version 1
The performance of deep neural networks is sensitive to the setting of their hyperparameters (e.g. L2-norm panelties). Recent advances in reverse-mode automatic differentiation have made it possible to optimize hyperparameters with gradients. The standard way of computing these gradients involves a forward and backward pass, which is similar to its cousin, back-propagation, used for training weights of neural networks. However, the backward pass usually needs to exactly reverse a training procedure, starting from the trained parameters and working back to the initial random ones. This incurs unaffordable memory consumption as it needs to store all the intermediate variables. Here we propose to distill the knowledge of the forward pass into an shortcut path, through which we approximately reverse the training trajectory. Experiments carried out on MNIST dataset show that our approach reduces memory consumption by orders of magnitude without sacrificing its effectiveness. Our method makes it feasible, for the first time, to automatically tune hundreds of thousands of hyperparameters of deep neural networks in practice.