Just another Network site

EnVision: The Black Magic of Deep Learning

The Black Magic of #Deep Learning - Tips and Tricks for the practitioner

  • I then caught wind from Academia, the new hype of Deep Learning was here and it would solve everything.
  • If you are going through voice data shift it and distort it
    This tip is from Karpathy, before training on the whole dataset try to overfit on a very small subset of it, that way you know your network can converge.
  • There is an excellent thesis about that (Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning)
    Avoid LRN pooling, prefer the much faster MAX pooling.
  • Always go for the smaller models, if you are working and deploying deep learning models like me, you quickly understand the pain of pushing gigabytes of models to your users or to a server in the other side of the world.
  • For a history lesson and a great introduction read Deep Learning: Methods and Applications (Foundations and Trends in Signal Processing) 
    If your really want to start implementing from scratch, check out Deep Belief Nets in C++ and CUDA C, Vol.

Thanks, fascinating article! You wrote:Always use dropout to minimize the chance of overfitting. Use it after large > 256 (fully connected layers or convolutional layers).256 layers is a crazy number of layers with the exception of some versions of resnet. I have come across dropout in deep NN architectures with a normal number of layers. Is there a typo? Did you mean a number of hidden units?

@kdnuggets: The Black Magic of #Deep Learning – Tips and Tricks for the practitioner

Suprisingly, it was no hype, Deep Learning works and it works well. However it is such a new concept (even though the foundations were laid in the 70’s) that a lot of anecdotal tricks and tips started coming out on how to make the most of it (Alex Krizhevsky covered a lot of them and in some ways pre-discovered batch normalization).

Always shuffle. Never allow your network to go through exactly the same minibatch. If your framework allows it shuffle at every epoch. 

Expand your dataset. DNN’s need a lot of data and the models can easily overfit a small dataset. I strongly suggest expanding your original dataset. If it is a vision task, add noise, whitening, drop pixels, rotate and color shift, blur and everything in between. There is a catch though if the expansion is too big you will be training mostly with the same data. I solved this by creating a layer that applies random transformations so no sample is ever the same. If you are going through voice data shift it and distort it

This tip is from Karpathy, before training on the whole dataset try to overfit on a very small subset of it, that way you know your network can converge.

Avoid LRN pooling, prefer the much faster MAX pooling.

Avoid Sigmoid’s , TanH’s gates they are expensive and get saturated and may stop back propagation. In fact the deeper your network the less attractive Sigmoid’s and TanH’s are. Use the much cheaper and effective ReLU’s and PreLU’s instead. As mentioned in Deep Sparse Rectifier Neural Networks they promote sparsity and their back propagation is much more robust.

Don’t use ReLU or PreLU’s gates before max pooling, instead apply it after to save computation

Don’t use ReLU’s they are so 2012. Yes they are a very useful non-linearity that solved a lot of problems. However try fine-tuning a new model and watch nothing happen because of bad initialization with ReLU’s blocking backpropagation. Instead use PreLU’s with a very small multiplier usually 0.1. Using PreLU’s converges faster and will not get stuck like ReLU’s during the initial stages. ELU’s are still good but expensive.

Use Batch Normalization (check paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) ALWAYS. It works and it is great. It allows faster convergence ( much faster) and smaller datasets. You will save time and resources.

I don’t like removing the mean as many do, I prefer squeezing the input data to [-1, +1]. This is more of  a training and deployment trick rather a performance trick.

Always go for the smaller models, if you are working and deploying deep learning models like me, you quickly understand the pain of pushing gigabytes of models to your users or to a server in the other side of the world. Go for the smaller models even if you lose some accuracy.

If you use the smaller models try ensembles. You can usually boost your accuracy by ~3% with an enseble of 5 networks. 

If your input data has a spatial parameter try to go for CNN’s end to end. Read and understand SqueezeNet , it is a new approach and works wonders, try applying the tips above. 

Modify your models to use 1×1 CNN’s layers where it is possible, the locality is great for performance. 

Don’t even try to train anything without a high end GPU.

If you are making templates out of models or your own layers, parameterize everything otherwise you will be rebuilding your binaries all the time. You know you will

And last but not least understand what you are doing, Deep Learning is the Neutron Bomb of Machine Learning. It is not to be used everywhere and always. Understand the architecture you are using and what you are trying to achieve don’t mindlessly copy models.  

EnVision: The Black Magic of Deep Learning

Comments are closed, but trackbacks and pingbacks are open.