Over the past decade, research efforts in deep learning have been directed towards creation of more complex, larger and faster networks. In this context, ImageNet It is considered one of the first deep convolutional neural networks, which impressed the community with its excellent performance in image recognition tasks. In its initial configuration, this network had 60 million parameters and 650,000 neurons organized in five convolutional layers. To achieve the feat of training this architecture in 2012, the authors had to resort to unsaturated neurons and also optimize the convolution operator for GPUs. Since then, convolutional neural networks (CNN) have become deeper and deeper, going from 100 layers to over a thousand in just a few years.
The use of these huge CNN architectures posed new challenges in training these networks. More specifically, it was becoming increasingly difficult to guarantee convergence. This is because the distribution of inputs to each layer changes during training as the parameters of the previous layers change. This drastically slows down its training, as it requires significantly lower learning rates and very careful initializations. This makes training notoriously difficult and requires recourse to non-saturating nonlinearities. This is the reason why the authors of ImageNet used this type of neuron to train their model. Therefore, it was not a design decision but rather one of ensuring model convergence. Another way to mitigate this problem is to use ‘batch-normalization’, which allows the use of much higher learning rates and less care with parameter initialization.
Of course, the search for deeper architectures is not just a frivolous endeavor of data scientists. The reason for this research is to answer a fundamental question of deep learning: ‘is getting better networks as easy as adding more layers?’ The ‘Deep Residual Networks (ResNet)’ They aimed to answer this question and, in the process, became one of the most influential architectures in the state of the art. In their initial experiments, they discovered that, for most typical network architectures, the answer to the question was a clear NO. In fact, in their experiments they found that a network with 56 layers performed significantly worse than another with only 20. If depth (i.e. number of layers) was the only thing that mattered for best accuracy, a larger network should have performed better than a shallow network and that definitely wasn’t the case. Although it may seem counterintuitive, this discrepancy was explained by the lack of ‘representation’ arising from the problem of optimizing such a large neural network. ResNet solved this problem by injecting residual connections into the network architecture so that the entire network optimization was significantly improved and the results followed suit (i.e., deeper architectures gave better results).
Another interesting fact about training these neural networks is that it seems that only a small part of the network (both weights and neurons) contributes significantly to the prediction. In fact, there are studies that show that between 95 and 99% of a neural network can be trimmed (with criteria) without almost penalizing the accuracy of the original network. This fact is known as the ‘lottery ticket hypothesis’.
The weights of a neural network are initialized randomly. At this point, there are many random subnetworks in the network, but some seem to have more ‘potential’ for prediction. That is, the optimizer believes that it is mathematically better to update this set of weights to the detriment of others. At the end of this procedure, the optimizer has developed a subnet to do all the work while the other parts of the network are mostly useless. Each subnet is a ‘lottery ticket’, with a random initialization and the favorable initializations are the ‘winning tickets’ identified by the optimizer. Therefore, the more random entries there are, the more likely we are to find a winning ticket. This is why larger networks generally perform better in line with what was discussed earlier about deep learning.
Although there are still many open questions in the area of deep learning, the last decade of research has shed a little more light on the functioning of these neural networks, their behavior, while also opening up new questions and hypotheses that should allow the state of the art in machine learning to advance.