Sunday, 15 December 2019

Bringing back Occam's razor to modern connectionist machine learning:

A simple complexity measure based on statistical physics
Cascading Periodic Spectral Ergodicity (cPSE)

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro with the title Applying Occam's razor to Deep Learning 

Kindly reviewed by Cornelius Weber

Preamble: Changing concepts in machine learning due to deep learning

Occam's razor or principle of parsimony has been the guiding principle in statistical model selection. In comparing two models, which they provide similar predictions or description of reality, we would vouch for the one which is less complex. This boils down to the problem of how to measure the complexity of a statistical model and model selection. What constitutes a model, as discussed by McCullagh (2002) in statistical models context is a different discussion, but here we assume a machine learning algorithms are considered as a statistical model.  Classically, the complexity of statistical models usually measured with Akaike information criterion (AIC) or similar. Using a complexity measure, one would choose a less complex model to use in practice, other things fixed.
Figure 1: Arbitrary architecture, each node represents
a layer for a given deep neural network, such as
convolutions or set of units.  Süzen-Cerd
à-Weber (2019)


The surge in interest in using complex neural network architectures, i.e., deep learning due to their unprecedented success in certain tasks,  pushes the boundaries of "standard" statistical concepts such as overfitting/overtraining and regularisation

Now Overfitting/overtraining is often used as an umbrella term to describe any unwanted performance drop off a machine learning model Roelofs et. al. (2019) and nearly anything that improves generalization is called regularization, Martin and Mahoney (2019).

A complexity measure for connectionist machine learning

Deep learning practitioners rely on choosing the best performing model and do not practice Occam's razor.  The advent of Neural Architecture Search and new complexity measures  
that take the structure of the network into account gives rise the possibility of practising Occam's razor in deep learning. Here, we would cover one of the very practical and simple measures called cPSE, i.e., cascading periodic spectral ergodicity. This measure takes into account the depth of the neural network and computes fluctuations of the weight structure over the entire network,  Süzen-Cerdà-Weber (2019) Figure 1. It is shown that the measure is correlated with the generalisation performance almost perfectly, see Figure 2.


Practical usage of cPSE

Figure 2: Evolution of PSE, periodic spectral ergodicity,
it is shown that complexity measure cPSE saturates
after a certain depth,  Süzen-Cerdà-Weber (2019)
The cPSE measure is implemented in Bristol python package, starting from version 0.2.6. If a trained network wrapped into a PyTorch model object, cPSE can be used to compare two different architectures. If two architectures give similar test performance, we would select the one with higher cPSE value. Lower the cPSE, more complex the model.

An example of usage requires a couple of lines, example measurements for VGG and ResNet are given in Süzen-Cerdà-Weber (2019):

from bristol import cPSE
import torchvision.models as models
netname = 'vgg11'
pmodel = getattr(models, netname)(pretrained=True)
(d_layers, cpse) = cPSE.cpse_measure(pmodel)




Conclusion and take-home message

Using a less complex deep neural network that would give similar performance is not practised by the deep learning community due to the complexity of training and designing new architectures. However, quantifying the complexity of similarly performing neural network architecture would bring the advantage of using less computing power to train and deploy such less complex models into production. Bringing back the Occam's razor to modern connectionist machine learning is not only a theoretical and philosophical satisfaction but the practical advantages for environment and computing time is immense.

Postscript : Appendix : Vanilla computation of cPSE

Recipe is quite easy to implement in practice: Get all weight matrices of your trained architecture as list of 2D matrices, special layers and biases subject to mapping, then:


from bristol import cPSE

spectrum_set = cPSE.get_eigenvals_layer_matrix_set(layer_matrices)

periodic_spectrum_set = cPSE.eigenvals_set_to_periodic(spectrum_set)

spectral_ergodicity_over_layers = cPSE.d_layers_pse(periodic_spectrum_set)


One should stop adding more layers where spectral_ergodicity_over_layers saturates to a constant value, i.e., decreasing curve.