Showing posts with label connectionist machine learning. Show all posts
Showing posts with label connectionist machine learning. Show all posts

Tuesday, 22 April 2025

Numerical stability showcase: Ranking with SoftMax or Boltzmann factor

Preamble 

Image: Babylonian table for
computation (Wikipedia)
Probably, one of the most important aspects of computational work in quantitative fields, such as physics and data sciences is stability of numerical computations. It implies given inputs, outputs should not wildly deviates to large numbers or it must not distort the results, such as ranking based on scores, one of the most used computation in data science tasks, such as in classification of clustering. In this short post, we provide a stunning example of using SoftMax that creates wrong results if it applied naively. 

SoftMax:  Normalisation with Boltzmann factor

SoftMax is actually something more physics concept than a data science usage. The most common usage in data science is used for ranking.  Given a vector $x_{i}$, then softmax can be computed with the following expression read $$exp(x_{i})/\sum_{i} exp(x_{I}).$$ This originates from statistical physics, i.e., Boltzmann factor. 

Source of Numerical Instability

Using exponential function in the denominator in a sum creates a numerical instability, if one of the number deviates from other numbers significantly in the vector. This makes all other entries zero for the output of softmax.  

Example Instability: Ranking with SoftMax

Let's say we have the following scores 

scores = [1.4, 1.5, 1.6, 170]

for teams A, B, C,  D for some metric, we want to turn this into probabilistic interpretation with SoftMax, this will read, [0., 0., 0., 1.] we see that D comes on top but A,B,C are tied. 

LogSoftMax

We can rectify this instability by using LogSoftMax. reads $$\log exp(x_{I})-log(\sum_{i} exp(x_{i})),$$reads [-168.6000, -168.5000, -168.4000, 0.0000], so that we can induce consecutive ranking without ties, as follows D, A, B, C.

Conclusion

There is a similar practice in statistics for likelihood computations, as Gaussians brings exponential repeatedly. Using Log of the given operations will stabilise the numerical instabilities caused by repeated exponentiation. This shows the importance of numerical pitfalls in data sciences. 

Cite as follows

 @misc{suezen25softmax, 
     title = {Numerical stability showcase: Ranking with SoftMax or Boltzmann factor}, 
     howpublished = {\url{https://memosisland.blogspot.com/2025/04/softmax-numerical-stability.html}}, 
     author = {Mehmet Süzen},
     year = {2025}
}  

Appendix: Python method 

A python method using PyTorch computing softmax example from the main text. 

import torch

List = list
Tensor = torch.tensor

def get_softmax(scores:List, log :bool = False) -> Tensor:
"""
Compute softmax of a list

Defaults to LogSoftMax
"""
scores = torch.tensor(scores)
if log:
scores = torch.log_softmax(scores, dim=0)
else:
scores = torch.softmax(scores, dim=0)
return scores




Saturday, 28 March 2020

Book review: A tutorial introduction to the mathematics of deep learning

Preamble
Artificial Intelligence Engines:
An introduction to the Mathematics
of Deep Learning
by Dr James V. Stone
the book and Github repository.
(c) 2019 Sebtel Press
Deep learning and associated connectionist approaches are now applied routinely in industry and academic research from image analysis to natural language processing and areas as cool as reinforcement learning. As practitioners, we use these techniques and utilise them from well designed and tested reliable libraries like Tensorflow or Pytorch as shipped black-boxed algorithms. However, most practitioners lack mathematical foundational knowledge and core algorithmic understanding. Unfortunately, many academic books and papers try to make an impression of superiority show subliminally and avoid a simple pedagogical approach. In this post we review, a unique book trying to fill this gap with a pedagogical approach to the mathematics of deep learning avoiding showing of mathematical complexity but aiming at conveying the understanding of how things work from the ground up. Moreover, the book provides pseudo-codes that one can be used to implement things from scratch along with a supporting implementation in Github repo. Author Dr James V. Stone, a trained cognitive scientist and researcher in mathematical neuroscience provides such approaches with other books many years now, writing for students, not for his peers to show off. One important note that this is not a cookbook or practice tutorial but an upper-intermediate level academic book.

Building associations and classify with a network

The logical and conceptual separation of associations and classification tasks are introduced in the initial chapters. It is ideal to start with from learning one association with one connection to many via gentle introduction to Gradient descent in learning the weights before going to 2 associations and 2 connections. This reminds me of George Gamow's term 1, 2 and infinity as a pedagogical principle. Perceptron is introduced later on how classification rules can be generated via a network and the problems it encounters with XOR problem.

Backpropagation, Hopfield networks and Boltzmann machines

Detail implementation of backpropagation is provided from scratch without too many cluttering index notation in such clarity. Probably this is the best explanation I have ever encountered. Following chapters introduced Hopfield networks and Boltzmann machines from the ground up to applied level. Unfortunately, many modern deep learning books skip these two great models but Dr Stone makes these two models implementable for a practitioner by reading his chapters.  It is very impressive. Even though I am a bit biased in Hopfield networks as I see them as an extension to Ising models and its stochastic counterparts, but I have not seen anywhere else such explanations on how to use Hopfield networks in learning and in a pseudo-code algorithm to use in a real task.

Advanced topics

Personally, I see the remaining chapters as advanced topics: Deep Boltzmann machines, variational encoders, GANs and introduction to reinforcement learning. Probably exception of deep backpropagation in Chapter 9. I would say what is now known as deep learning now was the inception of the architectures mentioned in sections 9.1 till 9.7.

Glossary, basic linear algebra and statistics

Appendices provide a fantastic conceptual introduction to jargon and basics to main mathematical techniques. Of course, this isn't a replacement to fully-fledged linear algebra and statistics book but it provides immediate concise explanations.

Not a cookbook: Not import tensorflow as tf book

One other crucial advantage of this book is that it is definitely not a cookbook. Unfortunately, almost all books related to deep learning are written in a cookbook style. This book is not. However, it is supplemented by full implementation in a repository supporting each chapter, URL here.

Conclusion

This little book archives so much with down to earth approach with introducing basic concepts with a respectful attitude, assuming the reader is very smart but inexperience in the field. If you are a beginner or even experienced research scientist this is a must-have book.  I still see this book as an academic book and can be used in upper-undergraduate class as the main book in an elective such as
"Mathematics of Deep Learning".

Enjoy reading and learning from this book. Thank you, Dr Stone, for your efforts on making academic books more accessible.

Disclosure: I received a review copy of the book but I have bought another copy for a family member. 

Sunday, 15 December 2019

Bringing back Occam's razor to modern connectionist machine learning:

A simple complexity measure based on statistical physics
Cascading Periodic Spectral Ergodicity (cPSE)

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro with the title Applying Occam's razor to Deep Learning 

Kindly reviewed by Cornelius Weber

Preamble: Changing concepts in machine learning due to deep learning

Occam's razor or principle of parsimony has been the guiding principle in statistical model selection. In comparing two models, which they provide similar predictions or description of reality, we would vouch for the one which is less complex. This boils down to the problem of how to measure the complexity of a statistical model and model selection. What constitutes a model, as discussed by McCullagh (2002) in statistical models context is a different discussion, but here we assume a machine learning algorithms are considered as a statistical model.  Classically, the complexity of statistical models usually measured with Akaike information criterion (AIC) or similar. Using a complexity measure, one would choose a less complex model to use in practice, other things fixed.
Figure 1: Arbitrary architecture, each node represents
a layer for a given deep neural network, such as
convolutions or set of units.  Süzen-Cerd
à-Weber (2019)


The surge in interest in using complex neural network architectures, i.e., deep learning due to their unprecedented success in certain tasks,  pushes the boundaries of "standard" statistical concepts such as overfitting/overtraining and regularisation

Now Overfitting/overtraining is often used as an umbrella term to describe any unwanted performance drop off a machine learning model Roelofs et. al. (2019) and nearly anything that improves generalization is called regularization, Martin and Mahoney (2019).

A complexity measure for connectionist machine learning

Deep learning practitioners rely on choosing the best performing model and do not practice Occam's razor.  The advent of Neural Architecture Search and new complexity measures  
that take the structure of the network into account gives rise the possibility of practising Occam's razor in deep learning. Here, we would cover one of the very practical and simple measures called cPSE, i.e., cascading periodic spectral ergodicity. This measure takes into account the depth of the neural network and computes fluctuations of the weight structure over the entire network,  Süzen-Cerdà-Weber (2019) Figure 1. It is shown that the measure is correlated with the generalisation performance almost perfectly, see Figure 2.


Practical usage of cPSE

Figure 2: Evolution of PSE, periodic spectral ergodicity,
it is shown that complexity measure cPSE saturates
after a certain depth,  Süzen-Cerdà-Weber (2019)
The cPSE measure is implemented in Bristol python package, starting from version 0.2.6. If a trained network wrapped into a PyTorch model object, cPSE can be used to compare two different architectures. If two architectures give similar test performance, we would select the one with higher cPSE value. Lower the cPSE, more complex the model.

An example of usage requires a couple of lines, example measurements for VGG and ResNet are given in Süzen-Cerdà-Weber (2019):

from bristol import cPSE
import torchvision.models as models
netname = 'vgg11'
pmodel = getattr(models, netname)(pretrained=True)
(d_layers, cpse) = cPSE.cpse_measure(pmodel)




Conclusion and take-home message

Using a less complex deep neural network that would give similar performance is not practised by the deep learning community due to the complexity of training and designing new architectures. However, quantifying the complexity of similarly performing neural network architecture would bring the advantage of using less computing power to train and deploy such less complex models into production. Bringing back the Occam's razor to modern connectionist machine learning is not only a theoretical and philosophical satisfaction but the practical advantages for environment and computing time is immense.

Postscript : Appendix : Vanilla computation of cPSE

Recipe is quite easy to implement in practice: Get all weight matrices of your trained architecture as list of 2D matrices, special layers and biases subject to mapping, then:


from bristol import cPSE

spectrum_set = cPSE.get_eigenvals_layer_matrix_set(layer_matrices)

periodic_spectrum_set = cPSE.eigenvals_set_to_periodic(spectrum_set)

spectral_ergodicity_over_layers = cPSE.d_layers_pse(periodic_spectrum_set)


One should stop adding more layers where spectral_ergodicity_over_layers saturates to a constant value, i.e., decreasing curve.




(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License