Tuesday 29 December 2020

Practice causal inference: Conventional supervised learning can't do inference

Domino OR-gate (Wikipedia)

A trained model may provide predictions about input values it may never seen before but it isn't an inference, at least for 'classical' supervised learning. In reality it provides an interpolation from the training-set, i.e., via function approximation: 

Interpolation doesn't mean to have all predictions within  convex-hull of the training set but interpolation as in numerical procedure of using training data only.

What does inference means? 

By inference, "we imply going beyond training data", reference to distributional shift, compositional learning or similar type of learning, should have been raised. This is specially apparent, for example, a human infant may learn  how 3 is similar to 8, but without labels supervised learning in naïve setting can't establish this, i.e., MNIST set without 8 can not learn what is 8 in plain form.

In the case of ontology inference, ontology being a causal graph, that is a "real" inference as it symbolically traverse a graph of causal connections. 


We might be able to directly transfer that to regression scenario but probably it is possible with altering our models with SCMs and hybrid symbolic-regression approach. 

  • Looper repo provides a resource list for causal inference looper 
  • Thanks to Patrick McCrae for invoking ontology inference comparison.

     title = {Practice causal inference: Conventional supervised learning can't do inference}, 
     howpublished = {\url{https://memosisland.blogspot.com/2020/12/practice-causal-inference-conventional.html}, 
     author = {Mehmet Süzen},
     year = {2020}

Sunday 1 November 2020

Gems of data science: 1, 2, infinity


Figure: George Gamow's book. (Wikipedia)
Problem-solving is the core activity of data science using scientific principles and evidence. On our side, there is an irresistible urge to solve the most generic form of the problem. We do this almost always from programming to formulation of the problem. But, don't try to solve a generalised version of the problem. Solve it for N=1 if N is 1 in your setting, not for any integer: Save time and resources and try to embed this culture to your teams and management. Extent later when needed on demand.

Solving for N=1 is sufficient if it is the setting

This generalisation phenomenon manifests itself as an algorithmic design: From programming to problem formulation, strategy and policy setting. The core idea can be expressed as mapping, let's say the solution to a problem  is a function, mapping from one domain to a range 

$$ f : \mathbb{R} \to \mathbb{R} $$

Trying to solve for the most generic setting of the problem, namely multivariate setting

$$ f : \mathbb{R}^{m} \to \mathbb{R}^{n} $$

where $m, n$ are the integers generalising the problem.  


It is elegant to solve a generic version of a problem. But is it really needed? Does it reflect reality and would be used? If N=1 is sufficient, then try to implement that solution first before generalising the problem. An exception to this basic pattern would be if you don't have a solution at N=1 but once you move larger N that there is a solution: you might think this is absurd, but SVM works exactly in this setting by solving classification problem for disconnected regions.


  • The title intentionally omits three, while it is a reference to Physics's inability to solve, or rather a mathematical issue of the three-body problem.

Sunday 28 June 2020

Conjugacy and Equivalence for Deep Neural Networks: Architecture compression to selection


A recently shown phenomenon can classify deep learning architectures with only using the knowledge gained by trained weights [suezen20a]. The classification produces a measure of equivalence between two trained neural network and astonishingly captures a family of closely related architectures as equivalent within a given accuracy. In this post, we will look into this from a conceptual perspective. 

Figure 1: VGG architecture spectral difference in the long
positive tail [suezen20a]
The concept of conjugate matrix ensembles and equivalence

Conjugacy is a mathematical construct reflecting different approaches to the same system should yield to the same outcome: It is reflected in the statistical mechanic's concept of ensembles. However, for matrix ensembles, like the ones offered in Random Matrix Theory, the conjugacy is not well defined in the literature. One possible resolution is to look at the cumulative spectral difference between two ensembles in the long positive tail part of the spectrum [suezen20a]. If this is vanishing we can say that two matrix ensembles are conjugate to each other. We observe this with matrix ensembles VGG vs. circular ensembles. 

 Conjugacy is the first step in building equivalence among different architectures.  If two architectures are conjugate to the same third matrix ensemble and their fluctuations on the spectral difference are very close over the spectral locations, they are equivalant in a given accuracy [suezen20a].

Outlook: Where to use equivalence in practice?

The equivalence can be used in selecting or compressing an architecture or classify different neural network architectures. Python notebook to demonstrate this with different vision architecture in PyTorch is provided, here.


[suezen20a] Equivalence in Deep Neural Networks via Conjugate Matrix Ensembles, Mehmet Suezen, arXiv:2006.13687 (2020)

Tuesday 12 May 2020

Collaborative data science: High level guidance for ethical scientific peer reviews


Catalan Castellers are
collaborating (Wikipedia)
Availability of distributed code tracking tools and associated collaborative tools make life much easier in building collaborative scientific tools and products. This is now especially much more important in data science as it is applied in many different industries as a de-facto standard. Essentially a computational science field in academics now become industry-wide practice.

Peer-review is a pull request

Peer-reviews usually appears as pull requests, this usually entails a change to base work that achieves the certain goal by changes. A nice coincidence that acronym PR here corresponds to both peer review and pull request.

Technical excellence does come with decent behaviour

Aiming at technical excellence is all we need to practice. Requesting technical excellence in PRs is our duty as peers. However, it does come with a decent behaviour. PRs are tools for collaborative work, even if it isn't your project or you are in a different cross-function. Here we summarise some of the high-level points for PRs. This can manifest as software code, algorithmic method or a scientific or technical article:
  • Don’t be a jerk  We should not request things that we do not practice ourselves or invent new standards on the fly.   If we do, then give a hand in implementing it.
  • Focus on the scope and be considerate We should not request things that extend the scope of the task much further than the originally stated scope.   
  • Nitpicking is not attention to details Attention to details is quite valuable but excessive nitpicking is not.
  • Be objective and don’t seek revenge If someone of your recommendations on PRs is not accepted by other colleague don’t seek revenge on his suggestions on your PRs by declining her/his suggestions as an act of revenge or create hostility towards that person.


We provided some basic suggestion on high-level guidance on peer review processes. Life is funny, there is a saying in Cyprus and probably in Texas too, -what you seed you will harvest-..

Saturday 28 March 2020

Book review: A tutorial introduction to the mathematics of deep learning

Artificial Intelligence Engines:
An introduction to the Mathematics
of Deep Learning
by Dr James V. Stone
the book and Github repository.
(c) 2019 Sebtel Press
Deep learning and associated connectionist approaches are now applied routinely in industry and academic research from image analysis to natural language processing and areas as cool as reinforcement learning. As practitioners, we use these techniques and utilise them from well designed and tested reliable libraries like Tensorflow or Pytorch as shipped black-boxed algorithms. However, most practitioners lack mathematical foundational knowledge and core algorithmic understanding. Unfortunately, many academic books and papers try to make an impression of superiority show subliminally and avoid a simple pedagogical approach. In this post we review, a unique book trying to fill this gap with a pedagogical approach to the mathematics of deep learning avoiding showing of mathematical complexity but aiming at conveying the understanding of how things work from the ground up. Moreover, the book provides pseudo-codes that one can be used to implement things from scratch along with a supporting implementation in Github repo. Author Dr James V. Stone, a trained cognitive scientist and researcher in mathematical neuroscience provides such approaches with other books many years now, writing for students, not for his peers to show off. One important note that this is not a cookbook or practice tutorial but an upper-intermediate level academic book.

Building associations and classify with a network

The logical and conceptual separation of associations and classification tasks are introduced in the initial chapters. It is ideal to start with from learning one association with one connection to many via gentle introduction to Gradient descent in learning the weights before going to 2 associations and 2 connections. This reminds me of George Gamow's term 1, 2 and infinity as a pedagogical principle. Perceptron is introduced later on how classification rules can be generated via a network and the problems it encounters with XOR problem.

Backpropagation, Hopfield networks and Boltzmann machines

Detail implementation of backpropagation is provided from scratch without too many cluttering index notation in such clarity. Probably this is the best explanation I have ever encountered. Following chapters introduced Hopfield networks and Boltzmann machines from the ground up to applied level. Unfortunately, many modern deep learning books skip these two great models but Dr Stone makes these two models implementable for a practitioner by reading his chapters.  It is very impressive. Even though I am a bit biased in Hopfield networks as I see them as an extension to Ising models and its stochastic counterparts, but I have not seen anywhere else such explanations on how to use Hopfield networks in learning and in a pseudo-code algorithm to use in a real task.

Advanced topics

Personally, I see the remaining chapters as advanced topics: Deep Boltzmann machines, variational encoders, GANs and introduction to reinforcement learning. Probably exception of deep backpropagation in Chapter 9. I would say what is now known as deep learning now was the inception of the architectures mentioned in sections 9.1 till 9.7.

Glossary, basic linear algebra and statistics

Appendices provide a fantastic conceptual introduction to jargon and basics to main mathematical techniques. Of course, this isn't a replacement to fully-fledged linear algebra and statistics book but it provides immediate concise explanations.

Not a cookbook: Not import tensorflow as tf book

One other crucial advantage of this book is that it is definitely not a cookbook. Unfortunately, almost all books related to deep learning are written in a cookbook style. This book is not. However, it is supplemented by full implementation in a repository supporting each chapter, URL here.


This little book archives so much with down to earth approach with introducing basic concepts with a respectful attitude, assuming the reader is very smart but inexperience in the field. If you are a beginner or even experienced research scientist this is a must-have book.  I still see this book as an academic book and can be used in upper-undergraduate class as the main book in an elective such as
"Mathematics of Deep Learning".

Enjoy reading and learning from this book. Thank you, Dr Stone, for your efforts on making academic books more accessible.

Disclosure: I received a review copy of the book but I have bought another copy for a family member. 

Sunday 22 March 2020

Computational Epidemiology and Data Scientists: Don't post analysis on outbreak arbitrarily


Many data scientist are trained or experienced in using tools to do statistical modelling, forecasting or machine learning solutions, this doesn't necessarily mean that they should just jump out and do an ad-hoc analysis on the available public data on the covid19 outbreak and draw policy conclusions and publish them in their blogs or the other medium.  Rule of thumb of doing such thing you should have at least one published paper, article or software solution related to outbreaks appeared before December 2019. Please be considerate, epidemiological modelling is not merely fitting exponential distribution.

Please refrain on posting a blog or similar posts on infection modelling and giving advice out of your ad-hoc data analysis you did over your lunch-break, if you have not worked on computational epidemiology before. There is a vast academic literature on computational epidemiology. Let people experts in those fields express their modelling efforts first. Let us value expertise in an area.

Appendix: Computational epidemiology introductory resources

Here we provide, limited pointers to computational epidemiology literature. Google Scholar is your friend to find many more resources.

  • Computational Epidemiology
    Madhav Marathe, Anil Kumar S. Vullikanti
    Communications of the ACM, July 2013, Vol. 56 No. 7, Pages 88-96 doi
  • Broadwick: a framework for computational epidemiology
    O’Hare, A., Lycett, S.J., Doherty, T. et al. Broadwick
    BMC Bioinformatics 17, 65 (2016).
  • Mathematical Tools for Understanding Infectious Disease Dynamics
    (Princeton Series in Theoretical and Computational Biology)
    Odo Diekmann, Hans Heesterbeek, and Tom Britton
    Princeton Press
  • Agent-Based Simulation Tools in Computational Epidemiology
    Patlolla P., Gunupudi V., Mikler A.R., Jacob R.T. (2006)
  • DIMACS 2002-2011 Special Focus on Computational and Mathematical Epidemiology Rutgers working group
  • Containment strategy for an epidemic based on fluctuations in the SIR model
    Philip Bittihn, Ramin Golestanian
    Oxford/Max Planck
  • SIAM Epidemiology Collection (2020)
  • The collection provided by the American Physical Society (APS) Physical Review COVID-19 collection
  • Modeling epidemics by the lattice Boltzmann method Alessandro De Rosis Phys. Rev. E 102, 023301
  • Implications of vaccination and waning immunity, Heffernan-Keeling,  2009 Jun 7; 276(1664): 2071–2080, url

Saturday 22 February 2020

A simple and interpretable performance measure for a binary classifier

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro 


The core application of machine learning models is a binary classification task. This appears in polyhedra of areas from medicine for diagnostic tests to credit risk decision making for consumers.  Techniques in building classifiers vary from simple decision trees to logistic regression and lately super cool deep learning models that leverage multilayered neural networks. However, they are mathematically different in construction and training methodology, when it comes to their performance measure, things get tricky. In this post, we propose a simple and interpretable performance measure for a binary classifier in practice. Some background in classification is assumed. 

Why ROC-AUC is not intepretable? 

Varying threshold produces different
confusion matrices (Wikpedia) 
De-facto standard in reporting classifier performance is to use Receiver Operating Characteristic (ROC) - Area Under Curve (AUC) measure. It originates from the 1940s during the development of Radar by the US Navy, in measuring the performance of detection.  There are at least 5 different definitions of what does ROC-AUC means and even if you have a PhD in Machine Learning, people have an excessively difficult time to explain what does AUC means as a performance measure. As AUC functionality is available in almost all libraries and it becomes almost like a religious ritual to report in Machine Learning papers as a classification performance. However, its interpretation is not easy, apart from its absurd comparison issues, see hmeasure.  AUC measures the area under the True Positive Rate (TPR) curve as a function of the False Positive Rate (FPR) that are extracted from confusion matrices with different thresholds.

$$\int_{0}^{1} f(x) dx = AUC$$

Whereby, y is TPR and x is FPR. Apart from a multitude of interpretations and easy to have confusions, there is no clear purpose of taking the integral over FPR. Obviously, we would like to have perfect classification by having FPR zero, but the area is not mathematically clear. Meaning that what is it as a mathematical object is not clear.

Probability of correct classification (PCC)

A simple and interpretable performance measure for a binary classifier would be great for both highly technical data scientist and non-technical stakeholders. The basic tenant in this direction is that the purpose of a classifier technology is the ability to differentiate two classes. This boils down to a probability value, Probability of correct classification (PCC). An obvious choice is so-called balanced accuracy (BA). This is usually recommended for unbalanced problems, even by SAS; though they used multiplication of probabilities. Here we will call BA as PCC and use addition instead, due to statistical dependence:
$PCC = (TPR+TNR)/2$


PCC tells us how good the classifier in detecting either of the class and it is a probability value, $[0,1]$. Note that, using total accuracy over both positive and negative cases are misleading, even if our training data is balanced in production, batches we measure the performance may not be balanced so accuracy alone is not a good measure.

Production issues
Immediate question would be, how to choose the threshold in generating confusion matrix? One option would be to chose a threshold that maximizes PCC for production on the test set. To improve the estimation of PCC, resampling on the test set can be performed to get a good uncertainty.


We try to circumvent in reporting AUCs by introducing PCC, or balanced accuracy as a simple and interpretable performance measure for a binary classifier. This is easy to explain to a non-technical audience. An improved PCC, that takes into account better estimation properties can be introduced, but the main interpretation remains the same as probability of correct classification. 

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License