Sunday, 15 December 2019

Bringing back Occam's razor to modern connectionist machine learning:

A simple complexity measure based on statistical physics
Cascading Periodic Spectral Ergodicity (cPSE)

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro with the title Applying Occam's razor to Deep Learning 

Kindly reviewed by Cornelius Weber

Preamble: Changing concepts in machine learning due to deep learning

Occam's razor or principle of parsimony has been the guiding principle in statistical model selection. In comparing two models, which they provide similar predictions or description of reality, we would vouch for the one which is less complex. This boils down to the problem of how to measure the complexity of a statistical model and model selection. What constitutes a model, as discussed by McCullagh (2002) in statistical models context is a different discussion, but here we assume a machine learning algorithms are considered as a statistical model.  Classically, the complexity of statistical models usually measured with Akaike information criterion (AIC) or similar. Using a complexity measure, one would choose a less complex model to use in practice, other things fixed.
Figure 1: Arbitrary architecture, each node represents
a layer for a given deep neural network, such as
convolutions or set of units.  Süzen-Cerd
à-Weber (2019)


The surge in interest in using complex neural network architectures, i.e., deep learning due to their unprecedented success in certain tasks,  pushes the boundaries of "standard" statistical concepts such as overfitting/overtraining and regularisation

Now Overfitting/overtraining is often used as an umbrella term to describe any unwanted performance drop off a machine learning model Roelofs et. al. (2019) and nearly anything that improves generalization is called regularization, Martin and Mahoney (2019).

A complexity measure for connectionist machine learning

Deep learning practitioners rely on choosing the best performing model and do not practice Occam's razor.  The advent of Neural Architecture Search and new complexity measures  
that take the structure of the network into account gives rise the possibility of practising Occam's razor in deep learning. Here, we would cover one of the very practical and simple measures called cPSE, i.e., cascading periodic spectral ergodicity. This measure takes into account the depth of the neural network and computes fluctuations of the weight structure over the entire network,  Süzen-Cerdà-Weber (2019) Figure 1. It is shown that the measure is correlated with the generalisation performance almost perfectly, see Figure 2.


Practical usage of cPSE

Figure 2: Evolution of PSE, periodic spectral ergodicity,
it is shown that complexity measure cPSE saturates
after a certain depth,  Süzen-Cerdà-Weber (2019)
The cPSE measure is implemented in Bristol python package, starting from version 0.2.6. If a trained network wrapped into a PyTorch model object, cPSE can be used to compare two different architectures. If two architectures give similar test performance, we would select the one with higher cPSE value. Lower the cPSE, more complex the model.

An example of usage requires a couple of lines, example measurements for VGG and ResNet are given in Süzen-Cerdà-Weber (2019):

from bristol import cPSE
import torchvision.models as models
netname = 'vgg11'
pmodel = getattr(models, netname)(pretrained=True)
(d_layers, cpse) = cPSE.cpse_measure(pmodel)




Conclusion and take-home message

Using a less complex deep neural network that would give similar performance is not practised by the deep learning community due to the complexity of training and designing new architectures. However, quantifying the complexity of similarly performing neural network architecture would bring the advantage of using less computing power to train and deploy such less complex models into production. Bringing back the Occam's razor to modern connectionist machine learning is not only a theoretical and philosophical satisfaction but the practical advantages for environment and computing time is immense.

Postscript : Appendix : Vanilla computation of cPSE

Recipe is quite easy to implement in practice: Get all weight matrices of your trained architecture as list of 2D matrices, special layers and biases subject to mapping, then:


from bristol import cPSE

spectrum_set = cPSE.get_eigenvals_layer_matrix_set(layer_matrices)

periodic_spectrum_set = cPSE.eigenvals_set_to_periodic(spectrum_set)

spectral_ergodicity_over_layers = cPSE.d_layers_pse(periodic_spectrum_set)


One should stop adding more layers where spectral_ergodicity_over_layers saturates to a constant value, i.e., decreasing curve.




Saturday, 23 November 2019

A simple pedagogical one-line python code for the Fibonacci sequence with recursion (and memoization)


Fibonacci, Leonardo da Pisa
(Wikipedia)
Updated with memoization on 26 Feb 2020

Preamble

The Fibonacci sequence $F_{n} = F_{n-1} + F_{n-2}$ is famous for both artistic and pure mathematical curiosity as an integer sequence. A simple implementation of this sequence appears as a standard programmer's entry-level job question, along with finding factorial and greatest common divisor. In this short post, we will show how to implement $F_{n}$ with recursion in Python one-liner.

Fibonacci Recursion in Python

Here is a one-liner
f = lambda n : n if n < 2 else f(n-1)+f(n-2) ; [f(n) for n in range(0,13)]

Let's dissect the one line. $f$ is our function that computes the $nth$ member of the sequence in a recursive fashion.

For example, a recursion of $f$ at $n=3$ follows the following depth as recursive calls:
  1. Upper level, n=3 : f(2)+f(1)
  2. n=2 : f(1) + f(0) 
    1. Returns 1+0 = 1
  3. n=1 : f(1) 
    1.  Returns 1
  4. Returns 2
Recall that every recursion needs a termination condition, which is less than two in our case. The second part of one-liner after semi-column, which indicates another line,  is a simple list-comprehension to repeat the call to $f$ from 0 to 12. Resulting
 [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144] 

Computational efficiency

To compute the sequence for a very large set, this implementation is obviously very slow. One would need to introduce caching within the recursion to avoid repeated computation.

Reading list for novice coders
Conclusion

One of the most interesting programming tasks usually comes in implementing integer sequences. Note that Fibonacci day is celebrated on 23rd of November, the day this post is published.

Postscript: Time complexity, memoization version with timings

The complexity of 1 liner code above is $O(2^N)$, with caching it can be $O(N)$. Thanks to Tim Scarfe for pointing out Big-O complexities. The solution is to introduce caching, this is so-called memoization.

The following gives timing for $m=35$ without and with memoization. It is seen that memoization is almost  5 orders of magnitude faster and will actually memory efficient as well. For larger $m$ memoization version may return a value but without memoization, we would quickly overflow on recursive calls.


import time
m = 35

# no memoization
t0 = time.process_time()
f = lambda n : n if n < 2 else f(n-1)+f(n-2) ;  sfib = [f(i) for i in range(1,m)]
time.process_time() - t0

# memoization
t0 = time.process_time()
mem={1:1,2:2} ; f = lambda n : n if n < 2 else mem.get(n) if mem.get(n) else mem.update({n:f(n-1)+f(n-2)}) ; sfib_mem = [f(i) for i in range(1,m)]; sfib_mem = [1]+list(mem.values())
time.process_time() - t0
 


Friday, 27 September 2019

On modern data scientist: A blind empiricist is not a data scientist

As mentioned by Professor Pearl graciously on twitter 

Preamble
Hubble Space Telescope (Wikipedia)
Computational science is to
modern data scientist, as telescopes are
for astrophysics.

A better software developer than a statistician and better statistician than a software developer would have been a good definition for the early 2010s in identifying who would be a data scientist. In the late 2010s, trends changed dramatically, a data scientist is now identified as who can turn any set of data to run through machine learning libraries and getting a model to deploy for service.  Unfortunately, this blind empiricism is now considered as a data science practice in many industrial places and the term "scientist" lost its intellectual practice and turn into the mass hysteria of producing "junk science" blindly in the name of "democratisation of data science".

Who is the modern data scientist? 

Modern data science actually goes beyond statistics and machine learning. Modern data scientist practice computational science from dynamical systems to game theory or graph theory. One could think of such practice as applied mathematics or statistical physics as well.  For example, most of the neural networks is actually originating from statistical physics. In that sense, a modern data scientist is a computational scientist building mechanics of data.


  1. The exploratory analysis goes beyond basic PCA or clustering to be able to form causal relationships or establish mechanics of the data.
  2. Can express the mechanics of data in mathematical models and build parametric inference. Not all parameter estimations are learning.
  3. Use machine learning algorithms from libraries by knowing the underlying algorithm and can relate this to the mechanics of data.
  4. Build algorithms fusing above work.
  5. Explainable and transparent work.
  6. Document the findings as in the scientific paper and scientific software. 

Ignoring the above practice and treating data science similar to a web-based software development activity is not a fair practice and an immense waste of time. Organisations should understand that investing in data science means investing in the new computational science of building mechanics of data. Pushing the outcome of such a scientific practice to make a real-world impact lies in the novelty of scientist and as in any scientific funding, this is a very risky investment.

Misconception in democratisation of data science

The democratisation of data science does not mean that anyone should build learning or statistical models using machine learning libraries and put lots of data to get a black-box model as a blind empiricist. Democratisation was about the availability of tools and services at very low cost and open culture of transparency in algorithmic and software work.

Artificial Intelligence is modern data science

The separation of AI from the above definition of data science is not really clear. While AI combines the same characteristics to build so-called intelligent agents.

Conclusion

Having a perspective and understanding of what is modern data science about will help organisations better orient in building modern data science capabilities.

Postscript: Further reading and on the mechanics of data

We used a term the mechanics of data, it implies the effort to put in finding signatures of causal relationships and make sense of the correlations within the data. The reason is one of the core scientific methods that give rise to modern science lies in Newton-Leibniz mechanics. Coveney and his co-worker's deep dives in intricacies of practising science and data science.
  • Big data: the end of the scientific method?
    Sauro Succi and Peter V. Coveney
    [article]
  • Big data need big theory too
    Peter V. Coveney, Edward R. Dougherty and Roger R. Highfield
    [article]
Post-Postscript
Judea Pearl, a pioneering scientist on causal inference field, a quiet revolution in statistics and data science, Turing award laureate has similar critique on excessive empiricism. His post explains: 

Radical Empiricism and Machine Learning Research, which is also published as an article here: doi

Thursday, 10 January 2019

The fundamental problem of causal inference: causality resource list Looper

Preamble


One of the main tenants of practical data science is performing statistical inference on data sets which are assumed to be representations of populations of activity, natural or man-made. A specific case is called causal inference, which probably the core interest of decision makers and probably the main reason why businesses or industry funds data science projects in the first place. Either by trying to find a cause of an event or understand the impact of data science and AI products. A recent popular book of "Why" once again put causal inference into the major market [pearl2018]. However, the nightmare for a data scientist is that there is a fundamental limitation of performing such an inference. It stems from the fact that, the condition whereby to test causal link, i.e., establishing cause and effect, on a given data point cannot occur simultaneously. This is the so-called FPCI, the fundamental problem of causal inference [Holland86] and it can only be resolved approximately. This implies a data point cannot be both treated and not-treated simultaneously. For example, a patient cannot have a condition that receives treatment or no treatment at the same time, or a customer cannot be in the campaign or excluded at the same time. This post provides possible approaches in tackling it with pointers to resource and software tools.  The core problem is a missing data issue, but note that the missingness here originates from not a measurement error but fundamentally non-existing data point, so it is not only simple data imputation issue. More extensive resource list can be found, a resource list called looper is available on GitHub,  A resource list for causality in statistics, data science and physics. [looper]
Figure: Looper the movie revolves
around the causal loop (Wikimedia)

Rubin causal model: A-null test with the inherent contradiction


The question of why as in what is the cause of an event inherently a physics question and goes into space-time concept of general relativity and in classical mechanics in general. Popular time-travel movies, such as looper, see Figure, causality loops creates a curious non-intuitive phenomenon. From a data analysis perspective, the Rubin causal model [Rubin1974]  asserts that the causal effect can be quantified with A-Null test. What does this mean? Cause, i.e., treatment as a medical connotation, is the cause of the event, i.e., the effect or the outcome, can be quantified by the algebraic difference between the expected value of the outcome and it's counterfactual, a fancy name for what would have been the outcome in the absence of the treatment. Unfortunately, this is inherently contradictory as mentioned above, so-called FPCI, the fundamental problem of causal inference [Holland86]. On a single sample, i.e. event,  this A-null test cannot be applied because data is missing causally.  In estimating the A-null test, two groups are designed, so that they are drawn from the same population, recall the law of large numbers. One group receives the treatment and the other does not by design. Over time the effect of the treatment measured on both groups and Rubins A-Null test is applied. Not surprisingly they are called control and treatment groups and this procedure is called Average Treatment Effects (ATE). The mathematical exposure of this procedure is well established and can be found in standard texts, see Looper, but mathematically ATE reads  $\mathbb{E}(Y_{i} | T_{i} = 1) - \mathbb{E}(Y_{i} | T_{i} = 0) $.

Matching or balancing

Randomized control trails (RCT) constructs control and treatment groups. To compute ATE robustly, the statistical properties of covariates, set of features used in the study, from control and treatment must be similar. Any technique try to address the balancing issue is called matching [stuart2010].  Prominienet example in this direction is propensity score matching, using similarity distances between covariates, implemented in R via matching package based on genetic search. 

Imputation approach: Causal inference as missing data problem

One way to overcome FPCI is applying an imputation to missing data, predicting outcome value in case of control sample, vice versa. Using advanced imputation techniques one can resolve FPCI.  One prominent example is via Multiple Imputation, implemented in R mice package.

Summary

We have reviewed the fundamental problem of causal inference (FPCI) . For causality resources, a resource list called looper is available on GitHub,  A resource list for causality in statistics, data science and physics, please sent  a pull request for additions.Note that causality is a large research area from Bayesian Networks, Pearl's do calculus and uplift modelling.


References

[pearl2018] Judea Pearl and Dana Mackenzie, The Book of Why: The New Science of Cause and Effect,  Basic Books; 1 edition (May 15, 2018)
[Holland1986] Holland, Paul W. (1986). "Statistics and Causal Inference". J. Amer. Statist. Assoc. 81 (396): 945–960. 
[Rubin1974] Rubin DB (1974). “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology, 66, 688–701.
[stuart2010]  Elizabeth A. Stuart, Matching methods for causal inference: A review and a look forward Stat Sci. 2010 Feb 1; 25(1): 1–21.
[looper] A resource list for causality in statistics, data science and physics http:github.com/msuzen/looper.

Thursday, 3 January 2019

Core principles of sustainable data science, machine learning and AI product development: Research as a core driver

Kindly reposto to KDnuggets  by Gregory Piatetsky-Shapiro


Preamble 

Almost all businesses and industry embraced Machine learning (ML) technologies. Apart from ROI concerns, as it is an expensive endeavour to develop and deploy a service driven by ML techniques, sustainability as in going beyond proof-of-concept core development appears to be one of the roadblocks in data science. In this post, we will outline basic logical core principles that can help organisations for sustainable AI product development cycle, apart from reproducibility issues. The aim is giving a coarser view, rather than listing fine-grain good practice advice.

Research as a core driver: Research Environment

Regardless of the size of your organisation, if you are developing machine learning or AI products, the core asset you have is a research professional, data scientist or AI scientist, regardless of their academic background. Developing a model using software libraries blindly won't resolve issues you might encounter after deployment of the product. For example, even if you need to do a simple hyperparameter search, this can easily yield to research. Why? Because most probably no one ever tried building a model or try a modelling task using your dataset and you might need a different approach than ML libraries provide. A simplest different angle or deviation from ML libraries will yield to a research problem. 
  • No full 'black-box' approaches.
  • No blind usage of software libraries.
  • Awareness and skills in the mathematical and algorithmic aspect in detail.
Figure: A schematic of core principles for AI product development.
Separate out research code and production code

Software development is an integral part of ML product development. However, during research, a code development can go very wild and a scientist, even if they are very good software developers, would end up creating hard to follow and poor code. Once there is a confidence in reproducibility and robustness of results, the production code should be re-written with high-quality software engineering principles.

Data Standardisation: Release data-sets for research 

A cold start problem for ML products is to release and design data-sets before even doing any research like work. This, of course, has to be aligned with industrial requirements. Imagine datasets like MINST or imagenet for benchmarking. Released sets will be the first step for any model building or product development, and would constitute a data product themselves. Data versioning is also a must.

Do not obsess with workflows: All workflows are ad-hoc

There is no such thing as a universal or generic workflow. A workflow depends on a human understanding of processes and steps. Human understanding is based on language and linguistically there is no such thing as universal language, at least it isn't practical yet c.f., universal grammar. Loosely defined steps are sufficient for research steps. However, once it entered into production, then much more strict workflow design might be needed, but be aware all workflows are ad-hoc.

Do not run sprints for core data science 

Agile principles are suitable for software development innovations. Sprints or Agile is not suitable for AI research and research environment, it is a different kind of innovation than software engineering. Thinking that Agile is a cure to do scientific innovation is naive wishful thinking. Structuring a research group,  periodic reviews and releases of the results via presentations and detailed technical reports are much more suitable for data science on top of mini-workshops. A simple proposal runs can also be made to decide which direction to invest, akin to research proposals.

Feedback loop: Service to Business decision making back to research

A service using ML technologies should produce more data. The very first service monitoring is A/Null testing, meaning that what would happen in the absence of the AI product. Detailed analysis of the service data would bring more insights both for business and to research. 
  • produce impact assessment: A/null testing
  • quality of service: Quality of service can be measured basically on what is the success of the ML model, this has to be technical.

Conclusion and outlook

There is no such thing as free-lunch and developing AI products won't be fully automated soon. Tools may improve the productivity immensely but AI replacing a data scientist or AI scientist is far from reality, at least for now. If you are investing in AI products, basically you are investing in research at the core, missing that important point may cost organisations a lot. The basic core principles or variation of them may help in sustaining AI products longer and form your teams accordingly.