Tuesday, 22 June 2021

Full cross-validation and generating learning curves for time-series models

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro


Time-series analysis is needed almost in any quantitative field and real-life systems that collects data over time, i.e., temporal datasets. Building predictive models on temporal datasets for future evolution of systems in consideration are usually called forecasting. The validation of such models deviates from the standard holdout method of having random disjoint splits of train, test and validation sets used in supervised learning. This stems from the fact that time-series are ordered and order induces all sorts of statistical properties that should be retained. For this reason, applying direct cross-validation to time-series model building is not possible and only restricted to out-of-sample (OOS) validation, using the end-tail of a temporal set as a single test set. A recent work proposed an approach that overcomes the known limitation achieving full cross-validation for time-series. The approach opens up for a possibility to produce learning curves for the time-series models as well, which is usually also not possible due to similar reasons.

Reconstructive Cross-validation (rCV) : A meta-algorithm design principles

rCV is proposed recently in the paper titled Generalised learning of time-series: Ornstein-Uhlenbeck processes. The design principles of rCV for time-series aims at the following principles:

   Figure 1 : rCV meta-algorithm for time-series
cross-validation and learning curves.

  1. Logically close to standard cross-validation: Arbitrary test-set size and number of folds. 
  2. Preserve correlations and data order. 
  3. Does not create absurdity of predicting past from the future data. 
  4. Applicable in generic fashion regardless of learning algorithm. 
  5. Applicable to multi-dimensional time-series. 
  6. Evaluation metric agnostic.

Idea of introducing missing-data  : Temporal cross-validation and learning curves

The key idea of rCV is to create cross-validation sets via creating missing-data sets K-times, as in K-fold, with a given degree of missing ratio, i.e., random data point removal. Each fold will have disjoint set of missing data points. By an imputation method, we would fill out the K-disjoint missing data sets and generate K-different training datasets.  This would allow us to have K-different models and we could measure the generalised performance of the modelling approach by testing the primary models prediction on the Out-of-sample (OOS) test set. To avoid confusion about what is a model?, what we are trying to achieve is to find out hypothesis, i.e., the modelling approach.  By changing the ratio of missing data and repeating the cross-validation exercise will yield to set of ratio of missing-missing data introduced and their corresponding rCV errors, the plot is nothing but a learning-curve from supervised learning perspective.  Note that the imputation and prediction models are different models. The primary model we are trying to build is the prediction model we used for producing OOS predictions. The procedure is summarised in Figure 1. 

    Figure 2 : Synthetic data and reconstructions. 

Show-case with Gaussian process models on Ornstein-Uhlenbeck processes

To demonstrate the utility of rCV, the mentioned paper uses a synthetic data generated by Ornstein-Uhlenbeck process, i.e., Gaussian process with certain parameter setting.  Figure 2, shows the synthetic data and example locations of generated missing-data sets's reconstruction errors. Figure 3 shows learning curves depending on the different ratios of missing data setting. 

    Figure 3: Learning curves for the Gaussian Process model
generated by rCV.


rCV provides logically consistent way of practicing cross-validation in time-series. It is usually not possible to produce learning-curves on the same time-window for time-series model: by using rCV with different ratio missing data achieves this as well. rCV paves way to do generalised learning for time-series.

Further Reading

Apart from the paper Generalised learning of time-series: Ornstein-Uhlenbeck processes. the results can be reproduced with the Python prototype implementation,  here.

Tuesday, 29 December 2020

Practice causal inference: Conventional supervised learning can't do inference

This is a bit philosophical but goes into causal inference.

A trained model may provide predictions about input values it may never seen before but it isn't an inference, at least for 'classical' supervised learning. In reality it provides an interpolation from the training-set, i.e., via function approximation. By "inference implies going beyond training data", reference to distributional shift, compositional learning or similar type of learning should have been raised. 

In the case of ontology inference, ontology being a causal graph, that is a "real" inference as it symbolically traverse a graph of causal connections. Not sure if we can directly transfer that to regression scenario but probably it is possible with altering our models with SCMs and hybrid symbolic-regression approach. 

  • Looper repo provides a resource list for causal inference looper 
  • Thanks to Patrick McCrae for invoking ontology inference comparison.

Sunday, 1 November 2020

Gems of data science: 1, 2, infinity


Figure: George Gamow's book. (Wikipedia)
Problem-solving is the core activity of data science using scientific principles and evidence. On our side, there is an irresistible urge to solve the most generic form of the problem. We do this almost always from programming to formulation of the problem. But, don't try to solve a generalised version of the problem. Solve it for N=1 if N is 1 in your setting, not for any integer: Save time and resources and try to embed this culture to your teams and management. Extent later when needed on demand.

Solving for N=1 is sufficient if it is the setting

This generalisation phenomenon manifests itself as an algorithmic design: From programming to problem formulation, strategy and policy setting. The core idea can be expressed as mapping, let's say the solution to a problem  is a function, mapping from one domain to a range 

$$ f : \mathbb{R} \to \mathbb{R} $$

Trying to solve for the most generic setting of the problem, namely multivariate setting

$$ f : \mathbb{R}^{m} \to \mathbb{R}^{n} $$

where $m, n$ are the integers generalising the problem.  


It is elegant to solve a generic version of a problem. But is it really needed? Does it reflect reality and would be used? If N=1 is sufficient, then try to implement that solution first before generalising the problem. An exception to this basic pattern would be if you don't have a solution at N=1 but once you move larger N that there is a solution: you might think this is absurd, but SVM works exactly in this setting by solving classification problem for disconnected regions.


  • The title intentionally omits three, while it is a reference to Physics's inability to solve, or rather a mathematical issue of the three-body problem.

Sunday, 28 June 2020

Conjugacy and Equivalence for Deep Neural Networks: Architecture compression to selection


A recently shown phenomenon can classify deep learning architectures with only using the knowledge gained by trained weights [suezen20a]. The classification produces a measure of equivalence between two trained neural network and astonishingly captures a family of closely related architectures as equivalent within a given accuracy. In this post, we will look into this from a conceptual perspective. 

Figure 1: VGG architecture spectral difference in the long
positive tail [suezen20a]
The concept of conjugate matrix ensembles and equivalence

Conjugacy is a mathematical construct reflecting different approaches to the same system should yield to the same outcome: It is reflected in the statistical mechanic's concept of ensembles. However, for matrix ensembles, like the ones offered in Random Matrix Theory, the conjugacy is not well defined in the literature. One possible resolution is to look at the cumulative spectral difference between two ensembles in the long positive tail part of the spectrum [suezen20a]. If this is vanishing we can say that two matrix ensembles are conjugate to each other. We observe this with matrix ensembles VGG vs. circular ensembles. 

 Conjugacy is the first step in building equivalence among different architectures.  If two architectures are conjugate to the same third matrix ensemble and their fluctuations on the spectral difference are very close over the spectral locations, they are equivalant in a given accuracy [suezen20a].

Outlook: Where to use equivalence in practice?

The equivalence can be used in selecting or compressing an architecture or classify different neural network architectures. Python notebook to demonstrate this with different vision architecture in PyTorch is provided, here.


[suezen20a] Equivalence in Deep Neural Networks via Conjugate Matrix Ensembles, Mehmet Suezen, arXiv:2006.13687 (2020)

(c) Copyright 2008-2020 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License