Saturday, 9 September 2023

Practical causal ordering:
Why weighted-Directed Acyclic Graphs (DAGs) are powerful for causal inference?


Fractal Tree (Wikipedia)
A quiet causal inference revolution is underway in industry. We see immense success of transformers deep learning architectures. However their success should also be attributed to causal modelling. Large Language Models (LLMs), specially closed-sourced ones, elevates their performance with encoding causal mechanism with human designed deep learning components, i.e., innovative layers such as causal convolutions with multi-head self-attention layers. Now a classical approach,  one of the corner-stone of causality, is expressing modelling variates' causal relationships with  Directed Acyclic Graphs (DAGs). This lead to causal analysis and preserving ordering. In this short tutorial, we cover these without graph theoretic language for practical causal ordering.

Understanding weighted Directed Acyclic Graphs (wDAGs) as causal data structure  

We first define what is a directions and weights. Providing a notational definition via tuples of objects.    

Definition (wDAG): A weighted Directed Acyclic Graph (wDAG)  $\mathscr{G_{c}}$  is defined as set of ordered triplets of weights and connected random variables, such that, $k$th  triplet $(w_{k}, x_{i}, x_{j})$ where by $w_{k} \in \mathbb{R}$ is the weight, an effect size, between two variates that $x_{i}$ effects $x_{j}$. There are constraints : 

(i) No cyclic effects can be defined, necessarily $x_{i}$ can not be equal to $ x_{j}$.

(ii) If there is a definition,  $(w_{k}, x_{i}, x_{j})$ the reverse can't be defined, i.e.,  so that $(w_{k}, x_{j}, x_{i})$ does not exist.    

(iii) No two causal effects sizes can't be exactly equal, $w_{k}$ can not be equal to $w_{l}$, from the same causal variable,  meaning no simultaneous events caused by the same random variable. This prevents ambiguity of ordering and random tie-breaks are unnatural.

This definition is practical and do not introduce any graph theory jargon. We left the sizes of indices as an exercise. 

Inducing Causal Order via wDAGs

By the very definition of wDAGs, the power of this definition is one can construct causal ordering. 

Definition (Causal Ordering from wDAG): Given $\mathscr{G_{c}}$, we can construct causal ordering among random variates $O(i)$ for $x_{i}$ using directionality and weights from  $\mathscr{G_{c}}$:

(i) if there exist a triplet  $(w_{k}, x_{i}, x_{j})$, then ordering $x_{j} \succ x_{i}$, implies $x_{j}$ occurred before $x_{j}$, or cause of $x_{i}$ was $x_{j}$

(ii) if there are two or more triplets having the same first variates, ordering is induces by the effect size $w_{k}$ among them.

To provide a simple example, let's say we formed  a wDAG, $\mathscr{G_{c}} = \{ (0.1, x_{1}, x_{2}),(0.2, x_{1}, x_{3}), (1.1, x_{2}, x_{4})  \}$ then the following causal ordering is established  $x_{1} \succ x_{3} \succ x_{2} \succ x_{4}$, note the ordering of $x_{3}$ that took precedence on $x_{2}$ due to its weight.

Why LLMs with causal ordering are so successful?

Probably not very well spelled property of LLMs are having causal layers with deep learning elevating their ability to capture causal ordering in natural language so well, not only sequence. This is still in infancy from research perspective as LLMs are biologically not plausible engineered software systems act as lossy knowledge compressors, lossy part usually identified as hallucination. 


We introduce basic definition of wDAGs without heavy graph theory jargon and provide hints on why causal ordering with wDAGs has an immense contribution in constructing useful LLMs.

Further reading

  • looper : Causality Resource List. [link]
  • Pearl et. al. Causal Inference in Statistics [link]
  • Shimizu et. al., A Linear Non-Gaussian Acyclic Model for Causal Discovery [link]
  • Hamad, et. al, Dilated causal convolution with multi-head self attention for sensor human activity recognition [link].
  • Lui et. al DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks [link]
Please cite as follows:

     title = {Practical causal ordering: Why weighted-Directed Acyclic Graphs (DAGs) are powerful for causal inference? }, 
     howpublished = {\url{}}, 
     author = {Mehmet Süzen},
     year = {2023}

Tuesday, 22 June 2021

Full cross-validation and generating learning curves for time-series models

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro


Time-series analysis is needed almost in any quantitative field and real-life systems that collects data over time, i.e., temporal datasets. Building predictive models on temporal datasets for future evolution of systems in consideration are usually called forecasting. The validation of such models deviates from the standard holdout method of having random disjoint splits of train, test and validation sets used in supervised learning. This stems from the fact that time-series are ordered and order induces all sorts of statistical properties that should be retained. For this reason, applying direct cross-validation to time-series model building is not possible and only restricted to out-of-sample (OOS) validation, using the end-tail of a temporal set as a single test set. A recent work proposed an approach that overcomes the known limitation achieving full cross-validation for time-series. The approach opens up for a possibility to produce learning curves for the time-series models as well, which is usually also not possible due to similar reasons.

Reconstructive Cross-validation (rCV) : A meta-algorithm design principles

rCV is proposed recently in the paper titled Generalised learning of time-series: Ornstein-Uhlenbeck processes. The design principles of rCV for time-series aims at the following principles:

   Figure 1 : rCV meta-algorithm for time-series
cross-validation and learning curves.

  1. Logically close to standard cross-validation: Arbitrary test-set size and number of folds. 
  2. Preserve correlations and data order. 
  3. Does not create absurdity of predicting past from the future data. 
  4. Applicable in generic fashion regardless of learning algorithm. 
  5. Applicable to multi-dimensional time-series. 
  6. Evaluation metric agnostic.

Idea of introducing missing-data  : Temporal cross-validation and learning curves

The key idea of rCV is to create cross-validation sets via creating missing-data sets K-times, as in K-fold, with a given degree of missing ratio, i.e., random data point removal. Each fold will have disjoint set of missing data points. By an imputation method, we would fill out the K-disjoint missing data sets and generate K-different training datasets.  This would allow us to have K-different models and we could measure the generalised performance of the modelling approach by testing the primary models prediction on the Out-of-sample (OOS) test set. To avoid confusion about what is a model?, what we are trying to achieve is to find out hypothesis, i.e., the modelling approach.  By changing the ratio of missing data and repeating the cross-validation exercise will yield to set of ratio of missing-missing data introduced and their corresponding rCV errors, the plot is nothing but a learning-curve from supervised learning perspective.  Note that the imputation and prediction models are different models. The primary model we are trying to build is the prediction model we used for producing OOS predictions. The procedure is summarised in Figure 1. 

    Figure 2 : Synthetic data and reconstructions. 

Show-case with Gaussian process models on Ornstein-Uhlenbeck processes

To demonstrate the utility of rCV, the mentioned paper uses a synthetic data generated by Ornstein-Uhlenbeck process, i.e., Gaussian process with certain parameter setting.  Figure 2, shows the synthetic data and example locations of generated missing-data sets's reconstruction errors. Figure 3 shows learning curves depending on the different ratios of missing data setting. 

    Figure 3: Learning curves for the Gaussian Process model
generated by rCV.


rCV provides logically consistent way of practicing cross-validation in time-series. It is usually not possible to produce learning-curves on the same time-window for time-series model: by using rCV with different ratio missing data achieves this as well. rCV paves way to do generalised learning for time-series.

Further Reading

Apart from the paper Generalised learning of time-series: Ornstein-Uhlenbeck processes. the results can be reproduced with the Python prototype implementation,  here.

Tuesday, 29 December 2020

Practice causal inference: Conventional supervised learning can't do inference

This is a bit philosophical but goes into causal inference.

A trained model may provide predictions about input values it may never seen before but it isn't an inference, at least for 'classical' supervised learning. In reality it provides an interpolation from the training-set, i.e., via function approximation. By "inference implies going beyond training data", reference to distributional shift, compositional learning or similar type of learning should have been raised. 

In the case of ontology inference, ontology being a causal graph, that is a "real" inference as it symbolically traverse a graph of causal connections. Not sure if we can directly transfer that to regression scenario but probably it is possible with altering our models with SCMs and hybrid symbolic-regression approach. 

  • Looper repo provides a resource list for causal inference looper 
  • Thanks to Patrick McCrae for invoking ontology inference comparison.

Sunday, 1 November 2020

Gems of data science: 1, 2, infinity


Figure: George Gamow's book. (Wikipedia)
Problem-solving is the core activity of data science using scientific principles and evidence. On our side, there is an irresistible urge to solve the most generic form of the problem. We do this almost always from programming to formulation of the problem. But, don't try to solve a generalised version of the problem. Solve it for N=1 if N is 1 in your setting, not for any integer: Save time and resources and try to embed this culture to your teams and management. Extent later when needed on demand.

Solving for N=1 is sufficient if it is the setting

This generalisation phenomenon manifests itself as an algorithmic design: From programming to problem formulation, strategy and policy setting. The core idea can be expressed as mapping, let's say the solution to a problem  is a function, mapping from one domain to a range 

$$ f : \mathbb{R} \to \mathbb{R} $$

Trying to solve for the most generic setting of the problem, namely multivariate setting

$$ f : \mathbb{R}^{m} \to \mathbb{R}^{n} $$

where $m, n$ are the integers generalising the problem.  


It is elegant to solve a generic version of a problem. But is it really needed? Does it reflect reality and would be used? If N=1 is sufficient, then try to implement that solution first before generalising the problem. An exception to this basic pattern would be if you don't have a solution at N=1 but once you move larger N that there is a solution: you might think this is absurd, but SVM works exactly in this setting by solving classification problem for disconnected regions.


  • The title intentionally omits three, while it is a reference to Physics's inability to solve, or rather a mathematical issue of the three-body problem.

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License