Saturday, 9 September 2023

Practical causal ordering:
Why weighted-Directed Acyclic Graphs (DAGs) are powerful for causal inference?

Preamble 

Fractal Tree (Wikipedia)
A quiet causal inference revolution is underway in industry. We see immense success of transformers deep learning architectures. However their success should also be attributed to causal modelling. Large Language Models (LLMs), specially closed-sourced ones, elevates their performance with encoding causal mechanism with human designed deep learning components, i.e., innovative layers such as causal convolutions with multi-head self-attention layers. Now a classical approach,  one of the corner-stone of causality, is expressing modelling variates' causal relationships with  Directed Acyclic Graphs (DAGs). This lead to causal analysis and preserving ordering. In this short tutorial, we cover these without graph theoretic language for practical causal ordering.

Understanding weighted Directed Acyclic Graphs (wDAGs) as causal data structure  

We first define what is a directions and weights. Providing a notational definition via tuples of objects.    

Definition (wDAG): A weighted Directed Acyclic Graph (wDAG)  $\mathscr{G_{c}}$  is defined as set of ordered triplets of weights and connected random variables, such that, $k$th  triplet $(w_{k}, x_{i}, x_{j})$ where by $w_{k} \in \mathbb{R}$ is the weight, an effect size, between two variates that $x_{i}$ effects $x_{j}$. There are constraints : 

(i) No cyclic effects can be defined, necessarily $x_{i}$ can not be equal to $ x_{j}$.

(ii) If there is a definition,  $(w_{k}, x_{i}, x_{j})$ the reverse can't be defined, i.e.,  so that $(w_{k}, x_{j}, x_{i})$ does not exist.    

(iii) No two causal effects sizes can't be exactly equal, $w_{k}$ can not be equal to $w_{l}$, from the same causal variable,  meaning no simultaneous events caused by the same random variable. This prevents ambiguity of ordering and random tie-breaks are unnatural.

This definition is practical and do not introduce any graph theory jargon. We left the sizes of indices as an exercise. 

Inducing Causal Order via wDAGs

By the very definition of wDAGs, the power of this definition is one can construct causal ordering. 

Definition (Causal Ordering from wDAG): Given $\mathscr{G_{c}}$, we can construct causal ordering among random variates $O(i)$ for $x_{i}$ using directionality and weights from  $\mathscr{G_{c}}$:

(i) if there exist a triplet  $(w_{k}, x_{i}, x_{j})$, then ordering $x_{j} \succ x_{i}$, implies $x_{j}$ occurred before $x_{j}$, or cause of $x_{i}$ was $x_{j}$

(ii) if there are two or more triplets having the same first variates, ordering is induces by the effect size $w_{k}$ among them.


To provide a simple example, let's say we formed  a wDAG, $\mathscr{G_{c}} = \{ (0.1, x_{1}, x_{2}),(0.2, x_{1}, x_{3}), (1.1, x_{2}, x_{4})  \}$ then the following causal ordering is established  $x_{1} \succ x_{3} \succ x_{2} \succ x_{4}$, note the ordering of $x_{3}$ that took precedence on $x_{2}$ due to its weight.

Why LLMs with causal ordering are so successful?

Probably not very well spelled property of LLMs are having causal layers with deep learning elevating their ability to capture causal ordering in natural language so well, not only sequence. This is still in infancy from research perspective as LLMs are biologically not plausible engineered software systems act as lossy knowledge compressors, lossy part usually identified as hallucination. 

Conclusion

We introduce basic definition of wDAGs without heavy graph theory jargon and provide hints on why causal ordering with wDAGs has an immense contribution in constructing useful LLMs.

Further reading

  • looper : Causality Resource List. [link]
  • Pearl et. al. Causal Inference in Statistics [link]
  • Shimizu et. al., A Linear Non-Gaussian Acyclic Model for Causal Discovery [link]
  • Hamad, et. al, Dilated causal convolution with multi-head self attention for sensor human activity recognition [link].
  • Lui et. al DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks [link]
Please cite as follows:

 @misc{suezen23pco, 
     title = {Practical causal ordering: Why weighted-Directed Acyclic Graphs (DAGs) are powerful for causal inference? }, 
     howpublished = {\url{https://memosisland.blogspot.com/2023/09/causal-ordering-dags-.html}}, 
     author = {Mehmet Süzen},
     year = {2023}
}  

Postscript Notes
Last Update 30 Oct 2024
  • Transformers mathematically mimics wDAGs, i.e., induce causal ordering among distributed representations via “Attention” i.e., weighting inputs and linking them to different types of output sequences.  
  • Directed Acyclic Graphs (DAGs), are actually build implicitly by transformer architectures. This lead to causal analysis and preserving ordering of most meaningful word embeddings in an empirical level, i.e., heuristic causality bound by data-driven model with inductive bias of the modeller (LLM trainer). Striking enough, it is possible to express these relationships using triplet notation without Graph Theory, and that’s what attention layer store rather than graph.
  • The primary  driver in this is not that it can only place an attention to certain portions of the embedded vector space for approximating  the separation of meaning. These data processing instructions actually builds a causal mappings among sub-spaces, i.e., a causal ordering among embedding sub-spaces. In short, transformers are heuristically executing causal discovery.  
  • Transformers in the context of machine translations performs 'causal graph discovery' heuristically, inducing causal ordering. That's why they exceeded previous translation architectures.
    • Grateful to see Judea Pearl's kind note on this via twitter. "What's the input? Statistical data or human authored texts? What are the assumptions behind the discovery?
      • This is critical as not to jump in conclusion that these neural network layers can do causal discovery automatically, but rather in human supervised conditions. The answer was, "Assumption is using an established causal model -human text/translation, directionality is already established. Then transformer compute weights on this graph. Other enabler is word embeddings & large examples. Can't do causal inference as in SCMs, still a statistical model."
      •  Self-attention for LLMs acts as casual discovery machines over human experience

        The success of transformers deep learning architecture can also be attributed to causal modelling. But how?  Probably the most prominent  application of transformers in machine translation. As translation models trained on human translated data

        Causality is inbuilt in these datasets and transformers behave as causal discovery layers. We could imagine weights on the translation matrix and other query matrices as DAGs, as connection between connectivity matrices and graphs are well known. Directionality is dictated by asymmetric weights. This manifest as causal analysis  preserving ordering of most meaningful words embeddings empirically i.e., heuristic causality. This can be demonstrated by triplets without graph theoretic notation, that would induce as causal ordering discovery.

        Causal set theory is also quite striking in quantum gravity. DAGs appear as Partially-ordered sets of events in Planck-scale relativistic events within the discrete space-time. Monograph by Benjamin Dribus, Discrete Causal Theory, Springer (2017)