Memo's Island: causality

Showing posts with label causality. Show all posts

Saturday, 24 February 2024

Inducing time-asymmetry on reversible classical statistical mechanics via Interventional Thermodynamic Ensembles (ITEs).

Preamble

Probably, one of the most fundamental issue in classical statistical mechanics is extending reversible dynamics to many-particle systems that behaves irreversibly. In other words, how time's arrow appears even though constituted systems evolves in reversible dynamics. This is the main idea of Loschmidt's paradox. The resolution to this paradox lies into something called interventional thermodynamic ensembles (ITEs).

Leaning Tower
of Pisa:Recall Galileo's
Experiments
(Wikipedia)

Time-asymmetry is about different histories : Counterfactual dynamics

Before trying to understand how ITEs are used in resolving Loschmidt's paradox, we understand that inducing different trajectories on an identical dynamical system in "a parallel universe" implies time-asymmetry. A trajectory provides here a reversibility. So called "a parallel universe" is about imagining a different dynamics via a sampling, this corresponds to counterfactuals within Causal inference frameworks.

Interventional Thermodynamic Ensembles (ITEs)

Interventional ensemble build upon an other ensemble, for the sake of simplicity, we can think of an ensemble as an associated chosen sampling scheme. From this perspective, sampling scheme $\mathscr{E}$ would have an interventional sampling $do(\mathscr{E})$ if the adjusted scheme only introduces a change in the scheme that doesn't change the inherent dynamics but effects the dynamical history. One of the first examples of this is appeared recently: single-spin-flip vs. dual-spin-flip dynamics [suezen23]. This is shown with simulations.

Outlook

Reversibility and time-asymmetry in classical dynamics are a long standing issues in physics. By inducing causal inference perspective in computing dynamical evolution of many body systems leads to reconciliation of reversibility and time-asymmetry i.e., $do-$operator's interpretation.

References

[suezen23] H-theorem do-conjecture (2023) arXiv:2310.01458 (simulation code GitHub).

Please Cite as:

@misc{suezen24ite,

title = {Inducing time-asymmetry on reversible classical statistical mechanics via Interventional Thermodynamic Ensembles (ITEs)},

howpublished = {\url{https://memosisland.blogspot.com/2024/02/inducing-time-asymmetry-on-reversible.html},

author = {Mehmet Süzen},

year = {2024}

}

Saturday, 25 November 2023

Why should there be no simultaneity rule for causal models?

Dominos in motion
(Wikipedia)

Preamble

The definition of weighted directed graphs (wDAGs) provides a great opportunity to express causal relationships among given variates. Usually this is expressed as SCMs, Structural Causal Model or in more generally causal model. A given causal model can be expressed as set of simultaneous equations, given a direction for the equality, right to left , meaning $A=B$ implies B causes A to happen $B \to A$ . Then what happens if A is a function of B and C, $A=f(B,C)$, then we say $ B \to A$ and $C \to A$ occurs simultaneously. In this post we discuss this situation that there should be no simultaneity rule in causal models, regardless of if they are not time-series models.

Understanding causal models

Basic definition of a causal model follows a functional form with set of equations, realistically with added noise. The models forms a weighted Directed Acyclic Graphs (wDAGs) visually. Here is the mathematical definition due to Pearl (Causality 2009), we made it a bit more coarser in this definition:

Definition (Causal Model) : Given set of $n$-variables $X \in \mathbb{R}^{n}$, two subsets of $X= x_{1} \cup x_{2}$, they can form set of equations $x_{2}=f(x_{1}; \alpha; \epsilon)$, $\alpha$ being the causal effect sizes on $x_{1}$ as causes of $x_{2}$ with some noise $\epsilon$. This corresponds to a $wDAG$ formed among $X$ with weights $\alpha$. So that there is a graph $\mathscr{G}(X, \alpha)$ representing this set of equations, where by equality put direction from right to the left side of the equation.

However this definition does not set any constraints on the values of $\alpha$. Any two or more values of $\alpha$-s can be equivalent on the same path within $X$. This implies an interestingly that there would be set of variates simultaneously causes the same thing. It sounds plausible and physically possible to a degree within Planck-time. However, this brings an ambiguity of breaking ties in ordering events.

Perfect Causal Ordering

Given wDAG as a causal model induces causal ordering among all members of $X$. As we defined how this can be achieved in a recent post: Practical causal ordering. In this context, perfect causal ordering implies $\alpha$ values within the first order paths to a given end variable are different. Mathematically speaking a definition follows.

Definition (No simultaneity rule) Given all $k$ triplets ($x_{i}, y, \alpha_{i}$), that $x_{i}$ is one of the causes of $y$, and $\alpha_{i}$ causal effect sizes, all $\alpha_{i}$ are different numbers, inducing a perfect causal ordering.

This rule ensures we don't need to break ties randomly as causal ordering is established uniformly.

Conclusion: Importance of no simultaneity

By this definition we ruled out any simultaneous causes. This may sound too restrictive for modelling but this impacts decision making significantly; ranking causes of an outcome will impact how to prioritise the policy in addressing the outcome, i.e., such as medical intervention to prevent first cause. Also, it may not be feasible to intervene simultaneous causes. Hence, establishing primary causes in order is paramount in decision making and execution of any reliable policy.

Further reading

looper : Causality Resource

Looper Nuggets 1 (LN1) : Definition of wDAGs with no simultaneity rule.

Please cite this article as:

@misc{suezen23nos,

title = {Why should there be no simultaneity rule for causal models?},

howpublished = {\url{https://memosisland.blogspot.com/2023/11/causal-model-simultaneous.html},

author = {Mehmet Süzen},

year = {2023}

}

Saturday, 9 September 2023

Practical causal ordering:
Why weighted-Directed Acyclic Graphs (DAGs) are powerful for causal inference?

Preamble

Fractal Tree (Wikipedia)

A quiet causal inference revolution is underway in industry. We see immense success of transformers deep learning architectures. However their success should also be attributed to causal modelling. Large Language Models (LLMs), specially closed-sourced ones, elevates their performance with encoding causal mechanism with human designed deep learning components, i.e., innovative layers such as causal convolutions with multi-head self-attention layers. Now a classical approach, one of the corner-stone of causality, is expressing modelling variates' causal relationships with Directed Acyclic Graphs (DAGs). This lead to causal analysis and preserving ordering. In this short tutorial, we cover these without graph theoretic language for practical causal ordering.

Understanding weighted Directed Acyclic Graphs (wDAGs) as causal data structure

We first define what is a directions and weights. Providing a notational definition via tuples of objects.

Definition (wDAG): A weighted Directed Acyclic Graph (wDAG) $\mathscr{G_{c}}$ is defined as set of ordered triplets of weights and connected random variables, such that, $k$th triplet $(w_{k}, x_{i}, x_{j})$ where by $w_{k} \in \mathbb{R}$ is the weight, an effect size, between two variates that $x_{i}$ effects $x_{j}$. There are constraints :

(i) No cyclic effects can be defined, necessarily $x_{i}$ can not be equal to $ x_{j}$.

(ii) If there is a definition, $(w_{k}, x_{i}, x_{j})$ the reverse can't be defined, i.e., so that $(w_{k}, x_{j}, x_{i})$ does not exist.

(iii) No two causal effects sizes can't be exactly equal, $w_{k}$ can not be equal to $w_{l}$, from the same causal variable, meaning no simultaneous events caused by the same random variable. This prevents ambiguity of ordering and random tie-breaks are unnatural.

This definition is practical and do not introduce any graph theory jargon. We left the sizes of indices as an exercise.

Inducing Causal Order via wDAGs

By the very definition of wDAGs, the power of this definition is one can construct causal ordering.

Definition (Causal Ordering from wDAG): Given $\mathscr{G_{c}}$, we can construct causal ordering among random variates $O(i)$ for $x_{i}$ using directionality and weights from $\mathscr{G_{c}}$:

(i) if there exist a triplet $(w_{k}, x_{i}, x_{j})$, then ordering $x_{j} \succ x_{i}$, implies $x_{j}$ occurred before $x_{j}$, or cause of $x_{i}$ was $x_{j}$

(ii) if there are two or more triplets having the same first variates, ordering is induces by the effect size $w_{k}$ among them.

To provide a simple example, let's say we formed a wDAG, $\mathscr{G_{c}} = \{ (0.1, x_{1}, x_{2}),(0.2, x_{1}, x_{3}), (1.1, x_{2}, x_{4}) \}$ then the following causal ordering is established $x_{1} \succ x_{3} \succ x_{2} \succ x_{4}$, note the ordering of $x_{3}$ that took precedence on $x_{2}$ due to its weight.

Why LLMs with causal ordering are so successful?

Probably not very well spelled property of LLMs are having causal layers with deep learning elevating their ability to capture causal ordering in natural language so well, not only sequence. This is still in infancy from research perspective as LLMs are biologically not plausible engineered software systems act as lossy knowledge compressors, lossy part usually identified as hallucination.

Conclusion

We introduce basic definition of wDAGs without heavy graph theory jargon and provide hints on why causal ordering with wDAGs has an immense contribution in constructing useful LLMs.

Further reading

looper : Causality Resource List. [link]
Pearl et. al. Causal Inference in Statistics [link]
Shimizu et. al., A Linear Non-Gaussian Acyclic Model for Causal Discovery [link]
Hamad, et. al, Dilated causal convolution with multi-head self attention for sensor human activity recognition [link].
Lui et. al DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks [link]

Please cite as follows:

@misc{suezen23pco,

title = {Practical causal ordering: Why weighted-Directed Acyclic Graphs (DAGs) are powerful for causal inference? },

howpublished = {\url{https://memosisland.blogspot.com/2023/09/causal-ordering-dags-.html}},

author = {Mehmet Süzen},

year = {2023}

}

Postscript Notes

Last Update 30 Oct 2024

Transformers mathematically mimics wDAGs, i.e., induce causal ordering among distributed representations via “Attention” i.e., weighting inputs and linking them to different types of output sequences.
Directed Acyclic Graphs (DAGs), are actually build implicitly by transformer architectures. This lead to causal analysis and preserving ordering of most meaningful word embeddings in an empirical level, i.e., heuristic causality bound by data-driven model with inductive bias of the modeller (LLM trainer). Striking enough, it is possible to express these relationships using triplet notation without Graph Theory, and that’s what attention layer store rather than graph.
The primary driver in this is not that it can only place an attention to certain portions of the embedded vector space for approximating the separation of meaning. These data processing instructions actually builds a causal mappings among sub-spaces, i.e., a causal ordering among embedding sub-spaces. In short, transformers are heuristically executing causal discovery.
Transformers in the context of machine translations performs 'causal graph discovery' heuristically, inducing causal ordering. That's why they exceeded previous translation architectures.

Grateful to see Judea Pearl's kind note on this via twitter. "What's the input? Statistical data or human authored texts? What are the assumptions behind the discovery?"

This is critical as not to jump in conclusion that these neural network layers can do causal discovery automatically, but rather in human supervised conditions. The answer was, "Assumption is using an established causal model -human text/translation, directionality is already established. Then transformer compute weights on this graph. Other enabler is word embeddings & large examples. Can't do causal inference as in SCMs, still a statistical model."
Self-attention for LLMs acts as casual discovery machines over human experience
The success of transformers deep learning architecture can also be attributed to causal modelling. But how? Probably the most prominent application of transformers in machine translation. As translation models trained on human translated data
Causality is inbuilt in these datasets and transformers behave as causal discovery layers. We could imagine weights on the translation matrix and other query matrices as DAGs, as connection between connectivity matrices and graphs are well known. Directionality is dictated by asymmetric weights. This manifest as causal analysis preserving ordering of most meaningful words embeddings empirically i.e., heuristic causality. This can be demonstrated by triplets without graph theoretic notation, that would induce as causal ordering discovery.
Causal set theory is also quite striking in quantum gravity. DAGs appear as Partially-ordered sets of events in Planck-scale relativistic events within the discrete space-time. Monograph by Benjamin Dribus, Discrete Causal Theory, Springer (2017)

Tuesday, 29 December 2020

Practice causal inference: Conventional supervised learning can't do inference

Domino OR-gate (Wikipedia)

Preamble

A trained model may provide predictions about input values it may never seen before but it isn't an inference, at least for 'classical' supervised learning. In reality it provides an interpolation from the training-set, i.e., via function approximation:

Interpolation doesn't mean to have all predictions within convex-hull of the training set but interpolation as in numerical procedure of using training data only.

What does inference means?

By inference, "we imply going beyond training data", reference to distributional shift, compositional learning or similar type of learning, should have been raised. This is specially apparent, for example, a human infant may learn how 3 is similar to 8, but without labels supervised learning in naïve setting can't establish this, i.e., MNIST set without 8 can not learn what is 8 in plain form.

In the case of ontology inference, ontology being a causal graph, that is a "real" inference as it symbolically traverse a graph of causal connections.

Outlook

We might be able to directly transfer that to regression scenario but probably it is possible with altering our models with SCMs and hybrid symbolic-regression approach.

Postscript

Looper repo provides a resource list for causal inference looper
Thanks to Patrick McCrae for invoking ontology inference comparison.

Cite

@misc{suezen20pci,

title = {Practice causal inference: Conventional supervised learning can't do inference},

howpublished = {\url{https://memosisland.blogspot.com/2020/12/practice-causal-inference-conventional.html},

author = {Mehmet Süzen},

year = {2020}

}

Thursday, 10 January 2019

The fundamental problem of causal inference: causality resource list Looper

Preamble

One of the main tenants of practical data science is performing statistical inference on data sets which are assumed to be representations of populations of activity, natural or man-made. A specific case is called causal inference, which probably the core interest of decision makers and probably the main reason why businesses or industry funds data science projects in the first place. Either by trying to find a cause of an event or understand the impact of data science and AI products. A recent popular book of "Why" once again put causal inference into the major market [pearl2018]. However, the nightmare for a data scientist is that there is a fundamental limitation of performing such an inference. It stems from the fact that, the condition whereby to test causal link, i.e., establishing cause and effect, on a given data point cannot occur simultaneously. This is the so-called FPCI, the fundamental problem of causal inference [Holland86] and it can only be resolved approximately. This implies a data point cannot be both treated and not-treated simultaneously. For example, a patient cannot have a condition that receives treatment or no treatment at the same time, or a customer cannot be in the campaign or excluded at the same time. This post provides possible approaches in tackling it with pointers to resource and software tools. The core problem is a missing data issue, but note that the missingness here originates from not a measurement error but fundamentally non-existing data point, so it is not only simple data imputation issue. More extensive resource list can be found, a resource list called looper is available on GitHub, A resource list for causality in statistics, data science and physics. [looper]

Figure: Looper the movie revolves
around the causal loop (Wikimedia)

Rubin causal model: A-null test with the inherent contradiction

The question of why as in what is the cause of an event inherently a physics question and goes into space-time concept of general relativity and in classical mechanics in general. Popular time-travel movies, such as looper, see Figure, causality loops creates a curious non-intuitive phenomenon. From a data analysis perspective, the Rubin causal model [Rubin1974] asserts that the causal effect can be quantified with A-Null test. What does this mean? Cause, i.e., treatment as a medical connotation, is the cause of the event, i.e., the effect or the outcome, can be quantified by the algebraic difference between the expected value of the outcome and it's counterfactual, a fancy name for what would have been the outcome in the absence of the treatment. Unfortunately, this is inherently contradictory as mentioned above, so-called FPCI, the fundamental problem of causal inference [Holland86]. On a single sample, i.e. event, this A-null test cannot be applied because data is missing causally. In estimating the A-null test, two groups are designed, so that they are drawn from the same population, recall the law of large numbers. One group receives the treatment and the other does not by design. Over time the effect of the treatment measured on both groups and Rubins A-Null test is applied. Not surprisingly they are called control and treatment groups and this procedure is called Average Treatment Effects (ATE). The mathematical exposure of this procedure is well established and can be found in standard texts, see Looper, but mathematically ATE reads $\mathbb{E}(Y_{i} | T_{i} = 1) - \mathbb{E}(Y_{i} | T_{i} = 0) $.

Matching or balancing

Randomized control trails (RCT) constructs control and treatment groups. To compute ATE robustly, the statistical properties of covariates, set of features used in the study, from control and treatment must be similar. Any technique try to address the balancing issue is called matching [stuart2010]. Prominienet example in this direction is propensity score matching, using similarity distances between covariates, implemented in R via matching package based on genetic search.

Imputation approach: Causal inference as missing data problem

One way to overcome FPCI is applying an imputation to missing data, predicting outcome value in case of control sample, vice versa. Using advanced imputation techniques one can resolve FPCI. One prominent example is via Multiple Imputation, implemented in R mice package.

Summary

We have reviewed the fundamental problem of causal inference (FPCI) . For causality resources, a resource list called looper is available on GitHub, A resource list for causality in statistics, data science and physics, please sent a pull request for additions.Note that causality is a large research area from Bayesian Networks, Pearl's do calculus and uplift modelling.

References

[pearl2018] Judea Pearl and Dana Mackenzie, The Book of Why: The New Science of Cause and Effect, Basic Books; 1 edition (May 15, 2018)
[Holland1986] Holland, Paul W. (1986). "Statistics and Causal Inference". J. Amer. Statist. Assoc. 81 (396): 945–960.
[Rubin1974] Rubin DB (1974). “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology, 66, 688–701.
[stuart2010] Elizabeth A. Stuart, Matching methods for causal inference: A review and a look forward Stat Sci. 2010 Feb 1; 25(1): 1–21.
[looper] A resource list for causality in statistics, data science and physics http:github.com/msuzen/looper.

Memo's Island

Saturday, 24 February 2024

Inducing time-asymmetry on reversible classical statistical mechanics via Interventional Thermodynamic Ensembles (ITEs).

Saturday, 25 November 2023

Why should there be no simultaneity rule for causal models?

Saturday, 9 September 2023

Practical causal ordering:
Why weighted-Directed Acyclic Graphs (DAGs) are powerful for causal inference?

Tuesday, 29 December 2020

Practice causal inference: Conventional supervised learning can't do inference

Thursday, 10 January 2019

The fundamental problem of causal inference: causality resource list Looper

Mehmet Suzen

Related

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)