Monday, 25 July 2016

Economy and dynamic modelling: Haavelmo's approach

Updated on 25 August 2017
Preamable: 
Predictions using dynamic modelling
Machine Learning and  Neural Networks are not the only way to do data science or AI. There are other techniques to explore  , for example, from quantitative economics. Apart from Game Theory, dynamic modelling could be suitable to many prediction problems, specially the ones with temporal datasets. Here is one example technique from  forgotten Norwegian Nobel prize winning economist/data scientist Trygve Haavelmo. Advantage of such models are they can be explained, not fully black box. Make no mistake, for very large system with large datasets, this is not that trivial to implement, GPUs are well suited. In this post we give one hands on exercise.

Summary

Econometrics aims at estimating observables in the economy and their inter-dependencies and testing the estimates against the economic reality. A quantitative approach to express these inter-dependencies appear as simultaneous equations, an i.e. system of linear equations, this is  a mathematical structure of economic relationships that were made possible with the pioneering work of Nobel prize winning economist Trygve Haavelmo [1,2]. This approach and its dynamic variants are now used routinely in dynamic modelling of econometric systems. From a computational perspective, R-project provides efficient and very rich computational environment and the large set of extensions for econometrics in general [3].

Dynamic Modelling

The simplest relationship that can be constructed with two arbitrary economic variables, or instruments, $X(t)$ and $Y(t)$ is shown by Haavelmo [1]. For example, these variables could be unemployment rate and Gross Domestic Product (GDP), as in Okun's law. Hence, the simplest bi-variate simultaneous system of equations looks as follows,

$X(t) =a Y(t) + \epsilon_{x}(t)$,
$Y(t) =b X(t) + \epsilon_{y}(t)$,

where $a$ and $b$ are constant coefficients. Where, $\epsilon_{x}$ and $\epsilon_{y}$, appear as non-deterministic disturbances and are not observed in the modelled economic system. Disturbances are usually expressed as random variables drawn from the normal distribution. At this stage, one can follow two approaches in doing economic scenario analysis for forecasting. We can aim at finding coefficients $a$ and $b$ using economic data via dynamic regression. If $a$ and $b$ coefficients are known, we may want to study effect of different disturbances over time.
Dynamic Haavelmo Model
A model of propensity to consume in economic system is shown by Haavelmo [3] based on his analysis of US economic conditions between 1929-1949. His analysis leads to following simultaneous system of equations,
$c(t)= \alpha y(t) + \beta + u(t)$,
$r(t)= \mu (c(t)+x(t)) +\nu + w(t)$,
$y(t)= c(t)+x(t)-r(t)$,
where $\alpha,\beta,\mu,\nu$ are constants and $u(t)$ and $w(t)$ are disturbances. Economic variables have the following meanings,

c(t) : personal consumption expenditures,
y(t) : personal disposable income,
r(t) : gross business savings,
x(t) : gross investment.

However, this model considered to be static while all the relationships are given at the same time point. Zellner-Palm [5] provided a dynamic version of the Haavelmo's model. Here we write down a version of it,

$c(t)= \alpha Dy(t) + \beta + u(t)$,
$r(t)= \mu D(c(t)+x(t)) +\nu + w(t)$,
$y(t)= c(t)+x(t)-r(t)$. 

where difference operator means, $Dy(t) = y(t)-y(t-1)$.

Krokozhia Case Study

Data

Krokozhia is a fictional country depicted in Steven Spielberg's movie The Terminal. Let's generate a fictional data for our dynamic Haavelmo model's economic instruments for this country from 1950 to present in R,

set.seed(4242)
# KH: Krokozhia Haavelmo Model
KH.df <- data.frame(Year=seq(1950,2013),
c=sample(300:1500,64,replace=TRUE),
y=sample(1000:5000,64,replace=TRUE),

r=sample(100:500,64,replace=TRUE),
x=sample(100:500,64,replace=TRUE))
# Add lag data for difference
KH.df$y.lag <- c(NA, KH.df$y[1:63])
KH.df$c.lag <- c(NA, KH.df$c[1:63])
KH.df$x.lag <- c(NA, KH.df$x[1:63])


Determining Constants

Multilevel regression is needed in order to fit the data and determine the constants of the dynamic model. One R package called sem developed by John Fox can do such analysis.

library(sem)
# Two-stage least squares
#   Eq1: $c(t)= \alpha Dy(t) + \beta + u(t)$,
#         \beta is the intercept and u(t) is not used
#   Eq2: $r(t)= \mu D(c(t)+x(t)) +\nu + w(t)$,
#         \nu is the intercept and w(t) is not used
KH.eq1 <- tsls(c~I(y-y.lag), ~c+y+r, data=KH.df)
KH.eq2 <- tsls(r~I(c-c.lag+x-x.lag), ~c+y+r, data=KH.df)
coef(KH.eq1)  # alpha=875.414  nu=0.015
coef(KH.eq2)  # mu=-0.028      nu=300.675


Here tsls performs two-stage least square analysis.

Propagating a disturbance in the economy

We have not used any disturbance in determining the system coefficients, constants, above. However, we can propagate the values of economic observables using the dynamic model if we set a disturbance value at a given time. Imagine if we set disturbance on the year 2001 as $u=200$ and $w=150$. Hence, the dynamic model will read on year 2001, $t=2001$, $u=200$ and $w=150$

     $c = 875.414 (y-2570) + 0.015 + 200$
     $r = -0.028(c+x-1281-479) + 300.675 + 150$,
     $y = c+x-r$,

By solving this under-determined system of equations we can determine the values in the year 2001 and furthermore propagate all dynamics after 2001 similarly. The resulting new series will give us a quantitative idea of the effect of single disturbance in the simulated economy.

Conclusions and outlook

In this post, we have briefly reviewed possible uses of R in simulating dynamic econometric models, in particular simultaneous equation models. A simple demonstration of determining model coefficients of the Haavelmo type toy model with generated synthetic data is provided. One use case of this type of approach in economic scenario analysis and forecasting is to monitor propagation of the econometric instruments over time is also mentioned.


References

[1] The Statistical Implications of a System of Simultaneous Equations,
     Trygve Haavelmo, Econometrica, Vol. 11, No. 1. (Jan., 1943), pp. 1-12
[2] Econometrics Analysis, William H. Greene, Prentice Hall (2011)
[3] Applied Econometrics with R,
     Kleiber, Christian, Zeileis, Achim, Springer (2008), Achim Zeileis
[4] Methods of measuring the marginal propensity to consume,
     T. Haavelmo, Journal of the American Statistical Association,
     1947 - Taylor & Francis.
[5] Time series analysis and simultaneous equation econometric models,
      Zellner, Arnold and Palm, Franz, Journal of Econometrics,
      Vol.2, Num.1, p17-54 (1974)

Saturday, 16 January 2016

S-shaped data: Smoothing with quasibinomial distribution

Figure 1: Synthetic data and fitted curves.
S-shaped distributed data can be found in many applications. Such data can be approximated with logistic distribution function [1].  Cumulative distribution function of logistic distribution function is a logistic function, i.e., logit.

To demonstrate this, in this short example, after generating a synthetic data, we will fit quasibinomial regression model to different observations.

ggplot [2], an implementation of grammar of graphics [3], provides capability to apply regression or customised smoothing onto a raw data during plotting.

Generating Synthetic Data

Let generate set of $n$ observation over time $t$, denoted, $X_{1}, X_{2}, ..., X_{n}$ for $k$ observation $X=(x_{1}, x_{2}, ..., x_{k})$. We will use cumulative function for logistic distribution [4],
$$F(x;\mu,s) = \frac{1}{2} + \frac{1}{2} \tanh((x-\mu)/2s)$$, adding some random noise to make
it realistic.

Let's say there are $k=6$ observations with the following parameter sets, $\mu = \{9,2,3,5,7,5\}$  and $s=\{2,2,4,3,4,2\}$, we will utilise `mapply` [5] in generating a syntetic data frame.


generate_logit_cdf <- function(mu, s, 
                               sigma_y=0.1, 
                               x=seq(-5,20,0.1)) {
  x_ms <- (x-mu)/s
  y    <- 0.5 + 0.5 * tanh(x_ms)  
  y    <- abs(y + rnorm(length(x), 0, sigma_y))
  ix   <- which(y>=1.0)
  if(length(ix)>=1) { 
    y[ix] <- 1.0
  }
  return(y)
}
set.seed(424242)
x      <- seq(-5,20,0.025) # 1001 observation
mu_vec <- c(1,2,3,5,7,8)   # 6 variables
s_vec  <- c(2,2,4,3,4,2)
# Syntetic variables
observations_df<- mapply(generate_logit_cdf, 
                              mu_vec, 
                              s_vec, 
                              MoreArgs = list(x=x))
# Give them names
colnames(observations_df) <- c("Var1", "Var2", "Var3", "Var4", "Var5", "Var6")
head(observations_df)  

Smoothing of observations

Using the syntetic data we have generated, `observations_df`,
we can noq use `ggplot` and `quasibinomial` `glm` to visualise
and smooth the variables.


library(ggplot2)
 library(reshape2)
 df_all <- reshape2:::melt(observations_df)
 colnames(df_all) <- c("x", "observation", "y")
 df_all$observation <- as.factor(df_all$observation)
 p1<-ggplot(df_all, aes(x=x, y=y, colour=observation)) + geom_point() + 
        scale_color_brewer(palette = "Reds") +
        theme(
              panel.background = element_blank(), 
              axis.text.x      = element_text(face="bold", color="#000000", size=11),
              axis.text.y      = element_text(face="bold", color="#000000", size=11),
              axis.title.x     = element_text(face="bold", color="#000000", size=11),
              axis.title.y     = element_text(face="bold", color="#000000", size=11)
#              legend.position = "none"
              )
 l1<-ggplot(df_all, aes(x=x, y=y, colour=observation)) +
        geom_point(size=3) + scale_color_brewer(palette = "Reds") + 
        scale_color_brewer(palette = "Reds") + 
        #geom_smooth(method="loess", se = FALSE, size=1.5) +
        geom_smooth(aes(group=observation),method="glm", family=quasibinomial(), formula="y~x",
                       se = FALSE, size=1.5) +
        xlab("x") + 
        ylab("y") +
        #scale_y_continuous(breaks=seq(0.0,1,0.1)) +
        #scale_x_continuous(breaks=seq(0.0,230,20)) +
        #ggtitle("")  + 
        theme(
              panel.background = element_blank(), 
              axis.text.x      = element_text(face="bold", color="#000000", size=11),
              axis.text.y      = element_text(face="bold", color="#000000", size=11),
              axis.title.x     = element_text(face="bold", color="#000000", size=11),
              axis.title.y     = element_text(face="bold", color="#000000", size=11)
              )
 library(gridExtra)
 gridExtra:::grid.arrange(p1,l1)


References
[1] https://en.wikipedia.org/wiki/Logistic_distribution#Applications
[2] http://www.ggplot.org
[3] The Grammar of Graphics, L. Wilkinson, http://www.amzn.com/038724544
[4] http://en.wikipedia.org/wiki/Logistic_distribution#Cumulative_distribution_function.
[5] https://stat.ethz.ch/R-manual/R-devel/library/base/html/mapply.html

Notes:
* Quasibinomial distribution is as a term used in R's GLM implementation context, see here.