Saturday, 16 January 2016

S-shaped data: Smoothing with quasibinomial distribution

Figure 1: Synthetic data and fitted curves.
S-shaped distributed data can be found in many applications. Such data can be approximated with logistic distribution function [1].  Cumulative distribution function of logistic distribution function is a logistic function, i.e., logit.

To demonstrate this, in this short example, after generating a synthetic data, we will fit quasibinomial regression model to different observations.

ggplot [2], an implementation of grammar of graphics [3], provides capability to apply regression or customised smoothing onto a raw data during plotting.

Generating Synthetic Data

Let generate set of $n$ observation over time $t$, denoted, $X_{1}, X_{2}, ..., X_{n}$ for $k$ observation $X=(x_{1}, x_{2}, ..., x_{k})$. We will use cumulative function for logistic distribution [4],
$$F(x;\mu,s) = \frac{1}{2} + \frac{1}{2} \tanh((x-\mu)/2s)$$, adding some random noise to make
it realistic.

Let's say there are $k=6$ observations with the following parameter sets, $\mu = \{9,2,3,5,7,5\}$  and $s=\{2,2,4,3,4,2\}$, we will utilise `mapply` [5] in generating a syntetic data frame.


generate_logit_cdf <- function(mu, s, 
                               sigma_y=0.1, 
                               x=seq(-5,20,0.1)) {
  x_ms <- (x-mu)/s
  y    <- 0.5 + 0.5 * tanh(x_ms)  
  y    <- abs(y + rnorm(length(x), 0, sigma_y))
  ix   <- which(y>=1.0)
  if(length(ix)>=1) { 
    y[ix] <- 1.0
  }
  return(y)
}
set.seed(424242)
x      <- seq(-5,20,0.025) # 1001 observation
mu_vec <- c(1,2,3,5,7,8)   # 6 variables
s_vec  <- c(2,2,4,3,4,2)
# Syntetic variables
observations_df<- mapply(generate_logit_cdf, 
                              mu_vec, 
                              s_vec, 
                              MoreArgs = list(x=x))
# Give them names
colnames(observations_df) <- c("Var1", "Var2", "Var3", "Var4", "Var5", "Var6")
head(observations_df)  

Smoothing of observations

Using the syntetic data we have generated, `observations_df`,
we can noq use `ggplot` and `quasibinomial` `glm` to visualise
and smooth the variables.


library(ggplot2)
 library(reshape2)
 df_all <- reshape2:::melt(observations_df)
 colnames(df_all) <- c("x", "observation", "y")
 df_all$observation <- as.factor(df_all$observation)
 p1<-ggplot(df_all, aes(x=x, y=y, colour=observation)) + geom_point() + 
        scale_color_brewer(palette = "Reds") +
        theme(
              panel.background = element_blank(), 
              axis.text.x      = element_text(face="bold", color="#000000", size=11),
              axis.text.y      = element_text(face="bold", color="#000000", size=11),
              axis.title.x     = element_text(face="bold", color="#000000", size=11),
              axis.title.y     = element_text(face="bold", color="#000000", size=11)
#              legend.position = "none"
              )
 l1<-ggplot(df_all, aes(x=x, y=y, colour=observation)) +
        geom_point(size=3) + scale_color_brewer(palette = "Reds") + 
        scale_color_brewer(palette = "Reds") + 
        #geom_smooth(method="loess", se = FALSE, size=1.5) +
        geom_smooth(aes(group=observation),method="glm", family=quasibinomial(), formula="y~x",
                       se = FALSE, size=1.5) +
        xlab("x") + 
        ylab("y") +
        #scale_y_continuous(breaks=seq(0.0,1,0.1)) +
        #scale_x_continuous(breaks=seq(0.0,230,20)) +
        #ggtitle("")  + 
        theme(
              panel.background = element_blank(), 
              axis.text.x      = element_text(face="bold", color="#000000", size=11),
              axis.text.y      = element_text(face="bold", color="#000000", size=11),
              axis.title.x     = element_text(face="bold", color="#000000", size=11),
              axis.title.y     = element_text(face="bold", color="#000000", size=11)
              )
 library(gridExtra)
 gridExtra:::grid.arrange(p1,l1)


References
[1] https://en.wikipedia.org/wiki/Logistic_distribution#Applications
[2] http://www.ggplot.org
[3] The Grammar of Graphics, L. Wilkinson, http://www.amzn.com/038724544
[4] http://en.wikipedia.org/wiki/Logistic_distribution#Cumulative_distribution_function.
[5] https://stat.ethz.ch/R-manual/R-devel/library/base/html/mapply.html

Notes:
* Quasibinomial distribution is as a term used in R's GLM implementation context, see here.


3 comments:

al said...

I get:
Error: Unknown parameters: family

I think family=quasibinomial() is the cause of the error, although PKG "stats" is loaded!!

thank you for your help.
Al

msuzen said...

@al Probably it is an R or ggplot version issue. This works on my laptop with the following session information.
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] gridExtra_2.0.0 reshape2_1.4.1 ggplot2_1.0.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 digest_0.6.8 MASS_7.3-44 grid_3.2.3
[5] plyr_1.8.3 gtable_0.1.2 magrittr_1.5 scales_0.3.0
[9] stringi_1.0-1 labeling_0.3 proto_0.3-10 RColorBrewer_1.1-2
[13] tools_3.2.3 stringr_1.0.0 munsell_0.4.2 colorspace_1.2-6
>

al said...

Thanks muzen, I used R 3.1.0 and it did work!!

(c) Copyright 2008-2015 Mehmet Suzen (suzen at acm dot org)

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License.