Memo's Island: 2015

Summary

The common case in data science or machine learning applications, different features or predictors manifest them in different scales. This could bring difficulty in interpreting the resulting coefficients of linear regression, such as one feature having very large or small values compare to other predictors and being in different units first of all. The common approach to overcome this to use z-score's for each features, centring and scaling.

This approach would allow us to interpret the effects. A possible question however is how could we map regression coefficients obtained with the scaled data back to original coefficients. In the context of ridge regression, this question is posed by Mark Seeto in the R mailing list and provided a solution for two predictor case with an R code. In this post we formalize his approach. Note that, in the case of how to scale, Professor Gelman suggests dividing them by two standard deviations. In this post we won't cover that approach and use usual approach.

Algebraic Solution: No error term

An arbitrary linear regression for $n$ variable reads as follows
$$y=(\Sigma_{i=1}^n \beta_{i} x_{i}) + \beta_{0}$$
here, $y$ is being response variable, $x_{i}$ are the predictors, $n=1,..,n$. Let's use primes for the scaled regression equation for $n$ variable.
$$y'=(\Sigma_{i=1}^n \beta_{i}' x_{i}') + \beta_{0}'$$
We would like to express $\beta_{i}$ by only using $\beta_{i}'$ and two statistic from the data, namely mean and standard deviations, $\mu_{x_{i}}$, $\mu_{y}$, $\sigma_{x_{i}}$ and $\sigma_{y}$.

The following transformation can be shown by using the z-scores and some algebra,

$$\beta_{0}=\beta_{0}' \sigma_{y} + \mu_{y} - \Sigma_{i=1}^{n} \frac{\sigma_{y}}{\sigma_{x_{i}}}\beta_{i}' \mu_{x_{i}}$$
$$\beta_{i} = \beta_{i}' \frac{\sigma_{y}}{\sigma_{x_{i}}}$$

Ridge regression in R

There are many packages and tools in R to perform ridge regression. One of the prominent one is glmnet. Following Mark Seeto's example, here we extent that in to many variate case with a helper function scaleBack.lm from R1magic package. Function provides a transform utility for $n$-variate case. Here we demo this using 6 predictors, also available as gist,

rm(list=ls())
library(glmnet)
library(R1magic) # https://github.com/msuzen/R1magic
set.seed(4242)
n <- 100 # observations
X <- model.matrix(~., data.frame(x1 = rnorm(n, 1, 1),
                                 x2 = rnorm(n, 2, 2),
                                 x3 = rnorm(n, 3,2),
                                 x4 = rnorm(n, 4,2),
                                 x5 = rnorm(n, 5,1),
                                 x6 = rnorm(n, 6,1)
                                ))[,-1] # glmnet adds the intercept
Y          <- matrix(rnorm(n, 1, 2),n,1)
# Now apply scaling 
X.s        <- scale(X)
Y.s        <- scale(Y)
# Ridge regression & coefficients with scaled data
glm.fit.s    <- glmnet(X.s, Y.s, alpha=0)
betas.scaled <- as.matrix(as.vector(coef(glm.fit.s)[,80]), 1, 7)
# trasform the coefficients 
betas.transformed <- scaleBack.lm(X, Y, betas.scaled)
# Now verify the correctness of scaled coefficients: 
# ridge regression & coefficients
glm.fit    <- glmnet(X, Y, alpha=0)
betas.fit  <- as.matrix(as.vector(coef(glm.fit)[,80]), 1, 7)
# Verify correctness: Difference is smaller than 1e-12
sum(betas.fit-betas.transformed) < 1e-12 # TRUE

Conclusion

Multiple regression is used by many practitioners. In this post we have shown how to scale continuous predictors and transform back the regression coefficients to original scale. Scaled coefficients would help us to better interpret the results. The question of when to standardize the data is a different issue.

Memo's Island

Friday, 10 April 2015

Scale back or transform back multiple linear regression coefficients: Arbitrary case with ridge regression