## Wednesday, 16 August 2017

### Understanding overfitting: an inaccurate meme in supervised learning

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro with the title Understanding overfitting: an inaccurate meme in machine learning

Preamble There is a lot of confusion among practitioners regarding the concept of overfitting. It seems like, a kind of an urban legend or a meme, a folklore is circulating in data science or allied fields with the following statement:
Applying cross-validation prevents overfitting and a good out-of-sample performance, low generalisation error in unseen data, indicates not an overfit.

This statement is of course not true: cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model. What actually people refer to in one aspect of this statement is called overtraining. Unfortunately, this meme is not only propagated in industry but in some academic papers as well. This might be at best a confusion on jargon. But, it will be a good practice if we set the jargon right and clear on what do we refer to when we say overfitting, in communicating our results.

Aim In this post, we will give an intuition on why model validation as approximating generalization error of a model fit and detection of overfitting can not be resolved simultaneously on a single model. We will work on  a concrete example workflow in understanding overfitting, overtraining and a typical final model building stage  after some conceptual introduction. We will avoid giving a reference to the Bayesian interpretations and regularisation and restrict the post to regression and cross-validation. While regularisation has different ramification due to its mathematical properties and prior distributions have different implications in Bayesian statistics. We assume an introductory background in machine learning, so this is not a beginners tutorial.

A recent question from Andrew Gelman, a Bayesian guru, regarding What is overfitting? was one of the reasons why this post is developed along with my frustration to see practitioners being muddy on the meaning of overfitting and continuing recently published data science related technical articles circulating around and even in some academic papers claiming the above statement.

What do we need to satisfy in supervised learning? One of the most basic tasks in mathematics is to find a solution to a function: If we restrict ourselves to real numbers in $n$-dimensions and our domain of interest would be $\mathbb{R}^{n}$. Now imagine set of $p$ points living in this domain  $x_{i}$ form a dataset, this is actually a partial solution to a function. The main purpose of modelling is to find an explanation of the dataset, meaning that we need to determine $m$-parameters, $a \in \mathbb{R}^{m}$ which are unknown. (Note that a non-parametric model does not mean no parameters.) Mathematically speaking this manifests as a function as we said before,  $f(x, a)$. This modelling is usually called regression, interpolation or supervised learning depending on the literature you are reading. This is a form of an inverse problem, while we don't know the parameters but we have a partial information regarding variables. The main issue here is ill-posedness, meaning that solutions are not well-posed. Omitting axiomatic technical details, practical problem is that we can find many functions  $f(x, a)$  or models, explaining the dataset. So, we seek the following two concepts to be satisfied by our model solution,  $f(x, a)=0$.

1. Generalized: A model should not depend on the dataset. This step is called model validation.
2. Minimally complex: A model should obey Occam's razor or principle of parsimony. This step is called model selection.

 Figure 1: A workflow for model validation and  selection in supervised learning.
Generalization of a model can be measured by goodness-of-fit. It essentially tells us how good our model (chosen function) explains the dataset. To find a minimally complex model requires comparison against another model.
Up to now, we have not named a technique how to check if a model is generalized and selected as the best model. Unfortunately, there is no unique way of doing both and that's the task of data scientist or quantitative practitioner that requires human judgement.

Model validation: An example One way to check if a model is generalized enough is to come up with a metric on how good it explains the dataset. Our task in model validation is to estimate the model error. For example, root mean square deviation (RMDS) is one metric we can use.  If  RMSD is low, we could say that our model fit is good, ideally it should be close to zero.  But it is not generalized enough if we use the same dataset to measure the goodness-of-fit.  We could use different dataset, specially out-of-sample dataset, to validate this as much as we can, i.e. so called hold out method.  Out-of-sample is just a fancy way of saying we did not use the same dataset to find the value of parameters $a$. An improved way of doing this is cross-validation. We split our dataset into $k$ partitions, and we obtain $k$ RMDS values to averaged over. This is summarised in Figure 1.  Note that, different parameterisation of the same model does not constitute a different model.

Model Selection: Detection of overfitting Overfitting comes into play when we try to satisfy 'minimally complex model'. This is a comparison problem and we need more than one model to judge if a given model is an overfit. Douglas Hawkins in his classic paper The Problem of Overfitting, states that
Overfitting of models is widely recognized as a concern. It is less recognized however that overfitting is not an absolute but involves a comparison. A model overfits if it is more complex than another model that fits equally well.
The important point here what do we mean by complex model, or how can we quantify model complexity? Unfortunately, again there is no unique way of doing this. One of the most used approaches is that a model having more parameters is getting more complex. But this is again a bit of a meme and not generally true. One could actually resort to different measures of complexity. For example, by this definition $f_{1}(a,x)=ax$ and $f_{2}(a,x)=ax^2$ have the same complexity by having the same number of free parameters, but intuitively $f_{2}$ is more complex, while it is nonlinear. There are a lot of information theory based measures of complexity but discussion of those are beyond the scope of our post. For demonstration purposes, we will consider more parameters and degree of nonlinearity as more complex a model.

 Figure 2: Simulated data and the non-stochastic part of the data.
Hand on example We have intuitively covered the reasons behind how we can't resolve model validation and judge overfitting simultaneously. Now try to demonstrate this with a simple dataset and models, yet essentially capturing the above premise.
A usual procedure is to generate a synthetic dataset, or simulated dataset, from a model, as a gold standard and use this dataset to build other models. Let's use the following functional form, from classic text of Bishop, but with an added Gaussian noise $$f(x) = sin(2\pi x) + \mathcal{N}(0,0.1).$$ We generate large enough set, 100 points to avoid sample size issue discussed in Bishop's book, see Figure 2. Let's decide on two models we would like to apply to this dataset in supervised learning task. Note that, we won't be discussing Bayesian interpretation here, so equivalency of these model under a strong prior assumption is not an issue as we are using this example for ease of demonstrating the concept. A polynomial model of degree $3$ and degree $5$, we call them $g(x)$ and $h(x)$ respectively are used to learn from the simulated data. $$g(x) = a_{0} + a_{1} x + a_{2} x^{2} + a_{3} x^{3}$$ and $$h(x) = b_{0} + b_{1} x + b_{2} x^{2} + b_{3} x^{3} + b_{4} x^{4} + b_{5} x^{5} + b_{6} x^{6}.$$
 Figure 3: Overtraining occurs at around after 40 percent of the data usage for g(x).
Overtraining is not overfitting Overtraining means a model performance degrades in learning model parameters against an objective variable that effects how model is build, for example, an objective variable can be a training data size or iteration cycle in neural network. This is more prevalent in neural networks (see Dayhoff 2011). In our practical example, this will manifest in hold out method to measure RMSD in modelling with g(x). In other words finding an optimal number of data points to use to train the model to give a better performance on unseen data, See Figure 3 and 4.
Overfitting with low validation error We can also estimate 10-fold cross-validation error, CV-RMSD. For this sampling, g and h have 0.13 and 0.12 CV-RMSD respectively. So as we can see, we have a situation that more complex model reaches similar predictive power with cross validation and we can not distinguish this overfitting by just looking at CV-RMSD value or detecting 'overtraining' curve from Figures 4. We need two models to compare, hence both Figure 3 and 4, with both CV-RMSD values. We might argue that in small data sets we might be able tell the difference by looking at test and training error differences, this is exactly how Bishop explains overfitting; where he points out overtraining in small datasets.
Which trained model to deploy? Now the question is, we found out best performing model with minimal complexity empirically. All well, but which trained model should we use in production?
Actualy we have already build the model in model selection. In above case, since we got similar
predictive power from g and h, we obviously will use g, trained on the splitting sweet spot from Figure 3.

 Figure 4: Overtraining occurs at around after 30 percent of the data usage for h(x)
Conclusion The essential message here is good validation performance would not guarantee the detection of an overfitted model. As we have seen from examples using synthetic data in one dimension. Overtraining is actually what most practitioners mean when they use the term overfitting.

Outlook As more and more people are using techniques from machine learning or inverse problems, both in academia and industry, some key technical concepts are deviated a bit and take different definitions and meaning for different people, due to the fact that people learn some concepts not from reading the literature carefully but from their line managers or senior colleagues verbally. This creates memes which are actually wrong or at least creating lots of confusion in jargon. It is very important for all of us as practitioners that we must question all technical concepts and try to seek origins from the published scientific literature and not rely entirely on verbal explanations from our experienced colleagues. Also, we should strongly avoid ridiculing question from colleagues even they sound too simple, at the end of the day we don't stop learning and naive looking questions might have very important consequences in fundamentals of the field.

P.S. As I mentioned above, the inspiration of writing this post was, a recent post from Gelman (post). He defined 'overfitting' as follows:

Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.
Priors and equivalence of two models aside, Gelman's definition is weaker than Hawkins definition, that he accepts a complex model having a similar predictive power. So, if we use Gelman's definitions it is ok to deploy either of g or h in our toy example above. But strictly speaking from Hawkins perspective we need to deploy g.

 Figure 5: A deployed models h and g  on the testing set with the original data.
Appendix: Reproducing the example using R

The code used in producing the synthetic data, modelling step and visualising the results can be found in github [repo]. In this appendix, we present this R code with detailed comments, but visualisation codes are omitted, they are available in the github repository.

R (GNU S) provides very powerful formula interface. It is probably the most advanced and expressive formula interface in statistical computing, of course along with S.

Above two polynomials can be expressed as formula and as well as a function where we can evaluate.

 1 2 3 4 5 6 7 8 9 #' #' Two polynomial models: g and h, 3rd and 5th degree respectively. #' g_fun <- function(x,params) as.numeric(params[1]+x*params[2]+x^2*params[3]+x^3*params[4]) h_fun <- function(x,params) as.numeric(params[1]+x*params[2]+x^2*params[3]+params[4]*x^3+ params[5]*x^4+params[6]*x^{5}+ params[7]*x^{6}) g_formula <- ysim ~ I(x) + I(x^2) + I(x^3) h_formula <- ysim ~ I(x) + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) 

A learning from data will be achieved with lm function from R,

 1 2 3 4 5 6 7 8 #' #' Given data.frame with x and ysim, and an R formula with ysim=f(x), #' fit a linear model #' get_coefficients <- function(data_portion, model_formula) { model <- lm(model_formula, data=data_portion) return(model$coefficients) }  and the resulting approximated function can be applied to new data set with the following helper functions with measuring RMSD as a performance metric.  1 2 3 4 5 6 7 8 9 #' #' Find the prediction error for a given model_function and model_formula #' lm_rmsd <- function(x_train, y_train, x_test, y_test, model_function, model_formula) { params <- get_coefficients(data.frame(x=x_train,ysim=y_train), model_formula) params[as.numeric(which(is.na(params)))] <- 0 # if there is any co-linearity f_hat <- sapply(x_test, model_function, params=params) return(sqrt(sum((f_hat-y_test)^2)/length(f_hat))) }  We can generate a simulated data as we discussed above by using runif.   1 2 3 4 5 6 7 8 9 10 11 12 13 #' #' Generate a synthetic dataset #' A similar model from Bishop : #' #' f(x) = sin(2pi*x) + N(0, 0.1) #' set.seed(424242) f <- function(x) return(sin(2*pi*x)) fsim <- function(x) return(sin(2*pi*x)+rnorm(1,0,0.1)) x <- seq(0,1,1e-2) y <- sapply(x,f) ysim <- sapply(x,fsim) simdata <- data.frame(x=x, y=y, ysim=ysim)  To detect overtraining we can split the data in different places in increasing training size, and measure the the performance on the training data itself and unseen test data.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 #' #' Demonstration of overtraining with g #' #' set.seed(424242) model_function <- g_fun model_formula <- g_formula split_percent <- seq(0.05,0.95,0.03) split_len <- length(split_percent) data_len <- length(simdata$ysim) splits <- as.integer(data_len*split_percent) test_rmsd <- vector("integer", split_len-1) train_rmsd <- vector("integer", split_len-1) for(i in 2:split_len) { train_ix <- sample(1:data_len,splits[i-1]) test_ix <- (1:data_len)[-train_ix] train_rmsd[i-1] <- lm_rmsd(simdata$x[train_ix], simdata$ysim[train_ix], simdata$x[train_ix], simdata$ysim[train_ix], model_function, model_formula) test_rmsd[i-1] <- lm_rmsd(simdata$x[train_ix], simdata$ysim[train_ix], simdata$x[test_ix], simdata$ysim[test_ix], model_function, model_formula) } rmsd_df <- data.frame(test_rmsd=test_rmsd, train_rmsd=train_rmsd, percent=split_percent[-1]) rmsd_df2 <- melt(rmsd_df, id=c("percent")) colnames(rmsd_df2) <- c("percent", "Error_on", "rmsd") rmsd_df2$test_train <- as.factor(rmsd_df2$Error_on) 
And the last portion of the code does 10-fold cross-validation.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 #' CV for g(x) and h(x) split_percent <- seq(0,1,0.1) split_len <- length(split_percent) data_len <- length(simdata$ysim) splits <- as.integer(data_len*split_percent) cv_rmsd_g <- 0 cv_rmsd_h <- 0 for(i in 2:split_len) { # 10-fold cross validation test_ix <- (splits[i-1]+1):splits[i] train_ix <- (1:data_len)[-test_ix] x_train <- simdata$x[train_ix] y_train <- simdata$ysim[train_ix] x_test <- simdata$x[test_ix] y_test <- simdata\$ysim[test_ix] cv_rmsd_g <- lm_rmsd(x_train, y_train, x_test, y_test, g_fun, g_formula)+cv_rmsd_g cv_rmsd_h <- lm_rmsd(x_train, y_train, x_test, y_test, h_fun, h_formula)+cv_rmsd_h } cat("10-fold CV error G = ", cv_rmsd_g/split_len,"\n") # 0.1304164 cat("10-fold CV error H = ", cv_rmsd_h/split_len,"\n") # 0.1206458 

Unknown said...

Thank you for this clear and informative article. Do you have suggestions of how to approach this for time series data? In my work on forecasting, the entire measure of performance is based on the model performance "out of sample"--i.e. train the model up to time t, then evaluate on data from t+1 to t+n, where n is the number of periods we would like to forecast. The model is built using multiple linear regression, and 10-fold cross validation is done with good (apparently) results. But there is no measure of over-training as you describe here. To put this concretely, I have data from 2012 through end of 2016, and would like to forecast 2017. I assessed different models on data through 2015 only, then used the "best" with the addition of 2016 data to re-train and make predictions. From this article, it seems I should have used all the data and done testing varying how large a sample was used vs. model perfomance. I have not seen this approach in time series literature, but of course I only know a small subset of that literature!

msuzen said...

@Baline Bateman. Indeed modelling with time-series data is more tricky due to serial correlations and stationarity issues along with small-sample sizes.I have also not seen people doing early stopping assessment for time-series modelling against overtraining as in in this post. But note that out-of-sample and k-folds manifests a bit differently in time-series. Rob J Hyndman, has some works on how to properly do cross-validation for time-series models (https://robjhyndman.com/publications/cv-time-series/). I think it is perfectly fine to try to check against training size; that would be useful in deploying the "final" model, but re-sampling has to be done a bit differently as far as I understood from Hyndman's work.

Unknown said...

Thank you. I will read the reference.

msuzen said...

@Baline Bateman As I said, there are not many works in this direction for time-series modelling, but effect of splitting is pointed out here:

A bootstrap evaluation of the effect of data splitting on financial time series
IEEE Transactions on Neural Networks ( Volume: 9, Issue: 1, Jan 1998 )
https://doi.org/10.1109/72.655043

So my conclusion is that; it is perfectly good practice to apply the approach as we discuss in the post to time-series models, as long as re-sampling do not violate serial correlations in your data.

Fabio Veronesi said...

Thank you for this post!
In your example it is fairly easy to understand that one model is more complex than the other. However, in other cases it may be a bit more difficult, is there a way to classify ML algorithms based on their complexity.
Something that clearly says for example that boosting is definitely more complex that bagging, or around these lines?

Many thanks,
Fabio

Unknown said...

You can use jitter to measure overfitting directly https://www.kaggle.com/miniushkin/jitter-test-for-overfitting-notebook

Unknown said...

You can use jitter on features to measure overfitting https://www.kaggle.com/miniushkin/jitter-test-for-overfitting-notebook

msuzen said...

@Fabio Veronesi Unfortunately there is no universal classifier for ranking complexity of models as I mentioned in the post. The reason is that there are lots of complexity measures, maybe over 100 or more, in the literature and which part of the model one has to apply has no unique definition, even with algorithmic complexity. But my take on this; information theoretic measures tend to be more robust. Recent work has some nice measures for Boltzmann machines:

* Comparing Information-Theoretic Measures of Complexity in Boltzmann Machines (https://arxiv.org/abs/1706.09667)

Similar to these lines can be applied to boosting/bagging, I think.

msuzen said...

@Alexander Minushkin Interesting idea but are you sure this is not a training sample size issue? What happens if you have a large enough training sample, like 100 in your regression example? How do you decide on the level of jitter? Actually this reminds me an old idea of stochastic resonance from physics, that one can induce directed motion with a small noise.

Unknown said...

@msuzen on classification example you can see it works on 1000 and 200 sample points, so, it need to be investigated, but I don' think there is any issues except computation time.
Concerning to level of jitter - I agree it is serious. I think it should be compareble to standard deviation for each feature. But I'm not sure what to do with categorical features.

msuzen said...

Ophir Sarusi had a nice comment: "one refers to how well the samples in the training data represent the space you are working in (Algebra - do they span the entire space), and the other is how well your algorithm "understands" the given samples in the dataset"

Remember again: The first problem in supervised learning (SL), Model Bias-variance dilemma is about error optimisation in generalisation problem (if model is overtrained or not ). The second problem in SL is if your model is 'minimally complex' (overfitted or not). At his point, one can also think in terms of 'manifolds', if your training data would be able teach your algorithm to localise the manifold your data maps into, so a localisation issue with a precision-accuracy problem, hence bias-variance issue. Second problem in SL, as you called it how well your function "understands", if your model uses too much "information" to explain something that can be expressed with "less information".

msuzen said...

@Alexander Minushkin It think your approach worth to investigate further in detail on regression. Once you "proof" that it can detect 'overfitting' and 'overtraining'
in regression. While as you said, categorical data is not easy to handle intuitively due to encoding and mixed-space.

msuzen said...

An other interesting historical reference:
Neural network studies. 1. Comparison of overfitting and overtraining
J. Chem. Inf. Comput. Sci., 1995, 35 (5), pp 826–833
http://pubs.acs.org/doi/abs/10.1021/ci00027a006

Unknown said...

Hi, thank you for this post. As you said, I think it is really important to question all the technical concepts, even if they seem consolidated in literature, and the kind of effort done with posts like yours is very healty for the community.

However, I strongly disagree with your exposition, in almost every point. Sorry. Given that I don't know if you actually want to discuss this, I'll keep this comment short, discussing only the main point. If you are interested we can continue the discussion afterwards.

Getting to the point, there's no such thing as overtraining. You describe this problem as dependent on "training data size or iteration cycles in a neural network".

The first thing is: adding data points never hurts the performance of a model. It can be useless (in a high bias situation), it can help (in a high variance situation), but it never hurts. Again, I can elaborate if you want, but to be brief, this can be seen easily, in the classification context, thanks to the VC analysis. The VC bound provides an upper bound to the generalisation error that decreases with the number of data points. Adding more data can hurt the in sample (training) error, but never the out of sample (test) error - that is what you care about. In your experiment, the trend you are seeing is due to the fact that, together with the training size, you are varying the test size. This provides a better estimate of the error (thanks to a bigger test set) when you're training set is small, making this kind of analysis meaningless: you are using different estimators of the error for different models. If you keep the test set size steady (e.g. 20% of the total data) when varying the training set size, this is what you get: the test error goes to 15% at approx. 30% percent of the data, and, apart from small random effects, increasing the data points becomes useless (but not harmful) after a that point. I can't post the plot in the comment, unfortunately.

Regarding the iteration cycles, the thing is a bit more complex. The number of iteration cycles of the training process affects how close the algorithm gets to a minimum of a cost function. Now, if the model is too complex for the problem - i.e. is overfitting the data - NOT getting to the minimum acts like a kind of regularisation, helping to generalise. This is similar to what happens in the context of non convex optimisation problems, like in neural networks. The fact that the objective function is not convex has been proven to not be such a bad thing, because, if you got to the global optima instead of a local optima or a saddle point, in many cases, you would probably overfit (I can find the paper if you want). So, in this case, what you are calling overtraining is overfitting. Training less regularises the result, fighting it.

msuzen said...

@Alessio Burani Thank you for your comments, all view points are welcome here. As I mentioned in the post, unfortunately, this concept of 'overfitting' is not uniform in the literature and also in industry. Check out papers from Shun-ichi Amari, that I came across recently, he has an even more confusing distinction between overtraining and overfitting. Also, some people think VC is not really a good as a complexity measure, at least it fails deep learning as a good method, which obviously "mistaken".