Reference: Overfitting
August 21, 2020
Overfitting is a phenomenon in statistics and machine learning, wherein a model maximizes performance on training data to the extent that it is unsuitable for unseen data. Overfitting prevents the model from generalizing well because it corresponds too closely to a particular dataset, thus preventing it from predicting future observations well.
In overfitting, a model can be thought to have “memorized” the training data rather than “learned” the general patterns in the training data. Overfitting is a common problem in machine learning and can occur for a variety of reasons, including the presence of too many adjustable parameters. An optimal model may ultimately need less data and flexibility than an overfit model. For this reason, nonparametric and nonlinear functions can be more susceptible to overfitting than linear models.
Training data can be thought of as consisting of two categories: information that is relevant to the future; and, noise that is irrelevant. In overfitting, a model learns too much of the noise, which makes it less reliable on unseen data. Compare this to underfitting, where the model doesn’t learn enough about its training data and thus performs poorly even on training data.
Several techniques are available to address overfitting, including cross-validation, pruning, regularization, Bayesian priors, and more. Validation data is particularly useful to identify model performance.