arXiv Analytics

Sign in

arXiv:1906.05271 [cs.LG]AbstractReferencesReviewsResources

Does Learning Require Memorization? A Short Tale about a Long Tail

Vitaly Feldman

Published 2019-06-12Version 1

State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set. This phenomenon is referred to as data interpolation or, informally, as memorization of the training data. The question of why such algorithms generalize well to unseen data is not adequately addressed by the standard theoretical frameworks and, as a result, significant theoretical and experimental effort has been devoted to understanding the properties of such algorithms. We provide a simple and generic model for prediction problems in which interpolating the dataset is necessary for achieving close-to-optimal generalization error. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and the frequencies of these subpopulations are chosen from some prior. The model allows to quantify the effect of not fitting the training data on the generalization performance of the learned classifier and demonstrates that memorization is necessary whenever frequencies are long-tailed. Image and text data are known to follow such distributions and therefore our results establish a formal link between these empirical phenomena. To the best of our knowledge, this is the first general framework that demonstrates statistical benefits of plain memorization for learning. Our results also have concrete implications for the cost of ensuring differential privacy in learning.

Related articles: Most relevant | Search more
arXiv:1404.7456 [cs.LG] (Published 2014-04-28)
Automatic Differentiation of Algorithms for Machine Learning
arXiv:2003.10113 [cs.LG] (Published 2020-03-23)
Algorithms for Non-Stationary Generalized Linear Bandits
arXiv:2007.13185 [cs.LG] (Published 2020-07-26)
Dimensionality Reduction for $k$-means Clustering