Preventing “Overfitting” of Cross-Validation data

Preventing “Overfitting” of Cross-Validation data
Abstract

Suppose that, for a learning task, we have to select one hypothesis out of a set of hypotheses (that may, for example, have been generated by multiple applications of a randomized learning algorithm). A common approach is to evaluate each hypothesis in the set on some previously unseen cross-validation data, and then to select the hypothesis that had the lowest cross-validation error. But when the cross-validation data is partially corrupted such as by noise, and if the set of hypotheses we are selecting from is large, then “folklore” also warns about “overfitting” the crossvalidation data [Klockars and Sax,1986, Tukey, 1949, Tukey, 1953]. In this paper, we explain how this “overfitting” really occurs, and show the surprising result that it can be overcome by selecting a hypothesis with a higher cross-validation error, over others with lower cross-validation errors. We give reasons for not selecting the hypothesis with the lowest cross-validation error, and propose a new algorithm, LOOCVCV, that uses a computationally efficient form of leave-one-out crossvalidation to select such a hypothesis. Finally, we present experimental results for one domain, that show LOOCVCV consistently beating picking the hypothesis with the lowest cross-validation error, even when using reasonably large cross-validation sets.