On Feature Selection: Learning with Exponentially many Irrelevant Features as Training Examples
We consider feature selection in the “wrapper” model of feature selection. This typically involves an NP-hard optimization problem that is approximated by heuristic search for a “good” feature subset. First considering the idealization where this optimization is performed exactly, we give a rigorous bound for generalization error under feature selection. The search heuristics typically used are then immediately seen as trying to achieve the error given in our bounds, and succeeding to the extent that they succeed in solving the optimization. The bound suggests that, in the presence of many “irrelevant” features, the main source of error in wrapper model feature selection is from “overfitting” hold-out or cross-validation data. This motivates a new algorithm that, again under the idealization of performing search exactly, has sample complexity (and error) that grows logarithmically in the number of “irrelevant” features – which means it can tolerate having a number of “irrelevant” features exponential in the number of training examples – and search heuristics are again seen to be directly trying to reach this bound. Experimental results on a problem using simulated data show the new algorithm having much higher tolerance to irrelevant features than the standard wrapper model. Lastly, we also discuss ramifications that sample complexity logarithmic in the number of irrelevant features might have for feature design in actual applications of learning.