Feature Selection

We are now searching for the features. In practice, since we are limited with time and computational constraints, the goal is to find the best possible minimum set of features that will yield satisfactory results. Features should be relevant, highly informative, accurate, and they should preserve class distribution, as close as possible. And how to determine which features are the best? Here comes the trouble. We can find the biggest set, containing all possible, existent features, make some measurements, discard each of them, test if performance has been improved (without the outlying points), and continue discarding the less powerful features, until we notice that performance couldn't be improved more. Or, the other way around, begin with the measurements of each individual feature, choose the best, than make the measurements on 2-tuples, the best feature+every single feature left, choose one that gives you the best performance with the first, and so forth... This is not that bad, sounds reasonable, but it is computationally very expensive. The ambiguous situation is reached if we measure individual features and then choose a subset of some number of the best features, evaluated individually which should be emphasized. The same goes for the solution of choosing good, but mutually orthogonal, independent features. Why am I so stubborn with depicting some choices as bad, no good, etc...? The answer will come in the next topic. Patience, please.

<< >>