Classes and feature dependence

In Pattern Recognition, we are not given with just one class, we are rather challenged with making choices between several classes, based on the discriminating power of the feature set we have established. Now, I can go back to the question from the bottom of the last page. If you can recall what you have read on the previous page (if you haven't read anything then I will suggest you to do so, because further explanations will rely on the above mentioned) about possible patterns for feature selection, then you will probably ask yourself what ever could be wrong with selecting top k features from the list made up by examining each individual feature- its discriminating power with respect to classifying objects in some number of different classes. Two class example: purple egg-plant and purple dirty socks. It might happen that two, at the first glance worst features- color (intensity scale of purple) and softness (gradual scale of the object's softness) have the same mean values for both classes.Thus, individually they have no discriminating power. Both egg-plants and purple dirty socks can take all the values form the given interval of purple intensities and both can be more or less soft, thus, we couldn't really tell one from another using just one of the proposed measures. But, there is something interesting going on between these two features- they are strongly correlated and what is very important here- the sign of correlation is different inside different classes. Dirty sock is more purple when it's more dirty (it could be white if clean), and, hence, it is more stiff (because of some chemical processes), while, egg-plant is more ripe if it is more purple and, thus, it is more soft.

Someone may argue about validity of my conclusions in bio-chemical... sense, but the issue here was really on finding the rather descriptive example for clarifying classification problem, and what is going on with egg-plants and socks in real life was not important at all.

If we assume that data is normally distributed in two-feature space (which is valid if we have more then 30 dirty socks, and the same number of egg-plants), then described relationship between features, and separation of classes in feature space could be shown in the following figure:

These two ellipses depict two dimensional Gaussian distribution, (learn about Gaussian distribution here) red- egg- plants, blue- socks, and we have positive correlation between features for socks and negative for egg-plants.

From the figure, it is probably more clear that single feature has no power, but joining two very, very bad individual features we got fantastic discriminator, with respect to the majority of points. Some socks will, eventually, end up in stew (the green area in the figure above).

Now we are ready to make a conclusion, which is one of the golden rules of pattern recognition:

The set of the best (individually evaluated) features has not to be the best set of features at all. Furthermore, it is usually not the best, which has been shown in practice.

<< >>