Editing incorrect samples: What and Why?

What and Why:

Why is this useful?

A major class of classifiers in pattern recognition are the supervised classifiers: they are trained for the recognition task using a set of samples for which the answer (the class label) is known. Using a student analogy, these classifiers "learn" with a "teacher" who can confirm (and correct) their answers on the practice problems.

In practice however, collecting a sufficient number of training samples and correctly labeling them can be a tedious, if not impossible, task. Thus, practical supervised classifiers are better described as "imperfectly supervised classifiers".

Moreover, some classification problems exhibit a certain inherent overlap between the class distributions in feature space. This overlap can confuse a supervised classifier during training ("What do you mean this apple is neither small, neither big, but somewhere in between?").

Example: NN classifier

The Nearest-Neighbor (NN) classifier is very simple: given a set of samples with known labels (the training set), it classifies a new sample with the label of the closest training sample in feature space. Typically, "closest sample" means: the training sample situated at the smallest Euclidean distance in the (multi-dimensional) feature space.

The Nearest-Neighbor classifier has several advantages: it is very simple (easy to implement and easy to understand), and does not make any assumptions about the data distributions in feature space (it is non-parametric). However, it trusts the training samples to be correctly labeled! For example, here is a training set, the decision boundary determined by the NN classifier (shown in green), and the boundary determined by a Bayes classifier, assuming Gaussian data distributions (shown in gray):

decision boundaries: NN, and Bayes-Gaussian classifiers

Note that the NN classifier fitted the training data very closely. However, because of the "rogue" samples, and because of the small number of training samples, the resulting boundary looks un-natural. In contrast, the Bayes classifier determined a boundary based on statistical measures from the data (mean, variance), and this boundary is more likely to be true (hence, correctly classify new samples) on these particular data.

In conclusion, an automatic method of editing the training set to remove the "bad apples" from it (that is, both obviously mis-labeled samples and samples lying at the overlap of the class distributions in feature space) is of high practical interest.

Terminology:

There are a couple of different meanings of "editing a training set" in the pattern recognition literature. In this tutorial, the meaning of editing is equivalent to: pruning, depuration, decontamination, correcting, or simply cleaning the mis-labeled samples from the set.

Another frequently used meaning of editing is for an operation that reduces the size of the training set, in such a way that the decision boundary (of a NN classifier) is not distorted significantly. This is also known as condensing or reducing the training set.

Table of Content	WHAT and WHY	HOW	SHOW ME	FURTHER INFO

Chris Cocosco <crisco@bic.mni.mcgill.ca>

$Date: 2002/01/23 16:10:33 $ GMT