Divergence

One of the most useful criteria used in feature selection problems (you can learn more about information measures here ) is Divergence, J, defined as:

$\begin{displaymath} J=\int[p(X\vert C_1)-p(X\vert C_2)]log\frac{p(X\vert C_1)}{p(X\vert C_2)}dX \end{displaymath}$

(1)

where p(X|C_i), i=1,2 are conditional probabilities of feature vector X, with respect to the given class i. X is d-dimensional feature vector. This measure of distance between two classes is especially convenient for the case of Multivariate Normal Gaussian Distribution of feature vector over class domains of classes C1 and C2.

If M_i and S_i, i=1,2 represent mean vector and covariance matrix in each of the class i, then divergence could be written in suitable form:

$\begin{displaymath} J=\frac{1}{2}(M_1-M_2)^T(\Sigma _1^{-1}+\Sigma _2^{-1})(M_1-M_2)+\frac{1}{2}tr(\Sigma_1^{-1}-\Sigma _2^{-1}) \end{displaymath}$

(2)

where T denotes matrix transpose, and 'tr' denotes trace, which in case of equal covariance matrices, yields the Mahalanobis distance , M ,:

$\begin{displaymath} J=M=(M_1-M_2)^T\Sigma^{-1}(M_1-M_2) \end{displaymath}$

(3)

Now, consider a 2-class, 2-feature problem. I am going to derive the relation that connects joint divergence J(x₁,x₂) of two features x₁ and x₂ with their individual divergences:

$\begin{displaymath} J(x_1,x_2)=\frac{J(x_1)+J(x_2)-2\rho\sqrt{J(x_1)J(x_2)}}{1-\rho^2} \end{displaymath}$

(4)

where $\rho$ is the correlation coefficient between x₁ and x₂.

The following are definitions for the feature vector, X, the mean vectors, and the covariance matrices:

$\begin{displaymath} X=\left(\begin{array}{c}x_1\\ x_2\end{array}\right) M_1=\lef... ...1} & \sigma_{12}\\ \sigma_{21} & \sigma_{22}\end{array}\right) \end{displaymath}$

(5)

You have probably noticed that covariance between two features ( $\sigma_{12}=\sigma_{21}$ ) was expressed in terms of the correlation coefficient( $\rho$ ). Also, notice that x₁ and x₂ could be replaced by any pair of individual features x_i and x_j from feature vector X.

Inverse of correlation matrix is given by:

$\begin{displaymath} \Sigma^{-1}=\left(\begin{array}{cc} \frac{1}{\sigma_1^2(1-\... ...2(1-\rho^2)}&\frac{1}{\sigma_2^2(1-\rho^2)} \end{array}\right) \end{displaymath}$

(6)

The individual divergences (describing the distance between two classes separated in one feature space) are given in the following:

$\begin{displaymath} J(x_1)=\frac{(m_{11}-m_{21})(m_{11}-m_{21})}{\sigma_{11}} J(x_2)=\frac{(m_{12}-m_{22})(m_{12}-m_{22})}{\sigma_{22}} \end{displaymath}$

(7)

Regarding the above results, we can write:

$\begin{displaymath} J(x_1,x_2)=(m_{11}-m_{21} m_{12}-m_{22})\Sigma^{-1}\left(\begin{array}{c}m_{11}-m_{21}\\ m_{12}-m_{22}\end{array}\right) \end{displaymath}$

(8)

$\begin{displaymath} \Rightarrow J(x_1,x_2)=(m_{11}-m_{21} m_{12}-m_{22})\left(\b... ...}+\frac{m_{12}-m_{22}}{\sigma_2^2(1-\rho^2)}\end{array}\right) \end{displaymath}$

(9)

$\begin{displaymath} \Rightarrow J(x_1,x_2)=\frac{(m_{11}-m_{21})^2}{\sigma_1^2(1... ...ma_2(1-\rho^2)}+\frac{(m_{12}-m_{22})^2}{\sigma_2^2(1-\rho^2)} \end{displaymath}$

(10)

$\begin{displaymath} \Rightarrow J(x_1,x_2)=\frac{J(x_1)}{(1-\rho)^2}+\frac{J(x_2... ...)}\frac{m_{12}-m_{22}}{\sigma_2}\frac{m_{11}-m_{21}}{\sigma_1} \end{displaymath}$

(11)

$\begin{displaymath} \Rightarrow J(x_1,x_2)=\frac{J(x_1)}{(1-\rho)^2}+\frac{J(x_2... ...ho^2)}\frac{m_{12}-m_{22}}{\sigma_2}\sqrt{J(x_2)}\sqrt{J(x_1)} \end{displaymath}$

(12)

$\begin{displaymath} \Rightarrow J(x_1,x_2)\stackrel{\mathrm{def}}{=}\frac{J(x_1)+J(x_2)-\rho\sqrt{J(x_1)J(x_2)}}{1-\rho^2} \end{displaymath}$

(13)

<< >>