CONTENTS

V.     Kernel Adaptation analysis

A.     Data Dependent Composite kernel:

 

In this section we a present kernel adapted knn classifier for the detection of polyps in CT Colonography. This method exploits the idea presented in [35]. We initially present the Data dependent kernel methodology as follows:

Let be n-dimensional training  samples of the given data, where  represents the class labels of the samples. We formulate a data-dependent kernel to capture the relationship among the data in this classification task. This data-dependent composite kernel is formulated as:

                                                     (20)       

Where, ,  andare  kernels we choose, and q(.), the factor function, takes the form of 

                                                                                                      (21)

                                                                                                    (22)

in which ,  is the combination coefficient. . Let us denote the vectors  and by  and  respectively where i=1,2. We have = K1  where K1 is a  matrix  given as follows.

 

                                                                                                (23)

 

Let the kernel matrices corresponding to ,  and  be ,  and . easy Therefore we can express composite kernel K as:

                                                       (24)

     

Defining Qi as the diagonal matrix of elements we can express eq(24) as :

                                                                                                                (25)

 This kernel model was first introduced in [32] and called “conformal transformation of a kernel.”

However, the authors in [32] did not consider how to optimize the kernel model. In the following section we perform kernel optimization based on the method presented in [35]and [36].

 

B. Kernel Optimization:

 

The optimization of the data-dependent kernel in (25) is to set the value of combination coefficient vector  so that the class separability of the training data in mapped feature space is maximized. For this purpose, Fisher scalar is adopted as the objective function of our kernel optimization. Fisher scalar measures the class separability of the training data in the mapped feature space and is  formulated as

                                                                                                                   (26)

where Sb1, Sb2 represents the “between-class scatter matrices” and Sw1, Sw2 are the “within-class scatter matrices.” Suppose that the training data are grouped according to their class labels, i.e., the first n1 data belong to one class and the remaining n2 data belong to the other class (n1 + n2 = n). Then, the basic kernel matrices   and  can be partitioned as :

 

 and                                                                                    (27)

()

where the sizes of the submatrices , i = 1,2;  are, . respectively.

 

A close relation between the class separability measure J and the kernel matrices has been established [37] as,

 and                                                                            (28)

(

Where          and  ; i = 1,2;                                                    (29)

 

And for i = 1,2;

       

                                                                                                   (30)                        

                                                         (31)

.

 

To maximize  the standard gradient approach is followed.

Let

                          ;                                                                   (32)

Therefore,

                                                                                                                              (33)

                                                                                                                              (34)

 

Now finding    for i = 1, 2;

From  eq(28)                                                                                               (35)

 

                                                                       (36)

                                             

                                                         (37)

 

 The condition for obtaining maximum value for Ji is that,

 

                                                       

This implies that,

                                                                    (38)

 

                                                              (39)

From eq(32),

                                                                                           (40)

                                        

                                                                                                   (41)

If  exists, then

                                                                                                                   (42)

 

If matrix N01 is nonsingular, the optimal  that maximizes the Ji(α) is the eigenvector corresponding to the maximum eigenvalue of the system,

 

                                                                                                                 (43)

 

Unfortunately, in practice, especially in analyzing gene expression data, the matrices ,  are frequently singular, which makes the eigen decomposition method not applicable. A gradient-based learning algorithm was proposed in [38] and [36] to solve the optimization problem. However, an alternative way to handle this problem as suggested in [35] is to introduce a regulation parameter μ > 0 and substitute the matrix N0i with N0i +μI, where I denote the unit matrix. Therefore, the optimal μ that maximizes the class separability measure Ji(α) (i = 1,2) is determined by the eigenvector corresponding the maximum eigenvalue of the system,

 

 where i = 1,2.                                                                              (44)

 

Selecting the best kernels that suit the data:

 

The criterion for selecting the best kernel function is finding the kernel which produces largest eigen value from (43), i.e.

                                                                                                              (45)

The idea behind is to choose the maximum eigen vector alpha corresponding to maximum eigen value that can maximize the  will result in the optimum solution.

We find the maximum eigen values for all possible kernel functions and arrange them in the descending order to choose most optimum kernels. For example the following example shows one of the possible order.

                                                                                                  (46)

And we pick the kernels corresponding to first two largest eigen  values for forming composite kernel.

 

Optimization Algorithm:

Step1:  Group the data according to their class labels. Calculate, first and then

           ,through which we can calculate and for i = 1,2;

Step 2:  Calculate the eigen value  corresponding to maximum eigen vector   according to (43)

Step 3: Arrange the eigen values in the descending order of magnitude.

Step 4: choose the kernels corresponding to most dominant eigen values.

Step 5:  Calculate;
Step 7:  Calculate and then compute .

 

Weighted Composite Kernel Function:

 

In this section we propose composite kernel function which is defined as the weighted sum of the set of different optimiseed kernel functions. In section VA, we proposed a data dependent composite kernel of the form

                                              .

How ever to obtain the optimum classification accuracy, we have to redefine the above composite kernel as

                                          

where the value of the composite coefficient  is a constant scalar. The characteristics of composite kernels are determined by different values of  for different regions of the input space.  is a vector. Through this approach, the relative contribution of both kernels to the model can be varied over the input space. And now instead of usingas Kernel function we will be using  as Kernel function. According to [46] satisfies the, Mercers condition.

Method  to assign the weight factor:

We set up a threshold value to decide the contribution of each dominant kernel. For example consider the equation (46); here. There fore we define the threshold as follows:

 If ; then >.9;

     ; then >.8; and so on…

The weighted composite kernel can be then implemented as follows:

Choosing Appropriate parameters and Kernels:

The following algorithm represents the method that is adopted for selecting optimum kernels and their respective parameters corresponding to the given data set.

 

Step 1:  Use a part of training data for validation.

Step 2: Select the pair of kernels which give higher accuracy.

Step 3: Find the weights by the method mentioned above.

Step 4: Implement  for optimum parameter. (should be assigned to the kernel corresponding to maximum classification accuracy.)

Step 5: Build the model with  by using whole training data.

Step 6: Test the built model on the test set.