The support vector machine (SVM) and other reproducing kernel Hilbert space

Matrixins , 0 Comments

The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. vectors. We pose the same problem if we exchange the role played by and in IVM. Hence the idea is to create Dauricine a dimension/feature screening/selection methodology via a sequential search strategy over the original feature space. Our proposal has following features: It uses kernel machine for classifier construction It produces a nonlinear classification boundary in the original Mouse monoclonal to KLHL25 input space The feature selection is done in the original input space not in the kernel transformed feature space Unlike SVM (both = (x≤ ∈ = {?1 1 or ∈ and x= (> 0 is the smoothing or regularization parameter and ? is a space of functions on which is called reproducing kernel Hilbert space (RKHS). In this article we will employ radial basis function (RBF) as kernel which is given by on a positive definite reproducing kernel to achieve this seemingly impossible computation. The optimal solution of the (3.1) is given by equation (2.2). It turns out that for most cases a sizeable number of = 1|= x) giving classification probability is often of interest by itself. Noting the similarity of the hinge loss of SVM and that of the NLL of the binomial distribution (plotted in Figure 1) Zhu and Hastie (2005) proposed to replace the hinge loss in equation 3.1. This essentially produces kernel logistic regression (KLR) given by: Figure 1 Hinge loss of SVM and NLL of binomial distribution for two class classification ∈ {?1 1 and input xto select a Dauricine subset of to approximate the full model. However for feature selection we face a problem of different kind where selection of dimension (small domain reduction of dimension (and is another possibility in the recently popular context. However in this paper our focus is exclusively on dimension/feature selection. In the variable selection context Lasso proposed Dauricine by Tibshirani (1996) is a very successful method for automatic feature selection. In penalized regression context small domain (? dimensions only. Also in situations where two (or more) dimensions have high correlation Lasso tends to select only one dimension from the group. Park and Hastie (2008) considered NLL of the binomial distribution with introduced latter) as many/few dimensions can be selected as desired. 4 Feature Selection in KLR Framework Let us denote the index set ? = {1 2 … = ( ) ≤ denotes the dimension of the transformed feature space. The classification boundary which is a hyperplane in the transformed feature space is given by coordinates are true features (or signals) and x* = (< is the shortest distance from the training data to the separating hyperplane in ??. A KLR problem in ? can be equivalently stated as obtaining coordinates of x and its last ? coordinates are noises we can partition coordinates of solution by → ∞ coordinates of with respect to a equals to zero and use the Newton-Raphson method to iteratively solve the score equation. With a little bit of algebra it can be shown that Newton-Raphson step is a weighted least square method and = 1. For each ∈ ? \ let {= = + 1. Repeat Dauricine Steps 2 and 3 until convergence criteria are satisfied. The dimensions in are called imported features. 4.3 Convergence Criteria In their original IVM algorithm Zhu and Hastie (2005) compared the quantity in different iterations. At step they compare with is a pre chosen small integer say Δ= 1. If the ratio is less than a pre chosen small number = 0.001 the algorithm stops adding new observations. This convergence criterion is fine in IVM context as their algorithm compares individual observations without altering dimensions. For FIVM in a specific iteration different we compute ? defined as the proportion of correctly classified training observations with imported features. If the ratio is less than a pre chosen small number = 0.001 the algorithm stops adding new features. Though it was not discussed in their original paper we would like to mention that the convergence criterion described in Zhu and Hastie (2005) has a mild assumption of no repetition of observations to be successfully applicable. In case there are two (or more) identical data points and Δ= 1 is chosen and if it turns out that in one of those identical points is also an imported observation in any step of the iteration then the algorithm will stop there as it will produce could be an unstable quantity. While this is true the way we define relative ratio of and (to compare with we closely follow the computational consideration of Zhu and Hastie (2005)..