Anzeige
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
Anzeige
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm
Nächste SlideShare
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Wird geladen in ... 3
1 von 9
Anzeige

Más contenido relacionado

Similar a A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm(20)

Anzeige

A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm

  1. ORIGINAL ARTICLE A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm Omid Naghash Almasi1 • Mohammad Hassan Khooban2 Received: 30 November 2016 / Accepted: 1 March 2017 Ó The Natural Computing Applications Forum 2017 Abstract This paper proposes and optimizes a two-term cost function consisting of a sparseness term and a gener- alized v-fold cross-validation term by a new adaptive par- ticle swarm optimization (APSO). APSO updates its parameters adaptively based on a dynamic feedback from the success rate of the each particle’s personal best. Since the proposed cost function is based on the choosing fewer numbers of support vectors, the complexity of SVM model is decreased while the accuracy remains in an accept- able range. Therefore, the testing time decreases and makes SVM more applicable for practical applications in real data sets. A comparative study on data sets of UCI database is performed between the proposed cost function and con- ventional cost function to demonstrate the effectiveness of the proposed cost function. Keywords Parameter selection Á Model complexity Á Support vector machines Á Adaptive particle swarm optimization Á Classification Á Real-world data sets 1 Introduction Support vector machines (SVMs) are proposed by Vapnik [1]. SVM is based on statistical learning theory imple- menting the structural risk minimization principle. There- fore, SVM is proven as a powerful machine learning method that attracts a great deal of research in the fields of classification, function estimation problems, and distribu- tion estimation [2]. The generalization ability of SVM depends on the proper choosing of a set of two adjustable parameters which is called SVM model selection problem [3–5]. Another important feature of SVM is its sparseness prop- erty which allows only a small part of training data named support vectors (SVs) contributes in construction of final hyper-plane. This causes that SVM model has a small size, and hence, less time is consumed in testing phase in comparison with a model which built up with contribution of all training data. The solution of a model selection problem not only controls the generalization performance, but also affects on the SVM model size. Large problems generate large data sets, and consequently in these data sets, the SVM model size (number of SVs) will increase. It is antici- pated from SVM as a sparse machine learning method to deal with this problem, but the model reduction is not as much as expected for real-world application and the number of support vectors increases with the size of data set. Generally, two crucial problems arise in SVM applica- tions. The first is the lack of a certain method for tuning SVM parameters, and the other one is the size of model in large data set. In fact, the model selection problem plays an important role in SVM generalization performance for either small or large size of data sets, but for large real- world data sets, the model selection complexity dramati- cally increases. Various model selection methods have proposed by considering different criteria such as Jaakkola–Haussler bound [6], Opper–Winther bound [7], span bound [8], radius/margin bound [9], distance between two classes & Mohammad Hassan Khooban khooban@sutech.ac.ir 1 Young Researchers and Elite Club, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Electrical Engineering, Shiraz University of Technology, Shiraz, Iran 123 Neural Comput & Applic DOI 10.1007/s00521-017-2930-y
  2. [10], and v-fold cross-validation [11]. Generally, gradient decent method-based algorithms are used to optimize dif- ferentiable criteria. Although these methods are fast, the algorithm may stuck in local minima and therefore not applicable for all aforesaid criteria [4, 9, 10, 12, 13]. To overcome these drawbacks, global optimization methods such as PSO [14–16], simulated annealing [17], ant colony [18], GA [19, 20] have introduced for non-differentiable and non-smooth cost function optimization problems. More recently, a PSO-based method has proposed which uses PSO method to tune SVM parameters and evolve more artificial instances to make imbalanced data sets balanced [21]. Many researchers have used v-fold cross-validation instead of conventional validation in their research works to evaluate the generalization performance, because in some cases there are not enough data available to partition it into separate training and test sets without losing sig- nificant modeling or testing capability [4, 9, 11, 22–26]. Moreover, the aim in v-fold cross-validation is to ensure that every datum from the original data set has the same chance of being in both the training and the testing sets. The main contribution of this paper is summarized as follows: (1) a new criterion is proposed for model selection problem to consider both tuning a SVM parameters and model size reduction at once. Building a parsimonious model and efficient tuning of SVM parameters play important roles to reduce the testing time and increase the generalization performance of a SVM, respectively. To concurrently achieve these goals, a two-term cost function consisting of sparseness and generalization performance measures of SVM is proposed. (2) To achieve the global optimal solution of the proposed cost function, a new adaptive particle swarm optimization (APSO) is also pro- posed to solve optimization problem. APSO uses a success rate feedback to update inertia weight, and also, its cog- nitive and social weights are adaptively changed during the optimization process to improve the APSO performance. The efficiency of APSO is evaluated by comparing to standard PSO in optimizing static benchmark test func- tions. Finally, the effectiveness of proposed cost function is assessed in comparison with one-term cost function con- sisting of generalization performance criterion on nine data sets. This rest of this paper is organized as follows. The SVM formulation for binary classification is reviewed in Sect. 2. In Sect. 3.1, generalized v-fold cross-validation formula- tion is stated, then in Sect. 3.2, new APSO is introduced, and finally, in Sect. 3.3, the proposed model selection is proposed. Section 4 begins with stating the experiment conditions, and then, the experimental results are dis- cussed. Finally, conclusions are drawn in Sect. 5. 2 Support vector machine Assume a given two-class labeled data set as follows X = (xi, yi). Each data point xi 2 Rn belongs to either of two classes as determined by a corresponding label yi 2 { -1, 1} for i = 1, …, n. The optimal hyper-plane is obtained by solving a quadratic optimization problem Eq. (1). Min u w; nð Þ ¼ 1 2 wT w þ C Xn i¼1 ni s:t: yi wT :xi þ bð Þ ! 1 À ni; i ¼ 1; 2; . . .; n ni ! 0; i ¼ 1; 2; . . .; n ð1Þ where ni is a slack variable that represents the violation of pattern separation condition for each of the data and C is a penalty factor called regularization parameter for controlling the SVM model complexity. This is one of the model selection parameters in the SVM formulation. For nonlinear separable data, a kernel trick is utilized to map the input space into a high-dimensional space named feature space. Then, the optimal hyper-plane is obtained in the feature space. The primal optimal prob- lem Eq. (1) is transformed into its dual form written as below: Max Q að Þ ¼ 1 2 Xn i¼1 Xn j¼1 aiajyiyjk xi; xj À Á À Xn j¼1 aj s:t: Pn j¼1 aiyi ¼ 0 0 ai C; i ¼ 1; . . .; n ð2Þ where k(., .) is a kernel function. Some of the conventional kernel functions are listed in Table 1. Kernel parameter highly affects on generalization performance as well as the model complexity of SVM. Therefore, kernel parameters are considered as the other model selection parameter. Furthermore, in Eq. (2), a = (a1, …, an) is the vector of non-negative Lagrange multipliers [1]. The solution vector a = (a1, …, an) is sparse, i.e., ai ¼ 0 for most indices of training data. This is the so-called SVM sparseness prop- erty. The points xi that correspond to nonzero ai are called Table 1 Conventional kernel functions Name Kernel function expression Linear kernel k(xi, xj) = xi T xj Polynomial kernel kðxi; xjÞ ¼ ðtà þ xT i xjÞdà RBF kernel kðxi; xjÞ ¼ expðÀxi À x2 j =r2Ã Þ MLP kernel kðxi; xjÞ ¼ tan hðbà 0xT i xj þ bà 1Þ * Kernel parameter Neural Comput & Applic 123
  3. support vectors. Therefore, the points xi that correspond to ai = 0 have no contribution in construction of the optimal hyper-plane and only a part of training data, support vec- tors, constructs the optimal hyper-plane. Let v be the index set of support vectors; then, the optimal hyper-plane is f xð Þ ¼ X#sv i2m aiyik xi; xj À Á þ b ¼ 0 ð3Þ and the resulting classifier is y xð Þ ¼ sgn X#sv i2m aiyik xi; xj À Á þ b " # ð4Þ where b is shown the bias parameter and determined by Karush–Kuhn–Tucker (KKT) conditions [1]. 3 Proposed model selection 3.1 Generalized v-fold cross-validation criterion Generalized v-fold cross-validation (CV) criterion was first introduced by Craven et al. [27]. Consider a given training set of n data points as follows fðxk; ykÞjk ¼ 1; 2; . . .; ng. The following definition is assumed to formulate general- ized v-fold CV estimator. Definition 3.1 (Linear smoother) An estimator ^f of f is called a linear smoother if, for each x 2 Rd , there exists a vector LðxÞ ¼ ðl1ðxÞ; . . .; lnðxÞÞT 2 Rn such that ^f xð Þ ¼ Xn k¼1 lk xð ÞYi: ð5Þ In matrix form, this can be written as ^f ¼ LY, with L 2 RnÂn and L is called the smoother matrix. Craven et al. [27] demonstrated that the deleted residuals Yk À ^fðÀkÞ ðXk; hÞ can be written in terms of Yk À ^fðXk; hÞ and the trace of the smoother matrix L. Moreover, the smoother matrix depends on tunable parameter h ¼ ðc; rÞ. The generalized v-fold CV criterion satisfies Generalized vÀfold CV hð Þ ¼ 1 n Xn k¼1 YkÀ^f Xk; hð Þ 1 À nÀ1tr L hð Þ½ Š 2 : ð6Þ The generalized v-fold CV estimate of h can be obtained by minimizing (6). For more details, see [27, 28]. Li [29] and Cao et al. [30] investigated the effectiveness of generalized v-fold CV, finding that generalized v-fold CV was a robust criterion, and regardless of the magnitude of noise, always the same h is obtained. 3.2 Adaptive particle swarm optimization PSO is one of the modern population-based optimization algorithms first introduced by Kennedy and Eberhart [31]. It uses swarm of particles to find the global optimum solution in a search space. Each particle represents a candidate solution for the cost function, and it has its own position and velocity. Assume particle swarms are in D-dimensional search space. Let the ith particle in a D-dimensional space represented as xi = (xi1, …, xid, …, xiD). The best previous position of the ith particle is recorded and represented as pbi = (pbi1, …, - pbid, …, pbiD), which is called Pbest and given the best value in the cost function. General best position, gbest, is denoted by pgb and shown the best value of the Pbest among all the particles in the cost function. The velocity of the ith particle is represented as vi = (vi1, …, vid, …, viD). In each of the iterations, the velocity and the position of each particle are updated according to Eqs. (7) and (8), respectively. vid ¼ wvid þ C1r1 pbid À xidð Þ þ C2r2 pgb À xid À Á ð7Þ xid ¼ xid þ vid ð8Þ where w is an inertia weight and it is typically selected within an interval of [0, 1]. C1 is a cognition weight factor, C2 is a social weight factor, r1 and r2 are generated ran- domly within an interval of [0, 1]. Standard PSO has some shortcomings. It converges to local minima in multimodal optimization problem and also has some parameters which should be tuned to have an acceptable exploration and exploitation properties [32, 33]. In [34], by considering the stability condition and an adaptive inertia weight, the acceleration parameters of PSO are adaptively determined. A simple adaptive nonlinear strategy is introduced. This strategy mainly depends on each particle’s performance and determines each particle’s performance by calculating the absolute distance between each particle’s personal best (Pbest) and the global best position (gbest) among all particles in each iterations of algorithm [35]. In [36], the inertia weight is given by a function of evolution speed factor and aggregation degree factor, and the value of inertia weight is dynamically adjusted according to the evolution speed and aggregation degree. In order to improve the performance of standard PSO, the inertia, the cognition, and the social weight factors should be modified. In this paper, the main idea of modifying the inertia weight is inspired from 1 5 success rate introduced by Schwefel [37, 38] in evolution algorithms. Herein, in each iteration, the success rate of each particle is meant that a better cost function value is achieved by the Pbest in that each itera- tion in comparison with its previous iteration. The success rate is formulated in Eq. (9). Then, the percentage of success rate is calculated by using Eq. (10). Neural Comput Applic 123
  4. SucessRate ¼ 1 if CostFcn Pbestiter i À Á CostFcn PbestiterÀ1 i À Á 0 Otherwise ð9Þ PSucc ¼ Pn i¼1 SucessRate i; tð Þ n ð10Þ where n is the number of particles. Now the value of PSucc can vary within an interval of [0, 1]. It is transparent that while PSucc is high for a particle, Pbest for that particle is far from the optimum point of cost function and vice versa. Therefore, the inertia weight should be correlated with PSucc. Because of frequent use of linear form for presenting the inertia weight, we formulate the function of the inertia weight as a linear function of PSucc as follows: w iterð Þ ¼ wmax À wminð ÞPsucc þ wmin ð11Þ The range of the inertia weight [wmin, wmax] is selected to be [0.2, 0.9]. In order to control the trade-off between exploitation and exploration properties of PSO algorithm at the beginning of the optimization process, a large value for the cognitive weight and a small value for the social weight should be chosen. Therefore, the exploration property of PSO is enhanced. In contrast, close to ending stages of PSO algorithm, a small cognitive weight and a large social weight should be assigned in such a way to improve the algorithm convergence to the global optimum point [39]. Therefore, it is necessary to change the cognitive weight and social weight during the optimization process adap- tively. To this end, the following formula for APSO is utilized [32, 33, 38]: If C1 final C1 initial , C1 ¼ Cfinal 1 À Cinitial 1 À Á iter itermax þ Cinitial 1 ð12Þ If C2 final [ C2 initial , C2 ¼ Cfinal 2 À Cinitial 2 À Á iter itermax þ Cinitial 2 ð13Þ where the superscripts ‘‘initial’’ and ‘‘final’’ indicate the initial and final values of the cognition weight and the social weight factor, respectively. To proof the superior performance of APSO, it is compared with a standard PSO in optimizing three com- mon static benchmark test functions. Finally, APSO is used to solve the model selection problem in SVM. The test functions are used to investigate the convergence speed and solution quality of PSO and APSO. Table 2 provides a detailed description of these functions. All the test func- tions are a minimization problem. The first function (Rosenbrock) is a unimodal function while the rest of the functions (Rastrigin and Ackly) are multimodal optimiza- tion problems. The termination criterion of both PSO and APSO is determined by reaching the maximum iteration number. In this study, the maximum number of iterations and the number of particles for both algorithms are selected to be 50 and 30, respectively. The dimension of the search space (D) is 30. For all test function, xà is the best solution of test function and fðxÃ Þ represents the best achievable fitness for that functions. Figure 1 shows the comparison results of PSO and APSO based on the final accuracy and the convergence speed over 100 iterations. These results demonstrate that APSO has a considerable higher performance in both unimodal and multimodal optimization problems. In solving model selection problem of SVM, APSO is used to optimize the proposed cost function; after the maximum number of iteration reached, global best particle represents an optimal solution consisting of the best regu- lation parameter and the best kernel parameter for SVM model. 3.3 Proposed cost function for model selection problem A successful selection of SVM model is based on two important parameters affecting both generalization perfor- mance and model size of SVM. As we discussed earlier, those two parameters are regularization and kernel parameters. In non-separable problems, noisy training data will introduce slack variables to measure their violation of the margin. Therefore, a penalty factor C is considered in SVM formulation to control the amount of margin vio- lation. In other words, the penalty factor C is defined to determine the trade-off between minimizing empirical error and structural risk error and also to guarantee the accuracy of classifier outcome in the presence of noisy training data. Selecting a large value for C causes the margin to be hard, and the cost of violation becomes too high, so the separating model surface over-fits the training data. In contrast, choosing a small value for C allows the margin to be soft, which results in under-fitting separating model surface. In both cases, the generalization perfor- mance of classifier is unsatisfactory, so it makes the SVM model useless [40]. Kernel parameter(s) are implicitly characterizing the geometric structure of data in high-dimensional space named feature space. In feature space, the data become linearly separable in such a way that the maximal margin of separation between two classes is achieved. The selection of kernel parameter(s) will change the shape of the separating surface in input space. Selecting improp- erly large or small value for the kernel parameter results Neural Comput Applic 123
  5. over-fitting or under-fitting problem in the SVM model, so the model is unable to accurately classify data set [13, 41]. Therefore, we define a model selection problem as an optimization problem by proposing a cost function which can concurrently boost up both generalization performance and sparseness property of a SVM. Although only con- sidering generalization performance error obtained from the generalized v-fold CV method as the model selection criterion guarantees high generalization performance of the model, no avoidance from over-/under-fitting problem and also no steering toward improving the sparseness property of SVM are observed which are more possible in real data sets, because of large number of SVs. The one-term cost function consisting of a generalized v-fold CV error is defined as follows: One-Term Cost Fun ¼ Generalized v-fold CV Error ð14Þ a modification needs to be applied to overcome men- tioned drawbacks of one-term cost function. Finally, the proposed two-term cost function is formulated as follows: Two-Term Cost Fun ¼ a1 Á Generalized v-fold CV Error þ a2 Á Sparseness ð15Þ where a1 = 0.8 and a2 ¼ 0:2 are the coefficients showing the significant of Generalized v-fold CV Error and Sparseness in the cost function, respectively. Sparseness term is obtained by dividing total number of SVs by the total number of training data. The proposed cost function is the weighted sum of the generalized v-fold cross-validation error and a sparseness property of SVM. By considering the SVM sparseness as the second term of the cost func- tion, the over-/under-fitting problem is controlled. There- fore, the sparsity of the solution is improved and the model size as well as testing time is decreased. 4 Computational experiments 4.1 Experimental configuration To evaluate the performance of the proposed cost function, a PC with configuration of Dual-Core E2160@1.8 GHz CPU and 1 GB RAM is utilized. Nine commonly used data sets of UCI database in the literature c used to assess the effectiveness of the proposed cost function in comparison with one-term cost function in solving model selection problem. The v value in generalized v-fold CV is consid- ered to be 10 in this study. Data sets descriptions are presented in Table 3. Although the proposed method could Table 2 Benchmark test functions [34] Function name Test function #Dim Search space xà f(xà ) Rosenbrock f(x) = P i=1 D-1 [100(xi 2 - xi?1)2 ? (xi - 1)2 30 [-5, 10]D [0,…,0] 0 Rastrigin f(x) = P i=1 D (xi 2 - 10 cos(2pxi) ? 10 30 [-5.12, 5.12]D [0,…,0] 0 Ackly fðxÞ ¼ À20 expðÀ0:2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 30 PD i¼1 x2 i q Þ À expð1 D PD i¼1 cos 2pxiÞ þ 20 þ e 30 [-32, 32]D [0,…,0] 0 (a) (b) (c) 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 x 10 5 Iteration Rosenbrock Function PSO APSO 0 10 20 30 40 50 60 70 80 90 100 100 150 200 250 300 350 400 450 500 Iteration Rastregin Function PSO APSO 0 10 20 30 40 50 60 70 80 90 100 1.68 1.7 1.72 1.74 1.76 1.78 1.8 Iteration Ackely Function PSO APSO Fig. 1 Comparison results between PSO algorithm and new APSO algorithm on three benchmark test functions, a Rosenbrock, b Rastrigin, c Ackley Neural Comput Applic 123
  6. be applied to every kernel functions, all experiments reported here are implemented by using RBF kernel for the following reasons: The RBF kernel non-linearly maps data sets into the feature space so it can handle the data sets when the relation between desired output and input attri- butes is nonlinear. The second reason is less number of hyper-parameters which influences on the complexity of model selection problem. Finally, RBF kernel has less numerical difficulties [10, 13, 41]. As a result, the model selection parameters are regularization parameter (C) and RBF kernel parameter (r). The search space for C and r, model selection range, is set to be [1, 1000] and [0.01, 100], respectively. The performance of SVM model is achieved by averaging over 1000 optimal models made out of optimal parameters. 4.2 Experimental results and discussion For each data set of Table 3, a comparative study between the optimal models obtained by the proposed two-term cost function and one-term cost function is performed. In the comparative study, the generalization performance accu- racy, the model size, and the testing time are discussed. The results of the comparative study for data sets are presented in Table 4. Table 4 is shown that the parsimonious model obtained from the two-term cost function has a remarkable effect on reducing the model size in com- parison with the model obtained from the one-term cost function; consequently, the testing time is considerably reduced. Overall, all data sets are shown on average 46% reduction in model size and on average 37% reduction in testing time. For instance, for smallest data set of experiment (Wine) and the largest data set (DNA), model size reduction is 58 and 64%, respectively, and the testing time reduction is about 26.51, 66.00% in comparison with one-term cost function. Table 3 Description of data sets Data set name #Data #Feature Wine 178 13 Ionosphere 351 35 Breast cancer 699 10 German 1000 20 Splice 2991 60 Waveform 5000 21 Two norm 7400 20 Banana 10,000 2 DNA 10,372 181 Table 4 Results of comparative study for one-term and two-term cost functions on nine data sets Data set Cost function Accuracy Model size Testing time % (±SD) Reduction (%) #SVs (±SD) Reduction (%) (s) Reduction (%) Wine One-term 99.62 ± 0.57 -0.78 28.53 ± 2.59 58.67 2.49 26.51 Two-term 98.84 ± 0.21 11.79 ± 1.65 1.83 Ionosphere One-term 91.86 ± 1.87 -0.86 117.11 ± 4.10 43.28 3.90 25.38 Two-term 91.07 ± 1.19 66.42 ± 4.35 2.91 Breast cancer One-term 97.07 ± 0.68 -0.42 60.45 ± 4.03 55.99 3.41 24.34 Two-term 96.66 ± 0.83 26.60 ± 4.38 2.58 German One-term 72.78 ± 0.52 -0.74 409.26 ± 8.43 36.70 29.92 35.53 Two-term 72.24 ± 0.43 259.03 ± 8.07 19.29 Splice One-term 90.04 ± 0.69 -0.99 1029.73 ± 16.05 42.67 209.19 49.72 Two-term 89.16 ± 0.70 590.23 ± 18.38 105.17 Waveform One-term 90.32 ± 0.49 -0.12 722.60 ± 17.85 38.84 234.61 33.10 Two-term 90.20 ± 0.47 441.94 ± 19.16 156.94 Two norm One-term 97.78 ± 0.19 -0.06 398.20 ± 11.51 43.21 190.50 49.85 Two-term 97.72 ± 0.16 226.13 ± 12.38 95.52 Banana One-term 96.28 ± 0.20 -0.23 705.12 ± 43.70 32.12 485.70 28.58 Two-term 96.05 ± 0.21 478.6 ± 34.28 346.84 DNA One-term 95.60 ± 1.80 -1.07 1180.11 ± 154.85 64.59 565.97 66.00 Two-term 94.57 ± 1.24 417.79 ± 38.12 192.39 Neural Comput Applic 123
  7. One-term Two-term One-term Two-term 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0.25 0.3 0.35 0.4 log10 (c) log10 (σ) 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0.3 0.32 0.34 0.36 0.38 0.4 log10 (c) log10 (σ) 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0 0.05 0.1 0.15 0.2 log10 (c) log10 (σ) 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0 0.05 0.1 0.15 0.2 log10 (c) log10 (σ) (a) (b) Fig. 2 Two examples of model selection problem with one-term and two-term cost functions for data sets described in Table 3, a German, b Banana Two norm Splice DNA One-term Two-term 0 10 20 30 40 50 60 70 80 90 100 Accuracy (%) One-term Two-term 0 50 100 150 200 250 300 350 400 Model Size (#SVs) One-term Two-term 0 20 40 60 80 100 120 140 160 180 200 Testing Time (Sec.) One-term Two-term 0 10 20 30 40 50 60 70 80 90 100 Accuracy (%) One-term Two-term 0 200 400 600 800 1000 1200 Model Size (#SVs) One-term Two-term 0 50 100 150 200 250 Testing Time (Sec.) One-term Two-term 0 10 20 30 40 50 60 70 80 90 100 Accuracy (%) One-term Two-term 0 200 400 600 800 1000 1200 Model Size (#SVs) One-term Two-term 0 100 200 300 400 500 600 Testing Time (Sec.) Fig. 3 Three visual examples of one-term cost function (blue) and two-term cost function (green) extracted from Table 4. Accuracy (left bars), model size (middle bars), and testing time (right bars) (colour figure online) Neural Comput Applic 123
  8. Although it is expected that by reducing model size, the generalization performance considerably decreased, experimental results show that only a slight drop happened in the generalization performance for all data sets. The accuracies reduction ranges are below 0.58% on average for all data sets. By considering the importance of time consuming, a slight decrease in generalization performance of SVM is acceptable. The parameters of the optimal model selection process obtained by APSO are shown in ‘‘Appendix’’. In Fig. 2, two examples of one-term and two-term cost functions surfaces are plotted versus two model selection parameters to present the difference between one-term and proposed two-term cost functions. In addition, three examples of obtained results listed in Table 4 are visualized in Fig. 3, to show the efficiency of proposed two-term cost function over one-term cost function. 5 Conclusion A new two-term cost function based on the generalized v-fold generalization performance and the sparseness property of SVM proposed for the SVM model selection problem. In addition, a new APSO introduced to solve the non-convex and multimodal optimization problem. The feasibility of this cost function in comparison with one-term cost function evaluated on nine data sets. The proposed cost function shows an acceptable loss in generalization performance while providing a parsimo- nious model and avoiding SVM model from over-/under- fitting problem. The experimental results demonstrated that the parsimonious model has a lower model size on average 46% and less time consuming on average 37% in SVM testing phase in comparison with model obtained by the one-term cost function. Compliance with ethical standards Conflict of interest The authors declare that there is no conflict of interests regarding the publication of this paper. Appendix The optimal model selection parameters for all experiments data sets are presented in Table 5. References 1. Vapnik VN (1998) Statistical learning theory. Wiley, New York 2. Almasi ON, Rouhani M (2016) Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets. Turk J Electr Eng Comput 241:219–233 3. Peng X, Wang Y (2009) A geometric method for model selection in support vector machine. Expert Syst Appl 36:5745–5749 4. Wang S, Meng B (2011) Parameter selection algorithm for sup- port vector machine. Environ Sci Conf Proc 11:538–544 5. Chapelle O, Vapnik VN, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 461:131–159 6. Jaakkola T, Haussler D (1999) Probabilistic kernel regression models. Artif Int Stat 126:1–4 7. Opper M, Winther O (2000) Gaussian processes and SVM: mean field and leave-one-out estimator. In: Smola A, Bartlett P, Scholkopf B, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, Cambridge, MA 8. Vapnik V, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12(9):2013–2016 9. Keerthi SS (2002) Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Trans Neural Netw 135:1225–1229 10. Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE Trans Neural Netw 212:305–318 Table 5 Optimal model selection parameters Data set Cost function C r Wine One-term 49.60 2.58 Two-term 855.06 13.08 Ionosphere One-term 31.08 2.86 Two-term 354.26 4.90 Breast cancer One-term 19.56 34.08 Two-term 997.48 26.32 German One-term 7.91 2.01 Two-term 24.34 5.73 Splice One-term 3.36 4.80 Two-term 636.01 25.82 Waveform One-term 1.01 2.73 Two-term 9.10 7.93 Two norm One-term 1.03 6.87 Two-term 992.21 53.36 Banana One-term 9.28 0.27 Two-term 25.50 0.30 DNA One-term 348.60 8.56 Two-term 870.91 52.83 Neural Comput Applic 123
  9. 11. Guo XC, Yang JH, Wu CG, Wang CY, Liang YC (2008) A novel LS-SVMs hyper-parameter selection based on particle swarm optimization. Neurocomputing 71:3211–3215 12. Glasmachers T, Igel C (2005) Gradient-based adaptation of general Gaussian kernels. Neural Comput 1710:2099–2105 13. Lin KM, Lin CJ (2003) A study on reduced support vector machines. IEEE Trans Neural Netw 146:1449–1459 14. Wang S, Meng B (2010) PSO algorithm for support vector machine. In: Electronic commerce and security conference, pp 377–380 15. Lei P, Lou Y (2010) Parameter selection of support vector machine using an improved PSO algorithm. In: Intelligent human–machine systems and cybernetics conference, pp 196–199 16. Lin SW, Ying KC, Chen SC, Lee ZJ (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 354:1817–1824 17. Zhang W, Niu P (2011) LS-SVM based on chaotic particle swarm optimization with simulated annealing and application. In: Intelligent control and information processing, 2011 2nd inter- national conference, vol 2, pp 931–935 18. Blondin J, Saad A (2010) Metaheuristic techniques for support vector machine model selection. In: Hybrid intelligent systems, 2010 10th international conference, pp 197–200 19. Almasi ON, Akhtarshenas E, Rouhani M (2014) An efficient model selection for SVM in real-world datasets using BGA and RGA. Neural Netw World 24(5):501 20. Lihu A, Holban S (2012) Real-valued genetic algorithms with disagreements. Stud Comp Intell 4(4):317–325 21. Cervantes J, Garcia-Lamont F, Rodriguez L, Lopez A, Castilla JR, Trueba A (2017) PSO-based method for SVM classification on skewed data sets. Neurocomputing 228:187–197 22. Williams P, Li S, Feng J, Wu S (2007) A geometrical method to improve performance of the support vector machine. IEEE Trans Neural Netw 183:942–947 23. An S, Liu W, Venkatesh S (2007) Fast cross-validation algo- rithms for least squares support vector machine and kernel ridge regression. Pattern Recognit 408:2154–2162 24. Huang CM, Lee YJ, Lin DK, Huang SY (2007) Model selection for support vector machines via uniform design. Comput Stat Data Anal 521:335–346 25. Almasi ON, Rouhani M (2016) A new fuzzy membership assignment and model selection approach based on dynamic class centers for fuzzy SVM family using the firefly algorithm. Turk J Electr Eng Comput 243:1797–1814 26. Almasi BN, Almasi ON, Kavousi M, Sharifinia A (2013) Com- puter-aided diagnosis of diabetes using least square support vector machine. J Adv Computer Sci Technol 22:68–76 27. Craven P, Wahba G (1978) Smoothing noisy data with spline functions. Numer Math 314:377–403 28. Efron B (1986) How biased is the apparent error rate of a pre- diction rule? J Am Stat Assoc 81394:461–470 29. Li KC (1987) Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: discrete index set. Ann Stat 15(3):958–975 30. Cao Y, Golubev Y (2006) On oracle inequalities related to smoothing splines. Math Methods Stat 154:398–414 31. Kennedy J, Eberhart RC (2001) Swarm intelligence. Academic Press, USA 32. Beyer HG, Schwefel HP (2002) Evolution strategies: a compre- hensive introduction. Nat Comput 352:2002 33. Yuan X, Wang L, Yuan Y (2008) Application of enhanced PSO approach to optimal scheduling of hydro system. Energy Convers Manag 49:2966–2972 34. Taherkhani M, Safabakhsh R (2016) A novel stability-based adaptive inertia weight for particle swarm optimization. Appl Soft Comput 31:281–295 35. Chauhan P, Deep K, Pant M (2013) Novel inertia weight strate- gies for particle swarm optimization. Memet Comput 5:229–251 36. Yang X, Yuan J, Yuan J, Mao H (2007) A modified particle swarm optimizer with dynamic adaptation. Appl Math Comput 189:1205–1213 37. Schwefel HPP (1993) Evolution and optimum seeking: the sixth generation. John Wiley Sons, Inc 38. Almasi ON, Naghedi AA, Tadayoni E, Zare A (2014) Optimal design of T-S fuzzy controller for a nonlinear system using a new adaptive particle swarm optimization algorithm. J Adv Comput Sci Technol 31:37–47 39. Wang Y, Li B, Weise T, Wang J, Yuan B, Tian Q (2011) Self- adaptive learning based particle swarm optimization. Inf Sci 181:4515–4538 40. Keerthi SS, Lin CJ (2003) Asymptotic behavior of support vector machines with gaussian kernel. Neural Comput 157:1667–1689 41. Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6:1579–1619 Neural Comput Applic 123
Anzeige