Printer Friendly

Improved feature-selection method considering the imbalance problem in text categorization.

1. Introduction

Text categorization [1], which assigns the predefined categories to an unlabeled text document [2], has become a very efficient method to manage the vast volumes of digital documents available on the Internet. In recent years, many sophisticated machine learning algorithms, such as support vector machine (SVM) [3], naive Bayes (NB) [4], and K-nearest neighbor (KNN) [2, 5], have been extensively applied to the text categorization.

The high dimensionality is the major characteristic of text categorization in which the number of the features can easily reach the orders of tens of thousands even for moderate size dataset [6, 7]. Most of features are irrelevant and lead to poor performance of the classifier [8]. Therefore, the dimensionality reduction, which attempts to reduce the size of feature space without sacrificing the performance of the text categorization, has been a critical step in text categorization [1, 9]. Feature selection [10], which selects a subset from original feature space according to evaluation criteria, is the most commonly used dimensionality reduction method in the field of the text categorization [11]. Feature-selection methods can be divided into three classes [12]. One is the embedded approach that the process of the feature selection is embedded in the induction algorithm; another one is the wrapper approach that the evaluation function is used to select the feature subset as a wrapper around the classifier algorithm [11, 13, 14]; the last one is the filtering approach that the evaluation function used to select the feature subset is independent of the classifier algorithm [14]. In this paper, we focus on the filtering approach. Many efficient and effective filtering feature-selection methods have been applied to text categorization, such as Information Gain (IG) [7], Chi-square statistics (CHI) [7, 15], Mutual Information (MI) [16], Document Frequency (DF) [7], improved Gini index (GINI) [17], DIA association factor (DIA) [1, 6], Comprehensive Measurement Feature Selection (CMFS) [11], Orthogonal Centroid Feature Selection (OCFS) [18], and Deviation from Poisson Feature Selection (DFPFS) [15].

So far, almost all of feature-selection algorithms evaluate the significance of a term based on the balanced datasets without considering the influence of the imbalanced factor. In fact, most of data in the real world is imbalanced. There are two reasons why there exist the imbalanced data in the world. One is the intrinsic nature of such event; the rare events yield less samples. The other reason is the expense of collecting samples and legal or privacy reasons [19]. The imbalanced factors in the datasets degrade the performance of the learning algorithms [20]. In recent years, the imbalanced learning problem has got broad attention of numerous experts and scholars [21-23]. In this paper, an improved scheme of existing feature-selection methods is proposed, which weakens the influence of the imbalanced factors occurring in the dataset. In our experiments, we applied the improved scheme on NB and SVM using three benchmark corpora. We favorably show the effectiveness of our approach by demonstrating that it significantly outperforms nine existing feature-selection algorithms.

The rest of this paper is organized as follows. Section 2 presents nine existing feature-selection algorithms used in the paper. Section 3 describes the basic idea and implementation of the improved scheme of nine existing feature-selection methods. The experimental details are given in Section 4 and the experimental results are presented in Section 5. Section 6 shows the statistical analysis and discussion. Our conclusion and the future work direction are provided in the last section.

2. Related Feature-Selection Algorithms

2.1. Information Gain (IG). Information Gain [24] is a criterion commonly used in the machine learning [7]. The Information Gain of the feature [t.sub.k] over the class [c.sub.i] is the reduction in uncertainty about the value of [c.sub.i] when the value of [t.sub.k] is known. The Information Gain of the feature [t.sub.k] over the class [c.sub.i] can be calculated as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

where P(c) is the fraction of the documents in category c over the total number of documents and P(t, c) is the fraction of documents in the category c that contain the word t over the total number of documents. P(t) is the fraction of the documents containing the term t over the total number of documents [25].

2.2. Chi-Square (CHI). Chi-square testing [7] was applied to evaluate the independence of two variables in mathematical statistics. In this paper, the independence of the feature [t.sub.k] and the category [c.sub.i] is measured by Chi-square. The greater the value of the CHI([t.sub.k], [c.sub.i]) is, the more category information the feature [t.sub.k] contains. Chi-square formula is defined as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

where N is the amount of documents in the training set; [a.sub.ki] is the frequency with which feature [t.sub.k] occurs in the category [c.sub.i]; [b.sub.ki] is the frequency with which feature [t.sub.k] occurred in all categories except [c.sub.i]; [c.sub.ki] is the frequency with which category q occurs and does not contain feature [t.sub.k]; [d.sub.ki] is the number of times neither [c.sub.i] nor [t.sub.k] occurs.

2.3. Mutual Information (MI). Mutual Information is a concept in information theory, which measures the dependencies between random variables and can be applied to measure the information content contained in a feature [26]. Mutual Information is used to measure the dependence between the feature [t.sub.k] and the category [c.sub.i] in the feature selection. The higher Mutual Information with the category [c.sub.i] the feature [t.sub.k] possesses, the more information about category [c.sub.i] the feature [t.sub.k] contains:

MI ([t.sub.k], [c.sub.i]) = log [P([t.sub.k], [c.sub.i])/P([t.sub.k]) P ([c.sub.i])], (3)

where P([t.sub.k], [c.sub.i]) is the probability that feature [t.sub.k] occurs in category [c.sub.i].

2.4. Document Frequency (DF). Document Frequency calculates the number of documents in which a feature occurs. The basic idea is that the rare terms are not useful for category prediction and maybe degrade the global performance [7]. The larger the number of the documents containing the feature[ [t.sub.k] in the category [c.sub.i] is, the more predictable information for category [c.sub.i] the feature [t.sub.k] possesses [1]. The Document Frequency of a term is calculated as follows:

DF ([t.sub.k], [c.sub.i]) = P([t.sub.k][c.sub.i]). (4)

2.5. Improved Gini Index (GINI). The Gini index was originally developed for the best split in decision tree induction [15]. In order to utilize it in text categorization with multiclass setting, the original Gini index was improved by Shang et al. [27]. The improved Gini index measures the purity of feature [t.sub.k] toward a category [c.sub.i]. The bigger the value of purity is, the better the feature is. The formula of the improved Gini index is defined as follows:

Gini ([t.sub.k]) = [summation over (i)] p[([t.sub.k] | [c.sub.i]).sup.2] P[([c.sub.i] | [t.sub.k]).sup.2] (5)

where P([t.sub.k] | [c.sub.i]) is the probability that the feature [t.sub.k] occurs in category [c.sub.i] and P([c.sub.i] | [t.sub.k]) refers to the conditional probability that the feature [t.sub.k] belongs to the category [c.sub.i] when the feature [t.sub.k] occurs.

2.6. DIA Association Factor (DA). DIA association factor [1, 28] is used to evaluate the conditional probability of a document being assigned to category [c.sub.i] when it contains the term [t.sub.k]. It determines the significance of the term [t.sub.k] for the category [c.sub.i]. The bigger the DIA of the term [t.sub.k] with respect to category [c.sub.i] is, the more significant for category [c.sub.i] the term [t.sub.k] is. The DIA association factor is defined by

DIA ([t.sub.k], [c.sub.i]) = P([c.sub.i] | [t.sub.k]), (6)

where P([c.sub.i] | [t.sub.k]) refers to the conditional probability that feature [t.sub.k] belongs to category [c.sub.i] when the feature [t.sub.k] occurs.

2.7. Comprehensive Measurement Feature Selection (CMFS). CMFS [11] is a new feature-selection algorithm proposed in our previous research work, in which the significance of a term both in intercategory and intracategory is comprehensively measured. The experimental results show that the CMFS can significantly improve the performance of the classifier:

CMFS ([t.sub.k], [c.sub.i]) = P([t.sub.k] | [c.sub.i])P([c.sub.i] | [t.sub.k]). (7)

2.8. Orthogonal Centroid Feature Selection (OCFS). The Orthogonal Centroid Feature Selection selects features optimally according to the objective function implied by the Orthogonal Centroid algorithm [17,18]. The centroid of each category and entire training set are used to calculate the score of the term. The score of a term [t.sub.k] is calculated as follows:

OCFS([t.sub.k]) = [[absolute value of C].summation over (i=1)] [[n.sub.i]/n] [([m.sup.k.sub.i] - [m.sup.k]).sup.2], (8)

where [n.sub.i] is the amount of documents in the category [c.sub.i], n is the amount of documents in the training set, [m.sup.k] is the kth element of the centroid vector [m.sub.i] of class [c.sub.i], [m.sup.k] is the kth element of the centroid vector m of entire training set, and [absolute value of C] refers to the number of categories in the corpus.

2.9. Deviations from Poisson Feature Selection (DFPFS). The Poisson distribution has been successfully used to select the effective query words in information retrieval. The DFPFS is derived from Poisson distribution and measures the degree at which a feature deviates from the Poisson distribution [15]. The farther a feature departs from Poisson distribution, the more effective it is. Conversely, if a feature can be predicted by Poisson distribution, then it is poor:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (9)

where [F.sub.i] is the total frequency of term [t.sub.i] in all messages and n([C.sub.j]) and n([[bar.C].sub.j]) are the numbers of messages which occur in [C.sub.j] and are absent from [C.sub.j], respectively

3. Algorithms

3.1. Motivation. Prior to feature selection for text categorization, a term-to-category matrix [11], in which rows are the features and columns are category vector, must be generated. In fact, the term-to-category matrix is the foundation of most feature-selection algorithms. All the feature-selection algorithms only consider the term frequency of a feature occurring in a given category and do not take the influence of the imbalance problem into consideration. Table 1 shows 5 features in term-to-category matrix for top 10 categories of Reuters-21578 corpus. The number in the parentheses indicates the sum of documents in the corresponding category. It can be seen from Table 1 that categories C1 and C4 have significantly more training documents than other categories, and, hence, the term frequency of many features appearing in these two categories is significantly higher than their frequency in other categories; for example, the total term frequency of five features occurring in categories C1 and C4 is 3853 and 5700, respectively. However, we think that the term frequency of a feature occurring in one majority category cannot suggest the essence of the feature in this category; the number of one feature occurring in one minority category cannot reflect the truth of the feature in this category. Based on this observation, a scheme which can eliminate the influence of the imbalance problem for feature-selection algorithms is proposed in this paper.

3.2. The Improved Scheme. Feature selection contains three steps. The first step is to calculate the significance of a particular feature [t.sub.k] over a given category [c.sub.i](FS([t.sub.k], [c.sub.i])). FS([t.sub.k], [c.sub.i]) is the local significance of the feature. The second step is to combine the category-specific scores of each feature into one score (FS([t.sub.k])). FS([t.sub.k]) is the global significance of the feature [7]. The last step is to rank all features in the training set according to the global significance of each feature and then select the top k significant features as a new feature subset. To eliminate the negative influence of the imbalance problem, the local significance of feature [t.sub.k] can be calculated using

FS ([t.sub.k], [c.sub.i]) = FS([t.sub.k], [C.sub.i])/P([c.sub.i]), (10)

where P([c.sub.i]) is the probability of category [c.sub.i] occurring in the entire training set. Two alternate ways can be used to calculate the value of P([c.sub.i]). One is to use the number of documents to calculate the probability P([c.sub.i]); the other is to use the amount of all features occurring in category [c.sub.i] to calculate the probability P([c.sub.i]). In this paper, (12) is used:

P([c.sub.i]) = [n.sub.i]/n, (11)

P([c.sub.i]) = t[f.sub.i]/[[summation].sup.[absolute value of C].sub.j=1]t[f.sub.i], (12)

where n is the total number of documents in the entire training set; [n.sub.i] is the sum of the documents in category q; t[f.sub.t] is the amount of features occurring in category [c.sub.i]; [absolute value of C] is the number of the categories.

There are two alternate ways that calculate the global significance of one feature based on the local significance. In one way the average value of one feature over all categories will be taken as the global value. The formula for the average way is shown in (13). In the other way the maximum value of one feature over all categories will be regarded as the global score. The formula for the maximum way is shown in (14). In order to weaken the influence of the imbalance problem, we substitute (15) and 16) for 13) and 14) in this paper:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (13)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (14)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (15)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (16)

Based on the idea proposed in this paper, the feature-selection algorithms listed in Section 2 can be improved. Table 2 shows the improved formula of nine existing feature-selection algorithms in Section 2. Since the category-specific score of GINI is not provided in the literature about the GINI algorithm, the extension version of local feature selection for GINI is not listed in Table 2. The category-specific score of OCFS is not described in the literature either; however, it can be deduced from the formula of OCFS that [([m.sup.k.sub.i] - [m.sup.k]).sup.2] is the local significance of the feature [t.sub.k].

4. Experiment Setup

4.1. Classifiers. In this paper, both NB and SVM are used to make a comparison before and after nine existing feature-selection methods are improved, respectively.

NB [4] is an excellent algorithm for text categorization. It is based on the assumption that a term occurring in a document is independent of other terms. There are two commonly used models for Bayesian classifier: one is the multivariate Bernoulli model; the other is the multinomial model which is used in this paper.

SVM, which was developed by Drucker et al. [3] for spam categorization and applied to text categorization by Joachims [29], is a higher efficient classifier in text categorization. In our study, LIBSVM toolkit [30] is used and the options for LIBSVM are assigned the default value.

4.2. Datasets. Three benchmark datasets (Reuters-21578, WebKB, and 20-Newsgroups) were used to evaluate the performance of the proposed method in our experiments. In the preprocessing step, all words were converted to lower case, punctuation marks were removed, stop lists were used, and no stemming was used. Document Frequency of a term was used in the text representation, and 10-fold validation was adopted in this paper.

The 20-Newsgroups dataset is one of the standard corpora for text categorization. It contains 19997 Newsgroup postings, and all documents were assigned evenly to 20 different UseNet groups.

21578 stories in Reuters-21578 dataset, which are from the Reuters newswire, are nonuniformly divided into 135 categories. In this paper, the top 10 categories are used.

The WebKB, which is a collection of web pages from four different college web sites, contains 8282 web pages. All web pages are nonuniformly assigned to 7 categories. In this paper, four categories ("course," "faculty," "project," and "student") are used.

4.3. Performance Measures. The text categorization effectiveness is usually measured using the F1, accuracy, and AUC [1, 31]. F1 measure is a combined effectiveness measure determined by "precision" and "recall." Precision is the conditional probability that the decision is correct when a random document is classified under a specific category. Recall is the conditional probability that the decision is taken when a random document ought to be classified under a specific category. The formulas of the precision and recall for the category [c.sub.i] are defined as

[P.sub.i] = T[P.sub.i]/T[P.sub.i] + F[P.sub.i], [R.sub.i] = T[P.sub.i]/T[P.sub.i] + F[N.sub.i] (17)

where T[P.sub.i] is the amount of the documents that are correctly classified to category [c.sub.i]; F[P.sub.i] is the amount of the documents that are misclassified to category [c.sub.i]; F[N.sub.i] is the amount of the documents which belong to category [c.sub.i] and are misclassified to other categories. For evaluating performance average across categories, the microaveraging was used in our experiments. The microprecision and microrecall maybe obtained as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (18)

where [absolute value of C] is the number of the categories. The micro-FI and accuracy are defined in the following way:

F[1.sub.micro] = 2[P.sub.micro][R.sub.micro]/[P.sub.micro] + [R.sub.micro]

Accuracy = TP + TN/TP + TN + FP + FN. (19)

The receiver operating characteristics (ROC) curve provides a powerful method to visualize performance of the classifier [22]. The area under the ROC curve (AUC) has become a wide measurement of performance of supervised classification rules. However, the simple form of AUC is only applicable to the case of two classes [32]. To calculate the multiclass AUC, the method proposed by Provost and Dominigos [33] is used in our experiments. First, the ROC curve of each class versus all other classes [34] is generated and their respective AUC is measured. Second, the expected AUC is the weighted average of all the AUCs.

5. Results

5.1. The Experimental Results on 20-Newsgroups. Tables 3 and 4 show the performance comparison of nine improved and existing feature-selection algorithms in terms of micro-F1 and AUC on 20-Newsgroups, respectively. It can be seen from Tables 3 and 4 that the performance of improved version of CHI, DIA, MI, DF, GINI, CMFS, and OCFS is significantly superior to that of the old version. Although the micro-F1 and AUC of NB based on the improved version of IG are inferior to that of existing version of IG, the performance of SVM based on the improved version of IG is superior to that of IG. Moreover, the performance of the improved version of Deviation from Poisson Feature Selection is inferior to that of the old version.

Figures 1 and 2 show the accuracy curves of NB and SVM based on nine pairs of feature-selection methods with 20Newsgroups, respectively. The value of x-axis in Figures 1 and 2 is the number of features selected by different feature-selection algorithms. Figure 1 indicates that the accuracy curve of NB based on CHIX, MIX, DFX, GINIX, DIAX, CMFSX, and OCFSX is significantly higher than that of CHI, MI, DF, GINI, DIA, CMFS, and OCFS. The extent of the performance growth of DIAX is the highest and the highest growth rate is 165 percent. The accuracy curves of NB based on IGX and IG completely coincide with each other in shape. However, the curve of NB based on DFPFSX is lower than that of DFPFS. It can be seen from Figure 2 that the curve of SVM based on improved version is higher than that of existing version except for DFPFS.

5.2. The Experimental Results on Reuters-21578. Table 5 shows the comparison of nine improved and existing feature-selection methods in terms of micro-F1 on Reuters-21578, respectively. It can be seen from Table 5 that the micro-F1 of NB based on CHIX, DFX, DIAX, OCFSX, and DFPFSX is superior to that of CHI, DFX, DIA, OCFS, and DFPFS. The micro-F1 of NB based on IGX is superior to that of IG when the number of selected features is 800 or 1200, respectively. The micro-F1 of NB based on MIX is superior to that of MI when the number of selected features is 400, 1600, or 2000, respectively. The micro-F1 of NB based on GINIX is superior to that of GINI when the number of selected features is 400, 800, 1600, or 2000, respectively. The micro-F1 of NB based on CMFSX is superior to that of CMFS when the number of selected features is 400, 800,1200, or 1600, respectively. The micro-F1 of SVM based on IGX, CHIX, DFX, GINIX, DIAX, CMFSX, OCFSX, and DFPFSX is significantly superior to that of IG, CHI, DF, GINI, DIA, CMFS, OCFS, and DFPFS. The micro-F1 of SVM based on MIX is superior to that of MI when the number of selected features is 1200, 1600, or 2000, respectively.

Table 6 indicates that the AUCs of SVM based on the improved feature-selection methods on Reuters-21578 are almost superior to that of the nine existing methods. Although some of AUCs of NB on Reuters-21578 based on the improved feature-selection methods are inferior to that of the existing feature-selection algorithms, there is no significant difference between them.

Based on nine pairs of feature-selection methods and Reuters-21578, the accuracy curves of the NB and SVM are shown in Figures 3 and 4, respectively. It can be seen from Figure 3 that the accuracy curve of NB based on IGX almost coincides with that of IG. The accuracy curve of NB based on CHIX is higher than that of CHI except that the number of features is 200, 400,1200, or 1400. The accuracy curve of NB based on MIX is higher than that of MI when the number of features is greater than 1000. The accuracy curves of NB based on DFX, DIAX, and DFPFSX are higher than those of DF, DIA, and DFPFS, but the growth rate of performance of DFX is quite small. The accuracy performance of NB based on GINIX is superior to that of GINI except that the number of features is 1400,1600, or 1800. The accuracy performance of NB based on CMFSX is superior to that of GINI except that the number of features is 200,400, or 800. The accuracy curve of NB based on OCFSX is higher than that of OCFS when the number of features is greater than 400. Figure 4 indicates that the accuracy curves of SVM based on IGX, DFX, DIAX, CMFSX, OCFSX, and DFPFSX are significantly higher than those of IG, DF, DIA, CMFS, OCFS, and DFPFS, respectively. When the number of features is greater than 200, the accuracy curve of SVM based on CHIX is higher than that of CHI. The performance of SVM based on MIX is superior to that of MI when the number of features is greater than 1000. The accuracy curve of SVM based on DIAX is higher than that of DIA when the number of selected features is greater than 400.

5.3. The Experimental Results on WebKB. Table 7 indicates the comparison of nine improved and existing feature-selection methods with respect to micro-F1 measure on WebKB, respectively. It can be seen from Table 7 that the micro-F1 of NB based on CHIX, DFX, GINIX, DIAX, CMFSX, OCFSX, and DFPFSX is significantly superior to that of CHI, DF, GINI, DIA, CMFS, OCFS, and DFPFS, respectively; the micro-F1 of NB based IGX is superior to that of IG when the number of selected features is 400 or 2000; the micro-F1 of NB based on MIX is superior to that of MI when the number of the selected features is greater than 200. The micro-F1 of SVM based on IGX, CHIX, DFX, GINIX, CMFSX, OCFSX, and DFPFSX is significantly superior to that of IG, CHI, DF, GINI, CMFS, OCFS, and DFPFS.

Table 8 lists the AUCs of NB and SVM on WebKB based on nine improved and existing feature-selection algorithms, respectively. The AUCs of SVM based on the improved feature-selection methods are superior to that of the existing methods except for the DIAX and MIX. The AUCs of NB based on IGX is higher than that of IG when the number of selected features is 400 or 2000. The AUC of NB based on DIAX is superior to that of DIA when the number of features is 1200,1600, or 2000.

Figures 5 and 6 show the accuracy curves of NB and SVM based on nine pairs of feature-selection methods on WebKB, respectively. The accuracy curve of NB based on IGX is very close to that of IG. The accuracy curves of NB based on DFX, MIX, DFX, GINIX, CMFSX, OCFSX, and DFPFSX are significantly higher than those of DF, MI, DF, GINI, CMFS, OCFS, and DFPFS. When the number of features is greater than 800, the accuracy of NB based DIAX is superior to that of DIA. The accuracy curves of SVM based on IGX, CHIX, DFX, GINIX, CMFSX, OCFSX, and DFPFSX are significantly higher than those of IG, CHI, DF, GINI, CMFS, OCFS, and DFPFS. However, the accuracy curves of SVM based on MIX and DIAX are lower than those of MI and DIA, respectively.

6. Discussion

Because the amount of documents in every category is equal, 20-Newsgroup is a balance dataset in the view of the number of documents in each category. However, the length of different documents is not identical and the number of terms contained in each document is also different. Figure 7 shows the total number of term frequency of each category in 20-Newsgroups dataset. It can be seen from Figure 7 that the sum of term frequency of the category "talk. politics. mideast" is maximum; the total number of term frequency of the category "misc. forsale" is minimum. Hence, it can be seen from Table 2 and Figures 1 and 2 that the performance of the improved feature-selection algorithms, which alleviate the effect of the imbalance factor, is significantly superior to that of existing feature-selection methods.

The expected cross-entropy (ECE) is a feature-selection algorithm used by Zhang and Qiu [35]. The formula of expected cross-entropy is defined by (20). It can be concluded from the experiments that the performance of ECE is superior to that of most of feature-selection algorithms. Table 9 lists the accuracy comparison of NB between ECE and nine existing feature-selection algorithms on 20-Newsgroups when the number of selected features is 400, 800,1200,1600, or 2000. It can be seen from Table 9 that the performance of ECE is superior to that of CHI, DF, IG, MI, OCFS, DIA, and DPFFS and inferior to that of GINI and CMFS. By analyzing of the formula of ECE, we found that the imbalance factor (P([c.sub.i])) has been considered by ECE; it is the reason why the ECE is more effective than others:

ECE [t.sub.k],[c.sub.i]) = P ([t.sub.k]) [[absolute value of C].summation over (i=1) P([c.sub.i] | [t.sub.k]) log P([c.sub.i] | [t.sub.k]) / P ([c.sub.i]). (20)

The time complexity of the improved feature-selection algorithm is higher than that of old version. The reason is that the cost of calculating the prior probability (P(ct)) in the improved feature-selection method has been taken into account. There are two ways to calculate the time complexity based on the formula of P([c.sub.i]). We assume that the size of vector space is [absolute value of V] and the number of categories is [absolute value of C]. If the P([c.sub.i]) is evaluated with the amount of documents in every category, the time complexity of P([c.sub.i]) is O([absolute value of C]). If P([c.sub.i]) is evaluated with the sum of term frequency of all features in every category, the cost of P([c.sub.i]) is O([absolute value of C] * [absolute value of V]).

To learn more about our experiments, readers can visit the web site (http://pan.baidu.com/s/1y8z7 K).

7. Conclusion

Feature-selection algorithm is designed to measure the significance of a feature for categorization on the basis of the balance dataset. Though most datasets are balanced in the view of the number of documents in every category, they are imbalanced in the view of the number of features in every category. Thus the traditional feature-selection algorithm does not achieve the best performance due to the adverse effect of the imbalance factor in the corpus. In this paper, we proposed an improved scheme which can weaken the adverse effect caused by the imbalance factor in the corpus. In our experiments, nine well-known feature-selection algorithms are improved using the scheme proposed in this paper. The experimental results indicate that the improved scheme can effectively enhance the performance of text categorization.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

http://dx.doi.org/10.1155/2014/625342

Acknowledgment

This research is supported by the project development plan of science and technology of Jilin province under Grant no. 20140204071GX.

References

[1] F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

[2] S. Jiang, G. Pang, M. Wu, and L. Kuang, "An improved Knearest-neighbor algorithm for text categorization," Expert Systems with Applications, vol. 39, no. 1, pp. 1503-1509, 2012.

[3] H. Drucker, D. Wu, and V N. Vapnik, "Support vector machines for spam categorization," IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1048-1054, 1999.

[4] A. McCallum and K. Nigam, "A comparison of event models for naive bayes text classification," in Proceedings of the AAAI Workshop on Learning for Text Categorization, 1998.

[5] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, pp. 21-27, 1967

[6] D. Fragoudis, D. Meretakis, and S. Likothanassis, "Best terms: an efficient feature-selection algorithm for text categorization," Knowledge and Information Systems, vol. 8, no. 1, pp. 16-33, 2005.

[7] Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 412-420, Morgan Kaufmann, Nashville, Tenn, USA, 1997

[8] R. H. W. Pinheiro, G. D. C. Cavalcanti, R. F. Correa, and T. I. Ren, "A global-ranking local feature selection method for text categorization," Expert Systems with Applications, vol. 39, no. 17, pp. 12851-12857, 2012.

[9] D. Hernandez-Lobato and J. M. Hernandez-Lobato, "Learning feature selection dependencies in multi-task learning," in Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS '13),, Nevada, nev, USA, 2013.

[10] M. Kolar and H. Liu, "Feature selection in high-dimensional classification," in Proceedings of the 30th International Conference on Machine Learning (ICML '13), S. Dasgupta and D. McAllester, Eds., pp. 329-337, 2013.

[11] J. Yang, Y. Liu, X. Zhu, Z. Liu, and X. Zhang, "A new feature selection based on comprehensive measurement both in intercategory and intra-category for text categorization," Information Processing and Management, vol. 48, pp. 741-754, 2012.

[12] A. L. Blum and P Langley, "Selection of relevant features and examples in machine learning," Artificial Intelligence, vol. 97, pp. 245-271, 1997.

[13] G. H. John, R. Kohavi, and K. Pfleger, "Irrelevant features and the subset selection problem," in Proceedings of the 11th International Conference on Machine Learning, pp. 121-129, Morgan Kaufmann, 1994.

[14] D. Mladenic and M. Grobelnik, "Feature selection on hierarchy of web documents," Decision Support Systems, vol. 35, no. 1, pp. 45-87, 2003.

[15] H. Ogura, H. Amano, and M. Kondo, "Feature selection with a measure of deviations from Poisson in text categorization," Expert Systems with Applications, vol. 36, no. 3, pp. 6826-6832, 2009.

[16] H. Peng, F. Long, and C. Ding, "Feature selection based on mutual information: criteria of Max-Dependency, MaxRelevance, and Min-Redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 12261238, 2005.

[17] S. S. R. Mengle and N. Goharian, "Measure feature-selection algorithm," Journal of the American Society for Information Science and Technology, vol. 60, no. 5, pp. 1037-1050, 2009.

[18] J. Yan, N. Liu, B. Zhang et al., "OCFS: optimal orthogonal centroid feature selection for text categorization," in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 122-129, ACM, Salvador, Brazil, 2005.

[19] Y. Liu, H. T. Loh, and A. Sun, "Imbalanced text classification: a term weighting approach," Expert Systems with Applications, vol. 36, no. 1, pp. 690-701, 2009.

[20] J. Wang, J. You, Q. Li, and Y. Xu, "Extract minimum positive and maximum negative features for imbalanced binary classification," Pattern Recognition, vol. 45, no. 3, pp. 1136-1145, 2012.

[21] B. Das, N. C. Krishnan, D. J. Cook, and wRACOG:, "A gibbs sampling-based oversampling technique," in Proceedings of the 13th IEEE International Conference on Data Mining (ICDM '13), pp. 111-120, 2013.

[22] H. He and E. A. Garcia, "Learning from imbalanced data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, 2009.

[23] G. Liang and A. G. Cohn, "An effective approach for imbalanced classification: unevenly balanced bagging," in Proceedings of the 27th AAAI Conference on Artificial Intelligence, Bellevue, Wash, USA, 2013.

[24] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.

[25] E. Youn and M. K. Jeong, "Class dependent feature scaling method using naive Bayes classifier for text datamining," Pattern Recognition Letters, vol. 30, no. 5, pp. 477-485, 2009.

[26] R. Battiti, "Using mutual information for selecting features in supervised neural net learning," IEEE Transactions on Neural Networks, vol. 5, no. 4, pp. 537-550, 1994.

[27] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A novel feature selection algorithm for text categorization," Expert Systems with Applications, vol. 33, no. 1, pp. 1-5, 2007

[28] N. Fuhr, S. Hartmann, G. Lustig et al., "AIR/X--a rule-based multistage indexing system for large subject fields," in Proceedings of the 3rd International Conference, "Recherche d'Information Assisteepar Ordinateur" (RIAO '91), pp. 606-623, Barcelona, Spain, 1991.

[29] T. Joachims, "Text categorization with support vector machines: learning with many relevant features," in Proceedings of the 10th European Conference on Machine Learning (ECML '98), C. Nedellec and C. Rouveirol, Eds., pp. 137-142, Springer, Chemnitz, Germany.

[30] C.-C. Chang and C.-J. Lin, "LIBSVM : a library for support vector machines," 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[31] X.-Y. Liu, J. Wu, and Z.-H. Zhou, "Exploratory undersampling for class-imbalance learning," IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics, vol. 39, no. 2, pp. 539-550, 2009.

[32] D. J. Hand and R. J. Till, "A simple generalisation of the area under the RO C curve for multiple class classification problems," Machine Learning, vol. 45, no. 2, pp. 171-186, 2001.

[33] F. Provost and P. Dominigos, "Well-trained pets: improving probability estimation trees," CeDER Working Paper IS-00-04, Stern School of Business, New York University, 2000.

[34] R. K. Eichelberger and V. S. Sheng, "Does one-against-all or one-against-one improve the performance of multiclass classifications?" in Proceedings of the 27th AAAI Conference on Artificial Intelligence, Bellevue, Wash, USA, 2013.

[35] W. Zhang and Y. Qiu, "The research of the feature selection method based on the ECE and quantum genetic algorithm," in Proceedings of the 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE '10), pp. V6193V6196, August 2010.

Jieming Yang, Zhaoyang Qu, and Zhiying Liu

College of Information Engineering, Northeast Dianli University, Jilin, Jilin 132012, China

Correspondence should be addressed to Jieming Yang; yjmlzy@gmail.com

Received 13 February 2014; Revised 18 April 2014; Accepted 23 April 2014; Published 26 May 2014

Academic Editor: Yudong Cai

TABLE 1: The term-to-category feature appearance matrix.

Features    C1 (2369)     C2 (237)     C3 (578)    C4 (3964)

Billion        345           60          251          1828
Company        2128          6           303          1515
April          622          113          121          1578
Bank           487           5            67          527
Oil            271           17          2018         252

Total          3853         201          2760         5700

Features     C5 (582)     C6 (478)     C7 (717)     C8 (286)

Billion        110          344          461           26
Company         22           6            14           42
April          304          210          243          121
Bank            24          780          1138          19
Oil             48           28           36          210

Total          508          1368         1892         418

Features     C9 (486)    C10 (283)

Billion        992           36
Company         24           9
April          202          156
Bank           141           2
Oil             94           32

Total          1453         235

TABLE 2: The improved formula of nine feature-selection algorithms
listed in Section 2.

Local feature selection             Global feature selection

IG [MATHEMATICAL EXPRESSION         IGX [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII]          NOT REPRODUCIBLE IN ASCII]

CHI [MATHEMATICAL EXPRESSION        CHIX [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII]          NOT REPRODUCIBLE IN ASCII]

MI MIX ([t.sub.k], [c.sub.i]) =     MIX ([t.sub.k]) = [[absolute value
1/P([c.sub.i]) log P([t.sub.k],     of C].summation over (i=1)] log
[c.sub.i]) P([t.sub.k])             P([t.sub.k], [c.sub.i]) /
P([c.sub.i])                        P([t.sub.k])P([c.sub.i])

DF DFX ([t.sub.k], [c.sub.i]) =     DFX ([t.sub.k]) = [[absolute value
P([t.sub.k] | [c.sub.i])            of C].summation over (i=1)]
                                    P([t.sub.k] | [c.sub.i])

GINI --                             GINIX ([t.sub.k]) = [[absolute
                                    value of C].summation over (i=1)]
                                    P[([t.sub.k] | [c.sub.i]).sup.2]
                                    P[([c.sub.i] | [t.sub.k]).sup.2] /
                                    P ([c.sub.i])

DIA DIAX ([t.sub.k, [c.sub.i]])     DIAX ([t.sub.k]) = [[absolute
= P([c.sub.i] | [t.sub.k])/         value of C].summation over (i=1)]
P([c.sub.i])                        P([c.sub.i] | [t.sub.k])

CMFS CMFSX ([t.sub.k], [c.sub.i])   CMFSX ([t.sub.k]) = [[absolute
= P([t.sub.k] | [c.sub.i])          value of C].summation over (i=1)]
P([c.sub.i] | [t.sub.k])/           P([t.sub.k] | [c.sub.i])P([c.sub.i]
P([c.sub.i])                        | [t.sub.k])

OCFS OCFSX ([t.sub.k], [c.sub.i])   OCFSX ([t.sub.k]) = [[absolute
= P([t.sub.k] | [c.sub.i])          value of C].summation over (i=1)]
P([c.sub.i] | [t.sub.k])/           [([m.sup.k.sub.i] -
P([c.sub.i])                        [m.sup.k]).sup.2]

DFPFS [MATHEMATICAL EXPRESSION      DFPFSX [MATHEMATICAL EXPRESSION
NOT REPRODUCIBLE IN ASCII]          NOT REPRODUCIBLE IN ASCII]

TABLE 3: The comparison of nine improved and existing
feature-selection methods with respect to micro-F1 measure on
20-Newsgroups for NB and SVM, respectively. The bold values
indicate the best performance of the classifier when various
feature-selection methods are used, respectively.

                             Naive Bayes

           400       800      1200      1600      2000

IG       49.55#    57.76#    62.88#    65.99#    68.34#
IGX       49.54     57.76     62.85     65.92     68.33
CHI       67.29     73.26     75.46     76.58     77.61
CHIX     69.48#    74.66#    76.39#    77.46#    78.35#
MI        23.13     33.80     51.88     55.47     60.50
MIX       27.39    39.06#    53.78#    59.97#    63.87#
DF        59.23     66.37     70.32     72.78     74.07
DFX      60.88#    67.35#    70.97#    73.62#    74.94#
GINI      73.79     76.68     77.83     78.21     78.74
GINIX    74.45#    77.17#    78.22#    78.94#    79.18#
DIA       15.45     22.97     25.70     28.51     30.96
DIAX     47.90#    59.59#    66.08#    73.45#    76.23#
CMFS      73.05     75.68     77.13     78.57     79.17
CMFSX    74.24#    76.84#    78.36#    79.56#    80.07#
OCFS      46.50     55.37     60.64     64.72     66.87
OCFSX    55.49#     62.33    66.25#    68.36#    69.89#
DFPFS    67.36#    71.41#    73.11#    74.14#    74.77#
DFPFSX    56.74     61.90     64.15     64.94     65.40

                      Support vector machines

           400       800      1200      1600      2000

IG        58.15     62.53     65.17     67.28     68.45
IGX      59.01#    63.51#    66.06#    67.91#    69.32#
CHI       73.67     76.07     76.26     76.18     75.60
CHIX     75.14#    77.09#    77.22#    77.27#    76.96#
MI        27.46     38.24     54.21     57.02     60.62
MIX      29.17#    41.18#    56.71#    61.61#    64.19#
DF        63.57     66.69     69.57     71.23     72.06
DFX      65.71#    68.17#    70.37#    71.92#    72.97#
GINI      77.13     75.58     74.50     73.99     74.26
GINIX     77.13    76.75#    75.68#    74.97#    75.16#
DIA       33.53     42.42     46.56     50.22     52.19
DIAX     59.57#    67.94#    72.12#    76.22#    77.80#
CMFS      76.91     77.50     76.80     76.20     75.75
CMFSX    77.37#    77.96#    77.55#    77.24#    77.00#
OCFS      54.87     60.30     63.59     66.36     67.61
OCFSX    63.08#    66.51#    68.60#    69.45#    70.21#
DFPFS    69.97#    71.21#    72.00#    72.50#    72.92#
DFPFSX    62.15     65.11     66.22     66.91     67.31

Note: The bold values indicate the best performance of the
classifier when various feature-selection methods are used,
respectively is indicated with #.

TABLE 4: The comparison of nine improved and existing feature-
selection methods with respect to AUC on 20-Newsgroups for NB and
SVM, respectively. The bold values indicate the best performance
of the classifier when various feature-selection methods are
used, respectively.

Feature                           Naive Bayes
selection
              400        800        1200       1600       2000

IG          0.7183#    0.7579#    0.7864#    0.8048#     0.8187
IGX          0.7182     0.7578     0.7863     0.8045     0.8187
CHI          0.8234     0.8545     0.8660     0.8717     0.8771
CHIX        0.8346#    0.8621#    0.8710#    0.8770#    0.8815#
MI           0.5830     0.6283     0.7205     0.7405     0.7695
MIX         0.6020#    0.6597#    0.7315#    0.7677#    0.7902#
DF           0.7685     0.8054     0.8302     0.8454     0.8531
DFX         0.7763#    0.8111#    0.8338#    0.8501#    0.8585#
GINI         0.8569     0.8726     0.8778     0.8801     0.8829
GINIX       0.8603#    0.8749#    0.8800#    0.8843#    0.8858#
DIA          0.5571     0.6040     0.6107     0.6165     0.6245
DIAX        0.7429#    0.7933#    0.8219#    0.8577#    0.8728#
CMFS         0.8536     0.8671     0.8748     0.8828     0.8865
CMFSX       0.8598#    0.8736#    0.8821#    0.8887#    0.8915#
OCFS         0.7083     0.7375     0.7892     0.8017     0.8113
OCFSX       0.7550#    0.7882#    0.8091#    0.8205#    0.8290#
DFPFS       0.8175#    0.8398#    0.8499#    0.8561#    0.8600#
DFPFSX       0.7591     0.7873     0.8000     0.8056     0.8086

                           Support vector machines

              400        800        1200       1600       2000

IG           0.7746     0.7998     0.8144     0.8257     0.8318
IGX         0.7774#    0.8038#    0.8183#    0.8284#    0.8361#
CHI          0.8485     0.8671     0.8702     0.8707     0.8690
CHIX        0.8589#    0.8726#    0.8747#    0.8763#    0.8753#
MI           0.6165     0.6733     0.7569     0.7721     0.7905
MIX         0.6269#    0.6894#    0.7700#    0.7961#    0.8089#
DF           0.8043     0.8224     0.8380     0.8464     0.8509
DFX         0.8146#    0.8297#    0.8416#    0.8500#    0.8557#
GINI         0.8711     0.8678     0.8631     0.8609     0.8625
GINIX       0.8720#    0.8735#    0.8690#    0.8660#    0.8672#
DIA          0.6331     0.6838     0.7086     0.7302     0.7419
DIAX        0.7519#    0.8038#    0.8324#    0.8585#    0.8690#
CMFS         0.8689     0.8769     0.8747     0.8723     0.8705
CMFSX       0.8722#    0.8798#    0.8788#    0.8776#    0.8766#
OCFS         0.7962     0.8164     0.8260     0.8301     0.8351
OCFSX       0.7984#    0.8193#    0.8313#    0.8365#    0.8408#
DFPFS       0.8377#    0.8446#    0.8492#    0.8521#    0.8543#
DFPFSX       0.7952     0.8119     0.8181     0.8218     0.8240

Note: The bold values indicate the best performance
of the classifier when various feature-selection methods are
used, respectively is indicated with #.

TABLE 5: The comparison of nine improved and existing
feature-selection methods with respect to micro-F1 measure on
Reuters-21578 for NB and SVM, respectively. The bold values
indicate the best performance of the classifier when various
feature-selection methods are used, respectively.

Feature                         Naive Bayes
selection
               400       800      1200      1600      2000

IG           62.09#     64.60     64.76    65.11#     65.22
IGX           62.07    64.61#    64.77#     65.10     65.22
CHI           62.89     64.02     64.96     64.92     65.35
CHIX         64.66#    65.70#    66.00#    66.07#    65.94#
MI            34.65    54.87#    59.71#     61.58     62.77
MIX          39.55#     51.46     59.64    61.84#    63.43#
DF            62.41     64.09     64.99     65.36     65.51
DFX          63.99#    65.38#    65.65#    65.92#    65.94#
GINI          65.13     65.82    66.64#     66.16     65.78
GINIX        66.43#    66.45#     66.39    66.54#    66.49#
DIA           30.10     30.82     31.49     32.85     37.60
DIAX         48.11#    57.71#    63.69#    64.23#    64.95#
CMFS          66.03     66.52     66.38     66.66    66.64#
CMFSX        66.94#    67.21#    66.87#    66.84#     66.59
OCFS          60.90     63.43     64.41     64.39     64.96
OCFSX        63.91#    65.62#    65.56#    65.65#    65.85#
DFPFS         57.20     57.04     57.03     57.05     57.03
DFPFSX       62.55#    63.71#    63.67#    63.40#    63.40#

Feature                   Support vector machines
selection
               400       800      1200      1600      2000

IG            61.31     62.19     62.66     62.59     62.88
IGX          64.82#    65.89#    66.56#    66.66#    67.01#
CHI           62.85     63.12     62.96     62.93     62.73
CHIX         67.36#    67.29#    67.28#    67.13#    66.82#
MI           43.02#    51.67#     56.94     58.98     60.11
MIX           39.47     51.35    61.41#    63.44#    64.59#
DF            61.07     62.75     62.69     62.70     62.82
DFX          65.90#    66.65#    67.05#    67.01#    67.58#
GINI          63.42     63.17     63.07     62.95     62.71
GINIX        67.44#    67.18#    66.99#    67.11#    67.20#
DIA           45.06     48.87     51.05     51.85     53.04
DIAX         49.79#    59.15#    65.73#    66.39#    66.96#
CMFS          63.73     63.52     63.14     63.06     62.97
CMFSX        67.48#    67.41#    67.10#    67.27#    67.45#
OCFS          60.28     61.49     62.69     62.52     62.69
OCFSX        65.82#    66.11#    66.19#    66.55#    66.92#
DFPFS         61.56     61.39     61.27     61.44     61.17
DFPFSX       64.92#    65.42#    65.37#    65.46#    65.53#

Note: The bold values indicate the best performance of the
classifier when various feature-selection methods are used,
respectively is indicated with #.

TABLE 6: The comparison of nine improved and existing feature-
selection methods with respect to AUC on Reuters-21578 for NB and
SVM, respectively. The bold values indicate the best performance
of the classifier when various feature-selection methods are
used, respectively.

                                  Naive Bayes
Feature
selection      400        800        1200       1600       2000

IG           0.8978#     0.9068     0.9073    0.9088#     0.9093
IGX           0.8977     0.9068     0.9073     0.9087     0.9093
CHI          0.8988#     0.9055    0.9093#     0.9091     0.9112
CHIX          0.8864    0.9058#     0.9074    0.9101#    0.9116#
MI            0.6923    0.8521#     0.8799     0.8902     0.8979
MIX          0.7282#     0.8256    0.8815#    0.8928#    0.9007#
DF            0.8977     0.9075     0.9098     0.9107     0.9107
DFX          0.9008#    0.9091#    0.9105#    0.9111#    0.9116#
GINI          0.9082     0.9119     0.9135    0.9138#     0.9123
GINIX        0.9112#    0.9129#    0.9139#     0.9135    0.9135#
DIA           0.7005     0.7146     0.7249     0.7334     0.7565
DIAX         0.7884#    0.8654#    0.9040#    0.9066#    0.9088#
CMFS         0.9109#    0.9141#     0.9133     0.9139     0.9134
CMFSX         0.9071     0.9135    0.9143#    0.9147#    0.9144#
OCFS         0.8983#    0.9074#     0.9088     0.9092    0.9104#
OCFSX         0.8914     0.9065    0.9091#    0.9095#     0.9102
DFPFS         0.8159     0.8157     0.8159     0.8161     0.8162
DFPFSX       0.8828#    0.8882#    0.8884#    0.8880#    0.8883#

                             Support vector machines
Feature
selection      400        800        1200       1600       2000

IG            0.9005     0.9050     0.9065     0.9079     0.9086
IGX          0.9083#    0.9124#    0.9143#    0.9159#    0.9168#
CHI           0.9053     0.9071     0.9066     0.9070     0.9079
CHIX         0.9090#    0.9126#    0.9143#    0.9144#    0.9138#
MI           0.7957#    0.8541#     0.8810     0.8904     0.8974
MIX           0.7357     0.8341    0.8897#    0.8970#    0.9070#
DF            0.9012     0.9060     0.9091     0.9086     0.9090
DFX          0.9103#    0.9138#    0.9159#    0.9169#    0.9167#
GINI          0.9083     0.9068     0.9075     0.9088     0.9083
GINIX        0.9165#    0.9147#    0.9152#    0.9157#    0.9164#
DIA          0.8501#     0.8673     0.8750     0.8790     0.8811
DIAX          0.7954    0.8778#    0.9124#    0.9143#    0.9159#
CMFS          0.9095     0.9094     0.9084     0.9087     0.9094
CMFSX        0.9150#    0.9171#    0.9153#    0.9156#    0.9162#
OCFS          0.9027     0.9029     0.9046     0.9057     0.9075
OCFSX        0.9068#    0.9094#    0.9111#    0.9127#    0.9136#
DFPFS         0.8875     0.8869     0.8865     0.8871     0.8863
DFPFSX       0.9032#    0.9066#    0.9064#    0.9062#    0.9067#

Note: The bold values indicate the best performance
of the classifier when various feature-selection methods are
used, respectively is indicated with #.

TABLE 7: The comparison of nine improved and existing feature-
selection methods with respect to micro-F1 measure on WebKB for NB
and SVM, respectively. The bold values indicate the best performance
of the classifier when various feature-selection methods are used,
respectively.

                                 Naive Bayes
Feature
selection      400       800      1200      1600      2000

IG            71.36    74.17#    75.84#    76.61#     77.46
IGX          71.64#     73.96     75.64     76.57    77.48#
CHI           71.92     74.16     74.65     76.19     76.93
CHIX         74.98#    76.28#    77.39#    78.57#    78.83#
MI           33.96#     34.63     36.32     40.71     49.58
MIX           33.69    38.31#    44.49#    51.00#    54.30#
DF            71.04     73.74     75.82     76.87     77.23
DFX          71.93#    74.48#    76.57#    77.52#    77.55#
GINI          72.13     75.58     77.68     77.84     78.42
GINIX        73.75#    76.96#    78.06#    78.80#    78.97#
DIA           4775      48.40     46.06     48.77     51.00
DIAX         50.11#    53.23#    60.99#    63.86#    63.68#
CMFS          72.14     73.22     75.62     76.81     77.33
CMFSX        73.37#    75.89#    77.63#    78.06#    78.70#
OCFS          73.07     75.23     76.46     77.09     78.02
OCFSX        73.79#    76.52#    78.05#    78.27#    78.53#
DFPFS         65.17     64.74     64.65     64.66     65.33
DFPFSX       67.96#    68.81#    68.37#    68.58#    68.83#

                          Support vector machines
Feature
selection      400       800      1200      1600      2000

IG            83.52     85.66     86.10     86.80     86.77
IGX          84.10#    86.29#    86.76#    87.27#    87.33#
CHI           85.88     86.52     86.08     86.89     86.89
CHIX         86.69#    86.97#    87.26#    87.63#    87.34#
MI           44.96#    48.60#    57.23#    61.75#    64.44#
MIX           38.37     46.86     50.56     56.63     60.85
DF            83.84     85.87     87.27     87.04     87.05
DFX          84.51#    86.28#    87.73#    87.64#    87.59#
GINI          86.17     86.81     87.09     87.13     86.90
GINIX        86.61#    87.52#    88.20#    87.56#    87.44#
DIA          62.66#    69.13#    74.84#    76.95#    79.03#
DIAX          56.42     63.42     69.97     71.73     71.36
CMFS          85.64     86.06     86.72    87.63#     87.17
CMFSX        86.04#    86.66#    87.29#     87.54    87.75#
OCFS          84.39     85.64     86.49     86.84     87.04
OCFSX        87.19#    86.86#    86.85#    87.44#    87.78#
DFPFS         80.69     81.46     81.27     81.26     80.28
DFPFSX       81.83#    82.08#    82.11#    82.00#    82.01#

Note: The bold values indicate the best performance
of the classifier when various feature-selection methods are used,
respectively is indicated with #.

TABLE 8: The comparison of nine improved and existing feature-
selection methods with respect to AUC on WebKB for NB and SVM,
respectively. The bold values indicate the best performance of
the classifier when various feature-selection methods are
used, respectively.

Feature                            Naive Bayes
selection
               400        800        1200       1600       2000

IG            0.8051    0.8253#    0.8369#    0.8399#     0.8442
IGX          0.8055#     0.8242     0.8357     0.8398    0.8444#
CHI           0.8214     0.8312     0.8365     0.8402     0.8423
CHIX         0.8345#    0.8416#    0.8478#    0.8521#    0.8520#
MI            0.5009     0.5030     0.5113     0.5419     0.6082
MIX          0.5237#    0.5505#    0.5925#    0.6359#    0.6593#
DF            0.8041     0.8242     0.8344     0.8404     0.8422
DFX          0.8051#    0.8262#    0.8382#    0.8432#    0.8439#
GINI          0.8177     0.8358     0.8463     0.8476     0.8492
GINIX        0.8229#    0.8420#    0.8480#    0.8534#    0.8533#
DIA          0.6252#    0.6372#     0.6058     0.6192     0.6465
DIAX          0.5700     0.5931    0.6111#    0.6279#    0.6511#
CMFS          0.8132     0.8216     0.8354     0.8397     0.8430
CMFSX        0.8183#    0.8351#    0.8445#    0.8475#    0.8513#
OCFS          0.8189     0.8349     0.8436     0.8453     0.8490
OCFSX        0.8244#    0.8401#    0.8480#    0.8488#    0.8508#
DFPFS         0.7504     0.7507     0.7511     0.7517     0.7525
DFPFSX       0.7723#    0.7770#    0.7772#    0.7782#    0.7786#

Feature                      Support vector machines
selection
               400        800        1200       1600       2000

IG            0.8933     0.9044     0.9093     0.9128     0.9120
IGX          0.8973#    0.9080#    0.9134#    0.9153#    0.9151#
CHI           0.9090     0.9114     0.9095     0.9131     0.9133
CHIX         0.9130#    0.9137#    0.9158#    0.9191#    0.9170#
MI           0.6357#    0.6618#    0.7137#    0.7478#    0.7620#
MIX           0.5681     0.6359     0.6653     0.7024     0.7310
DF            0.8952     0.9078     0.9157     0.9140     0.9143
DFX          0.8981#    0.9103#    0.9181#    0.9174#    0.9168#
GINI          0.9105     0.9144     0.9171     0.9155     0.9138
GINIX        0.9123#    0.9179#    0.9229#    0.9184#    0.9174#
DIA          0.7867#    0.8151#    0.8420#    0.8575#    0.8696#
DIAX          0.5791     0.6142     0.6473     0.6842     0.7052
CMFS          0.9067     0.9098     0.9123     0.9181     0.9152
CMFSX        0.9079#    0.9123#    0.9165#    0.9185#    0.9190#
OCFS          0.9106     0.9041     0.9119     0.9160     0.9158
OCFSX        0.9151#    0.9112#    0.9141#    0.9168#    0.9191#
DFPFS         0.8735     0.8750     0.8743     0.8741     0.8729
DFPFSX       0.8815#    0.8824#    0.8819#    0.8806#    0.8809#

Note: The bold values indicate the best performance of
the classifier when various feature-selection methods are
used, respectively is indicated with #.

TABLE 9: The accuracy comparison of ECE with nine feature-selection
algorithms when the NB is used on 20-Newsgroups. The numbers in the
parentheses are the difference of accuracy of the corresponding
feature-selection algorithm from that of ECE.

Feature
selections         400               800              1200

ECE            70.81 (--)        74.54 (--)        75.51 (--)
CHI           66.44 (-4.37)     72.36 (-2.18)     74.54 (-0.97)
DF            56.01 (-14.8)    63.02 (-11.52)      6774 (-777)
GINI          72.81 (+2.00)     75.79 (+1.25)     76.78 (+1.27)
IG           46.47 (-24.34)    53.99 (-20.55)    59.42 (-16.09)
MI           20.77 (-50.04)    29.38 (-45.16)    46.90 (-28.61)
DIA          15.86 (-54.95)    24.76 (-49.78)    26.02 (-49.49)
CMFS          72.19 (+1.38)     74.76 (+0.22)     76.20 (+0.69)
OCFS         43.10 (-27.71)    51.05 (-23.49)    56.82 (-18.69)
DFPFS         65.33 (-5.48)     69.56 (-4.98)     71.48 (-4.03)

Feature
selections        1600              2000

ECE            76.73 (--)        77.16 (--)
CHI           75.62 (-1.11)     76.66 (-0.50)
DF            70.64 (-6.09)     72.10 (-5.06)
GINI          77.22 (+0.49)     77.76 (+0.60)
IG           62.92 (-13.81)    65.55 (-11.61)
MI           50.70 (-26.03)    56.21 (-20.95)
DIA           2714 (-49.59)    28.64 (-48.52)
CMFS          77.72 (+0.99)     78.43 (+1.27)
OCFS         61.41 (-15.32)    63.87 (-13.29)
DFPFS         72.66 (-4.07)     73.39 (-3.77)
COPYRIGHT 2014 Hindawi Limited
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2014 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Research Article
Author:Yang, Jieming; Qu, Zhaoyang; Liu, Zhiying
Publication:The Scientific World Journal
Article Type:Report
Date:Jan 1, 2014
Words:9689
Previous Article:A stochastic total least squares solution of adaptive filtering problem.
Next Article:Effects of blended-cement paste chemical composition changes on some strength gains of blended-mortars.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |