Iterative ensemble feature selection for multiclass classification of imbalanced microarray data
© Yang et al. 2016
Published: 4 July 2016
Microarray technology allows biologists to monitor expression levels of thousands of genes among various tumor tissues. Identifying relevant genes for sample classification of various tumor types is beneficial to clinical studies. One of the most widely used classification strategies for multiclass classification data is the One-Versus-All (OVA) schema that divides the original problem into multiple binary classification of one class against the rest. Nevertheless, multiclass microarray data tend to suffer from imbalanced class distribution between majority and minority classes, which inevitably deteriorates the performance of the OVA classification.
In this study, we propose a novel iterative ensemble feature selection (IEFS) framework for multiclass classification of imbalanced microarray data. In particular, filter feature selection and balanced sampling are performed iteratively and alternatively to boost the performance of each binary classification in the OVA schema. The proposed framework is tested and compared with other representative state-of-the-art filter feature selection methods using six benchmark multiclass microarray data sets. The experimental results show that IEFS framework provides superior or comparable performance to the other methods in terms of both classification accuracy and area under receiver operating characteristic curve. The more number of classes the data have, the better performance of IEFS framework achieves.
Balanced sampling and feature selection together work well in improving the performance of multiclass classification of imbalanced microarray data. The IEFS framework is readily applicable to other biological data analysis tasks facing the same problem.
Microarray gene expression data are widely used for cancer clinical studies [1, 2].The identification of relevant genes to cancers is a common biological challenge . It is crucial to explore a list of high-potential biomarkers and signature candidates that are strongly associated with the disease among a large number of simultaneously observed genes . From a machine learning perspective, gene selection is regarded as feature selection to the candidate genes that can be used to distinguish the classes of sample tissues.
Multiclass cancer prediction based on gene selection has attracted increasing research interest [5–8]. For instance, Li et al.  compared different feature selection and multiclass classification methods for gene expression data. The paper indicated that multiclass classification problem is much more difficult than the binary one for gene expression data. By comparing several filter feature selection methods and representative classifiers including naive Bayes, k-nearest neighbor (KNN), and support vector machine (SVM), they also suggested that the classification accuracy degrades rapidly as the number of classes increase. Kim-Anh et al.  developed a One-Verse-One schema based optimal feature weighting approach using classification-and-regression tree and SVM classifiers. Zhou et al.  extended the support vector machine—recursive feature elimination (SVM-RFE)  to solve the multiclass gene selection problem based on different frameworks of multiclass SVMs, and improved the classification accuracy. Yeung et al.  utilized the Bayesian model averaging method for gene selection, which was reported to be applicable to microarray data sets with any number of classes. It is capable of obtaining high accuracy with only a small number of selected genes, and meanwhile providing posterior probabilities for the predictions. To alleviate the siren-pitfall problem, Rajapakse et al.  proposed a novel algorithm to decompose multiclass ranking statistics into class-specific statistics, and use Pareto-front analysis for the selection of genes. Experiments showed that a significant improvement in classification performance and redundancy reduction among the top-ranked genes was achieved.
The aforementioned methods have achieved success in multiclass microarray data, however, the inherent imbalanced nature of multiclass microarray data, i.e., some minority classes may have relatively small number of samples compared to other classes (denoted as majority classes), still pose major challenges to gene selection methods. In this study, we propose an iterative ensemble feature selection (IEFS) framework based on the One-Versus-All (OVA) classification schema  to improve the classification performance in terms of both classification accuracy and area under receiver operating characteristic curve (AUC). OVA schema is a widely used ensemble solution for solving multiclass problems. In each binary sub-classification of OVA schema, samples of the majority class outnumber those from the minority class [14–17]. Therefore, a binary classifier would obtain good overall accuracy on majority class but not the minority class. The informative genes beneficial to separate the minority class are overwhelmed by those that are discriminating in the majority class, due to the lack of samples in minority class. Known as siren-pitfall, this problem has not yet been well addressed in multiclass classification of microarray data . In this paper, we use a sampling method prior to gene selection in binary classification to solve this problem caused by imbalanced data distribution.
Data sampling is one of the most widely used approaches to address imbalanced classification problem . It turns an imbalanced distribution data into a balanced/optimal distribution one, wherein undersampling and oversampling as the two representative approaches have been thoroughly studied . Undersampling removes samples from the majority class to match the minority class. In contrast, oversampling duplicates samples from the minority class to match the size of majority class .
The IEFS framework is tested on six benchmark multiclass microarray data sets and the experimental results show that the framework significantly improves the prediction accuracy of both minority and majority classes.
Results and discussion
Microarray data sets
Summary of microarray data sets
In the experiment, we investigate the combinations of two sampling methods, i.e., oversampling and undersampling, and three filter feature selection methods in IEFS framework. The filter feature selection methods include one ranking method and two space search methods . The ranking method measures the relevance between features and the class label vector based on mutual information . The two space search methods include fast correlation-based filter selection (FCBF)  and minimum redundancy maximum relevance feature selection (mRMR) . FCBF identifies relevant features as well as redundancy among them based on symmetric uncertainty. The mRMR penalises a feature’s relevancy by its redundance in the presence of the other selected features. The relevance and redundancy are measured using correlation between features based on mutual information.
In the IEFS framework, undersampling or oversampling technology is applied to correct the sample distribution skewness before feature selection. Particularly, the random undersampling  and the synthetic minority oversampling technique (SMOTE)  are used. The sampling and feature selection are performed iteratively and alternatively until a satisfactory performance is obtained.
The classification performance of the selected feature subset obtained by IEFS framework is evaluated using both KNN and SVM. KNN and SVM classifiers are sensitive to the imbalanced class distribution [28, 29]. Their performance on the imbalanced data sets can easily be affected without sample distribution skewness correction. IEFS framework is expected to improve the performance of KNN and SVM.
Most classifiers obtain good overall classification accuracy on the whole data but a poor accuracy on the minority classes . When applied on imbalanced data, a good classifier should perform well on minority classes even at the expense of performance on the majority classes. AUC measures the sensitivity and specificity that are defined as the proportions of samples that are correctly classified in the positive and the negative classes, respectively. Therefore, the metric of AUC is better than classification accuracy to evaluate classifier performance on minority class . In addition to classification accuracy, the classification performance on AUC is also reported.
In our empirical studies, the number of selected features in filter ranking method is increased from 5 to 100 with internal 5. The performance of using all features is introduced as the baseline performance. The number of the nearest neighbors used in oversampling method is set to 5. For the controlled size of selected feature with filter ranking method, the step T of sample balance and feature selection is set to 1 and 4, respectively. Because FCBF is capable of deciding the number of selected features itself, the step T of sample balance and feature selection in IEFS with FCBF is set to 1. Consistently, the step T of selected feature subset with mRMR is set to 1. The classification accuracies on data sets Lung, ALL-AML-3 and ALL-AML-4 are evaluated with threefold stratified cross-validation  as the sizes of some classes are smaller than 10. The classification accuracies on the other three data sets, i.e., GCM, ALL and Thyroid are evaluated using tenfold stratified cross-validation . All experiments are conducted in the WEKA environment . The other parameters for FCBF, mRMR and the classifiers [KNN (K = 3) and SVM] are used with default settings in WEKA.
The computational cost of IEFS framework depends on the sampling preprocessing, the step T, and the number of classes. IEFS framework might consume more computational resources than the other filter feature selection methods, yet the extra effort for accuracy improvement is acceptable considering that the classification task is normally conducted offline.
This paper proposes an iterative ensemble feature selection for imbalanced multiclass microarray data. The performance of conventional filter feature selection methods including Filter ranking, FCBF, and mRMR is compared to the IEFS framework on six gene microarray data sets. The results show that our proposed framework and OVA ensemble schema can obtain promising performance on multiclass gene selection problems. Within this framework different concrete oversampling methods can be applied for various multiclass gene selection problems. Undersampling does not work so well as oversampling in this framework due to the lack of training samples. In the future work, more effective oversampling methods beneficial to specific filter feature selection techniques will be developed and investigated with OVA classification schema. Moreover, the optimal combination of sampling method and feature selection will be explored. IEFS framework is also applicable to other domains suffering from the same problem.
The iterative ensemble feature selection
The random undersampling  and SMOTE oversampling  are used in the IEFS framework. The random undersampling method creates sample balance between the two classes by reducing the size of the majority one. This is accomplished by randomly removing samples from the majority class until the sizes of majority and minority classes are equal. The SMOTE algorithm generates new samples for the minority class. These samples are created artificially based on the feature space similarities between existing minority examples. By interpolating between the existing minority samples, a denser minority class containing more samples is achieved.
Filter ranking feature selection
Filter ranking feature selection method first evaluates the univariate correlation between each feature and the class label vector based on mutual information and then ranks them in descending order. Afterward, a predefined number of top ranked features are selected. Filter ranking is widely used thanks to its easy implementation and high efficiency, but it cannot handle the redundancy between features.
Fast correlation-based filter feature selection
FCBF  is a fast correlation-based filter feature selection method used in IEFS framework. It begins by ranking the features based on the correlation between features and the class label vector in a descending order and then removes those with correlation values smaller than a threshold δ. FCBF goes through the ranked feature list in decreasing order and a feature f i is removed if there exist another feature f j such that SU(c; f j ) ≥ SU(c; f i ) and SU(f i ; f j ) ≥ SU(f i ; c) where SU(a; b) denotes the symmetrical uncertainty  between feature a and feature b. These two inequalities mean that f j is better as a predicator of class label vector c and f i is more similar to f j than to c. The threshold δ can be adjusted to get the expected number of features.
Minimum redundancy maximum relevance feature selection
JY and ZJ conceived the study, performed the experiments, and wrote the paper. JZ, ZZ, and XM reviewed and revised the manuscript. All authors read and approved the manuscript.
This work was supported in part by National Natural Science Foundation of China Joint Fund with Guangdong (U1201256), the National Natural Science Foundation of China (61471246, 61171125, and 61501138), the Guangdong Foundation of Outstanding Young Teachers in Higher Education Institutions (Yq2013141), Guangdong Special Support Program of Top-notch Young Professionals (2014TQ01X273), Guangdong Natural Science Foundation (S2012010009545), Shenzhen Scientific Research and Development Funding Program(JCYJ20130329115450637, KQC201108300045A, and ZYC201105170243A), Innovation R&D Project of Nanshan District of Shenzhen (KC2014JSQN0008A), and Nanshan Innovation Institution Construction Program(KC2014ZDZJ0026A and KC2013ZDZJ0011A).
The authors declare that they have no competing interests.
Publication of this article was funded by the National Natural Science Foundation of China (61171125). This article has been published as part of Journal of Biological Research—Thessaloniki, Volume 23, Supplement 1, 2016: Proceedings of the 2014 International Conference on Intelligent Computing. The full contents of the supplement are available online at http://jbiolres.biomedcentral.com/articles/supplements/volume-23-supplement-1.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Fehrmann RS, Karjalainen JM, Krajewska M, Westra HJ, Maloney D, Simeonov A, et al. Gene expression analysis identifies global gene dosage sensitivity in cancer. Nat Genet. 2015;47:115–25.View ArticlePubMedGoogle Scholar
- Gerstung M, Pellagatti A, Malcovati L, Giagounidis A, Della Porta MG, Jädersten M, et al. Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes. Nat Commun. 2015;6:5901.View ArticlePubMedPubMed CentralGoogle Scholar
- Chambers AH, Pillet J, Plotto A, Bai J, Whitaker VM, Folta KM. Identification of a strawberry flavour gene candidate using an integrated genetic-genomic-analytical chemistry approach. BMC Genomics. 2014;15:217.View ArticlePubMedPubMed CentralGoogle Scholar
- Hausser J, Zavolan M. Identification and consequences of miRNA-target interactions—beyond repression of gene expression. Nat Rev Genet. 2014;15:599–612.View ArticlePubMedGoogle Scholar
- Madahian B, Deng LY, Homayouni R. Development of sparse Bayesian multinomial generalized linear model for multi-class prediction. BMC Bioinformatics. 2014;15:S10.View ArticleGoogle Scholar
- Engchuan W, Chan JH. Pathway activity transformation for multi-class classification of lung cancer datasets. Neurocomputing. 2015;165:81–9.View ArticleGoogle Scholar
- Zhou X, Tuck DP. MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics. 2007;23:1106–14.View ArticlePubMedGoogle Scholar
- Rajapakse JC, Mundra PA. Multiclass gene selection using Pareto-fronts. IEEE/ACM Trans Comput Biol Bioinform. 2013;10:87–97.View ArticlePubMedGoogle Scholar
- Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20:2429–37.View ArticlePubMedGoogle Scholar
- Cao KAL, Bonnet A, Gadat S. Multiclass classification and gene selection with a stochastic algorithm. Comput Stat Data Anal. 2009;53:3601–15.View ArticleGoogle Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46:389–422.View ArticleGoogle Scholar
- Yeung K, Bumgarner RA, Raftery AE. Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005;21:2394–402.View ArticlePubMedGoogle Scholar
- Fürnkranz J. Round robin classification. J Mach Learn Res. 2002;2:721–47.Google Scholar
- Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA. 2001;98:15149–54.View ArticlePubMedPubMed CentralGoogle Scholar
- Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA. 2001;98:13790–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1:133–43.View ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7.View ArticlePubMedGoogle Scholar
- Forman G. A pitfall and solution in multi-class feature selection for text classification. Proc Twenty-first Int Conf Mach Learn. 2004;6441:38.View ArticleGoogle Scholar
- He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21:1263–84.View ArticleGoogle Scholar
- Liu X-Y, Wu J, Zhou Z-H. Exploratory undersampling for class-imbalance learning. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol 39. IEEE; 2009. p. 539–50.Google Scholar
- Yukinawa N, Oba S, Kato K, Taniguchi K, Iwao-Koizumi K, Tamaki Y, et al. A multi-class predictor based on a probabilistic model: application to gene expression profiling-based diagnosis of thyroid tumors. BMC Genomics. 2006;7:190.View ArticlePubMedPubMed CentralGoogle Scholar
- Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform. 2012;9:1106–19.View ArticlePubMedGoogle Scholar
- Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2002;3:1157–82.Google Scholar
- Yu L, Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. Proc Eight Int Conf Mach Learn. 2003;2:856–63.Google Scholar
- Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 27. IEEE; 2005. p. 1226–38.Google Scholar
- Japkowicz N. The class imbalance problem: significance and strategies. In Proceedings of the international conference on artificial intelligence. 2002;111–117.Google Scholar
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.Google Scholar
- Akbani R, Kwek S, Japkowicz N. Applying support vector machines to imbalanced datasets. Mach Learn. 2004;3201:39–50.Google Scholar
- Liu W, Chawla S. Class confidence weighted kNN algorithms for imbalanced data sets. Adv Knowl Discov Data Min. 2011;6635:345–56.View ArticleGoogle Scholar
- Chawla NV, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl. 2004;6:1–6.View ArticleGoogle Scholar
- Japkowicz N. Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, vol. 68; 2000. p. 10–15.Google Scholar
- Do KA, Ambroise C. Analyzing microarray gene expression data, vol. 14. New York: Wiley; 2004. p. 1080–7.Google Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. ACM Sigkdd Explor Newsl. 2009;11:10–8.View ArticleGoogle Scholar
- Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical Recipes in C. Cambridge University Press, vol. 10; 1992. p. 195–196.Google Scholar
- Gutlein M, Frank E, Hall M, Karwath A. Large-scale attribute selection using wrappers. In: IEEE Symposium on Computational Intelligence and Data Mining. 2009. p. 332–339.Google Scholar