Skip to main content
Top
Published in: Journal of Medical Systems 7/2016

01-07-2016 | Systems-Level Quality Improvement

Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation

Authors: Meng-Fong Tsai, Shyr-Shen Yu

Published in: Journal of Medical Systems | Issue 7/2016

Login to get access

Abstract

An imbalanced classification means that a dataset has an unequal class distribution among its population. For any given dataset, regardless of any balancing issue, the predictions made by most classification methods are highly accurate for the majority class but significantly less accurate for the minority class. To overcome this problem, this study took several imbalanced datasets from the famed UCI datasets and designed and implemented an efficient algorithm which couples Top-N Reverse k-Nearest Neighbor (TRkNN) with the Synthetic Minority Oversampling TEchnique (SMOTE). The proposed algorithm was investigated by applying it to classification methods such as logistic regression (LR), C4.5, Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN). This research also adopted different distance metrics to classify the same UCI datasets. The empirical results illustrate that the Euclidean and Manhattan distances are not only more accurate, but also show greater computational efficiency when compared to the Chebyshev and Cosine distances. Therefore, the proposed algorithm based on TRkNN and SMOTE can be widely used to handle imbalanced datasets. Our recommendations on choosing suitable distance metrics can also serve as a reference for future studies.
Literature
1.
go back to reference Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. Proc. 7th Int. Conf. Inform. Knowl. Manag. :148–155. Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. Proc. 7th Int. Conf. Inform. Knowl. Manag. :148–155.
2.
go back to reference Castillo, M., and Serrano, J., A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl. 6:70–79, 2004.CrossRef Castillo, M., and Serrano, J., A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl. 6:70–79, 2004.CrossRef
3.
go back to reference Sun, A., Lim, E. P., and Liu, Y., On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support. Syst. 48:191–201, 2009.CrossRef Sun, A., Lim, E. P., and Liu, Y., On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support. Syst. 48:191–201, 2009.CrossRef
4.
go back to reference Mazurowski, M., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., and Tourassi, G. D., Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21:427–436, 2008.CrossRefPubMed Mazurowski, M., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., and Tourassi, G. D., Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21:427–436, 2008.CrossRefPubMed
5.
go back to reference Anand, A., Pugalenthi, G., Fogel, G., and Suganthan, P., An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39:1385–1391, 2010.CrossRefPubMed Anand, A., Pugalenthi, G., Fogel, G., and Suganthan, P., An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39:1385–1391, 2010.CrossRefPubMed
6.
go back to reference Hao, M., Wang, Y., and Bryant, S. H., An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal. Chim. Acta. 806:117–127, 2014.CrossRefPubMed Hao, M., Wang, Y., and Bryant, S. H., An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal. Chim. Acta. 806:117–127, 2014.CrossRefPubMed
7.
go back to reference Chen, M. Y., Using a hybrid evolution approach to forecast financial failures for Taiwan listed companies. Quant. Finan. 14(6):1047–1058, 2014.CrossRef Chen, M. Y., Using a hybrid evolution approach to forecast financial failures for Taiwan listed companies. Quant. Finan. 14(6):1047–1058, 2014.CrossRef
8.
go back to reference Chen, M. Y., A hybrid ANFIS model for business failure prediction - utilization of particle swarm optimization and subtractive clustering. Inform. Sci. 220:180–195, 2013.CrossRef Chen, M. Y., A hybrid ANFIS model for business failure prediction - utilization of particle swarm optimization and subtractive clustering. Inform. Sci. 220:180–195, 2013.CrossRef
9.
go back to reference Phua, C., Alahakoon, D., and Lee, V., Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl. 6:50–59, 2004.CrossRef Phua, C., Alahakoon, D., and Lee, V., Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl. 6:50–59, 2004.CrossRef
10.
go back to reference Wei, W., Li, J., Cao, L., Ou, Y., and Chen, J., Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16:449–475, 2013.CrossRef Wei, W., Li, J., Cao, L., Ou, Y., and Chen, J., Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16:449–475, 2013.CrossRef
11.
go back to reference Khor, K. C., Ting, C. Y., and Phon-Amnuaisuk, S., A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl. Intell. 36:320–329, 2012.CrossRef Khor, K. C., Ting, C. Y., and Phon-Amnuaisuk, S., A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl. Intell. 36:320–329, 2012.CrossRef
12.
go back to reference Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P., SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16:321–357, 2002. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P., SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16:321–357, 2002.
13.
go back to reference Hart, P. E., The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 18:515–516, 1968.CrossRef Hart, P. E., The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 18:515–516, 1968.CrossRef
14.
go back to reference Wilson, D. L., Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2:408–420, 1972.CrossRef Wilson, D. L., Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2:408–420, 1972.CrossRef
15.
go back to reference Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proc. 14th Int. Conf. Inform. Mach. Learn. :179–186. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proc. 14th Int. Conf. Inform. Mach. Learn. :179–186.
16.
go back to reference Laurikkala, J., Improving identification of difficult small classes by balancing class distribution. Artif. Intell. Med. 2101:63–66, 2001.CrossRef Laurikkala, J., Improving identification of difficult small classes by balancing class distribution. Artif. Intell. Med. 2101:63–66, 2001.CrossRef
17.
go back to reference Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. Int. Conf. Mach. Learn., Workshop on Learning from Imbalanced Datasets 42–48. Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. Int. Conf. Mach. Learn., Workshop on Learning from Imbalanced Datasets 42–48.
18.
go back to reference Guo, H., and Viktor, H. L., Learning from imbalanced data sets with boosting and data generation: The data boosting approach. SIGKDD Explor. 6(1):30–39, 2004.CrossRef Guo, H., and Viktor, H. L., Learning from imbalanced data sets with boosting and data generation: The data boosting approach. SIGKDD Explor. 6(1):30–39, 2004.CrossRef
19.
go back to reference Han, H., Wang, W. Y., and Mao, B. H., Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Proc. Int. Conf. Intell. Comput. 2005(I):878–887, 2005. Han, H., Wang, W. Y., and Mao, B. H., Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Proc. Int. Conf. Intell. Comput. 2005(I):878–887, 2005.
20.
go back to reference Cohen, G., Hilario, M., Sax, H., Hogonnet, S., and Geissbuhler, A., Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37:7–18, 2006.CrossRefPubMed Cohen, G., Hilario, M., Sax, H., Hogonnet, S., and Geissbuhler, A., Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37:7–18, 2006.CrossRefPubMed
21.
go back to reference Sáez, J. A., Luengo, J., Stefanowski, J., and Herrera, F., SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform. Sci. 291:184–203, 2015.CrossRef Sáez, J. A., Luengo, J., Stefanowski, J., and Herrera, F., SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform. Sci. 291:184–203, 2015.CrossRef
22.
go back to reference Bunkhumpornpat, C., Sinapiromsaran, K., and Lursinsap, C., Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’09. Springer, Berlin, pp. 475–482, 2009.CrossRef Bunkhumpornpat, C., Sinapiromsaran, K., and Lursinsap, C., Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’09. Springer, Berlin, pp. 475–482, 2009.CrossRef
23.
go back to reference Maciejewski, T., and Stefanowski, J., Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proceedings of IEEE symposium on computational intelligence and data mining. IEEE Press, SSCI IEEE, pp. 104–111, 2011. Maciejewski, T., and Stefanowski, J., Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proceedings of IEEE symposium on computational intelligence and data mining. IEEE Press, SSCI IEEE, pp. 104–111, 2011.
24.
go back to reference Batista, G., Prati, R., and Monard, M., A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6:20–29, 2004.CrossRef Batista, G., Prati, R., and Monard, M., A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6:20–29, 2004.CrossRef
25.
go back to reference Tomek, I., Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6:769–772, 1976.CrossRef Tomek, I., Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6:769–772, 1976.CrossRef
26.
go back to reference Katos, V., Network intrusion detection: Evaluating cluster, discriminant, and logit analysis. Inform. Sci. 177(15):3060–3073, 2007.CrossRef Katos, V., Network intrusion detection: Evaluating cluster, discriminant, and logit analysis. Inform. Sci. 177(15):3060–3073, 2007.CrossRef
27.
go back to reference Chen, M. Y., Bankruptcy prediction in firms with statistical and intelligent techniques and a comparison of evolutionary computation approaches. Comput. Math. Appl. 62(12):4514–4524, 2011.CrossRef Chen, M. Y., Bankruptcy prediction in firms with statistical and intelligent techniques and a comparison of evolutionary computation approaches. Comput. Math. Appl. 62(12):4514–4524, 2011.CrossRef
28.
go back to reference Quinlan, J. R., Programs for machine learning. Morgan Kaufmann, San Fransisco, 1993. Quinlan, J. R., Programs for machine learning. Morgan Kaufmann, San Fransisco, 1993.
29.
go back to reference Salzberg, S. L., On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Disc. 1:317–327, 1997.CrossRef Salzberg, S. L., On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Disc. 1:317–327, 1997.CrossRef
Metadata
Title
Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation
Authors
Meng-Fong Tsai
Shyr-Shen Yu
Publication date
01-07-2016
Publisher
Springer US
Published in
Journal of Medical Systems / Issue 7/2016
Print ISSN: 0148-5598
Electronic ISSN: 1573-689X
DOI
https://doi.org/10.1007/s10916-016-0516-3

Other articles of this Issue 7/2016

Journal of Medical Systems 7/2016 Go to the issue

Transactional Processing Systems

Neonatal Jaundice Detection System