Top

Published in:

01-07-2016 | Systems-Level Quality Improvement

Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation

Authors: Meng-Fong Tsai, Shyr-Shen Yu

Published in: Journal of Medical Systems | Issue 7/2016

Abstract

An imbalanced classification means that a dataset has an unequal class distribution among its population. For any given dataset, regardless of any balancing issue, the predictions made by most classification methods are highly accurate for the majority class but significantly less accurate for the minority class. To overcome this problem, this study took several imbalanced datasets from the famed UCI datasets and designed and implemented an efficient algorithm which couples Top-N Reverse k-Nearest Neighbor (TRkNN) with the Synthetic Minority Oversampling TEchnique (SMOTE). The proposed algorithm was investigated by applying it to classification methods such as logistic regression (LR), C4.5, Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN). This research also adopted different distance metrics to classify the same UCI datasets. The empirical results illustrate that the Euclidean and Manhattan distances are not only more accurate, but also show greater computational efficiency when compared to the Chebyshev and Cosine distances. Therefore, the proposed algorithm based on TRkNN and SMOTE can be widely used to handle imbalanced datasets. Our recommendations on choosing suitable distance metrics can also serve as a reference for future studies.

Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. Proc. 7th Int. Conf. Inform. Knowl. Manag. :148–155.

Castillo, M., and Serrano, J., A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl. 6:70–79, 2004.CrossRef

Sun, A., Lim, E. P., and Liu, Y., On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support. Syst. 48:191–201, 2009.CrossRef

Mazurowski, M., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., and Tourassi, G. D., Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21:427–436, 2008.CrossRefPubMed

Anand, A., Pugalenthi, G., Fogel, G., and Suganthan, P., An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39:1385–1391, 2010.CrossRefPubMed

Hao, M., Wang, Y., and Bryant, S. H., An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal. Chim. Acta. 806:117–127, 2014.CrossRefPubMed

Chen, M. Y., Using a hybrid evolution approach to forecast financial failures for Taiwan listed companies. Quant. Finan. 14(6):1047–1058, 2014.CrossRef

Chen, M. Y., A hybrid ANFIS model for business failure prediction - utilization of particle swarm optimization and subtractive clustering. Inform. Sci. 220:180–195, 2013.CrossRef

Phua, C., Alahakoon, D., and Lee, V., Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl. 6:50–59, 2004.CrossRef

10.

Wei, W., Li, J., Cao, L., Ou, Y., and Chen, J., Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16:449–475, 2013.CrossRef

11.

Khor, K. C., Ting, C. Y., and Phon-Amnuaisuk, S., A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl. Intell. 36:320–329, 2012.CrossRef

12.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P., SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16:321–357, 2002.

13.

Hart, P. E., The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 18:515–516, 1968.CrossRef

14.

Wilson, D. L., Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2:408–420, 1972.CrossRef

15.

Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proc. 14th Int. Conf. Inform. Mach. Learn. :179–186.

16.

Laurikkala, J., Improving identification of difficult small classes by balancing class distribution. Artif. Intell. Med. 2101:63–66, 2001.CrossRef

17.

Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. Int. Conf. Mach. Learn., Workshop on Learning from Imbalanced Datasets 42–48.

18.

Guo, H., and Viktor, H. L., Learning from imbalanced data sets with boosting and data generation: The data boosting approach. SIGKDD Explor. 6(1):30–39, 2004.CrossRef

19.

Han, H., Wang, W. Y., and Mao, B. H., Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Proc. Int. Conf. Intell. Comput. 2005(I):878–887, 2005.

20.

Cohen, G., Hilario, M., Sax, H., Hogonnet, S., and Geissbuhler, A., Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37:7–18, 2006.CrossRefPubMed

21.

Sáez, J. A., Luengo, J., Stefanowski, J., and Herrera, F., SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform. Sci. 291:184–203, 2015.CrossRef

22.

Bunkhumpornpat, C., Sinapiromsaran, K., and Lursinsap, C., Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’09. Springer, Berlin, pp. 475–482, 2009.CrossRef

23.

Maciejewski, T., and Stefanowski, J., Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proceedings of IEEE symposium on computational intelligence and data mining. IEEE Press, SSCI IEEE, pp. 104–111, 2011.

24.

Batista, G., Prati, R., and Monard, M., A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6:20–29, 2004.CrossRef

25.

Tomek, I., Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6:769–772, 1976.CrossRef

26.

Katos, V., Network intrusion detection: Evaluating cluster, discriminant, and logit analysis. Inform. Sci. 177(15):3060–3073, 2007.CrossRef

27.

Chen, M. Y., Bankruptcy prediction in firms with statistical and intelligent techniques and a comparison of evolutionary computation approaches. Comput. Math. Appl. 62(12):4514–4524, 2011.CrossRef

28.

Quinlan, J. R., Programs for machine learning. Morgan Kaufmann, San Fransisco, 1993.

29.

Salzberg, S. L., On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Disc. 1:317–327, 1997.CrossRef

Title: Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation
Authors: Meng-Fong Tsai
Shyr-Shen Yu
Publication date: 01-07-2016
Publisher: Springer US
Published in: Journal of Medical Systems / Issue 7/2016
Print ISSN: 0148-5598
Electronic ISSN: 1573-689X
DOI: https://doi.org/10.1007/s10916-016-0516-3

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation

Abstract

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Please log in to get access to this content

Other articles of this Issue 7/2016

Monitoring and Follow-up of Chronic Heart Failure: a Literature Review of eHealth Applications and Systems

A Hybrid Data Mining Model to Predict Coronary Artery Disease Cases Using Non-Invasive Clinical Data

A Multilayer Perceptron Based Smart Pathological Brain Detection System by Fractional Fourier Entropy

Analysis of the process of representing clinical statements for decision-support applications: a comparison of openEHR archetypes and HL7 virtual medical record

Neonatal Jaundice Detection System

Massive Access Control Aided by Knowledge-Extraction for Co-Existing Periodic and Random Services over Wireless Clinical Networks