Abstract
The class imbalance problem is a pervasive issue in many real-world domains. Oversampling methods that inflate the rare class by generating synthetic data are amongst the most popular techniques for resolving class imbalance. However, they concentrate on the characteristics of the minority class and use them to guide the oversampling process. By completely overlooking the majority class, they lose a global view on the classification problem and, while alleviating the class imbalance, may negatively impact learnability by generating borderline or overlapping instances. This becomes even more critical when facing extreme class imbalance, where the minority class is strongly underrepresented and on its own does not contain enough information to conduct the oversampling process. We propose a framework for synthetic oversampling that, unlike existing resampling methods, is robust on cases of extreme imbalance. The key feature of the framework is that it uses the density of the well-sampled majority class to guide the generation process. We demonstrate implementations of the framework using the Mahalanobis distance and a radial basis function. We evaluate over 25 benchmark datasets and show that the framework offers a distinct performance improvement over the existing state-of-the-art in oversampling techniques.
Similar content being viewed by others
Notes
The SWIM code is available here: https://github.com/cbellinger27/SWIM.
In practice, we soften the equality to be less than or equal to.
This takes advantage of the whitening done in the preprocessing, as instead of dealing with the MD, we can use the Euclidean distance.
Using our default setting based on the mean distance between majority class samples, \(\epsilon \) is set to 1.68.
References
Abdi L, Hashemi S (2016) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637
Bellinger C, Sharma S, Japkowicz N (2012) One-class versus binary classification: which and when? In: 11th international conference on machine learning and applications, vol 2, pp 102–106
Chawla N, Bowyer K, Hall L, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 107–119
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. Lawrence Erlbaum Associates Ltd, Hillsdale, vol 17, pp 973–978
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE : a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing, pp 878–887
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (3), 1322–1328
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Koziarski M, Krawczyk B, Woźniak M (2017) Radial-based approach to imbalanced data oversampling. In: International conference on hybrid artificial intelligence systems. Springer, Berlin, pp. 318–327
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4):221–232
Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, Nashville, USA, vol 97, pp 179–186
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(1):2825–2830
Sharma S, Bellinger C, Krawczyk B, Japkowicz N, Zaïane O (2018) Synthetic oversampling with the majority class: a new perspective on handling extreme imbalance. In: Proceedings of In IEEE international conference on data mining
Sharma S, Somayaji A, Japkowicz N (2018) Learning over subconcepts: strategies for 1-class classification. Comput Intell 34(2):440–467
Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71
Tomek I (1976) Modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
Wang H, Gao Y, Shi Y, Wang H (2016) A fast distributed classification algorithm for large-scale imbalanced data. In: IEEE 16th international conference on data mining, ICDM 2016, December 12–15, 2016, Barcelona, Spain, pp 1251–1256
Wei W, Li J, Cao L, Ou Y, Chen J (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16(4):449–475
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bellinger, C., Sharma, S., Japkowicz, N. et al. Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl Inf Syst 62, 841–866 (2020). https://doi.org/10.1007/s10115-019-01380-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01380-z