Framework for extreme imbalance classification: SWIM—sampling with the majority class

1434 Accesses
33 Citations
1 Altmetric
Explore all metrics

Abstract

The class imbalance problem is a pervasive issue in many real-world domains. Oversampling methods that inflate the rare class by generating synthetic data are amongst the most popular techniques for resolving class imbalance. However, they concentrate on the characteristics of the minority class and use them to guide the oversampling process. By completely overlooking the majority class, they lose a global view on the classification problem and, while alleviating the class imbalance, may negatively impact learnability by generating borderline or overlapping instances. This becomes even more critical when facing extreme class imbalance, where the minority class is strongly underrepresented and on its own does not contain enough information to conduct the oversampling process. We propose a framework for synthetic oversampling that, unlike existing resampling methods, is robust on cases of extreme imbalance. The key feature of the framework is that it uses the density of the well-sampled majority class to guide the generation process. We demonstrate implementations of the framework using the Mahalanobis distance and a radial basis function. We evaluate over 25 benchmark datasets and show that the framework offers a distinct performance improvement over the existing state-of-the-art in oversampling techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Oversampling for Imbalanced Data Classification

LoRAS: an oversampling approach for imbalanced datasets

Article Open access 12 November 2020

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Notes

The SWIM code is available here: https://github.com/cbellinger27/SWIM.
In practice, we soften the equality to be less than or equal to.
This takes advantage of the whitening done in the preprocessing, as instead of dealing with the MD, we can use the Euclidean distance.
Using our default setting based on the mean distance between majority class samples, $\epsilon $ is set to 1.68.

References

Abdi L, Hashemi S (2016) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Article Google Scholar
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287
Google Scholar
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637
Article MathSciNet Google Scholar
Bellinger C, Sharma S, Japkowicz N (2012) One-class versus binary classification: which and when? In: 11th international conference on machine learning and applications, vol 2, pp 102–106
Chawla N, Bowyer K, Hall L, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 107–119
Chapter Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. Lawrence Erlbaum Associates Ltd, Hillsdale, vol 17, pp 973–978
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE : a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing, pp 878–887
Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (3), 1322–1328
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Koziarski M, Krawczyk B, Woźniak M (2017) Radial-based approach to imbalanced data oversampling. In: International conference on hybrid artificial intelligence systems. Springer, Berlin, pp. 318–327
Google Scholar
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4):221–232
Article Google Scholar
Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, Nashville, USA, vol 97, pp 179–186
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(1):2825–2830
MathSciNet MATH Google Scholar
Sharma S, Bellinger C, Krawczyk B, Japkowicz N, Zaïane O (2018) Synthetic oversampling with the majority class: a new perspective on handling extreme imbalance. In: Proceedings of In IEEE international conference on data mining
Sharma S, Somayaji A, Japkowicz N (2018) Learning over subconcepts: strategies for 1-class classification. Comput Intell 34(2):440–467
Article MathSciNet Google Scholar
Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71
Article Google Scholar
Tomek I (1976) Modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
MathSciNet MATH Google Scholar
Wang H, Gao Y, Shi Y, Wang H (2016) A fast distributed classification algorithm for large-scale imbalanced data. In: IEEE 16th international conference on data mining, ICDM 2016, December 12–15, 2016, Barcelona, Spain, pp 1251–1256
Wei W, Li J, Cao L, Ou Y, Chen J (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16(4):449–475
Article Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Data Science for Complex Systems, National Research Council of Canada, Ottawa, Canada
Colin Bellinger
Weather Telematics Inc., Ottawa, Canada
Shiven Sharma
Department of Computer Science, American University, Washington, DC, USA
Nathalie Japkowicz
Department of Computer Science, Alberta Machine Intelligence Institute, University of Alberta, Edmonton, Canada
Osmar R. Zaïane

Authors

Colin Bellinger
View author publications
You can also search for this author in PubMed Google Scholar
Shiven Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Japkowicz
View author publications
You can also search for this author in PubMed Google Scholar
Osmar R. Zaïane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Colin Bellinger.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Tabulated results

See Tables 4, 5, 6, 7, 8 and 9.

Table 4 G-means for minority training size 3

Full size table

Table 5 G-means for minority training size 5

Full size table

Table 6 G-means for minority training size 10

Full size table

Table 7 G-means for minority training size 15

Full size table

Table 8 G-means for minority training size 20

Full size table

Table 9 G-means for minority training size 30

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bellinger, C., Sharma, S., Japkowicz, N. et al. Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl Inf Syst 62, 841–866 (2020). https://doi.org/10.1007/s10115-019-01380-z

Download citation

Received: 11 January 2019
Revised: 15 June 2019
Accepted: 23 June 2019
Published: 17 July 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10115-019-01380-z

Framework for extreme imbalance classification: SWIM—sampling with the majority class

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Oversampling for Imbalanced Data Classification

LoRAS: an oversampling approach for imbalanced datasets

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Tabulated results

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Framework for extreme imbalance classification: SWIM—sampling with the majority class

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Oversampling for Imbalanced Data Classification

LoRAS: an oversampling approach for imbalanced datasets

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Tabulated results

Appendix: Tabulated results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation