Published in

2012 11th International Conference on Machine Learning and Applications

DOI: 10.1109/icmla.2012.162

Links

Tools

Export citation

Search in Google Scholar

An Experimental Design to Evaluate Class Imbalance Treatment Methods

Proceedings article published in 2012 by Gustavo Batista, Diego Furtado Silva, Ronaldo Cristiano Prati ORCID
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as "Are all learning paradigms equally affected by class imbalance?", "What is the expected performance loss for different imbalance degrees?" and "How much of the performance losses can be recovered by the treatment methods?". In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data sets with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance. We employ such experimental design in a large-scale ex-perimental evaluation with twenty-two data sets and seven learning algorithms from different paradigms. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5%) for the most balanced distributions up to 10% of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20% for 1% of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the sampling algorithms only partially recover the performance losses. On average, typically about 30% or less of the performance that was lost due to class imbalance was recovered by random over-sampling and SMOTE.