A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem

The problem of class imbalance is extensive for focusing on numerous applications in the real world. In such a situation, nearly all of the examples are labeled as one class called majority class, while far fewer examples are labeled as the other class usually, the more important class is called minority. Over the last few years, several types of research have been carried out on the issue of class imbalance, including data sampling, cost-sensitive analysis, Genetic Programming based models, bagging, boosting, etc. Nevertheless, in this survey paper, we enlisted the 24 related studies in the years 2003, 2008, 2010, 2012 and 2014 to 2019, focusing on the architecture of single, hybrid, and ensemble method design to understand the current status of improving classification output in machine learning techniques to fix problems with class imbalances. This survey paper also includes a statistical analysis of the classification algorithms under various methods and several other experimental conditions, as well as datasets used in different research papers.


Introduction
The data is the most valuable asset and one of the best possible sources for any research and development.Data can be viewed as one of the significant components of educational and business strategy decisions.Therefore, the data for research and development for any major decision should be balanced and accurate.Balanced data is one of the major concerns nowadays.Data Imbalance problems impede the performance of the classification algorithm (Singh and Purohit, 2015).However, the efficiency of predictive models is significantly impacted when the data set in the real world is highly imbalanced (Amin et al., 2016).So, the data are used for strategic decisions and research should be balanced.When class distribution in the dataset is not uniform, the data are called imbalance (Haykin, 1999).In such cases, there are only a limited number of instances represented in the least one known as minority class and the remainder of the dataset consisting of other classes is known as majority class.Recent work has shown that output an unequal distribution of class examples in the learning process can skew efficiency.This means the class provides minimal specificity on the minority class while it offers great accuracy in the majority class.Class disparity in the datasets can drastically skew the performance of classifiers in a majority-minority classification problem, introducing a prediction bias for the majority class (Leevy et al., 2018).Though high imbalance affects the output significantly, some of the small imbalances metrics were beneficial (Koziarski et al., 2018).
Researchers divided the data imbalance problem into two major categories: Multiclass data imbalance (Bhowan et al., 2011).In a multiclass dataset, there are more than two classes and just two classes in a binary dataset.There are plenty of attempts to fix the issue of binary class imbalance, but the different types of problems relevant to the problem of multiclass imbalance are not yet solved (Rout et al., 2018).Learning from imbalanced data is studied extensively in standard classification and also in multilevel classification in recent times (Charte et al., 2019).There are two major approaches (external and internal) used to build methods to solve the problem of data imbalances (Eggermont et al., 2004).Bagging and boosting are the most common external approaches for handling the data imbalance problem sampling.Researchers have proposed only a few approaches within their internal approach.Among them, Genetic Programming (GP) (Haykin, 1999) and the Extreme Learning Machine (ELM) (Yu et al., 2016) approaches are the most popular techniques to solve the data imbalance problem.
Over the years, many scholars and researchers have done some successful work on the improvement of the imbalance dataset.This study analyzed all of the relevant research in the data collection of imbalances over the last separate ten years.In total 24 articles are enlisted in this survey paper from 2003, 2008, 2010, 2012 and 2014 to 2019.This study contained the proposed method of classification techniques with learning algorithms.Statistical comparison was provided for viewing class if the technique and the algorithm chosen were used along with the selection steps of the function.The paper is set out as follows: An overview of the research subject is given in section 2, which describes a variety of imbalanced dataset techniques.Section 3 offers a statistical analysis of the papers where various year-wise approaches are discussed.Section 4 includes discussion and conclusion alongside indicating some issues for future imbalance dataset research using approaches to machine learning.

Research Paper Overview
A number of researches are going on in the field of data mining and machine learning such as keyword extraction (Showrov and Sobhan, 2019), summarization (Abulaish et al., 2018;Showrov et al., 2019a), breast cancer detection (Showrov et al., 2019b) and so on.The most challenging problem nowadays in this field is a class imbalance.Several scholars have suggested various types of approaches for dealing with problems with class imbalances.Methods of data level, methods of leveling algorithms and methods of the ensemble are categorized methods.Hybrid methods are another group type for dealing with the problem of class imbalances.

Data Level Methods
This approach is geared towards matching the class distributions.The class distribution are being balanced using the sampling methods by resizing the training datasets.The sampling methods can be categorized into techniques for under-sampling and over-sampling.

Over-Sampling Technique
Oversampling is the method of either randomly increasing the number of instances in the minority class to increase the disparity ratio such that the corresponding classification algorithms can be employed for the classification of the data.The benefit of this technique is that any necessary information is not missed from the dataset and the primary dataset can be preserved even though new data is appended to it for balancing the data (Kaur and Gosain, 2018).

SMOTE
Synthetic Minority Over-sampling Technique (SMOTE) is a process to increase the data into the minority class by generating new synthetic data using the existing data (Junsomboon and Phienthrakul, 2017).It is capable of producing patterns that follow a distribution similar to the true on (Elreedy and Atiya, 2019).However, other methods, such as fuzzy or locally linear embedding (Verbiest et al., 2012), continue to improve SMOTE.

Cluster-Based Over-Sampling (CBO)
This approach contains clustering the training data of each class separately that is achieved using the k-means method.Thereafter, random over-sampling is carried out on all clusters (Popel et al., 2018).

ADASYN
Adaptive synthetic sampling method improves learning on the distributions of data in two ways: (1) Reduction the biasness which is introduced by the class imbalance and (2) adaptively is shifting the boundary of classification decisions towards the difficult examples (He et al., 2008).

Under-Sampling Technique
In the under-sampling method, the working area in the dataset is a majority class where either randomly or by using some technique to balance the classes are extracted from the majority class.The under-sampling method is used to boost the imbalance ratio on unbalanced data and classes are then classified using conventional classification algorithms (Kaur and Gosain, 2018).However, it does have the benefit of reducing the time required to train the models because the size of the training data set is reduced (Seiffert et al., 2009).

Random Under-Sampling (RUS)
Random Under-Sampling (RUS) is an under-sampling method that excludes the majority-class instances randomly to balance the class distribution (Popel et al., 2018).It is a technique that removes examples randomly from the class of majority.Given its simplicity, RUS was shown to perform very well.Simplicity, speed and efficiency are the reasons for the introduction of RUS into the boosting process (Seiffert et al., 2009).

Tomek Link (T-Link)
T-Link is a technique of under-sampling stated by Tomek.This is seen as improving the Nearest-Neighbor 1548 Regulation (NNR).The T-Link technique can be used as a directed under-sampling method where the majorityclass observations are deleted (Rahman et al., 2011).

Algorithm Level Methods
Algorithm level approaches concentrate on improving the ability of current classifier algorithms for learning from minority classes, which are often called internal approaches.For example, adjustment of the estimation of probability or modification of cost per class may be favorable to the minority class.

Support Vector Machine (SVM)
Support Vector Machine was introduced in the mid-1990s (Rahman et al., 2011).This technique discriminates over input spaces in a finite area.It is necessary for classifications to be obtained by learning from the training sample (Durgesh and Lekha, 2010).Traditional SVM classification methods use as input training data consisting of a mix of data classified by two groups (Catania et al., 2012).

K-Nearest Neighbor (KNN)
A number of distance measuring methods are being adopted in K-nearest neighbors.In the training results, KNN finds k number of nearest samples and then allows the class label used frequently within the estimated training samples based on the test sample.The K-nearest neighbor is known for being the simplest and most nonparametric sample classification (Friedl and Brodley, 1997).KNN can be mentioned as a learner based on the instance (Bishop, 1995).

Naïve Bayes
To simplify the relationship, Naïve Bayes often produces strong classification outcomes.Although a lot of classification missions, only one scan of the training data is required (Mitchell, 1997).Based on the given class label, Naïve Bayes estimates that.The attributes are independent of conditions and therefore investigate to determine the class conditional probability (Kotu and Deshpande, 2018).

Decision Tree
A Decision tree classifies an unknown test instance by way of a series of decisions.Decision tree classifiers are widely used in many different ways, in particular for their high adaptability to complex classification problems (Friedl and Brodley, 1997).The decision tree is simpler and easier to enforce, so as a single classifier it is renowned (Farid et al., 2013).

Ensemble Methods
Ensemble approaches involve the synthesis of various methods.Ensembles based on bagging and boosting are commonly used and are efficient solutions for the class issue with imbalances.Breiman (1996) presented the idea of aggregating bootstraps to create ensembles.

Bagging
Bagging (Bootstrapped Aggregating) is a way to boost the classification algorithms results (Machová et al., 2006).Bagging utilizes and integrates multiple selfemployed learners using an averaging technique.Reducing variation and bias (Sanabila and Jatmiko, 2018) works fine.
Boosting Schapire (1990) launched Boosting.Schapire has shown that a low learner (slightly better than guessing randomly) can transform into a powerful learner.AdaBoost is the family's most influential algorithm.Boosting needs bootstrapping too.There is yet some other difference here, though.Unlike bagging, each sample of data boosts weights.It means that some samples will be run more frequently than others (Breiman, 1996).

Hybrid Methods
The hybrid approaches include both data sampling and algorithm boosting.While many data sampling techniques are specifically designed to address the problem of class imbalances, hybrid methods can improve the performance of any weak classifier (regardless of whether the data is unbalanced) (Seiffert et al., 2009).

SMOTEBoost
SMOTEBoost produces synthetic examples of the rare or minority class, thereby implicitly adjusting the weights of updates and compensating for distorted distributions.This construct focuses on the sampled minority class examples for each boosting iteration and creates new examples (Chawla et al., 2003).

Random
Under-Sampling Based Boosting (RUSBoost) Method presents a simple, quicker and less complex method for learning from imbalanced data.It proposed a detailed empirical study comparing the performance of several strategies for improving the efficiency of classification when data is imbalanced (Seiffert et al., 2009).

LIUBoost
Locality Informed Underboosting (LIUBoost) approach combines sampling technique with cost-sensitive learning.Under-sampling, it uses data sets in every boosting iteration such as RUSBoost, while incorporating a cost term for each instance based on their hardness into the weight update formula minimizing the information loss introduced by under-sampling (Ahmed et al., 2019).1549 RHSBoost For further enhancement of classification accuracy, Random Hybrid Sampling Boosting (RHSBoost) uses both under-sampling and ROSE sampling in the boosting algorithm.Under a boosting scheme (Gong and Kim, 2017), the classification rule uses random undersampling and ROSE sampling.

Hybrid
Under-Sampling Based Boosting (HUSBoost) approach proposes managing imbalanced data that requires three basic steps-data cleaning, data balance and classification.The goal of these methods is to optimize the overall accuracy while these algorithms neglect the minority class most of the time (Popel et al., 2018).

Distribution of Papers by Year of Publications
This survey contains 24 research articles within the period from 2003, 2008, 2010, 2012 and 2014 to 2019.It addressed 3 papers for 2014 and 2019, each year.From the year 2017, researchers studied the largest number of papers.For that year, the number of papers is 6. Figure 1 shows the ratio of paper distribution by the released year.Gong and Kim, 2017;Popel et al., 2018;Seiffert et al., 2009;Chawla et al., 2003)

Method Design
The unbalanced set of data can be categorized into various categories, namely single, ensemble and hybrid.Type of single-level design includes 10 papers, the type of ensemble design covers 7 papers and finally, the type of hybrid design contains 7 papers.The enlisted 24 papers are given at a glance in the following Table 1. Figure 2 represents the number of research papers based on single, ensemble and hybrid methods used in each particular year.

Single Level Method
The number of research papers utilizing different types of sampling classifiers and traditional machine learning algorithms in the single-level method.Mostly, the single level process survey uses Synthetic Minority Oversampling Technique (SMOTE), Decision Tree (DT), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Edited Nearest Neighbor (ENN) and Naïve Bayes.
Table 2 depicts the year-wise distribution of the Single level method regarding advantage or limitation and citation.

Ensemble Method
Multiple algorithms are combined in this method.Table 3 reflects the year-wise distribution of the ensemble method concerning the benefit or disadvantage that we are analyzing here.Adaboost, ROS, RUS, etc. are different types of algorithms used in ensemble methods.Table 3 also lists the method proposed for each paper and the citation of each article.

Hybrid Method
Table 4 shows the year-wise distribution of the hybrid method concerning the results and citations of each paper, the hybrid method is used to solve the imbalanced data set in the mainstream study due to the recent output accuracy.Statistics show the largest number of publications on the hybrid approach in 2017.The table also displays in each article the algorithm used and their success in solving the problem of an imbalanced dataset.

Used Dataset in Researches
Datasets are assigned for default tasks such as classification, clustering, prediction results, etc.This survey paper analyses Dataset is for classification purposes.Table 5 displays the distribution of randomly used datasets year by year.
The study also shows that few private or non-public datasets are used over the time frame.Although the study briefly highlights the UCI machine learning repository datasets being considered as standard datasets for handling and solving imbalanced data.Some medical datasets such as Ovarian I and Ovarian II (Yu and Ni, 2014) datasets are used as well as Breast Cancer, ILPD, Pima Indians, Fertility, Haberman form medical dataset (Junsomboon and Phienthrakul, 2017) are also used.On the other hand, customer behavior of banking transaction Data (Sanabila and Jatmiko, 2018) is used to get the accuracy.

Discussion of Surveyed Works
The surveyed works and their specific data sets were summarized in Table 2 to 5 to provide a highlevel overview and better compare the existing approach for various learning models in class imbalance.Table 2 points out on single-level methods which involve both data level and algorithm level approaches where different kinds of the method are used to balance the imbalanced datasets.ADASYN (He et al., 2008) and FRIPS (Verbiest et al., 2012) methods are used to handle multiple class imbalance learning and to control noise to classify imbalanced data using the like of traditional algorithm DT and KNN.Single-level approaches have been investigated in 10 studies and can be further divided into new loss functions, cost-sensitive approaches, performance thresholds.Ensemble methods involve 7 studies that are depicted in Table 3. Resampling method (Sanabila and Jatmiko, 2018) were combined SMOTE and ENN to minimize the bias of majority class and less misclassification in the way of classifying and also AMCS (Yijing et al., 2016) method used Naïve Bayes, DT and KNN to choose the best route for different types of data for classifying multiclass imbalanced data.Multiple authors (Popel et al., 2018;Ahmed et al., 2019) suggested that the use of a machine learning model with DT to address the class imbalance in the field of hybrid methods which denotes in Table 4.A combination of sampling, cost-sensitive method makes the hybrid method better sometimes to give better accuracy than the others in case of balancing data.1554 About half of the researchers used just various imbalance to test their machine learning approaches to resolve the class disparity.The writers are not at fault for this, as most of them have been concentrating on addressing a particular problem or a benchmark assignment.However, more rigorous experiments testing these approaches across a broader variety of data sets, with differing degrees of class inequality and difficulty, would help explain their strengths and limitations.Also, only one-third of the experiments show the number of rounds or repetitions done on each trial.In other words, the rest of it the groups either did not conduct several runs or neglected to provide the specifics and reported the most desirable findings.For their safety, training a machine learning model on a broad data set will take days or even weeks, making it difficult to perform many rounds of experiments.This opens a range of opportunities for future study, as applying different machine learning approaches to several data sets with replication can improve confidence in outcomes and help direct future practitioners in model selection.
The methods that have been discussed explicitly cannot currently be comparable, since they are tested in separate data sets with differing class imbalances and results with contradictory performance assessments are published.Besides, some studies report inconsistent findings, further indicating that performance is highly dependent on problem complexity, class representation and recorded performance metrics.Overall, there is a lack of evidence that distinguishes any hybrid, ensemble machine learning method as superior for learning from class imbalanced results and additional experiments are needed before such conclusions can be drawn.Class imbalance is not limited to low or high imbalance data and further analysis needs to be performed to test the application of these machine learning class imbalance approaches in other domains.

Conclusion and Future Works
Data imbalances are a common concern.It has been a long time since the attention of the researchers.Uses of various imbalance dataset classifier techniques are an emerging data science study.Existing classifier performance on an imbalanced dataset is not expected without a balancing dataset.So, we have to reprocess the data to make a better accuracy of results.In this survey, the application of using different classifiers has been established for the classification of imbalances in results.This survey paper has provided a clear comparison of these papers and a fair viewpoint in this area, but this study cannot say an in-depth analysis of those papers.The following points may be useful in future research:  System performance is a key factor.In the training phase, the removal of redundant and irrelevant features increases system performance  In the classification techniques, the consideration feature selection step will play a vital role in the future  For performance measurement, uses of hybrid or ensemble classifiers are more feasible instead of a single classifier There are some areas for future work that are evident.Applying the newly developed methods to a wider range of data sets and class imbalance levels, comparing outcomes with several complementary performance indicators and reporting statistical evidence can help to define the optimal deep learning approaches as well as traditional machine learning methods for future class imbalance applications.Experimenting with new hybrid and cluster-based machine learning approaches along with deep learning approaches to fix class imbalances in the sense of big data and class rareness may prove useful for the future of big data analytics.

Fig. 1 :
Fig. 1: Distribution of papers based on years

Table 2 :
Year-wise distribution of single level method regarding advantage/limitation and citation

Table 3 :
Year-wise distribution of ensemble method regrading advantages/limitations and citation

Table 4 :
Year-wise distribution of hybrid method regarding advantage/limitation and citation

Table 5 :
Year wise distribution of randomly used dataset