Data Pre-Processing with Sampling to Reduce Uneven Data Distribution in Disease Datasets
Sushruta Mishra, Hrudaya Kumar Tripathy and Soumya Sahoo
Journal of Computer Science
There is a constant upgradation in data aggregated from varying sources due to which proper data analysis is tough. Evaluating a classification algorithm is difficult in such circumstances. The datasets used are massive in recent days. Handling these massive datasets is a challenging task. More problem occurs when there is an uneven distribution in data samples among the classes. Classification of data becomes difficult in such scenario. Though most classifiers focus on majority class but still class with less data samples should also be taken into consideration. Thus uneven data distribution in classes leads to data skewing which needs attention of research scholars. This paper is based on the analysis of various sampling techniques on the data skewing issue using disease datasets. Three sampling methods are used in our research which include SMOTE, Spread Sub sampling and Resampling and Multi layer Perceptron is used as a classifier while Particle swarm optimization is used as the feature selection algorithm to select an optimized data from the raw disease data samples. Some critical performance metrics are used to determine the performance of classification. It is inferred that pre-processing with sampling techniques act as an optimizing agent subsequently enhancing the classification accuracy.
© 2017 Sushruta Mishra, Hrudaya Kumar Tripathy and Soumya Sahoo. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.