Review Article Open Access

Visualization of Data Mining Techniques for the Prediction of Breast Cancer with High Accuracy Rates

Vasudev Sharma1, Raj Kumar Rajasekaran1 and Shreya Badhrinarayanan2
  • 1 Vellore Institute of Technology, India
  • 2 Brighton and Sussex Medical School, United Kingdom
Journal of Computer Science
Volume 15 No. 1, 2019, 118-130


Submitted On: 27 August 2018 Published On: 21 January 2019

How to Cite: Sharma, V., Rajasekaran, R. K. & Badhrinarayanan, S. (2019). Visualization of Data Mining Techniques for the Prediction of Breast Cancer with High Accuracy Rates. Journal of Computer Science, 15(1), 118-130.


Breast cancer is one of the leading causes of death in women worldwide. Around one in 30 women are affected by breast cancer. Mammography has helped in detecting breast cancer in the early stages which have reduced mortality. The diagnosis of breast cancer is dependent on a variety of parameters. In this paper, we aim to create the best model for predicting breast cancer through preprocessing, feature extraction, data visualization and prediction using breast cancer data. Various visualization techniques like violin plot, grid plot, swarm plot and heat plot were utilized for proper feature extraction which has improved the accuracy of our results. For the purpose of prediction, we have used algorithms like the random forest, decision tree with single and multiple predictors, along with the commonly used statistical model, logistic regression model. We have also relied on 5-fold cross-validation methods to measure the unbiasedness of the prediction models for performance reasons. An analysis of the models was carried out and the best model was selected based on its accuracy. The results showcased that the random forest model provided an accuracy rate of 94.724% with decent 5-fold cross-validation, followed by the decision tree model which had an accuracy rate of 100% with poor 5-fold cross-validation. This was followed by the logistic regression model which had an accuracy rate of 88.442% with a low 5-fold cross-validation score.

  • 2 Citations



  • Mammography
  • Data Visualization
  • Violin Plot
  • Swarm Plot
  • Random Forest
  • Logistic Regression
  • Decision Tree
  • 5-Fold Cross Validation