Research Article Open Access

A Reformed K-Nearest Neighbors Algorithm for Big Data Sets

Vo Ngoc Phu1 and Vo Thi Ngoc Tran2
  • 1 Nguyen Tat Thanh University, Vietnam
  • 2 Vietnam National University, Vietnam
Journal of Computer Science
Volume 14 No. 9, 2018, 1213-1225

DOI: https://doi.org/10.3844/jcssp.2018.1213.1225

Submitted On: 1 February 2018 Published On: 8 March 2018

How to Cite: Phu, V. N. & Ngoc Tran, V. T. (2018). A Reformed K-Nearest Neighbors Algorithm for Big Data Sets. Journal of Computer Science, 14(9), 1213-1225. https://doi.org/10.3844/jcssp.2018.1213.1225

Abstract

A Data Mining Has Already Had Many Algorithms Which A K-Nearest Neighbors Algorithm, K-NN, Is A Famous Algorithm For Researchers. K-NN Is Very Effective On Small Data Sets, However It Takes A Lot Of Time To Run On Big Datasets. Today, Data Sets Often Have Millions Of Data Records, Hence, It Is Difficult To Implement K-NN On Big Data. In This Research, We Propose An Improvement To K-NN To Process Big Datasets In A Shortened Execution Time. The Reformed K-Nearest Neighbors Algorithm (R-K-NN) Can Be Implemented On Large Datasets With Millions Or Even Billions Of Data Records. R-K-NN Is Tested On A Data Set With 500,000 Records. The Execution Time Of R-K-NN Is Much Shorter Than That Of K-NN. In Addition, R-K-NN Is Implemented In A Parallel Network System With Hadoop Map (M) And Hadoop Reduce (R).

  • 1,076 Views
  • 975 Downloads
  • 2 Citations

Download

Keywords

  • K-Nearest Neighbors Algorithm
  • K-NN
  • Parallel Network Environment
  • Distributed System
  • Data Mining
  • Association Rules
  • Cloudera
  • Hadoop Map
  • Hadoop Reduce