A Reformed K-Nearest Neighbors Algorithm for Big Data Sets

: A Data Mining Has Already Had Many Algorithms Which A K-Nearest Neighbors Algorithm, K-NN, Is A Famous Algorithm For Researchers. K-NN Is Very Effective On Small Data Sets, However It Takes A Lot Of Time To Run On Big Datasets. Today, Data Sets Often Have Millions Of Data Records, Hence, It Is Difficult To Implement K-NN On Big Data. In This Research, We Propose An Improvement To K-NN To Process Big Datasets In A Shortened Execution Time. The Reformed K-Nearest Neighbors Algorithm (R-K-NN) Can Be Implemented On Large Datasets With Millions Or Even Billions Of Data Records. R-K-NN Is Tested On A Data Set With 500,000 Records. The Execution Time Of R-K-NN Is Much Shorter Than That Of K-NN. In Addition, R-K-NN Is Implemented In A Parallel Network System With Hadoop Map (M) And Hadoop Reduce (R).


Introduction
The field of data mining has been studied for many years.The K-Nearest Neighbors algorithm (K-NN) is a popular algorithm in the data mining field.We use K-NN to stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions) and it is a simple algorithm.Besides, K-NN has been used in statistical estimation and pattern recognition and it is a non-parametric technique.
It is clear that social networks, information technology and computer science are being developed at a dramatic rate, generating vast amounts of data, information and knowledge.The information in these big data sets belonging to large corporations is very valuable, particularly if this information can be exploited in ways which are of benefit.
Hence, many algorithms have been proposed to run on big datasets.K-NN is a well-known algorithm in data mining and other fields, but K-NN takes a lot of time to run on big data sets, yet it is efficient on small datasets.Thus, we tested K-NN and other algorithms on big data sets.There are many algorithms which can perform efficiently on small data sets but they are not effective on big datasets with millions of data records.To test this, we studied K-NN on big data sets and examined the existing research.
We want this survey to design an improved K-NN algorthim (R-K-NN), with Hadoop Map (M) and Hadoop Reduce (R) to implement on data sets containing millions of records in parallel systems, the results showing that R-K-NN is able to process big datasets in a shortened time.R-K-NN is also tested in the Cloudera distributed system.
According to the existing research related to the K-Nearest Neighbors algorithm (Larose, 2005;Fukunaga and Narendra, 2006;Keller et al., 2012;Kuncheva, 1995;Pan et al., 2015;Franco-Lopez et al., 2001;Callahan and Kosaraju, 1995;Horton and Nakai, 1997;Denoeux, 2002;Zhang and Zhou, 2005;Seidl and Kriegel, 1998;Mouratidis et al., 2005;Song and Roussopoulos, 2001), there is no work related to K-NN which is similar to this work, nor is there any study related to the K-Nearest Neighbors algorithm in the parallel network environment (the distributed system).
In the research related to the distributed environment reported in (Favuzza et al., 2006;Satyanarayanan et al., 2002;Babaoğlu et al., 1992;Fujimoto, 2001;Phu et al., benefits and drawbacks of K-NN.We present the negatives of K-NN as follows: We knew many very simple classifiers such as K-NN which works well on basic recognition problems; and it can work with noisy training data (if inverse square of weighted distance is used as the 'distance').The drawbacks of K-NN are displayed as follows: The KNN algorithm will find the K closest neighbors to the new instance from the training data and the predicted class label will then be set as the most common label among the K closest neighboring points; the distance must be computed and all the training data at each prediction are shorted, if there is a large number of training examples, it can be slow.
We show our work's motivation as follows: K-NN of the data mining field is a popular algorithm which is being increasingly used and applied to many different fields, hence it is very useful for researchers and commercial applications.In today's information age, massive amounts of big datasets are being generated which current algorithms and methods are not able to successfully process.Many methods, applications and systems in the current time are implemented correctly with the current data sets, whereas, these cannot be performed correctly.Therefore, we propose a new model to address these limitations.
The novelty of the proposed approach is as follows: KNN can be run in the parallel network environment and can also handle big data sets.Many KNN used models/methods, many KNN related applications and many KNN used systems can be upgraded on the big data sets and on the distributed systems.Thus, this model is proposed.
We also display the crucial contributions of our survey as follows: In light of these contributions, the superiority of R-K-NN over K-NN is demonstrated.
We present five sections of this work as follows: This section is the Introduction section.We show Section 2 to detail the related studies on the K-Nearest Neighbors algorithm.The methodology for the implementation of the K-Nearest Neighbors algorithm in the parallel network environment is displayed in Section 4. We show Section 4 which presents details of the experiments.We present the conclusion of this work in Section 5.

Related Work
In this section, we detail the existing research related to the K-Nearest Neighbors algorithm and the parallel network system.
We show algorithms, applications and studies in the distributed system in (Hadoop, 2016;Apache, 2016;Cloudera, 2016).We use Hadoop, an Apache-based framework, in (Hadoop, 2016;Apache, 2016)  The surveys related to the K-Nearest Neighbors algorithm are presented in (Larose, 2005;Fukunaga and Narendra, 2006;Keller et al., 2012;Kuncheva, 1995;Pan et al., 2015;Franco-Lopez et al., 2001;Callahan and Kosaraju, 1995;Horton and Nakai, 1997;Denoeux, 2002;Zhang and Zhou, 2005;Seidl and Kriegel, 1998;Mouratidis et al., 2005;Song and Roussopoulos, 2001).A discussion of the differences between supervised and unsupervised methods is shown in the work (Larose, 2005) and the k-nearest neighbor algorithm is introduced, in the context of a patient-drug classification problem.The authors (Fukunaga and Narendra, 2006) detailed K-nearest neighbor methods for estimation and prediction which are examined, along with methods for choosing the best value for k.Computation of the knearest neighbors generally requires a large number of expensive distance computations.
Based on the best of our knowledge, we do not see any researches related to the K-Nearest Neighbors algorithm in the parallel system.

Methodology
The implementation of the K-Nearest Neighbors algorithm in the sequential environment is firstly detailed and we also present the implementation of the reformed K-Nearest Neighbors algorithm in the Cloudera environment secondly in this section.

K-Nearest Neighbors Algorithm in the Sequential Environment
An overview of K-NN in the sequential system is shown in Fig. 2 We use Algorithm 1 to transfer the 500,000 data records of our data set (The Data Set, 2016) into the 500,000 vectors.

Input:
A vector group, including the 500,000 vectors.Output: the results of clustering of K-NN.Begin 1. Identify the K parameter (K -Nearest Neighbors): in this survey, we choose K = 5; 2. Calculate the distance between the vectors (which need to be clustered) with the vectors in the training data by Euclidean distance; 3. Arrange the distances in ascending order; and identify the K -nearest neighbors with the vectors which need to be clustered; 4. Get all clusters of K-nearest neighbors which are identified; 5. Identify the cluster of the vector according to all the clusters of K-NN; 6.
We also present R-K-NN by using Hadoop Map (M) and Hadoop Reduce (R) in the Cloudera in Fig. 3.
Transferring of the data records comprises two phases as follows: Hadoop Map of the Cloudera in the first phase and Hadoop Reduce of the Cloudera in the second phase.The Fig. 4 shows the first phase of transferring the 500,000 data records into the 500,000 vectors in Cloudera.
The Fig. 5 illustrates the second phase.
After transferring the 500,000 data records into the 500,000 vectors in Cloudera, we reform K-NN in the distributed system which involves two phases as shown in the Fig. 6.
The Fig. 7 illustrates the first phase of K-NN in Hadoop Map in the parallel system.
The Fig. 8 illustrates the second phase of K-NN in Hadoop Reduce of the Cloudera distributed system.Based on all the clusters of the nearest neighbors, identify the cluster of the vector; We used a Java programming language to implement the novel model on a dataset with 500,000 observations (The Data Set, 2016).
We used one node (one server) to perform our survey in the sequential environment.We also used the Java programming language to program R-K-NN.
The server in the sequential system had the configuration which was Intel® Server Board S1200V3RPS, Intel® Pentium® Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs.Cloudera system was used as the operating system of the server.
We implemented R-K-NN in the Cloudera parallel network environment.The Java language was used in programming the application of R-K-NN in Cloudera.The Cloudera system comprised five nodes (five servers).The operating system of each server in the five nodes was Cloudera.The configuration of each server in the Cloudera system was Intel® Server Board S1200V3RPS, Intel® Pentium® Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs.All five nodes had the same configuration information.
The results of K-NN and R-K-NN are similar and due to space limitations, we do not present the detailed results in this study.The results of K-NN in the sequential environment are similar to that of R-K-NN in the Cloudera parallel environment.In addition, the execution time of K-NN in the sequential system is much shorter than that of R-K-NN in the Cloudera distributed system.
The execution times of K-NN in the sequential system and R-K-NN in the distributed system are shown in the Table 1.
To better understand the advantages of this study, we compare this work with other work related to the K-Nearest Neighbors algorithm in (Larose, 2005;Fukunaga and Narendra, 2006;Keller et al., 2012;Kuncheva, 1995;Pan et al., 2015;Franco-Lopez et al., 2001;Callahan and Kosaraju, 1995;Horton and Nakai, 1997;Denoeux, 2002;Zhang and Zhou, 2005;Seidl and Kriegel, 1998;Mouratidis et al., 2005;Song and Roussopoulos, 2001), as shown in Table 2.As indicated in the Tables 2 and 3, no other studies used K-NN in the parallel network environment nor was there any research which was similar to the model developed in this work.
Next, we compare our study with those related to the distributed system reported in (Favuzza et al., 2006;Satyanarayanan et al., 2002;Babaoğlu et al., 1992;Fujimoto, 2001;Phu et al., 2016;Shirazi et al., 1995;Chen et al., 2011;Sulistio et al., 2004;Borges et al., 2001;Fujita et al., 1998) in the Table 4 and 5.  (Larose, 2005;Fukunaga and Narendra, 2006;Keller et al., 2012;Kuncheva, 1995;Pan et al., 2015;Franco-Lopez et al., 2001;Callahan and Kosaraju, 1995;Horton and Nakai, 1997;Denoeux, 2002;Zhang and Zhou, 2005;Seidl and Kriegel, 1998;Mouratidis et al., 2005;Song and Roussopoulos, 2001) (Larose, 2005;Fukunaga and Narendra, 2006;Keller et al., 2012;Kuncheva, 1995;Pan et al., 2015;Franco-Lopez et al., 2001;Callahan and Kosaraju, 1995;Horton and Nakai, 1997;Denoeux, 2002;Zhang and Zhou, 2005;Seidl and Kriegel, 1998;Mouratidis et al., 2005;Song and Roussopoulos, 2001)] Studies Advantages Disadvantages Larose (2005) The k-nearest neighbor algorithm is introduced in the context of a patient-drug No mention classification problem.Voting for different values of k are shown to sometimes lead to differentresults.The distance function, or distance metric, is defined, with Euclidean distance being typically chosen for thisalgorithm.The combination Function is defined, for both simple unweighted voting and weighted voting.Stretching the axesis shown as a method for quantifying the relevance of various attributes.Database considerations, such as balancing,are discussed.Finally, k-nearest neighbor methods for estimationand prediction are examined, along with methods for choosing the best value for k Fukunaga and Experimental results demonstrate the efficiency of the algorithm.Typically, an No mention Narendra (2006) average of only 61 distance computations were made to find the nearest neighbor of a test sample among 1000 design samples A sequence of experiments with both synthetic and real point data sets are studied.
In the experiments, the authors' algorithms always outperform the existing ones by fetching 70% less disk pages.In some settings, the saving can be as much as one order of magnitude Our work The execution time is shorter; the improved algorithm can process It takes a lot of time and cost to big datasets with millions of records in the shortest time.
implement this model; sometimes it causes confusion to implement true; in the process of making this model we must meet many errors Table 4: Comparison of our work with the studies related to the parallel environment in (Favuzza et al., 2006;Satyanarayanan et al., 2002;Babaoğlu et al., 1992;Fujimoto, 2001;Phu et al., 2016;Shirazi et al., 1995;Chen et al., 2011;Sulistio et al., 2004;Borges et al., 2001;Fujita et al., 1998 Table 5: Comparison of the advantages and disadvantages of our work with the studies related to the parallel environment in (Favuzza et al., 2006;Satyanarayanan et al., 2002;Babaoğlu et al., 1992;Fujimoto, 2001;Phu et al., 2016;Shirazi et al., 1995;Chen et al., 2011;Sulistio et al., 2004;Borges et al., 2001;Fujita et al., 1998 (2006) with the installation of feeder and substations is viewed as a new option for solving distribution systems capacity problems, along several years.The objective to be minimized is therefore the overall cost of distribution systemsreinforcement strategy in a given time-frame.An application on a medium size network is carried out using the proposed technique that allows the identification of optimal paths in extremely large or non-finite spaces.The proposed algorithm uses an adaptive parameter in order to push exploration or exploitation as the search procedure stops in a local minimum.The algorithm allows the easy investigation of these kinds of complex problems and allows useful comparisons to be made as the intervention strategy and type of DG sources vary Satyanarayanan et al.The authors' goal in building Coda is to develop a distributed file system that retains Although Coda is far from (2002) the positive characteristics of AFS while providing substantially better availability.maturity, the initial experience with In this survey, the authors have shown how these goals have been achieved through it reflects favorably on its design.the use of two complementary mechanisms, server replication and disconnected Performance measurements from the operation.The authors also show how disconnected operation can be used to support Coda prototype are promising, portable workstations.
although they also reveal areas where further improvement is possible.The authors believe that a well-tuned version of Coda will indeed meet its goal of providing high availability without serious loss of performance, scalability, or security.A general question about optimistic replication schemes that remains open is whether users will indeed be willing to tolerate occasional conflicts in return for higher availability

Results and Discussion
As shown in Table 1, the execution time of K-NN in the sequential system is 14,603 sec whereas that of R-K-NN in the Cloudera distributed environment (four nodes) is 1233 sec; and that of R-K-NN in the Cloudera parallel system (five nodes) is 890 sec.
The results of the sequential environment are similar to those in the Cloudera distributed system.The execution time in the sequential system is longer than that in the Cloudera parallel system.
The execution time of R-K-NN in the distributed environment is up to the performance similar to the performance of the parallel system.If the performance of the distributed system is higher, R-K-NN is faster.

Conclusion
In this research, we developed an improved K-Nearest Neighbors algorithm to implement in the Cloudera parallel network environment for processing big datasets containing millions of records.The data set used in our work comprises 500,000 data records (The Data Set, 2016).
Our model has many advantages and disadvantages.The advantages are: The execution time is shorter and it can process big datasets with millions of records in a shortened time.However, its disadvantages are: It takes a lot of time and cost to perform construct and implement this model; sometimes it causes confusion to implement true; in the process of making this model we must meet many errors.
As shown in the Table 3, the studies related to the distributed network environment did not use K-NN hence, they are not similar to this work.It is well known that K-NN is not efficient for big data sets.Our proposed model solved the problems very effectively and R-K-NN was able to be implemented on very large data sets.As shown in Table 1, the experiment results proved that the execution time of R-K-NN is much shorter than that of K-NN for the same data set.
In the near future, we will use the results of the proposed model to classify the semantics (positive, negative, neutral) of millions of English documents.
a) This work implements K-NN in the parallel environment b) This work implements K-NN to process big data with millions of records c) This research implements K-NN in both sequential systems and distributed environments d) We propose the K-NN related algorithms which are implemented in distributed systems e) The results of this novel model can be used in many other studies and commercial applications f) Using the results of this work, other studies and systems related to KNN can be successfully enhanced to handle large data sets on clusters consisting of multiple computers, using the Map and Reduce programming model.There are two components of Hadoop: The Hadoop Distributed File System (HDFS) and Hadoop M/R (Hadoop Map/Reduce).Engineers use Hadoop Map and Hadoop Reduce to program to write applications for the parallel processing of large data sets on clusters consisting of multiple computers.An M/R task has two main components: (1) Map and (2) Reduce.The global provider of the fastest, easiest and most secure data management and analytics platform built on Apache™ Hadoop® and the latest open source technologies, called Cloudera (2016); and it will submit proposals for Impala and Kudu to join the Apache Software Foundation (ASF).Cloudera delivers a modern data management and analytics platform built on Apache Hadoop and the latest open source technologies.

Fig. 2 :
Fig. 2: Overview of K-NN in the sequential system

Fig. 4 :
Fig.4: The first phase of transferring the 500,000 data records into the 500,000 vectors in Cloudera

Fig. 7 :
Fig. 7: Overview of the K-Nearest Neighbors algorithm in Hadoop Map (M) in Cloudera

Table 1 :
The execution times of K-NN in the sequential

Table 2 :
Comparison of our work with the studies related to K-NN in [

Table 3 :
Comparison of the advantages and disadvantages of our work with the studies related to K-NN in [

Table 3 :
Continue Song andFour different methods are proposed for solving the problem.Discussion about No mention Roussopoulos the parameters affecting the performance of the algorithms is also presented.(2001)