A Modified Key Partitioning for BigData Using MapReduce in Hadoop

Corresponding Author: Gothai Ekambaram Department of CSE, Kongu Engineering College, Erode638052, Tamilnadu, India Email: kothaie@yahoo.co.in  Abstract: In the period of BigData, massive amounts of struc tured and unstructured data are being created every day by a multitude of everpresent sources. BigData is complicated to work wit h and needs extremely parallel software executing on a huge num ber of computers. MapReduce is a current programming model that makes simpler writing distributed applications which manipulate BigData. In order to make MapReduce to work, it has to divide the workload be tween the computers in the network. As a result, the performance of Map Reduce vigorously depends on how consistently it distributes this stu dy load. This can be a challenge, particularly in the arrival of data skew . In MapReduce, workload allocation depends on the algorithm that p rtitions the data. How consistently the partitioner distributes the da ta depends on how huge and delegate the sample is and on how healthy the s amples are examined by the partitioning method. This study recommends a n enhanced partitioning algorithm using modified key partition ing that advances load balancing and memory utilization. This is completed via an enhanced sampling algorithm and partitioner. To estimate the proposed algorithm, its performance was compared against a high-tech pa rtitioning mechanism employed by TeraSort. Experimentations demonstrate that the proposed algorithm is quicker, more memory efficient and mor e accurate than the existing implementation.


Introduction
Over the past decades, computer technology has become increasingly ubiquitous. Computing devices have numerous uses and are essential for businesses, scientists, governments, engineers and the everyday consumer. What all these devices have in general is the probable to produce data. In essence, data can arrive from everywhere. The majority types of data have a propensity to have their own distinctive set of characteristics over and above how that data is dispersed. Data that is not examined or utilized has small significance and can be a waste of space and resources. On the contrary, data that is executed on or examined can be of immeasurable value. The data itself may be too huge to store on a single computer. As a result, in order to decrease the time it takes to execute the data and to have the storage space to store the data, software engineers have to write down programs that can perform on 2 or more computers and dispense the workload amongst them. While abstractly the computation to execute may be straightforward, traditionally the implementation has been complicated. In reaction to these extremely same matters, engineers at Google built up the Google File System (GFS) as stated by (Ghemawat et al., 2003), a distributed file system design representation for major data processing and formed the MapReduce programming model by .
Hadoop is an open source implementation of MapReduce, written in Java, initially developed by Yahoo. Tan et al. (2009) stated that Hadoop was built in response to the need for a MapReduce structure that was unfettered by proprietal licenses, in addition to the increasing need for the technology in Cloud computing. Hive, Pig, ZooKeeper and HBase are all examples of regularly utilized extensions to the Hadoop structure. Likewise, this study also concentrates on Hadoop and examines the load balancing mechanism in Hadoop's MapReduce skeleton for small-sized to medium-sized clusters.
In summary, this study presents a technique for increasing the work load distribution among nodes in the MapReduce framework, a technique to decrease the necessary memory footprint and improved execution time for MapReduce when these techniques are performed on small or medium sized cluster of computers. The remaining part of this study is planned as follows. Section 2 discusses some basic information on MapReduce and its internal workings. Section 3 presents the related work and existing methods applied for TeraSort in Hadoop. Section 4 contains a proposed idea for an improved load balancing methodology and a way to better utilize memory. Section 5 introduces investigational results and a discussion of this study's findings. Section 6 concludes this study with a brief idea to future work.

Background
MapReduce Dean and Ghemawat (2008) mentioned that MapReduce is a programming representation created as a method for programs to handle with huge amounts of data. It attains this objective by distributing the workload among several computers and after that working on the data in parallel. Hsu et al. (2007) stated that programs that perform on a MapReduce structure need to separate the work into 2 phases known as Map and Reduce. Each phase has key-value pairs for both input and output. To put into practice these phases, a programmer needs to state 2 functions: A map function called a Mapper and its equivalent reduce function called a Reducer. While a MapReduce program is performed on Hadoop, it is anticipated to be run on several computers or nodes. For that reason, a master node is necessary to run all the essential services desired to organize the communication between Mappers and Reducers. An instance of MapReduce dataflow is shown in Fig. 1. Kavulya et al. (2010) reported that in the MapReduce structure, the workload has to be balanced in order for resources to be utilized powerfully.

HashCode
Hadoop utilizes a hash code as its standard method to partition key-value pairs. The hash code itself can be depicted mathematically and is represented by (Kenn et al., 2013) as the subsequent equation: The hash code given in Equation 1 is the default hash code utilized by a string object in Java, the programming language on which Hadoop is based. A partition function normally utilizes the hash code of the key and modulo of reducers to decide which reducer to send the key-value pair to. It is essential then that the partition function uniformly distributes key-value pairs among reducers for appropriate workload distribution. O'Malley (2008) stated that Hadoop ruined the world record in sorting a Terabyte of data by using its TeraSort technique. Winning first place it managed to sort 1 TB of data in 209 sec (3.48 min). This was the first occasion either a Java program or an open source program had won the contest. TeraSort was able to step up the sorting process by distributing the workload uniformly within the MapReduce framework. This was done via data sampling and the use of a Trie as stated by (Panda et al., 2010). Even though the main goal of TeraSort was to sort 1 TB of data as speedily as possible, it has since been incorporated into Hadoop as a standard.

Fig. 1. MapReduce dataflow
On the whole, the TeraSort algorithm is extremely alike to the standard MapReduce sort. Its efficiencies rely on how it distributes its data between the Mappers and Reducers. To attain an excellent load balance, TeraSort uses a custom partitioner. Since the original goal of TeraSort was to sort data as speedily as possible, its implementation adopted a space for time approach. For this reason, TeraSort utilizes a 2-level trie to partition the data. Kenn et al. (2013) has shown that a trie which confines strings stored in it to 2 characters is known as 2level Trie. This 2-level Trie is built using cut points extracted from the sampled data. Once the trie is constructed using the cut points, the partitioner can initiate its job of partition strings based on where in the trie that string would go if it were to be included in the trie.

Related Works
Sorting is a primary concept and is mandatory step in countless algorithms. Heinz et al. (2002) stated that Burst Sort is a sorting algorithm developed for sorting strings in huge data collections. The TeraSort algorithm also utilizes these burst trie techniques as a method to sort data but does so under the perspective of the Hadoop architecture and the MapReduce framework. An essential problem for the MapReduce framework is the idea of load balancing. Over the period, several researches have been done on the area of load balancing. Where data is situated by (Hsu and Chen, 2012), how it is communicated by (Hsu and Chen, 2010), what background it is being located on by (Hsu and Tsai, 2009;Hsu et al., 2008;Zaharia et al., 2008) and the statistical allotment of the data can all have an outcome on a systems efficiency. Most of these algorithms can be found universal in a variety of papers and have been utilized by structures and systems earlier to the subsistence of the MapReduce structure stated by (Krishnan, 2005;Stockinger et al., 2006). As stated by (Candan et al., 2010), RanKloud make use of its personal uSplit method for partitioning huge media data sets. The uSplit method is required to decrease data duplication costs and exhausted resources that are particular to its media based algorithms. So as to work just about perceived boundaries of the MapReduce model, various extend or changes in the MapReduce models have been offered. BigTable was launched by Google to handle structured data as reported by (Chang et al., 2008). BigTable looks like a database, but does not support a complete relational database model. It utilizes rows with successive keys grouped into tables that form the entity of allocation and load balancing. And experiences from the similar load and memory balancing troubles faced by shared nothing databases. HBase of Hadoop is the open source version of BigTable, which imitates the similar functionality of BigTable. Because of its simplicity of use, the MapReduce model is pretty popular and has numerous implementations as reported by (Liu and Orban, 2011;Miceli et al., 2009). For that reason, there has been a diversity of research on MapReduce so as to get better performance of the structure or the performance of particular applications similar to graph mining as mentioned by (Jiang and Agrawal, 2011), data mining reported by (Papadimitriou and Sun, 2008;Xu et al., 2009), genetic algorithms by (Jin et al., 2008;Verma et al., 2009), or text analysis by (Vashishtha et al., 2010) that execute on the framework.
Occasionally, researchers discover the MapReduce structure to be too strict or rigid in its existing implementation. Fadika and Govindaraju (2011) stated that DELMA is one of such a framework which imitates the MapReduce model, identical to Hadoop MapReduce. Such a system is likely to have attractive load balancing problems, which is afar the scope of our paper. One more different framework to MapReduce is Jumbo as reported by (Groot and Kitsuregawa, 2010). The Jumbo framework may be a helpful tool to research load balancing, but it is not well-matched with existing MapReduce technologies. To work around load balancing problems resulting from joining tables in Hadoop, (Lynden et al., 2011) introduced an adaptive MapReduce algorithm for several joins using Hadoop that works without changing its setting. This study also attempts to do workload balancing in Hadoop without changing the original structure, but concentrates on sorting text. Kenn et al. (2013) stated that the XTrie algorithm presented a method to advance the cut point algorithm derived from TeraSort. The important issue of the TeraSort algorithm is that to deal with the cut points it utilizes the Quick Sort algorithm. By using quicksort, TeraSort wants to store all the keys it samples in memory and that decreases the probable sample size, which decreases the correctness of the preferred cut points and this affects load balancing mentioned by (O'Malley, 2008). One more difficulty TeraSort has is that it only thinks the first 2 characters of a string during partitioning. This also decreases the efficiency of the TeraSort load balancing algorithm: The main issue derived by TeraSort and XTrie is that they utilize an array to represent the trie. The major concern with this method is that it tends to hold a lot of exhausted space. Kenn et al. (2013) also stated that an Algorithm, the ReMap algorithm, which decreases the memory requirements of the original trie by decreasing the number of elements it believes. The ReMap chart maps each one of the 256 characters on an ASCII chart to the reduced set of elements anticipated by the ETrie. Since the reason of ETrie is to imitate words found in English text ReMap relocates the ASCII characters to the 64 elements. By dropping the number of elements to think from 256 to 64 elements per level, the total memory necessary is reduced to 1/16 th of its original footprint for a 2-level Trie. So as to use the ETrie, the TrieCode offered in Equation 2 has to be customized. The EtrieCode showing in Equation 3 is alike to the TrieCode in Equation 2, but has been changed to replicate the smaller memory footprint. Even if it is superior to XTrie, the difficulty with this method is that it tends to have a lot of exhausted space. The EtrieCode equation is as follows:

The Proposed Method
This section describes the key partitioning as an alternative of hash code partitioning using Horner's Rule which will be incorporated in TeraSort of Hadoop. Besides, this section discusses how memory can be saved by means of a ReMap technique. In accordance with investigational outcome of XTrie and ETrie, the irregular rate is lower, lower being improved, while a trie has more levels. This is since the deeper a trie is the longer the prefix each key symbolizes. So, in this study, full length key is considered as prefix instead of 2 or 3 and the hash value also calculated for the full key.
A trie has 2 advantages when compared with the quick sort algorithm. First, the time complexity for insert and search using the trie algorithm is O (k) where k is the length of the key. In the meantime, the quick sort algorithm best and average case is O (n log n) and in the worst case O (n 2 ) where n is the number of keys in its sample. Next, a trie has a predetermined memory footprint. This means the number of samples moved into the trie can be enormous if so preferred. In the proposed HTrie algorithm, the HTrie is an array accessed via a HTrie code. A HTrie code is alike to a hashcode, but the codes it generates occur in chronological ASCII order using Horner's Hash Key Rule. The equation for the HTrie code is also a hash code which will use the next prime number as specified by Horner's Rule since the whole key is considered instead of a trie structure. Equation 2 and 3 used 256 and 64 respectively to get the hash code and also provided best value since only 2 or 3 prefixes were considered. So, to get the different as well as good result, the next prime number 37 instead of 31 is used. The equation is as follows: (4) Figure 2 illustrates how the hash code works for a usual partitioner. In this illustration, there are 3 reducers and 3 strings. Each string comes from a key in a (key, value) pair. The first string 'ate' consists of 3 characters 'a', 't' and 'e' and have the equivalent ASCII values. The specific ASCII values are then supplied to Equation 4 to obtain the hash value 137186. Because of 3 reducers, a modulo 3 is used which provides a value 2. Then the value is increased by one in the illustration since there is no reducer 0, which modifies the value to 3. This moved the key-value pair to reducer 3. Using the similar technique, the 2 other strings 'bad' and 'can' are allocated to reducers 2 and 1, correspondingly.

Results
To estimate the performance of the proposed method, this study examines how fine the algorithms dispense the workload and looks at how fine the memory is used. Tests performed in this study were completed using LastFm Dataset, with each record containing the user profile with fields like country, gender, age and date. Using these records as our input, we simulated computer networks using VMware for Hadoop file system. The tests are carried out with a range of size of dataset such as 1 Lakh, 3 Lakhs, 5 Lakhs, 10 Lakhs, 50 Lakhs and 1 Crore records. During the first experiment, an input file containing 1 lakh records is considered. As mentioned in the MapReduce Framework, the input set is divided into various splits and forwarded to Map Phase. Here for this input file, only one mapper is considered since the number of mappers is depends on the size of the input file. After mapping, partition algorithm is used to reduce the number of output records by grouping records based on Htrie value on the country attribute which is assumed as a key here. After grouping, 4 partitions are created using the procedure Gender-Group-by-Country. All the corresponding log files and counters are analyzed to view the performance. In the other 5 experiments, input files with 3 Lakhs, 5 Lakhs, 10 Lakhs, 50 Lakhs and 1 Crore records are considered. As per the above said method, all the input files are partitioned into 4 partitions.
In order to compare the different methodologies presented in this study and determine how balanced the workload distributions are, this study uses various metrics such as Effective CPU, Rate and Skew among various metrics like clock time, CPU, Bytes, Memory, Effective CPU, Rate and Skew since only the said 3 parameters shows the significant difference in outcomes. Rate displays the number of bytes from the Bytes column divided by the number of seconds elapsed since the previous report, rounded to the nearest kilobyte. No number appears for values less than one KB per second. Effective CPU displays the CPU-seconds consumed by the job between reports, divided by the number of seconds elapsed since the previous report. The result is expressed in units of CPU-seconds per second-a measure of how process or intensive the job is from each report to the next. The skew of a data or flow partition is the amount by which its size deviates from the average partition size: *100 partition size average partition size skewof a data size of largest partition − =

Discussion
The Tables 1-3 shows the results when using various sized input files for the comparison of the performance of ETrie, XTrie and HTrie with the parameters Skew, Effective CPU and Rate respectively. Similarly, the Fig. 3-5 shows comparison chart of the results of the above. From the tables and figures for results, it is shown that the proposed method (HTrie) is performing better than XTrie and ETrie based on all the 3 parameters said above.

Conclusion
This study presented HTrie, comprehensive partitioning technique, to improve load balancing for distributed applications. By means of improving load balancing, MapReduce programs can turn out to be more proficient at managing tasks by reducing the overall computation time spent processing data on each node. The TeraSort was developed based on arbitrarily generated input data on an extremely huge cluster of 910 nodes. In that specific computing setting and for that data configuration, every partition created by MapReduce became visible on simply one or 2 nodes. But in contrast, our work concentrates at small-sized to medium-sized clusters. This study changes their model and boosts it for a smaller environment. A sequence of experimentations have exposed that given a skewed data sample, the HTrie architecture was capable to safeguard more memory, was capable to distribute more computing resources on average and do so with a lesser amount of time complexity.
After this, additional research can be made to introduce new partitioning mechanisms so that it can be incorporated with Hadoop for applications using different input samples since Hadoop file system is not having any partitioning mechanism except key partitioning.