E-GENMR: Enhanced Generalized Query Processing using Double Hashing Technique through MapReduce in Cloud Database Management System

: Big Data, Cloud computing and Data Science is the booming future of IT industries. The common thing among all the new techniques is that they deal with not just Data but Big Data. Users store various kinds of data on cloud repositories. Cloud Database Management System deals with these large sets of data. Cloud Database service provider deals with many obstacles while providing various service. Amongst all the challenges processing of large amount of data, interoperability and security are the major concerns that are explained in this study. Enhanced Generalized Query Processing through MapReduce (E-GENMR) is a prototype model that provides solution for these problems. Firstly, traditional approaches are not suitable for processing such gigantic amount of data as they are not able to handle such amount of data. Various solutions have been developed such as Hadoop, MapReduce Programming codes, HIVE, PIG etc. but these technologies don’t provide solution for these problems at the same time and moreover users are not compatible with these latest technologies like MapReduce codes. E-GENMR provides interoperability as it takes queries written in various RDBMS forms like SQL Server, ORACLE, DB2, MYSQL and convert into MapReduce codes as they are considered to be the efficient way for processing large data. Secondly, Client’s data is stored in encrypted form and processing is done on this data hence it ensures the security aspect. Indexing plays a very important role in processing queries, in E-GENMR indexing is implemented using closed double hashing technique. We compared various query processing time of E-GENMR for encrypted data and unencrypted data. A comparison of various queries has been done to evaluate the performance of E-GENMR with latest techniques like Hadoopdb, SQLMR, HIVE and PIG and it has been concluded that E-GENMR shows better performance.


Introduction
One of the influential service that a cloud service provider provides is Cloud Database. Many Cloud provider Companies such as Amazon, Yahoo, EMC2, Microsoft, Google, Rackspace etc. provide database services in SQL and NOSQL form. Users on cloud can access Cloud Database service by two ways either by running their databases on virtual machine provided by cloud provider or they can use directly the database services provided by the cloud service provider. MySQL, PostgreSQL, Microsoft SQL Server, NuoDB are some of the SQL services provided by the Cloud service provider. Cassandra, MongoDB, CouchDB are some of the examples of NOSQL types of Database services (Bloor, 2011). CDBMS is attractive for various reasons as organizations are not bothered about the hardware maintenance, software cost or any administrative cost, they only focus on the efficiency of their business.
In 2016, the latest Beckman Report on database Research (Abadi et al., 2016) discussed various research challenges in this field. It has been concluded that among various challenges of Cloud Databases processing, interoperability and security of data present at cloud repositories are the major one's and are the concern of this paper.
Some interesting points of the report are "Many big data applications will be deployed in the cloud, both public and private, on a massive scale. This requires new techniques to offer predictable performance and flexible interoperation "and "A diverse and data-driven world requires diverse programming abstractions to operate on very large datasets" (Abadi et al., 2016).
Cloud Database service providers deal with many obstacles while providing the service. Firstly, Processing of the data present on the cloud has become a biggest issue now a days. Such huge amount of data is being generated from various sources like Sensors, social networking sites etc (Manyikaetal, 2011). Traditional database management systems are not able to process such hefty size of data. New technologies such as MapReduce, Hive, PIG, Hadoop etc. are emerging as a solution for processing this data. But till date, users are very much comfortable with traditional DBMS and not with the MapReduce codes.
MapReduce codes available in the market are attractive as they provide benefits like being present in simple Key-value form hence they are easy to use. They are a Cost effective solution for processing large size of data as they provide parallel processing. MapReduce codes provide flexibility as it is not based on any schema, data can either be in structured or unstructured form. MapReduce codes provide scalability as well (Dean and Ghemawat, 2008).
One of the main characteristic of Cloud is that it is based on multitenant environment which means many clients share the same datacenter provided by the cloud provider (Pippal et al., 2001). Multiple Clients store their data with the Cloud Service provider so security is the biggest challenge for the Cloud Service Provider. Earlier models deal with the problems of either security or processing but E-GENMR provides solution for both these problems as shown in Fig. 1. E-GENMR provides security features at each step. Client's data is being stored in encrypted form and processing is being done on this encrypted data.
Indexing plays a very important role in processing queries, A survey of various indexing techniques for Big Data in Cloud has been explained in (Adamu et al., 2015;Gani et al., 2016). In E-GENMR indexing is implemented using closed double hashing technique. Previous techniques like Map reduce uses Inverted indexing technique Table 1 summarized the indexing techniques used in the latest techniques used for processing Big Data.
Inverted indexing technique used in Map Reduce programming paradigm takes o(|q|*D*|D|) times where |q| is the length of query and |D| is the length of document whereas B+ tree indexing technique is used in SQLMR which takes log(n) complexity for searching any data. In Bit map indexing for low cardinality attributes space complexity is low whereas for higher cardinality attributes the space complexity is very high. Overall advantage with Double hashing is that searching complexity is in the order of 1 i.e., o(1). It takes only time for the computation of hash function.  Rest of the paper is organized as follows. Section 2 describes the work that has been done so far related to the field of Cloud Database Management System, Big Data and security issues related to cloud. In section 3, we briefly define our proposed model along with the algorithms which is used for the implementation of E-GENMR. In section 4, Results and analysis have been described. We analyzed E-GENMR with latest techniques. Lastly, we conclude our work with future possibilities in section 5.

Related Work
The literature was reviewed to bring out the salient features and techniques being used in this field. The literature review has been grouped into following categories: Big data and the latest techniques for processing Big data on Cloud, Generalized query interface, Cloud Security and Indexing techniques used for the Big data on cloud.
Big data now a days characterized by seven characteristics named as volume, velocity, variety, veracity, variability, value and complexity (6 Vs and Complexity) are described in (Manyikaetal, 2011). Simple MapReduce (Dean and Ghemawat, 2008) codes in key-value pair are considered to be a suitable solution for large amount of parallel data processing. Dean and Ghemawat (2008), introduced MapReduce Programming paradigm based on Parallel and distributed computing in which Inverted index scheme is used (McCreadie et al., 2012).
Another technique used for processing which is Hive defined by acts as data warehouse system built inside the hadoop file system. It provides user with a platform where they can easily use queries similar to SQL but is named differently called HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In Hive system Bit Map indexing i.e., a Simple indexes with single attribute is used and creation of indexes is linearly proportional (Liu et al., 2013;Fuad et al., 2014). Abouzeid et al. (2009) HadoopDB is a data management system that combines the capability of RDBMS and map reduce programming paradigm. It inherits the scalability feature from Hadoop and combines the basic features of DBMS. It achieves better results compared with parallel databases Vertica, DB-X etc. Indexes in HadoopDB are maintained internally by Local DBMS. A lot of time is consumed in pre-partitioning phase.
Hsieh et al. (2011) Implemented one system model named "SQLMR", which is a hybrid approach to fill the gap between SQL-based and MapReduce data processing. With effective part partitioning and B tree indexing, low overhead file construction, optimized rack awareness algorithm, query result cache mechanism the system produced best results as compare to HadoopDB. YSmart (Lee et al., 2011a) which is another system similar to SQLMR based on correlation aware SQL-to-MapReduce translator.
An enhancement MapReduce codes is being provided with the help of pipelining concept i.e., Whenever Mapper function produces its results in the intermediate form it goes to Reducer function for generating output (Lee et al., 2011b;Dahiphale et al., 2014;Condie et al., 2009) to provide further parallel processing of data. Jayalath et al. (2013) described the efficient way to process Bigdata across geographical distributed data centers. Li-Yung et al. (2011) explained one optimization algorithm for cross Rack Optimization for Reducer program. Here, generalized model takes Mapper function into account as well. A detail related to theoretical proposed model is given in (Malhotra et al., 2015) which explains the interoperability in the model that takes queries in SQL, MYSQL, DB2, Oracle form and converts into MapReduce form. Big data analysis (Ramamoorthy and Rajalakshmi, 2013) on Cloud has become an issue; the author provides a solution of MapReduce algorithm and Bigdata analytic techniques. In paper  author explained the enhancements that are happening in the Cloud Computing world. MySQL provides a way to process and manipulate data but it is not applicable for large amount of data sets. In (Mongia and Kataria, 2015), Authors discussed about the Layer wise Security issues related to Cloud Database Management System and also discussed about the so far implemented and proposed solutions for each security issue. In this study, implemented model provide solution for data Security by encrypting the data with the help of complex algorithm.

Prototype Model
The problem with the today's world is that users are not comfortable with MapReduce kind of codes to process large size of data present at the cloud repositories. Secondly, Cloud is based on Multitenant environment in which multiple clients uses the services provided by the Cloud. Multiple Clients store their databases with the Cloud Service provider so security is the biggest challenge for the Cloud Service Provider. Earlier models deal with the problems of security and processing but here E-GENMR provides solution for both the problems along with solving the interoperability issue.
The prototype model provides interoperability as it takes up user queries in any of the syntax defined by RDBMS like SQL server, DB2, Oracle, MySQL hence it is called as Generalized and with the help of model's compiler module, queries get converted into MapReduce form. MapReduce is a splendid solution to process large amount of data as these codes process data in parallel. Client's data is stored in the encrypted form and queries runs on encrypted data hence the system ensures security aspects also.
A five layer architecture for Cloud Database Management System has been proposed in (Mongia et al., 2013;Alam and Shakil, 2013). In the below sections, a detailed working description in the form of algorithms related to each layer has been provided and briefly explained in Fig. 2. Figure 3 and 4 describes the architecture of Proposed Generalized Model: Data storage phase and data processing phase.

External Layer: User Interface
External Layer is the only layer which is closest to the user and provide interfacing. The main function of this Layer is to provide the transparency and to manage different types of users. User sends their queries in the syntax of SQL server, DB2, Oracle, MySQL. Existed data is pre-partitioned horizontally, indexed with the help of double hashing and stored in encrypted form into the number of Data nodes of the Racks to have parallel and distributed processing as explained by algorithm 1. For efficiently storing data double hashing technique is used as it takes only 0(1) time for searching any data due to the hash indexing. Cloud repository consist of Big Data Centers which consist of many Racks, where Data is stored in inter Racks and Intra Rack. Algorithm 1 also described the way data is stored in Inter-Rack or Intra-Rack to have Inter or Intra Rack Communication. Table 2 and 3 comprised of a symbols and assumptions used throughout the paper.  6. Until all the data is placed at the Datanodes of the Rack. 7. Case 2: Inter Rack. 8. Total number of Datanodes required on all the Racks will be same i.e., Total Datanodes = z/q 9. for i = 1 to n … for n number of Racks 10. for j = 1 to m ….. for m datanodes 11. Data is partitioned as to have total datanodes = z/q For indexing double indexing is used hi(data) = (h(data)+f(i)) mod (datanode_size) where f(i) = i+hash 2 (x) Store data in Encrypted form with the help of AES algorithm. 12. Until all the data is placed at the Datanodes of the Racks. 13. End of For loop 14. End of For Loop.
In algorithm 1, intra rack communication is explained in lines (2-6) and inter rack communication is explained in lines (7-12). In Intra rack communication data is stored at the datanode's of same rack while in Inter rack communication data can be stored at the datanodes of any rack (line 9). Double hashing technique is used for indexing data on datanode's as indexing is used for efficient searching of data after indexing data is stored in encrypted form using symmetric key algorithm (AES) for security reasons. At external Layer when data is being partitioned other clients will not be able to predict the data as they can see only the encrypted form of data.

Conceptual Middleware Layer: Any Database to MapReduce Compiler
This layer provides interoperability which means it hides the availability of different databases to the users and operates irrespective of the underlying available databases. User's process their queries in the Databases languages in which they are comfortable. Users till date are comfortable with RDBMS tools but RDBMS is not a probable solution for processing large amount of Data. Users are not compatible with new technologies like MapReduce Programming Paradigm, Hive, Pig, HBase which can process large amount of data. This layer provides the facility to the users such that their queries are converted into NOSQL Map-Reduce key-value form. Compiler takes input queries from the user interface which is at the external layer. It converts these queries into MapReduce codes. Query takes pre-partitioned data from the text file stored at the DataNodes of the Racks. On the basis of queries again partitioning is done. The data obtained after this partitioning is called intermediate data. Table 4 has the detail of queries considered in this study and the corresponding Key-value pairs defined by the Model's compiler.
This prototype model can handle queries with more than one filter and takes up the complex queries like Order-by, Group-by, Join with more than one Data Table. At conceptual middleware and conceptual layer data is being processed. MapReduce key-value pairs are being generated by this process are in encrypted form so none of the other clients can see the processing of other clients.

Conceptual Layer: Data Processing
This layer deals with actual processing of data. At this layer actual processing of key value pair is being done. Reducer will be applied to the partitioned intermediate data. Table 4 comprised of the queries for data processing and with the help of conceptual Middleware Layer's Compiler these queries are converted into key value pair. Now, at conceptual Layer reducer program takes the key-value pair and give results accordingly as described by the algorithm 2.
Algorithm 2 has the detail of the reducer function which gives result back to the user.

Physical Middleware Layer and Physical Layer: Mapper Reducer Placement and Storage Issues
Two important aspects related to storage like Inter Rack or Intra Rack communication shown in Fig. 5 and Mapper and Reducer function Placement problems with encrypted data are considered at these layers. Physical Middleware layer provides interoperability but main storage related issues are dealt on Physical Layer.
Initially, at the physical layer, data is being prepartitioned and stored in encrypted form as per the Algorithm 1.
Equation 1  At physical layer data is stored in encrypted form so that none of the clients can see each other's data.

Results and Performance Analysis
In our research we did experiment analysis on the data upto 512 GB, additionally we did comparison of E-GENMR with two SQL based database system MYSQL, DB2 and four NOSQL MapReduce based systems including HadoopDB, SQLMR, PIG, HIVE. Since all these MapReduce systems only handle read only operations so in our experiments we compare performance for read only operations including range, Join and OrderBy queries.
Two traditional databases have been used i.e., MYSQL and DB2 for the purpose of showing that these databases do not provide scalability feature. HIVE and PIG are considered to be a suitable choice for testing such situations. HIVE is SQL-Like language in which users send their queries in SQL form and with the help of Hadoop framework their queries internally get converted into MapReduce code and users get result. Similarly in PIG, PIG is a scripting Language where users write their queries in the scripts and these queries gets converted into MapReduce form with the help of Hadoop framework internally and users gets their results back. HadoopDB and SQLMR are the hybrid systems equipped of MapReduce and DBMS technologies for systematic workloads. Here, the prototype model E-GENMR is implemented in C#, with the help of .NET 2012 framework.
The experiment contain two parts: First part is to show the scalability with respect to data size. In the second part a comparison of E-GENMR is done with respect to encrypted and unencrypted data. Table 5 shows the system requirements that is being considered in this study.

Performance Comparison with MYSQL, DB2, HadoopDB, SQLMR, HIVE and PIG on Different Data Sizes
This set of experiments compare the scalability with respect to increase in data size for two queries i.e., SELECT and JOIN with ORDERBY. The data size varies from 512 GB Table 6 and 7 shows the execution time of MYSQL, DB2, HadoopDB, PIG HIVE, SQLMR and E-GENMR with different data sizes for SELECT and JOIN with ORDERBY queries. Figure 5 and 6 gives the graphical representation of the performance comparison for SELECT Query and Fig. 7 gives the graphical representation of the performance comparison for JOIN with ORDERBY Query. The SQL SELECT Query and JOIN with ORDERBY is as follows:  Figure 6 shows the graphical representation for the small data sizes while Fig. 7 shows the graphical representation for the large amount of data size. For small data size as shown in Fig. 6, DB2 performed better than MYSQL because it consumes primary memory for the processing. SQL based systems outperformed MapReduce Based systems for up to 2 GB of data size because SQL based system does not provide parallelism but in case of MapReduce based systems initial time is consumed for providing parallelism. When the data size increased further than 4GB Mapreduce based systems outperformed.
In Fig. 7, when the data size reaches to 512GB, MySQL, DB2, RDBMS performed very poorly but all the other systems performed better because of the parallel processing. Execution time of MapReduce based system HadoopDB increases intensely with increase of data set this is because of the higher input workload due to pre-partitioning done in HadoopDB System. SQLMR outperforms because of the effective part partitioning with B tree indexing, low overhead file construction, optimized rack awareness algorithm produced best results as compared to HadoopDB, HIVE and PIG. But E-GENMR consistently performed very well than all the other MapReduce based systems because of the various optimizations in terms of hash based pre-partitioning, double hash indexing and flexibility in terms of Mapper reducer placement described in section 3, as shown in Fig.  8, Join with Orderby Queries takes 2.70 more times than select queries as more time is required for processing of sort operation along with joining of two tables.
In general for SELECT queries E-GENMR model is 4.17 times faster than HadoopDB, 1.43 times faster than PIG, 1.19 times faster than HIVE and 1.11 times faster than SQLMR system. For JOIN with ORDERBY queries E-GENMR model is 1.28 times faster than SQLMR, 1.59 times faster than HIVE and 6.38 times faster than HadoopDB system.

E-GENMR Data Processing with Encrypted and Unencrypted Data
This set of experiment analyzed E-GENMR by processing 21 queries on the Data file of 16 GB as shown below in Table 8 and Fig. 9. These queries are applied on both encrypted and unencrypted data. It has been observed that Data processing time for encrypted data is 1.08 more than the data processing time for unencrypted data, which can be bearable because the advantage of encrypted data are more as it provides security to the system.

Conclusion
As users are comfortable with Relational Databases, in this study a model has been implemented which takes users queries and through the model's compiler these queries gets converted into Map-Reduce keyvalue form. It is easier to process large amount of data with the help of MapReduce codes as compare to Traditional databases. The model has also been implemented with pre-partitioning and indexing using double hash functions. Model is also flexible in terms of choosing number of mappers and reducers for parallel processing. E-GENMR is evaluated and compared with the latest technologies in the field of Cloud and Big data i.e., HadoopDB, SQLMR, Pig and Hive with respect to the increase in data size. It has been observed from the conducted experiment that the prototype model E-GENMR achieves improvement in query processing time with improvement ratio of 4.17 against HadoopDB, 1.43 against PIG, 1.19 against HIVE and 1.11 against SQLMR for SELECT Query and for JOIN with ORDERBY queries E-GENMR model is 1.28 times faster than SQLMR, 1.59 times faster than HIVE and 6.38 times faster than HadoopDB system. It has also been observed that Data processing time for encrypted data is 1.08 times more as compare to the Data processing time of unencrypted Data, which can be bearable because the advantage of encrypted data are more as it provides security to the system.
In Future, instead of non artificial indexing techniques, artificial techniques can also be applied.