Parallel Two-Dimensional Quicksort Algorithm (PTSA)

: In this research, a Parallel Two-Dimensional Sorting Algorithm (PTSA) is presented that has better performance than the classical Quicksort, to sort a data vector of size n = r (rows) × c(columns). PTSA algorithm divides the input vector into n/r sub-vectors, which represents tow-dimensional vector of Slave Processor Elements (VPE ), the maximum number of VPE for parallel sorting is equal to r×c, VPE just the read,and write operations. The number of Master Processors (MP) which do the sort operation is equal to c, The time needed for PTSA algorithm is reduced by q (n/r log n/r) with respect to the time needed by Quicksort q (n log n) to sort the same vector. Simulation results show that the efficiency of sorting using PTSA algorithm is increased and the complexity is reduced significantly compared with classical Quicksort.


INTRODUCTION
Sorting is an integral component of most database management systems (DBMSs) and data stream management systems (DSMSs), its efficiency can influence drastically the overall system performance.
To speed up the performance of database system, parallelism is applied to the data administration operations.
We all know that Quicksort is one of the fastest algorithms for sorting, in spite of its slow running time in the worst case that is θ(n 2 ) on an input array of n numbers. However, its running time in the average case is θ(n log(n)). Given the importance of parallelism in improving the running time of Quicksort many researches have been done in this area. Previous results show that the parallel Quicksort outperform the sample sort algorithm. In [11] , the experiments show that the speed of the parallel Quicksort is more than six units higher than the speed of sample sort for three processors of the Enterprise 1000. While, for one processor, parallel Quicksort achieved 15% faster execution times than the traditional Quicksort. This is due to its low memory requirement and that parallel Quicksort could sort data sets twice the size that of traditional Quicksort could under the same system memory restrictions.
In this research, we present a parallel two dimensional sorting algorithm that outperform traditional Quicksort and many of the previous parallel Quicksort algorithms in running time and in the number of comparisons. The idea is dividing the vector data into number of processors to sort data in parallel. Due to that, the transactions of comparison and data transfer is reduced and the complexity time to sort an array of size n is reduced to O(n/r (logn/r)). In addition, the complexity time of this algorithm depends on the number of processors, where many of the previous work in parallel Quicksort have higher complexity time and generally the complexity time is dependent on the number of processors used in parallel and in some cases is dependent on the number of partitions.
In [8] , the cost of parallel Quicksort algorithm is measured given the pair (t,p) where t is the time needed and p is the number of processors that are both dependent on the input size. However, to sort n words all of length l over an alphabet $\Sigma$ of size O(n) it requires only O(log(nl)/log(nl/p)×nl/p) times and using p<nl processors.

RELATED WORKS
An early advancement in parallel sorting comes in 1968 when Batcher discovered the elegant θ(log 2 n) depth bitoni sorting networks [3] .
Sorting is the most studied problem in computer science for two reasons: First, it is used as a substep in many applications. Second, it is a simple combinatorial problem with many interesting and diverse solutions. Parallel algorithms for sorting have been studied since at least the 1960's. Batcher bitonic sorting technique provides a parallel algorithm for sorting n numbers in θ(log 2 n) time with n processers. Bitonic sort used parallelism to increase the chip speed by treating several bits in parallel at each step.
The idea on which bitonic sort depends is the sorting and merging operations; given two inputs A and B, the algorithm must define the two correspondent output in a manner that if (A<B) so, A = L that represents the smallest element and B = H that represents the highest element. If (B<A ) then B = L and A = H.
This procedure can be applied to two input lists to obtain only one-sorted lists. The merging operation between the two lists is done in two stages: the first is the odd stage in which all the elements in both lists are compared with the odd index and arranged given H, L as described above. In the same manner, in the second stage (the even stage) all the elements in both lists are compared with even index. This means, given two lists a 1 , a 2 ,.........a s , b 1 ,b 2 ,.....b t the algorithm must produce a sorted list C 1 , C 2 ,... C t+s . Figure.1 represent a simple description of how bitonic sort work.
The main advantage of bitonic sort is that it can be extended to compare 4 by 4 elements in each step. Using bitonic sort, 2 p words can be sorted in only 1/2 p (p+1) steps. In the literature, many parallel sorting Fig. 1: Bitonic sort algorithms have been proposed like radixsort and Quicksort [5] and a variant of Quicksort is called columnsort [1] . In [11] , a parallel Quicksort algorithm is presented and implemented that works in stages. In the first stage, a parallel partition of data is done in which the array is divided in k blocks of fixed dimension B. A number of processors p is assigned to sort these blocks.
The algorithm works as follow: each processor sorts in parallel two blocks one from the head and the other from the tail of the array. This procedure is repeated until most of the blocks are sorted. In the second stage, it assigns one processor to sort the remaining unsorted blocks. The third stage is the process partition in which the array is divided into two partitions and the processors are divided into two groups, each group is assigned to a partition. In the Final stage, it uses Quicksort in each group of processors to sort its partition. If one processor finishes sorting its blocks it helps the other processors in the same group by a consecutive operation of push and pop.
The time complexity of the algorithm depends on the size of block B and on the number of processors P and does not depend on the distribution of keys. In [10] , comparisons of three parallel sorting algorithms are: odd-even transition Sort [2] , Parallel Rank Sort, Parallel Merge Sort is presented. The odd-even transposition algorithm [2] , works as follow: first of all the data is divided into (n*p) partitions, where p is the number of processors. These partitions are distributed to all the processors. Each processor sorts sequentially its part. Then, the merging stage consists of two substages: odd stage and even stage. In the even stage, the even processors (i) communicate with the odd processors (i+1) to merge their data in a manner that higher data values are kept in the higher number of processors. The lower values are kept in the lower number of processors. The same procedure is done in the odd stage where the odd processors (i) communicate with the even processors (i-1) to merge their data in the same previous manner. The whole data will be sorted in at most p stages and the time complexity is O(bn 2 ) where b = 1/2p 2 . That means the time will be reduced to 1/p 2 . The second algorithm compared was Parallel Merge Sort, this algorithm is based on the Divide and Conquer Algorithm, but here we have the concept of slaves and master processors. The initial array is divided into (n/p) elements. Each processor denoted as slave sorts its part in parallel and then these sorted subarrays will be returned to the master processor that sorts the entire final array. The time complexity of this algorithm is O(n/p log(n/p)). The third algorithm compared and implemented in [10] , is Parallel Rank Sort Algorithm that works as follow: the basic idea is to determine the rank of the elements. The data is divided in different partitions, each slave processor is assigned a data of size (n/p) elements where n is the size of the input data and p is the number of processors. Each slave processor is responsible to rank its elements and return the ranks to the master processor, which in turn is responsible of constructing the whole sorted list. The time complexity is O(n 2 ) = 1/p. Parallel Two-Dimensional Sorting Algorithm PTSA: Quicksort [7] is a sequential sorting algorithm that is widely believed to be the fastest comparison-based sequential sorting algorithm. It is a recursive algorithm that uses the Divide and Conquer method to sort all keys. The standard Quicksort first picks a key from the list, the pivot and finds its position in the list where the key should be placed. This is done by walking through the array of keys from one side to the other. When doing this, all other keys are swapped into two parts in the memory: first the keys less than or equal to the pivot are placed in one part of the pivot and the keys larger than the pivot are placed in the other part of the pivot.

PTSA Algorithm Description:
The main idea of PTSA algorithm is to divide the input into number of rows and then to use Quicksort in parallel manner to sort theses rows and obtain the following: • Sorted rows in parallel, where each row has its own processors • Sorted columns in parallel, where each column has its own processors PTSA passes through three stages: distribution data is the first stage, parallel sorting rows is the second stage and parallel sorting columns is the third stage.
Distribution data stage: In this stage, given an input array A to a matrix of processor Elements of size n, where, n = rows (r)×columns (c), A will be divided into a r×c, where r is an odd number greater than or equal to 3 and c is greater than or equal to r, where c = n/r. The first row is assigned to the first processor (Mp 0 ), the second row to the second processor (Mp 1 ) and so on to till the last row (Mp r-1) . Next, the first column is assigned to the first processor (Mp 0 ), second column (Mp 1 ) to the second processor and so on to till the last column (Mp c-1 ). Parallel sorting rows stage: In this stage, the processors MP 0 , MP 1, MPr -1 , are assigned to the data rows respectively to sort them in parallel fashion. The above steps are repeated until it ends-up with one middle row. Finally, the middle row is sorted using quicksort algorithm. Figure. 2, illustrates the three stages mentioned above.
PTSA algorithm analysis: PTSA as a two dimensional sorting algorithms is based on Quicksort algorithm which is used to sort rows and columns in parallel.
However, it is known that Quicksort performs better when the input data is increased, but here in PTSA algorithm the additional time required by Quicksort to sort a number of data inputs (which is equal to n) is reduced to n/r. In each iteration tow values (maximum, minimum) are transferred at the same time, while in the Divide and Conquer the transfer is made once only either the min or the max. The algorithm analyses steps are as follows: Step 0: It assumed that: i. Data Vector (A) of size n ii. Cl is a number of clock cycles required to send one data item from one master processor to another Step 1: The required clock cycle (S r1 ) to divide the data vector (DV) into rows and columns is computed and each row is sent to one of the row processor elements: where, cl is a number of clock cycle required to send one data item from one processor to another.
Step 2: The required clock cycles (S r2 ) to sort all rows in parallel is computed using quick sort algorithm: ( ) r 2 S n / r log(n / r) * j = , where, j is the number of clock cycles needed to process one iteration of traditional quicksort algorithm. Throughout this work, j is assumed to be equal to 10* cl, where cl = 1.
Step 3: The required clock cycles (S r3 ) to distribute all sorted rows item to the processors in VPE is computed as: r3 S r * (c 1) * cl. = − Step 4: The required clock cycles (S r4 ) to sort in parallel r items using quicksort algorithm is compute an shifted as: r 4 S (r log r) * j. = Step 5: Compute the required clock cycles (Sr5) to take maximum and minimum items then shift: r5 S cl. = Step 6: Steps 4, 5 are repeated c/2 times.
Step 8: The required clock cycles (S r8 ) to sort the middle row processors using Quicksort algorithm is computed as: r8 S (n / r log(n / r)) * j. = Step 9: The required lock cycles (S r9 ) to send in parallel middle row processors values to the suitable location is computed as: r9 S cl. = Step 10: The mathematical model to calculate the total number of clock cycles to sort input data using PTSA is:- Complexity Analysis: The time complexity of PTSA algorithm is calculated for the three stages together. The time needed by PTSA algorithm to sort an array of n elements denoted as T tot . T Sor refers to the time needed by the Quicksort to sort one subarray. T mer is the time needed to put maximum and minimum items in their locations in the output vector. T par is the time needed to distribute data to VPE of size r × c.

Simulation results:
The idea behind the work in this simulation is to compare the time needed for Quicksort (to sort an array of n items) with the time needed by PTSA (to sort the same array) by dividing it into r×c VPE.In the case of sorting an array of size n, f(n) becomes the count function. That is; f(n) gives the number of basic operation (clock cycles) done by the PTSA. Suppose that the time spent in each clock cycle is t, then the total time it would take to execute f(n) is tf(n). Clearly, the constant time t depends on the speed of the computer and therefore varies from computer to computer. However, f(n), the number of clock cycles, is the same for each computer. If we know how the function f(n) grows as the size of the problem grows,we can determine the efficiency of the PTSA with respect to Quicksort algorithm. We compare its results of the traditional Quicksort on the same standalone PC with Pentium III processor running at 800Mhz under windows XP platform using Microsoft Excel.  The dynamic data size and fixed number of rows: For different data size and fixed number of rows = 99, suppose that a computer can executes 1 billion operations per second, Table. 1 and Figure. 3 shows the time required to execute PTSA and the time required to execute traditional Quicksort.

The fixed data size and dynamic number of rows:
For fixed data size = 10 7 and different number of rows, suppose that a computer can executes 1 billion operations per second, Table 2 and Figure 4 shows the time that the computer takes to execute PTSA and Quicksort algorithm.
The PTSA accelerate: Analyzing the results presented in Table. 3 it can be seen the accelerate (Ac = Quicksort time on one processor / PTSA time on c processors) is around (1.5-1.7).
Example 1: Figure. 5, shows an example of an array of size 9, At the end we obtain a sorted array S, The initial values for the counters are zero ( i = 0, j = 0 ) : Step 1: The vector is distributed into three rows and three columns (to the VPE ).  Step 2: The first row is sent to the first MP (p 0 ) and the second row is sent to the second MP (p 1 ) and the third row is sent to the third MP (p 2 ).
Step 3: Sorting is carried out in parallel in the same order for all rows.
Step 4: The first column is sent to the first MP column (p 0 ) and the second column is sent to the second MP column (p 1 ) and third column is sent to the third MP column (p 2 ) Step 5: Sorting is carried out in parallel in the same order for all columns. Step 6: In the location VPE (p 11) we get the minimum element and in location VPE (p 33 ) we get the maximum element for giving vector.
Step 7: In parallel put the value at VPE (p 11 ) in the first location (minimum) and the value at VPE (p 33 ) (maximum) in last location for output vector.
Step 8: The first row of processor elements is shifted one position to the left and third row of processor elements shifted one position to the right.
Step 11: Sort the middle row and the result send it in parallel to the output vector (S) locations from s i to s J .

CONCLUSION
In this research a new algorithm is proposed for sorting an array of size n. The idea of this algorithm is to use parallelism to reduce the running time. We have obtained a scalable system, in which its performance is improved when the number of input data is increased. The reduction becomes significant and that is confirmed by our simulation results where the reduction in number of comparisons respectively to the traditional Quicksort is about 55-65%.