Real Time Density-Based Clustering (RTDBC) Algorithm for Big Data

: Density Based Spatial Clustering of Applications with Noise (DBSCAN), a well-known Density-Based Clustering Algorithm is a advanced data clustering method with various applications in numerous fields like Satellites images, X-ray crystallography, Anomaly Detection in Temperature Data. But its run time R ( n 2 ) complexity draws a major challenge. In this research paper, we propose a new unique algorithm called Real Time Density Based Clustering RTDBC to minimize the problems in DBSCAN. In proposed algorithms, objects are allotted into clusters using labels representatives than the method of propagating directly to reduce propagation time of label considerably. In contrast, RTDBC produce fast result and continuous process of runtime and additionally users are permitted to suspend for testing the result and continue as to enhance good results.


Introduction
DBSCAN, a familiar density based data clustering algorithm introduced by Easter et al. (1996). It has a fast solution for complicated clusters assigned one input parameter and suggested the value of the parameter for user. In huge data bases it was 1900 times faster and expected improved final results ended up. DBSCAN identify the clusters which are in arbitrary shape and also for finding outliers. A set of Dense objects connected and separated by a new created cluster with low density region clusters while density object more than p objects inside ε radius of neighborhood. DBSCAN is mainly considerable clustering algorithm with various applications and extension (Brecheisen et al., 2004;Gan and Tao, 2015) like satellite images, x-ray crystallography, anomaly detection in temperature data, astronomy (Settles, 2009) and neuroscience (Mai et al., 2012). Bur its real weakness complex data sets if they located too close with each other even if they are different densities.
During cluster extension process DBSCAN (Ester et al., 1996) executes and determines the ε-radius of the neighborhood queries of all objects for data grouping. Thus, it has two sources: • Range of n query process, ensuring as R(d x n 2 ) where d x = worst case complexity of distance, n = no. of objects • Propagation process of label with R(d x n 2 ): it's a time complexity to allocate objects as labels.
These two sources rapidly turn to block while increasing the volume and aims for various works for improving DBSCAN. These techniques (both sources) results to accelerate of DBSCAN by either of two sources means without the data information exploited the improved performance. A filter (Fast Lower Bound) by source 2, the calculations of the true distance reduced along with increased label propagation time to maintain the order of initial list. The Data space divides by grids in Grid Based Technique (GBT) for saving the run time as each cell perform locally.
Information of the data is not utilized considerably by all these techniques, they earn excessive distance computation, thus tends to limit the performance efficiency. Hence, we propose a new unique algorithm called Real Time Density Based Clustering RTDBC to minimize the problems in DBSCAN.
When compare to previous techniques, it upholds accurate results always and reads the current structure data then it considers a small object into subsets for refining all iterations. So, it replace the label propagating directly with objects are in to clusters by the representative of the labels. Hence propagation time of the label is decreased considerably.
During execution in existing approaches works on batch scheme, does not permit the user communication (Brecheisen et al., 2004;Gan and Tao, 2015). In other side, during the run time anytime algorithm (Zhou et al., 2000) rapidly generate approximate result and refine continuously and permit the users to suspend for verifying the result, resume to finding satisfactory result is obtained. So, RTDBC algorithms have suitable method and broadly applied for various areas (Zhou et al., 2000). But all existing methods are designed mainly for small datasets due to their space complexity and high time. Hence, proposed RTDBS algorithm aiming to provide for very large datasets.
There are very few algorithms works on complex data like images and graphs (Brecheisen et al., 2004) but facing a scalability problems. But in proposed RTDBC algorithm effectively works on very large complex data and minimize the high time and space complexity.
In this research paper, a proposed new RTDBC algorithm for clustering very large complex datasets that represents all described problems above. The proposed RTDBC algorithm ahead with the advantages: • RTDBC dynamically study the information of data and apply to decrease the propagation time of the label with number of range queries. So, it is considerably speed up the runtime to extent level of magnitude compared with DBSCAN and other approaches • RTDBC runs initial runtime very low for better results compared and user interaction for to get good considerations in arbitrary time • RTDBC useful for clustering very large complex datasets

Density based Data Clustering Algorithm
Definition 1: ε-Neighborhood Figure 1 describes ε-neighborhood of objects within the radius of ε from an object, the ε neighborhood of an object p represented by N ε (p) then: Definition 2: High Density   Density Reachability Figure 4 describes asymmetric object q is directly density-reachable from object p if p is a core object and q is in p's ε -neighborhood Density Connectivity: is not yet classified then if o is a core-object then collect all objects density reachable from o and assign them to a new cluster else assign o to NOISE DBSCAN arbitrarily draws object p (unlabelled) and executed q є N ε (p) while p is core object, then objects are labeled for p including all density connected objects of p.

Proposed Algorithm: RTDBC (Real Time Density Based Clustering)
RTDBC algorithm is a solution for time consuming in many areas like object recognition (Kobayashi et al., 2013) and robotics (Zhou et al., 2000). The main idea of this algorithm is to produce approximate results immediately and continuously drawing the results till to extract the acceptable results or solutions. This algorithm also analyzes the intermediate results on interruption while running and resumed for extract acceptable solutions this representation is shown in Fig. 5. Figure 6 shows the development of different algorithms of proposed RTDBC and observed that the performances of a (Zhou et al., 2000) is better quality than others (B, C).
Hence A preferred for many works for better solution and other side C stands on worst performance.
The main approach of proposed RTDBC algorithm is shown in Fig. 7.
By illustrating the Fig. 7, C1 cluster is determined completely while select the two objects f, g then: • C1 and C2 are the final Clusters • Two small clusters are formed inside C1, by a, b and with their neighbors • Two more small clusters form inside C2 by d, e and with their neighbors • Outlier is c • a, b are density connected together while core object is f • border object g permits to find the core object h without performing query as h having minimum µ neighbors • C2 also determined with d and e which are density connected together. Hence the proposed RTDBC extract the same results as in DBSCAN without executing all queries, result that time reduced in clustering.
The pseudo code RTDBC algorithm shown in algorithm 2 described in nine major steps.
Step 1: Design a Structure of an initial cluster Step 2: Developing the cluster graph as G = (V, E) Step 3: Identifying the connected components Step 4: Merging the connected components Step 5: Verifying a stopping condition Step 6: Choosing objects for queries Step 7: Activating queries Step 8: Updating cluster graph In step 1 RTDBC queries objects α in size of blocks and β for step 6 to 7. Hence selection of objects α and β for activating queries for all iterations of step 1 and 6, 7 as to provide main benefits; • The quality of intermediate clustering at earlier steps been enhancing with overlapping of primitive circles • Anytime scheme of the overall overhead been reducing as by using α = β Assume that RTDBC is run at the end; its end results are absolutely identical from DBSCAN.
Here we analyze a RTDBC algorithm of worst case complexity. Lets assume: The real time complexities in RTDBC are very smaller than those illustrated above and consideration of experimental analysis. Therefore: • The maximum iterations in RTDBS = v i >> v >> n and b>>b max , where b max = (n-v)/β and • The run time complexity of RTDBS O(n 2 ) very smaller than DBSCAN So, RTDBC requires: • Space for storing the graph G = R (v 2 + v n + n + v + lµ) • The space complexity of RTDBC in the worst case = R (n 2 )…… v >> n

Experimental Results
We create larger data sets of 2D-4 synthetic DS1 to DS4 data sets having 16 to 32 clusters, contains 3254-9554 points which are placed randomly DS1 data set added 99 more objects which are placed additionally to the original data sets for all objects (DS1x100) for analyzing of arbitrarily clusters in RTDBC. We also study the characteristics of RTDBC on increasing the number of objects while maintained the cluster structure.
We use α = β = 512 µ = 5, є = 1 The performance of RTDBC is shown in Fig. 8 by increasing objects for DS1 to DS4. It is observed that in Fig. 8b, RTDBC significantly faster than DBSCAN. It means denser of the clusters, speedup factors are high and the solutions are found in Fig 8c and 8d.
First, RTDBC used very few queries compared to DBSCAN. Therefore, it needs only 0.25% (6964.4) range queries for clustering DS1x300 on average with objects of 2783567. Second, graph nodes initial numbers of are much small, 0.12% (3441.6) for clustering on average DS1x300.
However, the graph nodes are considerably reduced during runtime on all iterations shown in Fig. 8e and the time of label propagation also reduced. Thus the RTDBC is significantly the faster at the end than the DBSCAN.
Normalized -Mutual Information (NMI) used for extracting the results of intermediate clustering and compare real results. If the Result is perfect clustering means 1 and respectively. The results of perfect clustering is shown in Fig. 9  The initial nodes in G increases with respect to α because the primitive circles are overlapped in step 1 shown in Fig. 11b and due to merging in step 4 leads to faster reduction in graph nodes. The nodes of the graph also decreases more rapidly on all iterations shown in Fig. 11d. Hence the RTDBC cumulative run times reduced considerably shown in Fig. 11a. Hence the numbers of states edges identification required more queries and increased queries are stable while β is large. Therefore, more or additional core objects are identified on each step and thus making to rapid detection the "yes" states of the edges shown in Fig. 11c. So the increased queries effect is very small on operation cost and the cumative run times of RTDBC are still reduced. Overall performance decreases from redundant queries while α, β are very large. Hence in RTDBC prefer the method that for maximum iterations while α = β. The RTDBC runtime slightly increases by increasing the parameter µ and more queries needed to find unprocessed objects shown in Fig. 12. Thus the graph size decreases tends to reduce the cost. In other hand this is happen while noise objects are more.
Increased the value of є will impact to decrease the initial graph nodes while more objects are labeled inside the primitive circle. However, number of queries and runtimes are decreased. Thus, the larger of є tends RTDBC to obtain faster clustering results of all iterations.
RTDBC performance on different synthetic datasets created by DBSCANR (DBSCAN variant) (Gan and Tao, 2015) shown in Fig. 13 with synthetic 1 (9 with 2000000 points) and synthetic 2 (11 with 2000000 points) dimensions on different values of є with µ = 5. It is observed that the performance of RTDBC very faster compared to DBSCAN and its variant DBSCANR.

Fig. 13. RTDBC performance on different synthetic datasets
It is also noted that in Synthetic 1 data set while є = 4000 RTDBC needs 3.68 sec where DBSCAN and DBSCANR needs 1093.6 sec and 221 sec respectively. Thus, RTDBC is 297.1 times faster than DBSCAN and 60 from DBSCANR.
Scalability of RTDBC with respect to DBSCANR shown in Fig. 13b with µ = 5 and є = 5000 and µ = 5 and є = 4000 of number of objects and data dimension respectively. It is noted that the efficient performance of RTDBC on higher values of objects and data dimension. Thus, for clustering of 5000000 objects RTDBC completes in bellow 9.3 sec where as 505.4 sec and 19388.8 sec in DBSCANR and DBSCAN respectively. However, overall RTDBC is nearly 55.5 faster compared to DBSCANR and DBSCAN.

Conclusion
Though DBSCAN, a well-known Density-Based Clustering Algorithm is a advanced data clustering method with various applications in numerous fields, but its run time R(n 2 ) complexity draws a major challenge. RTDBC is a solution to minimize the problems in DBSCAN. In RTDBC objects are allotted into clusters using labels representatives than the method of propagating directly to reduce propagation time of label considerably. In contrast, RTDBC produce fast result and continuous process of runtime and additionally users are permitted to suspend for testing the result and continue as to enhance good results. RTDBC is 297.1 times faster than DBSCAN and 60 from DBSCANR. Clustering of 5000000 objects RTDBC completes in bellow 9.3 sec where as 505.4 sec and 19388.8 sec in DBSCANR and DBSCAN respectively. However, overall RTDBC is nearly 55.5 faster compared to DBSCANR and DBSCAN.