A Density Based Dynamic Data Clustering Algorithm based on Incremental Dataset

: Problem statement: Clustering and visualizing high-dimensional dynamic data is a challenging problem. Most of the existing clustering algorithms are based on the static statistical relationship among data. Dynamic clustering is a mechanism to adopt and discover clusters in real time environments. There are many applications such as incremental data mining in data warehousing applications, sensor network, which relies on dynamic data clustering algorithms. Approach: In this work, we present a density based dynamic data clustering algorithm for clustering incremental dataset and compare its performance with full run of normal DBSCAN, Chameleon on the dynamic dataset. Most of the clustering algorithms perform well and will give ideal performance with good accuracy measured with clustering accuracy, which is calculated using the original class labels and the calculated class labels. However, if we measure the performance with a cluster validation metric, then it will give another kind of result. Results: This study addresses the problems of clustering a dynamic dataset in which the data set is increasing in size over time by adding more and more data. So to evaluate the performance of the algorithms, we used Generalized Dunn Index (GDI), Davies-Bouldin index (DB) as the cluster validation metric and as well as time taken for clustering. Conclusion: In this study, we have successfully implemented and evaluated the proposed density based dynamic clustering algorithm. The performance of the algorithm was compared with Chameleon and DBSCAN clustering algorithms. The proposed algorithm performed significantly well in terms of clustering accuracy as well as speed.


INTRODUCTION
Data mining is the process of extracting potentially useful information from a data set. Clustering is a popular data mining technique which is intended to help the user discover and understand the structure or grouping of the data in the set according to a certain similarity measure. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics and numerical analysis. The search for clusters is unsupervised learning and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology and many others (Berkhin, 1988). The existing clustering algorithm integrates static components. Most of the applications are converted into real time application. It enforced that object to be clustered during the process based on its property. Dynamic clustering is a mechanism to adopt the clustering in real time environments such as mobile computing, war-end movement observation (Crespoa and Weber, 2005). Dynamic data mining is increasingly attracting attention from the respective research community. On the other hand, users of installed data mining systems are also interested in the related techniques and will be even more, since most of these installations will need to be updated in the future for each data mining technique used. We need different methodologies for dynamic data mining. In this study, we present a methodology for Density Based Dynamic Data Clustering Algorithm based on Incremental DBSCAN.

Clustering of dynamic data:
Clustering is a field of active research in data mining. Most of the work has focused on static data sets (Han and Kamber, 2011). Traditional clustering algorithms used in data mining will not perform well on dynamic data sets. A clustering algorithm must consider the elements' history in order to efficiently and effectively find clusters in dynamic data. There has been little work on clustering of dynamic data. We define a dynamic data set as a set of elements whose parameters change over time. A flock of flying birds is an example of a dynamic data set. We are interested in exploring algorithms are capable of finding relationships amongst the elements in a dynamic data set. In this study we evaluate the use of data clustering techniques developed for static data sets on dynamic data.

Recent developments of dynamic data mining:
Within the area of data mining various methods have been developed in order to find useful information in a set of data. Among the most important ones are decision trees, neural networks, association rules and clustering methods (Crespoa and Weber, 2005;Loganantharaj et al., 2000).
For each of the above-mentioned data mining methods, updating has different aspects and some updating approaches have been proposed, as we will see next.
Decision trees: Various techniques for incremental learning and tree restructuring as well as the identification of concept drift have been proposed in the literature. Clustering: Below, we describe in more detail approaches for dynamic data mining using clustering techniques that can be found in literature.
Recent developments of clustering systems using dynamic elements are concerned about modeling the clustering process dynamically, i.e. adaptations of the algorithm are performed while applying it to a static set of data.

MATERIALS AND METHODS
The cluster validation methods: Major difficulties in cluster validation: The presence of large variability in cluster geometric shapes and the number of clusters cannot always be known a priori are the main reason for validating the quality of the identified clusters. Different distance measures also lead to different types of clusters so that deciding the 'best' cluster is based on several aspects with respect to the application. So that the results of a cluster validation algorithm not always give best result from the application's point of view (Bezdek and Pal, 1998).

Cluster validity:
In fact, if cluster analysis is to make a significant contribution to engineering applications, much more attention must be paid to cluster validity issues that are concerned with determining the optimal number of clusters and checking the quality of clustering results. Many different indices of cluster validity have been proposed, such as the Bezdek's partition coefficient, the Dunn's separation index, the Xie-Beni's separation index, Davies-Bouldin's index and the Gath-Geva's index. Most of these validity indices usually assume tacitly that data points having constant density to the clusters. However, it is not sure of the real problems (Bezdek and Pal, 1998).

Indices of cluster validity:
Cluster validation refers to procedures that evaluate the clustering results in a quantitative and objective function. Some kinds of validity indices are usually adopted to measure the adequacy of a structure recovered through cluster analysis. Determining the correct number of clusters in a data set has been, by far, the most common application of cluster validity. In general, indices of cluster validity fall into one of three categories. Some validity indices measure partition validity by evaluating the properties of the crisp structure imposed on the data by the clustering algorithm. In the case of fuzzy clustering algorithms, some validity indices such as partition coefficient and classification entropy use only the information of fuzzy membership grades to evaluate clustering results. The third category consists of validity indices that make use of not only the fuzzy membership grades but also the structure of the data.

The cluster validity measures: Dunn's index vD:
This index is used to identify the compact and well-separated clusters C Eq. 1: Where: δ is a distance function and C I , C j C k are the sets whose elements are the data points assigned to the corresponding i th , j th and k th clusters respectively. The main drawback with direct implementation of Dunn's index is computational since calculating becomes computationally very expensive as the number of clusters and the total point's increase. Larger values of vD correspond to good clusters and the number of clusters that maximizes vD is taken as the optimal number of clusters.

Generalized Dunn Index vGD Eq. 2:
Five set distance functions and three diameter functions are defined in of these, we have used two combinations δ 3 and δ 3 (which is recommended in (Karypis et al., 1999) as being most useful for cluster validation) in one and combinations δ 5 and δ 3 in the other. The three measures viz., combinations δ 3 , δ 3 and δ 5 and are defined as follows: Larger values of vGD correspond to good clusters and the number of clusters that maximizes vGD is taken as the optimal number of clusters. In this evaluation, we used δ 3 and δ 3 as diameter functions during evaluating the algorithms under consideration.

Davies-bouldin index []:
This index (Davies and Bouldin, 1979) is a function of the ratio of the sum of within-cluster scatter to between-cluster separation Eq. 3: where n-number of clusters, S n -average distance of all objects from the cluster to their cluster centre, -S(Q , ,Q j ) distance between clusters centres. Hence the ratio is small if the clusters are compact and far from each other. Consequently, Davies-Bouldin index will have a small value for a good clustering (Bezdek and Pal, 1998 is that the merging decisions are based upon static modeling of the clusters to be merged. These schemes fail to take into account special characteristics of individual clusters and thus can make incorrect merging decisions when the underlying data does not follow the assumed model, or when noise is present. There are two major limitations of the agglomerative mechanisms used in existing schemes. First, these schemes do not make use of information about the nature of individual clusters being merged. Second, one set of schemes (CURE and related schemes) ignore the information about the aggregate interconnectivity of items in two clusters, whereas the other set of schemes (ROCK, the group averaging method and related schemes) ignore information about the closeness of two clusters as defined by the similarity of the closest items across two clusters (Karypis et al., 1999;Bezdek and Pal, 1998).
Its key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters. Chameleon uses a novel approach to model the degree of interconnectivity and closeness between each pair of clusters. This approach considers the internal characteristics of the clusters themselves. Thus, it does not depend on a static, usersupplied model and can automatically adapt to the internal characteristics of the merged clusters. Chameleon operates on a sparse graph in which nodes represent data items and weighted edges represent similarities among the data items. This sparse-graph representation allows Chameleon to scale to large data sets and to successfully use data sets that are available only in similarity space and not in metric spaces. Data sets in a metric space have a fixed number of attributes for each data item, whereas data sets in a similarity space only provide similarities between data items.
Chameleon finds the clusters in the data set by using a two-phase algorithm. During the first phase, Chameleon uses a graph-partitioning algorithm to cluster the data items into several relatively small subs to find the genuine clusters by repeatedly combining these sub-clusters. During the second phase, it uses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining together these sub-clusters (Crespoa and Weber, 2005;Goura et al., 2011;Goyal et al., 2011).

DBSCAN: DBSCAN (Density Based Spatial
Clustering of Applications with Noise) and DENCLUE ((DENsity-based CLUstEring) will be implemented to represent density based partitioning algorithms. DBSCAN creates clusters from highly connected elements while DENCLUE clusters elements in highly populated areas. Both algorithm handle outliers well and will not include them in any cluster.

The proposed density based dynamic DBSCAN:
We modeled the proposed Density based Dynamic DBSCAN algorithm using the ideas mentioned in the earlier work 1996;Ester and Wittmann, 1998;Su et al., 2009;Sarmah and Bhattacharyya, 2010;). Our implementation is slightly different from the standard approach, in our algorithm, we only considered problems related with data insertion. Further, we dynamically changed the epsilon during each batch of insertion. Another most important variation is, in during each step of batch insertion, the data points which were classified as noise or border objects (outliers) were considered as unclassified points and combined with the new data which is to be inserted. These small changes made our algorithm to perform very good and formed good clusters with the dynamic incremental data set.

The density based dynamic clustering algorithm:
The main aspects of dynamic clustering process: When inserting an object p into the database D, it may be treated in one of the following ways: Noise: If there is no nearby point in the epsilon neighborhood or the number of neighbors is not satisfying the density criteria, then, p is also a noise object and nothing else is changed.

Absorption of point p:
If all the nearby points in the epsilon neighborhood belongs to some cluster, then the newly inserted point p also belong to the same class IDin other words, the new point will simply be absorbed by that existing cluster.

Merging of clusters:
If all the nearby points in the epsilon neighborhood are members of different clusters, then the newly inserted point p will connect all these existing clusters and form one cluster out of these several clusters.
Creation of a cluster: At the location of insertion, if there are some nose objects already present and if the point p can be treated as a core point after insertion by satisfying the condition of a cluster membership, then it will lead to form a new cluster in that region.
A dynamic DBSCAN algorithm for clustering evolving data over time: Let: • D Ex be the Existing dataset which is already cluster in to C ex number of classes.
• D New be the New dataset which is to be added in to D Ex cluster in to C new number of classes.

RESULTS
The performances of the algorithms are evaluated using synthetic dataset and real data sets from UCI Data repository.
The performance in Terms of Generalized Dunn Index, Davies-Bouldin Index and clustering time with the synthetic dataset and real dataset , The proposed dynamic clustering algorithm was good and almost equal or little bit better than the normal DBSCAN algorithm.

DISCUSSION
Results with synthetic data set: To evaluate the performance of clustering in a very controlled manner, multi dimensional synthetic data sets of were used.
The following Fig. 1 shows the two dimensional plot of one of such dataset.
The parameters of the algorithm used to create the synthetic spheroid form of data points using Gaussian distribution: : 600.00 The following Fig. 2 results are the performance of clustering with dataset of the above mentioned attributes. The line chart shows the performance of the algorithm with the increase of data size. The bar chart shows the average performance of the algorithms.
The following Fig. 3 shows the performance in Terms of Generalized Dunn Index with the synthetic dataset. The performance of the proposed dynamic clustering algorithm was good and almost equal or little bit better than the normal DBSCAN algorithm.
The following Fig. 4 shows the average performance in Terms of Generalized Dunn Index. The performance of the proposed dynamic clustering algorithm equal to the normal DBSCAN algorithm.
The following Fig. 5 shows the performance in Terms of Davies-Bouldin Index The following Fig. 6 and 7 shows the average performance in Terms of Davies-Bouldin Index Performance in terms of time: The following graph shows the performance in Terms of time. The speed of the proposed dynamic clustering algorithm was better than Chameleon as well as DBSCAN algorithm.
The results with UCI data sets: To validate the performance of the algorithms, we used some of the real data sets from UCI Data repository. The performance of the algorithms with "UCI Wine Data" with different size of incremental data.
The following Fig. 8-15 shows the performance in Terms of Generalized Dunn Index, Davies-Bouldin Index and clustering time. The performance of the proposed dynamic clustering algorithm was good. And in most cases, the accuracy in terms of validation metrics is little bit better than the normal DBSCAN algorithm and Chemeleon: The performance with different UCI datasets: The following graph shows the Average performance of the algorithm with different UCI data sets. The performance was measured in terms of Generalized Dunn Index, Davies-Bouldin Index and clustering time.
The performance of the proposed dynamic clustering algorithm was good. And in most cases, the accuracy in terms of validation metrics is little bit better than the normal DBSCAN algorithm and Chameleon. The average performance of the algorithms in terms of Generalized Dunn Index with different size of incremental data was good and almost equal in with all the four evaluated datasets.

Average performance in terms of Davies-Bouldin index:
The average performance of the algorithms in terms of, Davies-Bouldin Index is almost equal or little bit higher than the normal DBSCAN.

Average performance in terms of time:
The average performance of the algorithms in terms of, clustering time is almost very minimum in the proposed dynamic clustering algorithm. The performance of the proposed algorithm was very good on all the data sets.

CONCLUSION
In this study, we have successfully implemented and evaluated the proposed density based dynamic clustering algorithm. The algorithm was able to insert data objects one by one and then re-estimate the cluster IDs during each and every point which was inserted. The algorithm is capable of create, modify and insert clusters over time. The performance of the algorithm was compared with Chameleon and DBSCAN clustering algorithms. As shown in the results of the previous section, the proposed algorithm performed significantly well in terms of clustering accuracy as well as speed.
There are possibilities to handle batch insertion by which we can reduce the run time of the algorithm. So the future work will address the ways to improve the performance of the algorithm in terms of speed and accuracy. This work only addressed the problem of clustering incremental data set in which only data is added over time.
The future work may address all the other possibilities of dynamic operations like deletions and modifications of data points and remodel the algorithm to cluster the data during this dynamically changing dataset. Even though, the performance of Chameleon was poor in terms of speed, it also posses the capabilities of becoming a dynamic clustering algorithm. Future works may explore these possibilities and address hybrid dynamic clustering algorithms.