Integrating Correlation Clustering and Agglomerative Hierarchical Clustering for Holistic Schema Matching

: Holistic schema matching is the process of carrying off several number of schemas as an input and outputs the correspondences among them. Treating large number of schemas may consume longer time with poor quality. Therefore, several clustering approaches have been proposed in order to reduce the search space by partitioning the data into smaller portions which can facilitate the matching process. However, there is still a demand for improving the partitioning mechanism by avoiding the random initial solutions (centroids) re-sulted from the clustering process. Such random solutions have a significant impact on the matching results. This study aims to integrate correlation clustering and agglomerative hierarchical clustering toward improving the effectiveness of holistic schema matching. The proposed integrated method avoids the random initial so-lutions and the predefined number of centroids. Several preprocessing steps have been performed with using auxiliary information (domain dictionary). The experiments have been carried out on Airfare, Auto and Book datasets from UIUC Web Integration Repository. The proposed method has been compared with K-means and K-medoids clustering methods. As a results the proposed method has outperformed K-means and K-medoids by achieving 0.9, 0.93 and 0.9 of accuracy for Airfare, Auto and Book respectively.


Introduction
A huge amount of heterogeneous data sources have expanded and become reachable through the web interfaces or the so-called deep web which refers to the web online accessible databases that dynamically generated in queries of the users . Such interfaces are very crucial for the e-business search engine that provides a unified access to multiple sites, which allow the end-user to search and compare among products easily . Thus, it is necessary to accommodate heterogeneous semantic between the interfaces queries, this process called schema matching which aims to find the attribute correspondences among the schemas in order to provide a unified interface for the user (Chen et al., 2012). Schema matching has become more essential and challenging for many applications such as data warehouses, e-commerce and semantic web (Rahm, 2011). Although it has brought many researchers attentions, schema matching is still an active problem regarding to the variant representation of the data and schema structures (Pei et al., 2006). Generally, schema matching is hard because there is often no documentation present with columns that tell us the semantics of the columns, that is, column names and values may be opaque (Jaiswal et al., 2013).
Holistic schema matching is the process of carrying off a group of schemas as an input and then outputs the correspondences semantics at the same time (Su et al., 2006). Regarding to its efficiency and effectiveness, holistic schema matching has become more challenging and auspicious in terms of solving the problem of large scale matching so that, many approaches have been proposed according to holistic schema matching such as (Chuang and Chang, 2008;Yuchen et al., 2009). Furthermore, holistic schema matching has facilitated to attract the attentions of several techniques and approaches in terms of solving large scale schema matching such as clustering technique. Basically, dealing with large scale data may affect the effectiveness of matching results due to the large portions of elements and that may refer to different meanings. As well as, the efficiency may also effected when matching large scale according to the time and space that would be consumed during the matching task (Rahm, 2011).
Hence, search space reduction has brought the attentions of researchers in order to partition the data into smaller sectors that be easily match with an accurate results. Consequently, clustering has contributed to solve such issue using its features of dividing the data into similar groups. Currently, several clustering-based approaches have been done in terms of solving holistic schema matching using various clustering techniques such as k-means and hierarchical. K-means is a fast clustering technique, but it requires the user to specify the numbers of k clusters which is not easy to attain in the case of holistic schema matching (Wirth, 2010). In contrast, hierarchical produces effective results, but restricted due to its time complexity (Alofairi, 2012). Additionally, some approaches have been integrated both of k-means and hierarchical clustering techniques in order to get better results. However, there is still room for improvement in terms of accuracy toward search space reduction in holistic schema matching. Therfore, this study proposed an integrated method of correlation clustering and agglomerative hierarchical clustering for holistic schema matching. Basically, correlation clustering does not require a user specification and it avoids the random initial solutions (Wirth, 2010). The proposed method utilized the characteristics of both of schema's elements such as columns names and labels and schema's constraints such as the data type of the attributes. Moreover, several preprocessing steps have been performed including transformation, normalization and exploited domain dictionary. The experiments have been carried out on web interfaces which are Airfare, Auto and Book datasets from the ICQ Query Interface data sets in the UIUC Web Integration Repository. Eventually, a prototype has been developed based on the proposed method in order to match holistic schemas. He and Chang (2004) presented an integrator tool that match the attribute between query interfaces by exploited names, label and data type. The authors utilized the matching results in order to build a global interface. The matching has been performed using cluster-based method which aims to group the attributes with the same domain into clusters based on their names, then using the semantic (synonyms) such clusters will be merged. The resulted clusters is considered representative attribute for the global interface. Wu et al. (2004) proposed an interactive clustering-based approach using agglomerative hierarchical clustering to match query interfaces. It is very effective approach that can reduce the search space and treats simple and complex mappings. Moreover, the approach performs the matching based on the similarity of name, label and domain. Furthermore, Pei et al. (2006) proposed a novel clustering-based approach using k-means algorithm. The clustering approach has been done in three steps, (i) clustering schemas, (ii) clustering attributes within the same schema, (iii) clustering attributes in different schema. While, Alofairi (2012) proposed an integrated clustering algorithm of k-means and agglomerative hierarchical clustering for holistic schema matching. It exploited name, label and data type with a domain specific dictionary. However, the clustering techniques that have been used in the existing approaches have some limitations for instance; k-means requires priori specification of the clusters number and has randomly initial solutions. Therefore, there is still a vital demand for enhancing the clustering techniques in terms of the search space reduction for holistic schema matching.

Materials and Methods
The proposed method consists of three phases. The first phase is preprocessing which contains transformation and normalization.This phase aims to turn the data into an internal representation and eliminating the noisy data. Whereas, the second phase is clustering which contains the integrated correlation clustering and agglomerative hierarchical clustering. Eventually, the third phase which is the evaluation. Figure 1 shows the framework of the proposed method.

Dataset
This research used three datasets each of them contains 20 web interfaces schema's which are collected by utilizing online directories. The chosen datasets are Airfare, Auto and Book datasets that brought from the ICQ Query Interface data sets in the UIUC Web Integration Repository (Chang et al., 2003). Every single schema has been described as a text file that includes the strings of the attribute's names and labels for each field.

Preprocessing
This phase contains the required procedures that aim to turn the data into a format that can be processed. It includes transformation and normalization which can be illustrated as follows.

Transformation
In terms of pre-processing and clustering, the datasets have to be represented into an appropriate and unified scheme (internal representation) in order to be processable and executable. Each dataset represented in a table with the following attributes: (Schema_Number, Field_Number, Name, Label and do-main)

Clustering
The proposed integrated clustering consists of correlation clustering and agglomerative heirarchcial clustering. It aims to assign a point from each schema using correlation clustering. As mention earlier, each schema has several fields thus, in our method point is reffered to a field. The key characteristic of correlation clustering lies on maximizing the agreements (similarities) within a cluster (intra-cluster) and maximizing the disagreements (dissimilarities) between the clusters (inter-cluster) (Wirth, 2010). In order to achieve this objective, Levenshtein Distance have been measured between each field (point) and its neighbor.
Levenshtein Distance LD is the number of procedures have to be performed in order to convert a word to another, those procesdures include insertation, deletion and replacement (Chowdhury et al., 2013). LD is usually used to determine the variations among words. Assume x and y are two words, the levenshtein distance between them will be calculated as follow Equation 1: ( 1, ) 1 ( , ) ( , 1) 1 ( 1, 1) 1 x y x y x y where, 1 ‫)݆ܾ≠݅ݔ(‬ is an indicator function equal 0 when ‫݅ݔ‬ = ‫݆ݕ‬ and 1 otherwise. This means that the greater value of LD between two words, the greater dissimilarity among them. As well as, the lower value of LD between two words, the greater similarity among them. Hence, the proposed integrated clustering measures the LD values of each field (point) and its neighbor, then it select the group with maximum value of LD and assign them as initial solutions (centroids). Then agglomerative hierarchical clustering has been used to compute the cosine similairty between each field and each centroid. In cosine similarity the two values that wanted to be measure are represented as vectors so that the correlation of those vectors is the Cosine angle between them (Li and Han, 2013). Giving two values t α r and b t r , the cosine similarity between them is Equation 2: where, t α r and b t r are m-dimensional vectors over the term set ܶ = ‫.}݉ݐ,.…,1ݐ{‬ The results of cosine will be nonnegative and ranged in [0,1]. Using a specific threshold the hierarchical clustering merges all fileds with the appropriate centroid. This step is very sensitive due to its influence that associated with the results of matching. Therefore, threshold is very chal-lenging issue that facing off the researchers in the field of schema matching. However, our proposed method per-forms several values of threshold in terms of seeking best matching results. The algorithm of the proposed method is stated below:

Generate_Centroids (); Repeat
For each schema Assign a random field f; Compute Lev of f and its neighbors; Store schema No. n and its result r in list D(n,r); End For; Until schema number is reached; For each element in D Find the maximum r; Store all f of n that associated with Max r in an array C;

Results
Basically, three clustering methods have been integrated with hierarchical clustering in terms of portioning which are Correlation clustering, K-means and K-medoids for the three datasets Airfae, Auto and Book. The evaluation has been performed using the common information retrieval metrics which are Precision, Recall and F-measure. Eventually, several values of threshold have been adjusted in terms of seeking the optimal. Table 1 shows the results of integrated Kmedoids and agglomerative hierarchical clustering.
As shown in Table 1, the results of the integrated kmedoids and agglomerative hierarchical clustering have been described with several values of threshold. It obvious that 0.4 of threshold has achieved the highest values of F-measure which are 0.84, 0.84 and 0.85 for Airfare, Auto and Book respectively.
On the same manner, Table 2 shows the results of the integrated K-means and agglomerative hierarchical clustering.
As shown in Table 2, the results of the integrated kmeans and agglomerative hierarchical clustering have been described with several values of threshold. It obvious that 0.4 of threshold has achieved the highest values of F-measure which are 0.87, 0.88 and 0.86 for Airfare, Auto and Book respectively.
On other hand, Table 3, shows the results of the integrated correlation clustering and agglomerative hierarchical clustering.
As shown in Table 3, the results of the integrated correlation clustering and agglomerative hierarchical clustering have been described with several values of threshold. It obvious that 0.4 of threshold has achieved the highest values of F-measure which are 0.90, 0.93 and 0.90 for Airfare, Auto and Book respectively.
Eventually, Table 4 shows the F-measure results of the three clustering method for the three datasets with 0.4 of threshold.

Discussion
As shown in Table 4, correlation clustering has outperformed both of K-medoids and K-means. The outperforming was slightly in both of Airfare and Book datasets and remarkable in Auto dataset. As expected from other studies such as Velmurugan and Santhanam (2010);Wirth, 2010), unlike k-means and k-medoids, correlation clustering does not require predefined number of clusters and avoid the random initial solution which has a significant impact on the matching results.

Conclusion
This research addresses the problem of schema matching by proposing an integrated clustering method consisting of correlation clustering and agglomerative hierarchical clustering toward improving the effectiveness of research space reduction for holistic schema matching. The experiments have been carried out on web interfaces which are Airfare, Auto and Book datasets. The proposed method consumed the characteristics of both of schema's elements such as columns names and labels and schema's constraints such as the data type of the attributes. In addition, the cardinality of matching on this research is based on 1:1 matching due to the restriction of the available information that related to the datasets. The proposed method have been evaluated by applying different clustering method and performing a comparison. Correlation clustering has outperformed the other clustering methods.