A Comparative Study in Classification Techniques for Unsupervised Record Linkage Model

,


INTROCUDTION
Data integration is defined as the process of merging data from various databases and sources such as flat files, data cube and databases into a coherent source like data warehouse. Data integration is using vastly in current information systems. Since heterogeneous data sources have different formats and standards, a real world entity may be presented with different styles in each of these sources. Moreover, data entry mistakes such as typing errors or utilizing Optical Character Recognition (OCR) can also cause different presentations of the same object. So, these matters lead to duplication which is considered as one of the major problems of data quality. Hence, finding such duplicates in order to make a proper decision to handle them is an essential requirement of information systems. This task is also known as record linkage.
Record linkage also known as citation matching (McCallum et al., 2000), authority control (Warnner and Brown, 2001), object matching (Surajit et al., 2003) and entity resolution (Sarawagi and Bhamidipaty, 2002), is a difficult and heavy step of data integration. The goal of record linkage is to find, match and aggregate duplicate tuples in an integrated database. Record linkage is vastly used in different contexts including Digital Libraries, bioinformatics and business customer information. Moreover, it is also a common pre-processing step of mining projects.
Web datasets are often lack proper quality; so, finding duplicate records in such databases is a challenging task. The bibliographic entities in online digital libraries can be mentioned as an example.  Figure 1 shows some common steps of record linkage process. This model consists of several steps such as data cleaning, blocking, field comparison and classification. Some usual problems of real-world databases are noisy, incomplete and incorrect data (Churches et al., 2002). Data cleaning is an initial preprocessing step to modify such mentioned problems.
Comparison is an important step which has been applied in all record linkage frameworks. In order to detect duplicate records in two input dataset, all records from the first dataset should be compared with the second one. This task can involve a bulk amount of comparisons which makes the task inefficient. To solve such a mentioned problem, blocking techniques are used in record linkage frameworks to decrease the number of comparisons by putting more similarly duplicate records in the same block. Only the content of each block will be compared together in the next step.
Determining a similarity function and matching records are two important steps applied in most record linkage frameworks. Records which are in the same block are compared together by similarity functions.
Comparison similarity functions can be a simple string function or a complicated combination of several functions. The results of applying comparison functions are scores (also named weight vectors) which show the degree of similarity between record pairs. Classification techniques are applied on the results of the previous step in order to classify the records in three classes: match, non-match and possibly match. To classify the records, supervised and unsupervised classification techniques have been utilized in various studies. One of the technique for classifying the record pairs is to separate the training data by selecting the record pairs with the highest and the lowest similarity scores as match and non-match classes, respectively. Other pairs are considered as possibly match records and are classified by data mining techniques based on known match and non-match samples.
Initially, the concept of record linkage was presented by Newcombe et al. (1959) in the context of medical records. Fellegi and Sunter (1969) proposed an EM-Based method to determine error rates and set matching parameters. Their theory was followed by (Winkler, 1999) in which EM-based methods were utilized for setting optimal matching rules.
One of the significant aspects of record linkage is related to blocking. In order to decrease the number of comparisons between record pairs and come up with faster execution time, a variety of blocking strategies have been proposed. To mention a few, standard blocking (Jaro, 1989), sorted neighborhood method (Hernández and Stolfo, 1998) and canopy clustering algorithm (McCallum et al., 2000) are considered as some popular ones.
Since 1990s, the usage of techniques which were related to such areas as machine learning, artificial intelligence, data mining and information retrieval have been explored in record linkage and duplicate detection. Most of these strategies are supervised. It means that classifying is done based on available training samples which are labeled manually. Two prominent machine learning techniques which have been applied for classifying record pairs in the area of record linkage are decision tree (Elfeky et al., 2002) and Support Vector Machines (Nahm et al., 2002). However, labeling data manually can be a costly task. Furthermore, unsupervised techniques have also been employed providing that training samples are not available or sufficient enough. In (Gu and Baxter. 2006), one of the clustering techniques, namely k-means was utilized for classifying record pairs into match and non-match classes. Elfeky et al. (2002) proposed a hybrid approach in which supervised and unsupervised techniques were combined. This approach performed well in facing lack of training samples. (2007) proposed a two step classification approach for classifying record pairs. In this approach, after computing similarity between record pairs, training examples are selected based on their similarity scores. Then, other instances are classified based on training samples by Support Vector Machines.

Christen in
In this study, we follow the approach of (Christen, 2007) with different classifiers in order to determine their effectiveness in detecting duplicate records. Then, applied classification techniques are compared together.

MATERIALS AND METHODS
We follow a two step classification method presented in (Christen, 2007). The data used in our paper is Restaurant and Cora datasets. The details of these datasets will be described later. Sorted neighborhood (Hernández and Stolfo, 1995) is used as a blocking algorithm and Longest Common Subsequence (LCS) (Allison and Dix, 1986) is utilized as a similarity function. In the next step, SVM, C4.5, Naïve bayes and Bayesian network classifiers are applied on selected training records in order to train and build the models. Finally, testing set is classified based on training results to see the effectiveness of each classifier in this dataset.
Blocking technique: Nearest neighborhood (Hernández and Stolfo, 1995) blocking algorithm is used in this study. In Restaurant dataset, firstly, a blocking key is produced for sorting by combination of first three characters of Name, Address and City attributes of this dataset. The blocking key of Cora dataset is composed of first three characters of Title, Author and Venue fields of this dataset. Then, all records of each dataset are sorted based on blocking key by considering the window size as three. Each three records in a same window are compared to each other with comparison function.
Similarity function: Next step is computing the similarity between records within the same block. Longest Common Subsequence (LCS) is an algorithm proposed in (Allison and Dix, 1986) and is used to find the longest subsequences which are common in two strings. It has been successfully experimented in several contexts such as record linkage. A normalized version of LCS in which the result is normalized by considering the length of both input strings is proposed in (Islam and Inkpen, 2008) as follow: 2 lenght(LCS(s1,s2)) NLCS(s1,s2) lenght(s1) lenght(s2) = × where, S1 and S2 are two input strings. In this study, the normalize version of LCS is applied in order to calculate the similarity of fields of two records. The output of this task is the similarity scores of compared records for each field, also known as weight vectors.
Classification: Each weight vector consists of several values. These values are the result of comparing two fields of each record and are considered to be in the range of 0 and 1. In the classification step, firstly, the distances of all weight vectors are computed from two vectors with the values of 1 and 0, respectively by Euclidean distance measure. Afterwards, some of the weight vectors which have the nearest values to 1 and 0, are selected as match and non-match classes, respectively. These weight vectors are considered as training set for a classifier. The remaining weigh vectors are regarded as test set and will be classified by different supervised classification techniques based on known training samples. Support Vector Machines (SVM) are a set of techniques which investigate data to recognize patterns. SVM is used as a classification tool to build hyperplane or some hyperplanes to separate instances into two classes: -1 and +1. The more distance of hyperplane to the nearest training data-points, the less classification errors for unseen data instances. A separating hyperplane can be written as: where, W = {w 1 , w 2 ,…., w n }are weight vectors for n attributes A = {A 1 , A 2 ,…., A n }; b is a scalar and X = {x 1 , x 2 ,…., x n } are values of attributes (Han and Kamber, 2006). There are more details on SVM in (Han and Kamber, 2006;Pugazhenthi and Rajagopalan, 2009;Lee et al., 2010;. Decision Tree is one of the significant data mining techniques for classification. This technique facilitate the decision making process by dividing it to several steps. It uses labeled training instances to classify unseen data. The most common algorithm for building decision trees is the C4.5 algorithm (Quinlan, 1992;Kusrini et al., 2010) which is an extension of ID3 algorithm (Quinlan, 1979).
A Bayes classifier is a simple probabilistic and statistical classifier which can predict class membership probabilities and is based on applying Bayes's rule of conditional probability. Naïve Bayesian classifiers assume that all predictor variables are independent (Han and Kamber, 2006). Naïve Bayes is utilized in several studies (Al-Salemi and Aziz, 2011;Wagner, 2010).
A Bayesian Network or Bayesian belief network or directed acyclic graphical model is represented as a directed acyclic graph in which each node holds a random variable and each variable corresponds to a particular attribute in the data. These variables may have continuous or discrete values. Mainly, a Bayesian network is based on this assumption that each variable as a parent node is conditionally independent of its nondecedents in the graph (Han and Kamber, 2006). Bayesian Network is used in a variety of domains (Ting and Phon-Amnuaisuk, 2009;Mustapha et al., 2011;Mehdi et al., 2007).

RESULTS
The experiments are done on two real world datasets, namely Cora and Restaurant. In the classification step, the nearest 1, 5 and 10 percent weight vectors to one and zero are selected as the training set in different experiments. The rest of weight vectors in each step are considered as test set. Finally, Weka classifier package has been used as a tool in order to classify test set instances with different classification techniques.

Restaurant dataset:
Restaurant is a standard dataset which is used in several record linkage studies (Christen, 2008;Kopcke and Rahm, 2010;Stoermer et al., 2010). It was created by merging the information of some restaurants from two websites: Zagat (331 non-duplicate restaurants) and Fooders (533 non-duplicate restaurants). There are 864 records in this dataset and 112 of them are duplicates. Name, Address, City, Phone and Type of restaurants are attributes of this dataset.

Cora dataset:
The second applied dataset is Cora. Cora is a real world dataset which contains 1295 citations of 112 computer science papers which were gathered from the Cora Computer Science Research Paper Engine. The attributes of the citation are as follow: Author, Volume, Title, Institution, Venue, Address, Publisher, Year, Pages, Editor, Note and Month. Moreover the attribute of Class in this dataset is also used for determining whether two records are duplicates or not. It also used in several record linkage studies (Kopcke and Rahm, 2010;Ojokoh et al., 2011;Christen, 2008;Hassanzadeh and Miller, 2009).

Evaluation metrics:
The effectiveness of each classifier can be measured by precision, recall and Fscore metrics. The following measures are required in order to calculate evaluation metrics: • True Positive (TP): Corresponds to the number of matched detected when it is really match   other algorithms in all terms of precision, recall and fmeasure in this dataset. After SVM, the effectiveness of Bayesian network classifier is better than two others. Figure 2 shows a comparison of the effectiveness of different classifiers for Restaurant dataset. As can be seen from Fig. 2, the SVM outperforms other techniques in all of the evaluation metrics.
For Cora dataset, unlike Restaurant, the Naïve bayes method outperforms others in all evaluation metrics. As statistical results show, the F-measure values for SVM, Decision tree, Naïve bayes and Bayesian network are 85.73, 68.30, 89.70 and 89.13, respectively. The results of this classifier is slightly better than Bayesian network. Figure 3 Compares the effectiveness of different classifiers in Cora dataset.

DISCUSSION
Finding and matching duplicate records is an essential task to improve data quality. In this study, some prominent classification techniques were utilized in order to detect duplicate records in two integrated real world datasets. As The experimental results show, the effectiveness of classifiers in detecting duplicate records is different based on the input dataset. While SVM outperforms other methods in detecting duplicate objects in Restaurant dataset, Naïve bayes comes up with the best results in Cora dataset. However, the Precision of SVM is still noticeable in the Cora dataset.

CONCLUSION
Considering the results, there is no best classification technique for all datasets. Users could try different classification techniques on a new dataset in order to detect the best classification technique for it. However, SVM which is known as a robust and prominent classification technique is a good option for the classification task.
Applying record linkage task in data integrating improves the quality of data significantly. This matter leads to more accurate decisions in information systems. Finding other methods to enhance the effectiveness of detecting duplicate records, such as combining similarity measures in classification or finding more proper similarity measures will be examined in future study.