A Comparative Study in Classification Techniques for Unsupervised Record Linkage Model

Mohammadreza Ektefa; Fatimah Sidi; Hamidah Ibrahim; Marzanah A. Jabar; Sara Memar

doi:10.3844/jcssp.2011.341.347

Research Article Open Access

A Comparative Study in Classification Techniques for Unsupervised Record Linkage Model

Mohammadreza Ektefa, Fatimah Sidi, Hamidah Ibrahim, Marzanah A. Jabar and Sara Memar

Abstract

Problem statement: Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage algorithms with different steps have been developed in order to detect such duplicate records. To find out whether two records are duplicate or not, supervised and unsupervised classification techniques are utilized in different studies. In order to utilize the supervised classification algorithms without consuming a lot of time for labeling data manually, a two step method which selects the training data automatically has been proposed in previous studies. However, the effectiveness of different classification techniques is the issue which should be taken into accounts in record linkage systems in order to classify records more accurately. Approach: To determine and compare the effectiveness of different supervised classification techniques in an unsupervised manner, some of the prominent classification methods are applied in duplicate records detection. Duplicate detection and classification of records in two real world datasets, namely Cora and Restaurant is experimented by Support Vector Machines, Naïve Bayes, Decision Tree and Bayesian Networks which are regarded as some prominent classification techniques. Results: As experimental results show, while Support Vector Machines outperforms with F-measure of 96.27% in Restaurant dataset, for Cora dataset, the effectiveness of Naïve Bayes is the best and it leads to an improvement with F-measure of 89.7%. Conclusion/Recommendation: The result of detecting duplicate records with different classification techniques tends to fluctuate depending on the dataset which is used. Moreover, Support Vector Machines and Naïve Bayes outperform other methods in our experiments.

Journal of Computer Science

Volume 7 No. 3, 2011, 341-347

DOI: https://doi.org/10.3844/jcssp.2011.341.347

Submitted On: 13 October 2010 Published On: 7 March 2011

How to Cite: Ektefa, M., Sidi, F., Ibrahim, H., Jabar, M. A. & Memar, S. (2011). A Comparative Study in Classification Techniques for Unsupervised Record Linkage Model. Journal of Computer Science, 7(3), 341-347. https://doi.org/10.3844/jcssp.2011.341.347

Copyright: © 2011 Mohammadreza Ektefa, Fatimah Sidi, Hamidah Ibrahim, Marzanah A. Jabar and Sara Memar. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

7,726 Views
5,624 Downloads
12 Citations

Download

Keywords

Record linkage
duplicate detection
classification techniques
Optical Character Recognition (OCR)
Longest Common Subsequence (LCS)
data integration
support vector machines
heterogeneous data
ID3 algorithm
Bayesian network