Research Article Open Access

CLUSTER BASED DUPLICATE DETECTION

A. Venkatesh Kumar1 and S. Vengataasalam1
  • 1 , India

Abstract

We propose a clustering technique for entropy based text dis-similarity calculation of de-duplication system. Improve the quality of grouping; in this study we propose a Multi-Level Group Detection (MLGD) algorithm which produces a most accurate group with most closely related object using Alternative Decision Tree (ADT) technique. Our propose a two new algorithm; first one is Multi-Level Group Detection (MLGD) formation using Alternative Decision Tree (AD Tree), which will split the bunch of record into self-sized cluster to reduce the volume of data for text comparisons. Second one is calculating the dis-similarity percentage using entropy and Information Gain (IG). We show experimentally our proposed technique achieves higher average accuracy than existing traditional de-duplication system. Further, our technique not required any manual tuning for clustering formations as well as dis-similarity calculation for any kind of business data. In this study, we have presented a new efficient method is introduced for clustering formation using ADTree algorithm for duplicate deduction. The new method offers more accuracy dis-similarity measure for each cluster data without manual intervention at the time of duplicate deduction.

Journal of Computer Science
Volume 9 No. 11, 2013, 1514-1518

DOI: https://doi.org/10.3844/jcssp.2013.1514.1518

Submitted On: 9 September 2013 Published On: 28 September 2013

How to Cite: Kumar, A. V. & Vengataasalam, S. (2013). CLUSTER BASED DUPLICATE DETECTION. Journal of Computer Science, 9(11), 1514-1518. https://doi.org/10.3844/jcssp.2013.1514.1518

  • 2,800 Views
  • 2,163 Downloads
  • 0 Citations

Download

Keywords

  • Clustering Algorithm
  • Alternative Decision Tree Algorithm
  • Duplicate Detection
  • Efficient Method
  • Manual Intervention
  • Cluster Data
  • Similarity Measure
  • Clustering Formation