CLUSTER BASED DUPLICATE DETECTION
A. Venkatesh Kumar and S. Vengataasalam
DOI : 10.3844/jcssp.2013.1514.1518
Journal of Computer Science
Volume 9, Issue 11
We propose a clustering technique for entropy based text dis-similarity calculation of de-duplication system. Improve the quality of grouping; in this study we propose a Multi-Level Group Detection (MLGD) algorithm which produces a most accurate group with most closely related object using Alternative Decision Tree (ADT) technique. Our propose a two new algorithm; first one is Multi-Level Group Detection (MLGD) formation using Alternative Decision Tree (AD Tree), which will split the bunch of record into self-sized cluster to reduce the volume of data for text comparisons. Second one is calculating the dis-similarity percentage using entropy and Information Gain (IG). We show experimentally our proposed technique achieves higher average accuracy than existing traditional de-duplication system. Further, our technique not required any manual tuning for clustering formations as well as dis-similarity calculation for any kind of business data. In this study, we have presented a new efficient method is introduced for clustering formation using ADTree algorithm for duplicate deduction. The new method offers more accuracy dis-similarity measure for each cluster data without manual intervention at the time of duplicate deduction.
© 2013 A. Venkatesh Kumar and S. Vengataasalam. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.