COMPARATIVE STUDY OF K-MEANS AND K-MEANS++ CLUSTERING ALGORITHMS ON CRIME DOMAIN

Bashar Aubaidan; Masnizah Mohd; Mohammed Albared

doi:10.3844/jcssp.2014.1197.1206

Research Article Open Access

COMPARATIVE STUDY OF K-MEANS AND K-MEANS++ CLUSTERING ALGORITHMS ON CRIME DOMAIN

Bashar Aubaidan¹, Masnizah Mohd² and Mohammed Albared²

¹ , Iraq
² Universiti Kebangsaan Malaysia, Malaysia

Abstract

This study presents the results of an experimental study of two document clustering techniques which are k-means and k-means++. In particular, we compare the two main approaches in crime document clustering. The drawback of k-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. Since k-means++ has not being applied before in crime document clustering, this study presented a comparative study between k-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than k-means. We proposes the k-means++ clustering algorithm, to identify best seed for initial cluster centers in clustering crime document. The aim of this study is to conduct a comparative study of two main clustering algorithms, namely k-means and k-means++. The method of this study includes a pre-processing phase, which in turn involves tokeniza-tion, stop-words removal and stemming. In addition, we evaluate the impact of the two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. Exper-imental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, k-mean++ can significantly (with the significance interval at 95%) work better than k-means. These results demonstrate the accuracy of k-mean++ clustering algorithm in clustering crime doc-uments.

Journal of Computer Science

Volume 10 No. 7, 2014, 1197-1206

DOI: https://doi.org/10.3844/jcssp.2014.1197.1206

Submitted On: 21 November 2013 Published On: 19 February 2014

How to Cite: Aubaidan, B., Mohd, M. & Albared, M. (2014). COMPARATIVE STUDY OF K-MEANS AND K-MEANS++ CLUSTERING ALGORITHMS ON CRIME DOMAIN. Journal of Computer Science, 10(7), 1197-1206. https://doi.org/10.3844/jcssp.2014.1197.1206

Copyright: © 2014 Bashar Aubaidan, Masnizah Mohd and Mohammed Albared. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

4,991 Views
5,715 Downloads
24 Citations

Download

Keywords

Crime Document Clustering
K-Means++
K-Means Algorithm
Similarity/Distance Measures