Improving the Performance of Multivariate Bernoulli Model based Documents Clustering Algorithms using Transformation Techniques
Abstract
Problem statement: Document clustering is the most important areas of data mining since they are very much and currently the subject of significant global research since such areas strengthen the enterprises of web intelligence, web mining, web search engine design and so forth. Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Approach: This study explores the suitability of multivariate Bernoulli model based probabilistic algorithm for text clustering application. In a multivariate Bernoulli model, a document is represented as a binary vector over the space of words with 0 and 1, indicating that whether word occurs or not in the document. The number of occurrences is not considered. So the word frequency information is lost due to this nature of implementation. In this work, we propose a FFT based transformation technique for improving clustering performance of multivariate Bernoulli model based probabilistic algorithm. We are using the transformation technique to transform the actual term frequency count data in to a time domain signal. So, the weight of frequency of each word will be distributed throughout each row of records. Now if we apply multivariate Bernoulli model on values less than zero and greater than zero, the performance will get increased since there is no information loss in this kind of data representation. Results: In this work, Bernoulli model-based clustering and an improved version of the same will be implemented and evaluated using suitable metrics and the results are shown. Conclusion: The transformation technique in multivariate Bernoulli model improves the performance of document clustering significantly.
DOI: https://doi.org/10.3844/jcssp.2011.762.769
Copyright: © 2011 Perumal Pitchandi and Nedunchezhian Raju. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 3,644 Views
- 3,295 Downloads
- 6 Citations
Download
Keywords
- Text clustering
- text classification
- document clustering
- model based clustering
- term document matrix
- Text to Matrix Generator (TMG)
- Bernoulli model
- Fast Fourier Transformation (FFT)
- transformation technique
- clustering algorithms