Journal of Computer Science

Web Document Segmentation Using Frequent Term Sets for Summarization

Chitra Pasupathi, Baskaran Ramachandran and Sarukesi Karunakaran

DOI : 10.3844/jcssp.2012.2053.2061

Journal of Computer Science

Volume 8, Issue 12

Pages 2053-2061

Abstract

Query sensitive summarization aims at extracting the query relevant contents from web documents. Web page segmentation focuses on reducing the run time overhead of the summarization systems by grouping the related contents of a web page into segments. At query time, query relevant segments of the web page are identified and important sentences from these segments are extracted to compose the summary. DOM tree structures of the web documents are utilized to perform the segmentation of the contents. Leaf nodes of DOM tress are merged to form segments according to the statistical and linguistic similarity measure. The proposed system has been evaluated by intrinsic approach making use of user satisfaction index. The performance of the system is compared with summarization without using preprocessed segments. Performance of this system is more promising than the other measures like cosine similarity, jaccard measure which make use of sparse term-frequent vectors, since the most frequent term sets are considered to measure the relevance. Relevant segments alone need to be processed at run time for summarization which reduces the time complexity of the summarization process.

Copyright

© 2012 Chitra Pasupathi, Baskaran Ramachandran and Sarukesi Karunakaran. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.