Research Article Open Access

Web Document Segmentation Using Frequent Term Sets for Summarization

Chitra Pasupathi1, Baskaran Ramachandran2 and Sarukesi Karunakaran3
  • 1 RMK Engineering College, India
  • 2 Anna University, India
  • 3 Hindustan University, India
Journal of Computer Science
Volume 8 No. 12, 2012, 2053-2061

DOI: https://doi.org/10.3844/jcssp.2012.2053.2061

Submitted On: 4 June 2012 Published On: 19 December 2012

How to Cite: Pasupathi, C., Ramachandran, B. & Karunakaran, S. (2012). Web Document Segmentation Using Frequent Term Sets for Summarization. Journal of Computer Science, 8(12), 2053-2061. https://doi.org/10.3844/jcssp.2012.2053.2061

Abstract

Query sensitive summarization aims at extracting the query relevant contents from web documents. Web page segmentation focuses on reducing the run time overhead of the summarization systems by grouping the related contents of a web page into segments. At query time, query relevant segments of the web page are identified and important sentences from these segments are extracted to compose the summary. DOM tree structures of the web documents are utilized to perform the segmentation of the contents. Leaf nodes of DOM tress are merged to form segments according to the statistical and linguistic similarity measure. The proposed system has been evaluated by intrinsic approach making use of user satisfaction index. The performance of the system is compared with summarization without using preprocessed segments. Performance of this system is more promising than the other measures like cosine similarity, jaccard measure which make use of sparse term-frequent vectors, since the most frequent term sets are considered to measure the relevance. Relevant segments alone need to be processed at run time for summarization which reduces the time complexity of the summarization process.

  • 1,089 Views
  • 1,653 Downloads
  • 0 Citations

Download

Keywords

  • Search Engine Optimization
  • Segmentation
  • Summarization
  • Pre-Processing
  • Query Sensitive