WEB DOCUMENT SEGMENTATION USING FREQUENT TERM SETS FOR SUMMARIZATION

Query sensitive summarization aims at extracting th e query relevant contents from web documents. Web page segmentation focuses on reducing the run time overhead of the summarization systems by grouping the related contents of a web page into segments. A t query time, query relevant segments of the web pa ge are identified and important sentences from these s gments are extracted to compose the summary. DOM tree structures of the web documents are utilized t o perform the segmentation of the contents. Leaf no des of DOM tress are merged to form segments according to the statistical and linguistic similarity measur e. The proposed system has been evaluated by intrinsic approach making use of user satisfaction index. Th e performance of the system is compared with summariz ation without using preprocessed segments. Performance of this system is more promising than t he other measures like cosine similarity, jaccard measure which make use of sparse term-frequent vect ors, since the most frequent term sets are consider ed to measure the relevance. Relevant segments alone n eed to be processed at run time for summarization which reduces the time complexity of the summarizat ion process.


INTRODUCTION
The exponential growth of the volume of web documents, poses a hard challenge to the users in locating, retrieving and using the huge contents pooled over WWW. Search engines help the users to search and locate the information to an extent. For each query, the search engine returns thousands of URLs as the result set which includes redundant as well as irrelevant links. Improving the retrieval efficiency to meet the users' personalized need becomes critical in information retrieval domain.
Summarization techniques focus on reducing the time and effort required for the user to understand the core concept of the large by providing a short summary. Query-based summarization technique extracts/abstracts pieces of information from web documents to answer a query. Processing entire document at run time dynamically according to the query is a challenging task for the processing capacity and response time of the automatic summarizers.
Some kind of pre-processing methods like topic based or content based segmentation, sentence clustering can be applied to reduce the processing overhead at run time. This study focuses on content segmentation employing the relevance measurement technique using statistical and linguistic measures.

JCS
representation of text streams using two-sided contextual information. The Latent Semantic Analysis (LSA) (Landauer et al., 1998;Landauer and Dumais, 1997) and the Hyperspace Analogues to Language (HAL) model (Landauer et al., 1998) were two wellknown methods in corpus-based similarity. The basic idea of LSA, was that the aggregation of all the word contexts in which a given word did or did not appear would represent the similarity between text units. LSA did not take into account any syntactic information and was thus more appropriate for longer texts.
The HAL (Landauer and Dumais, 1997) method used lexical co-occurrence to produce a highdimensional semantic space. Similarity between two sentences was calculated using Euclidean distance. Drawback of HAL was due to the building of the memory matrix and its approach to form sentence vectors which did not capture sentence meaning well.
The vector-based document model methods were commonly used in Information Retrieval (IR) systems (Mohamed and Rajasekaran, 2006), where the document most relevant to an input query is determined by representing a document as a word vector and then queries were matched to similar documents in the document database via a similarity metric (Chen and Shen, 2009). An extension of word co-occurrence methods lead to the pattern matching methods which were used in text mining and conversational agents (Iosif and Potamianos, 2010). This technique relied on the assumption that more similar documents would have more words in common. But it is not always the case that texts with similar meaning necessarily share many words (Wang et al., 2008).
Semantic Text Similarity (STS) method using the Longest Common Subsequence (LCS) measure for string similarity measure was proposed by Islam and Inkpen (2008). This aproach was improved to compute a similarity measure between text units using feature vectors. Kogilavani (2012) used Term Synonym Concept Frequency-Inverse Sentence Frequency (TSCF-ISF) to measure the weight of a word to detect dominant concepts in web documents.
VIsion-based Page Segmentation (VIPS) algorithm was proposed by Cai et al. (2003) which segmented the web page by simulating the way of human understanding of the web layout structure. This approach used human visual perception model. Kohlschütter and Nejdl (2008) proposed a segmentation approach which utilized the notion of textdensity as a measure to identify the individual segments of the web page by reducing the problem to solve a 1Dpartitioning task. Pnueli et al. (2009) described an algorithm that segments a web page recursively to segment the layout of the page and the UI components using the page's rendered image.
Most of these works were not query relevant and the generated segments were not in view of improving summarization process.
The original contributions of this study are: • Focuses on reducing the time complexity of extractive summarization process at run time • Relevant pieces of scattered information in the web document are grouped as segments during preprocessing • DOM tree structures of the web documents are utilized and the relevant leaf nodes are merged to form the segments • Frequent term sets and the WordNet (wordnet.princeton.edu/wordnet/download/) distance between term sets of the nodes are used to measure the semantic relevance between blocks of text • The segment details are materialized in relational database and could be used for generating query sensitive summaries on the fly during query time

Frequent Term Set Based Segmentation and Summary Generation
Query sensitive summarization techniques aim at providing the gist of document with respect to the user query (Mohamed and Rajasekaran, 2006;Chen and Shen, 2009). This short summary is useful in understanding the larger document without reading the entire content. Summarization can be abstractive or extractive in nature. In case of abstractive summarization, NLP techniques are used to generate an abstract of the content by framing sentences. In the later method summary is composed by extracting the important sentences from the document.
Generating summary of the document at run time based on the dynamic query given by the user requires huge processing capacity of the information processing servers. Each sentence in the selected web page need to be verified for relevance to the given query and assigned a score which is a measure of the importance of the sentence. The time required for generating summary can drastically be reduced by reducing or limiting the size of text unit to be processed for summarization.

JCS
This study proposes a preprocessing technique in which the relevant pieces of information which are scattered throughout the document can be merged (Chitra et al., 2011) into segments. Information about the segments and the keywords are stored in the relational database (Feng et al., 2011). At run time the segments relevant to the query having matching keywords alone are identified and processed to generate the summary. This system uses unsupervised technique for segmenting the web documents which can be extended to any domain. Figure 1 shows the segmentation and summarization process of the proposed system. Web crawler crawls over WWW starting from the seed pages provided and captures and saves the documents in the server's database. Indexer component periodically indexes these documents by creating a keyword based inverse index for the documents.
Indexed documents are segmented using frequent term set based segmentation technique and their segment ID and the frequent terms are saved in segments database (relationalDB) for further usage during summarization.
User enters the query string through the search engine user interface based on which the search engine identifies the set of matching documents from the database and present the URLs to the user according to their rank order. User is required to choose the URL from which he/she wishes to get the gist of the content. Segments relevant to the user query are extracted from Segment Database and processed to generate the summary. The scope for the selection of summary sentences is now reduced to only few segments that are relevant to the query string. This technique is very effective in minimizing the processing overhead of the information servers at run time for dynamic summary generation.

Frequent Term Sets Based Segmentation
A segment on the web page is the collection of content from the page that is identified as distinct from the rest of the page in some way. Figure 2 (Frequent term set based segmentation of HTML document) depicts the segmentation (Yen and Hsu, 2009;Chitra et al., 2011) using DOM (http://www.w3c.org/DOM/) tree structure. The nodes from left to right of a parent constitute a coherent semantic string of the content (Li et al., 2006).
Leaf nodes (Li et al., 2006) are considered as micro blocks which are the basic building blocks. Adjacent micro blocks of the same parent tag are merged to form the topic blocks. These topic blocks are stemmed after removing the stop words like a, an, it, to which do not contribute much to the core content of the blocks. Frequent term sets and their frequency in each of the topic block are identified. Frequent term set based relevance measure is used to measure the semantic relevance between the topic blocks. Topic blocks having similarity above the threshold value α(0.6), are combined to form the concept block. The value of α is chosen so that intra segment relevance is high and inter-segment relevance is less. Relevant topic blocks are expected to have content about the same concept (for example placement and training in a college web site). Similarity measure also considers the WordNet distance between the frequent terms (Pnueli et al., 2009) which is considered to be a better measure that simulate human thought proces.
Segmentation is carried out offline for all web documents in the repository and segment details are materialized in relational database for further processing during summarization. The set of sentences in each of these segments are actually present is different parts of the document.
Since DOM nodes are processed, the time taken for processing is less when compared to other vector based and document graph (Wang et al., 2008;Mohamed and Rajasekaran, 2006) based models. The processing time required to build the document graph is eliminated in this approach.

Frequent Term Set Based Segmentation Algorithm
The Frequent Term Sets (FTS) and their WordNet distances are the important factors in measuring the similarity between topic blocks. The segmentation algorithm is described below:

JCS
Step 1: Mark all leaf nodes as individual micro blocks in the DOM tree.
Step 2: Extend the border of the micro block to include all leaf nodes of the same parent tag to form a topic block so as to have a set of topic blocks TB={tb 1 , tb 2 , …tb n }, TB⊂d i .
Where: TF tb1,i = Frequency of i th term in tb 1 TF tb2,j = Frequency of j th term in tb 2 Tfweight (tb1,i) , tfweight {tb2,j) = Weight of term i,j with respect to topic blocks tb 1, tb 2 normalized by the frequency vectors of tb 1, tb 2, calculated (Chitra et al., 2011;Islam and Inkpen, 2008;Hao et al., 2011) as in Equation 2: Where: tfweight i = The importance of the i th term with respect to the frequent term vector of i th topic block tbi tf i = Frequency of i th term in tbi R ((tb1,i),(tb2,j)) = Relevance between i th term in tb 1 and j th term in tb 2 measured using WordNet distance (Yen and Hsu, 2009;Shehata et al., 2010) between the terms using Equation 3: The similarity score is normalized by the frequency vectors of both topic blocks so that the resulting score lies between the range of 0 to 1.
Step 5: Merge the topic blocks having similarity measure above the predefined threshold α. Segment S k ={set of topic blocks tb i }| ∀tb i , tb j ∈S k , sim(tb i , tb j )> α, tb i ⊂TB, tb j ⊂TB, k=1..n Step 6: Output Segments S 1 , S 2 ,..S n and their respective frequent term sets.
Similarity between topic blocks is measured by considering the frequent term sets of topic blocks, tb1 and tb2. Frequency of these terms and their topic block based weight-age are used to measure the similarity score. tfweight (tb1,i) represents the importance of a particular term within the topic block where term frequencies are normalized by the topic block wide frequency factor. WordNet distance (Li et al., 2006;Hao et al., 2011) between the terms is a useful measure for identifying the semantics relevance between the terms. Words that are semantically closer will get higher score which in turn increases the similarity score between the topic blocks when added to the TF based score.
Segments of all web documents in the repository are identified during pre-processing stage and stored in the database.

Similarity Metric to Measure Topic Blocks Relevance
Cosine similarity is the most common measure used to find the relevance between text segments (Cai et al., 2003, Kumar, 2011. This measure makes use of bags of words approach where the term vector contains more null values. Jaccard measure makes use of frequency of common words to measure the similarity which is very uncommon in real documents. Relevant documents need not contain same set of words to give the same meaning. Yen and Hsu (2009) This measure was tested using dmoz.org (www.dmoz.org/) web pages and proved to be working effectively. Yen and Hsu (2009) considered only the term frequency of all terms in both documents. Importance of particular term within the document is not taken into account which contributes more to measure the relevancy. For one particular term t1, the frequency may be very low in document d1 and very high in document d2.   For the same term t1, document 3 and document d4 may be having moderate score compared to d1 and d2. Both will not make any difference in the above mentioned metric. Longer documents are likely to have high frequency for many terms which need to be normalized to find the relevance. The proposed metric given in Equation 1 considers term frequency, term weight with respect to the topic block and also the WordNet distance to find the relevance between the topic blocks. Hence the segmentation process is more promising than the other aproaches. Consider the previous scenario having four sample documents and a term t1. According to the importance the term weightage changes and also the relavance score is changed. Our metric is more efficient since we consider term frequency, term weight-age and also the WordNet distance. Term weightage itself is normalized by the length of the topic block before multiplying it with the WordNet based relevance. The final score is again normalized by length of the topic blocks as in cosine similarity measure which produces better result than existing measure.

JCS
Unlike the vectors used in cosine similarity and Jaccard measures the term frequency vectors of any two topic block contains the words and its frequency. These two vectors need not have the same set of words and in same order (Chitra et al., 2011). They can appear in any order as they are obtained after preprocessing the topic blocks. Frequency of each term is added to the frequency of every term in the other vector which is then multiplied by the relevance between these two words as per WordNet Tree structure. The closer terms pair would get higher score as per Equation (1) which may not be similar to each other. For example "college" and "education" are dissimilar but relevant words, would get high score as per Equation (1).

Materializing the Segments
Information about the identified segments are saved in the relational database (Wang et al., 2008). The structure of the segment table and keyword index table are shown below in Table 1 and 2.
All leaf nodes are numbered from left to right starting from 1 according to DOM tree traversal technique. After segmentation details of the segments are stored on Segment Table 1 which contains DOCID, SEGID, NODEIDs of all nodes constituting that segment and KEYWORDS present in that segment. Keyword Index Table (KWIT) ( Table 2) contains an entry for each concept present in the document repository and DOCIDs of all documents containing information about that concept keyword. URL Table 3 contains mapping from DOCID to actual location of the document.
During query time based on the query keywords, the DOCIDs of relevant documents are identified from Keyword index table. Then the corresponding URLs of these DOCIDs are identified and presented to the user as search result. User now selects a URL to view the summary. From the ST the segments relevant to query keyword are identified and summarization algorithm is applied only on these identified segments.

Experimentation and Evaluation
Experimentation of the proposed system was conducted using WEBKB (www.cs.cmu.edu/~webkb/) dataset and also real time datasets. WEBKB (Mohamed and Rajasekaran, 2006)

JCS
downloaded from four universities containing information related to faculty, students, courses offered, activities, achievements. Real time corpus was built by downloading top 10 web pages from Google search engine for the keywords "Engineering education", "Efficiency of optimization algorithms" and "Global warming". Three diversified domains are chosen to prove that the proposed system is domain independent and unsupervised.   Table 4 shows the identified key words for this real time corpus and the DOCIDs of Documents containing the keywords. All these pages were cleaned by removing unrelevant HTML tags(like meta tag not contributing to content mining) and segmentation algorithm was applied. Segment details were saved in relational database. Document di contains n number of segments as di = {s1,s2,…sn}. Only m segments are selected for summarization. This improves the processing efficiency of the information server by (n-m)/n which is a remarkable improvement in view of the processing load to the server at query time.
Our data collection process was carried out using Google Search Engine. Of the five major or "core" search engines, Google held a substantial lead over its rivals for more than the past five years (Pasupathi et al., 2011) (according to comScore research and Search Marketing Standard). Ebiz MBA Knowledge database statistics says that more than 9 billion monthly visitors are using Google for information search on WWW. For testing purpose we took 15 web documents from the real time corpus on which the segmentation algorithm was applied. Segment details (for only 3 documents) were stored in the Segment Table as shown in Table 5.

RESULTS
In the above mentioned example document d1 contains 3 segments and 8 nodes. If the query string given is "engineering college + traning and Placement" then during summarization only segments s2 and s4 of document d1 need to be processed for summarization. This improves the processing efficiency by 2/4 at the segment level and 6/9 at the node level. Starting from these set of leaf nodes the specific branch of the DOM tree can be considered for generating the summary. The efficiency improves for larger web documents as the segment required to be processed will be remarkably less.

DISCUSSION
The Table 6 shows that the segmentation helps to improve the summarization efficiency considerably which in also depicted in the following graph. Instead of processing the complete document only a part that is few segments relevant to the query alone are going to be processed. Figure 3 clearly indicates that the pre-processing segmentation improves the summarization efficiency by reducing the size of text units to be processed for generating summary. As the number of nodes in the document increases the efficiency of summarization using segmentation increases and the time complexity and processing overhead of the server is drastically decreased. Summarization without using segments needs to process all nodes of the document which in turn will increase time complexity of the process. Segmentation as pre-processing for summarization is an innovative idea which has not yet been applied in any summarization system.

CONCLUSION
Query based summarization focuses on extracting query relevant pieces of information from the web page at query time. Information servers need to process the entire content of selected web pages to compose the summary page. This study proposed an innovative idea of identifying relevant sentence from the web page as segments and materializing the segment information in relational database during pre-processing stage i.e., offline. Web documents were segmented based on frequent term sets and WordNet distance between term sets. Query relevant segments of the user selected URL from the search result were identified and considered for summary generation process. This reduces the load for information servers to produce on the fly summaries at query time. Query relevant summary is really a boon to information seekers who need to understand the content of the web page quickly.
Pre-processed segments are more helpful in reducing processing overload of the information servers by reducing the scope of summarization to few relevant segments instead of processing the entire document at query time. In this scenario, the size of the document does not have much impact on the summarization process.