Automatic Multi Document Summarization Approaches

: Problem statement: Text summarization can be of different nature ranging from indicative summary that identifies the topics of the document to informative summary which is meant to represent the concise description of the original document, providing an idea of what the whole content of document is all about. Approach: Single document summary seems to capture both the information well but it has not been the case for multi document summary where the overall comprehensive quality in presenting informative summary often lacks. It is found that most of the existing methods tend to focus on sentence scoring and less consideration is given to the contextual information content in multiple documents. Results: In this study, some survey on multi document summarization approaches has been presented. We will direct our focus notably on four well known approaches to multi document summarization namely the feature based method, cluster based method, graph based method and knowledge based method. The general ideas behind these methods have been described. Conclusion: Besides the general idea and concept, we discuss the benefits and limitations concerning these methods. With the aim of enhancing multi document summarization, specifically news documents, a novel type of approach is outlined to be developed in the future, taking into account the generic components of a news story in order to generate a better summary.


INTRODUCTION
The need of automatic text summarization has recently increased due to the proliferation of information on the Internet. With the availability and speed of internet, information search from online documents has been eased down to user's finger tips. However, it is not easy for users to manually summarize those large online documents. For example, when a user searches for information about earthquake which occurred in Sendai, Japan, the user will probably receive enormous articles related to that event. The user would definitely opt for a system that could summarize those articles. The goal of automatic text summarization is condensing the source text into a shorter version preserving its information content and overall meaning.
The objective and approach of summarization of documents explain the kind of summary that is generated. For example, it could be indicative of what a particular subject is about (closely related to a user query), or can be informative about what the whole content of document is all about. Besides that, approach towards text summarization can be either extractive or abstractive (Radev et al., 2002). In extractive type summarization, important sentences are identified and directly extracted from the original document, i.e. the final summary consists of original sentences. On the other hand, in abstractive type summarization (Ganesan et al., 2010) the sentences which are selected from the original document are further processed to restructure them before concatenating them into final summary. This process usually involves deep natural language analysis and sentence compression.
By understanding the type of summary i.e., indicative, informative, extractive and abstractive, we can then apply them to either single document or multi document. This study focuses mainly on informative and extractive type multi document text summarization. The distinct characteristics that make multi document summarization rather different from single document summarization is that multi document summarization problem involves multiple sources of information that overlap and supplement each other, being contradictory at occasions. So the key tasks are not only identifying and coping with redundancy across documents, but also ensuring that the final summary is both coherent and complete.
The contributions of this study can be summarized as follows: We discuss the four notable approaches of multi document summarization and present it with related research from literatures. The benefits and limitations concerning these approaches are also discussed. At the end of this study, a novel type of approach is outlined to be developed in the future, for news documents summarization. We aim to incorporate the generic components of a news document in order to generate a better summary.
The rest of the study is organized as follows: First we present the survey on four multi document summarization approaches namely the feature based method, cluster based method, graph based method and knowledge based method. Then we outline the proposed multi document summarization method; i.e., the component based method. Finally we end with conclusion.

Multi document summarization approaches:
A number of research study have addressed multi document summarization in academia (Erkan and Radev, 2004a, Wan and Yang, 2008, Haribagiu and Lacatusu, 2010 and illustrated different types of approaches and available systems for multi document summarization. In this study we direct our focus notably on four well known approaches to multi document summarization. Our discussion will be based on the following pattern: For each method, we will first discuss its main idea. Following that, we will look at some research study from related literatures. Finally the benefits and limitations concerning each method are commented.
Feature based method: Extractive type summarization involves identifying the most relevant sentences from the text and put them together to create a concise summary. In the process of identifying important sentences, features influencing the relevance of sentences are determined. Here we list some of the common features that have been considered for sentence selection.
Word frequency: The idea of using word frequency is that important words appear many times in the document. The most common measure widely used to calculate the word frequency is tf and idf.
Title/headline word: Occurrence of words from the document title in sentence indicates that the sentence is highly relevant to the document.
Sentence location: Important information in a document is often covered by writers at the beginning of the article. Thus the beginning sentences are assumed to contain the most important content.
Sentence length: Very short sentences are usually not included in summary as they convey less information. Very long sentences are also not suitable to represent a summary.
Cue word: There are certain words in a sentence which indicate that the sentence is carrying an important message in the document (e.g., "significantly", "in conclusion").
Proper noun: Sentences containing proper noun representing a unique entity suchlike name of a person, organization or place are considered important to the document.
Let us show a simple example for scoring a sentence S. Consider that we select three features namely title word, sentence length and sentence location. Calculation of the score for each of these features as in (Suanmali et al., 2009) is shown below: No.title word in S Title word ,T No.word in title = No.word in S Sentence length, L No.word in longest sentence = Sentence position, P 5 / 5 for 1st, 4 / 5 for 2nd, 3/ 5 for 3rd, 2 / 5 for 4th, 1/ 5 for 5th,0 / 5 for other sentences = After computing each of the feature score, the total scoring of a sentence is given by Eq. 1: where, Score_S i is the total score of sentence S i . The terms T i , L i and P i are the feature scores of sentence S i based on the title words it contains, sentence length and sentence position respectively. w 1 , w 2 and w 3 are the weights for the linear combination of the three features. Figure 1 depicts the generalized architecture of a feature based summarizer. We can see that text features score are combined for sentence scoring, as shown in Eq. 1. But not all text features are treated with same level of importance as some of the features have more importance or weight and some have less. Thus emphasis should be given on dealing with the text features based on their importance. This issue can be overcome by using weight learning method. Many researchers have been using various weight learning methods in their study. Binwahlan et al. (2009) introduced a novel text summarization model based on swarm intelligence technique known as Particle Swarm Optimization (PSO). The feature scores were adjusted using the weights resulting from the training using the PSO. The training set consist pairs of document and human reference summary. They implemented the sentence scoring using the following Eq. 2: where, Score (S) is the score of the sentence S, w 1 is the weight of the feature i produced by PSO, i = 1-5 showing that 5 text features where used and Score_f i (s) is the score of the feature i. The use of PSO for optimization have also been proven to be robust in other domains as well (Nacy et al., 2009, Balaji andKamaraj, 2011). Another weight learning approach was described by Bossard and Rodrigues (2011) who approximated the best weight combination using a genetic algorithm for their multi document summarizer. By using the genetic algorithm, a suitable combination of feature weights can be found. Table 1 lists some of the weight learning methods applied to text summarization. Weight learning method (Osborne, 2002) Conjugate gradient decent search method (Fattah and Ren, 2009) Mathematical Regression (MR) model (Binwahlan et al., 2009) Particle Swarm Optimization (PSO) (Dehkordi et al., 2009), Genetic Algorithm (GA) model (Bossard and Rodrigues, 2011) and (Suanmali et al., 2011) Besides optimizing feature weights, the impact of combining different features has been investigated by Hariharan (2010) for multi document summarization. In his study, the author showed that term frequency weight combined with position and node weight feature yields significantly better results.
The feature based method has been widely used by researchers because of its simple and direct approach to sentence selection for summarization. It is found that the combination of cue words, title words and location are what relied upon as primary features (Gupta and Lehal, 2010). An important issue to be noted in the context of multi document summary is that important or relevant information is usually spread across documents and feature based methods often fail to handle this problem. Besides that, the feature based method is knowledge poor in term of capturing contextual information contents that exist in the sentences and multiple documents. These limitations are due to the sentence scoring process which depends solely on flat feature representation of a sentence while omitting cross-document concepts and varying context in different documents.

Cluster based method:
The idea of clustering is to group similar objects into their classes. As far as multi documents are concerned, these objects refer to sentences and the classes represent the cluster that a sentence belongs to. By looking at the nature of documents that address different subjects or topics in the documents, some researchers try to incorporate the idea of clustering into their study. Using the concept of similarity, sentences which are highly similar to each other are grouped into one cluster, thus generating a number of clusters. The most common technique to measure similarity between a pair of sentences is the cosine similarity measure where sentences are represented as a weighted vector of tf-idf. Once sentences are clustered, sentence selection is performed by selecting sentence from each cluster. Sentence selection is then based on the closeness of the sentences to the top ranking tf-idf in that cluster. Those selected sentences are then put together to form the final summary Fig. 2. Typically, clustering algorithms can be categorized as agglomerative or partitional (Jain et al., 1999). In agglomerative clustering (also known as "bottom-up" approach), each sentence is initially considered a separate cluster by its own. This individual clusters are then merged into successively larger clusters. This iterative process ends when some stopping criterion is reached.
Whereas in partitional clustering approach, initially all sentences are grouped into one big cluster. Iteratively smaller clusters are generated by dividing the largest cluster into several sub-clusters. Each subcluster generated will then contain sentences with higher similarity. A well known partitional clustering algorithm is the K-Means algorithm. Radev et al. (2004) pioneered the use of cluster centroids for their multi-document summarizer, MEAD. Centroids are the top ranking tf-idf that represents the cluster. These cluster centroids are then used to identify the sentences in each cluster that are most similar to the centroid. Thus, the summarizer generates sentence which are most relevant to each cluster.
Taking the benefit of clustering approach, efforts have been put into making the overall process of summarizing multi document effective. One that is worth to be mentioned here is determining the optimal number of clusters, where Xia et al. (2011) adopted the co-clustering theory to find optimal clusters. They determine the weights of sentences and terms based on the sentence-term co-occurrence matrix. Sentence-term matrix is designed to represent diversity and redundancy within multiple articles.
Finally the top-weighted sentence in every cluster is picked out to form the summary, until a userpreferred summary length is met.
Some researchers employ clustering-based hybrid strategy (Yu et al., 2006) to combine local and global search for sentence selection. This approach does not depend only on similarity to cluster for sentence selection but also considers the overall document content similarity. Focus has also been given on strengthening clusters diversity. To achieve this, Aliguliyev (2010) used PSO algorithm by adding a mutation operation adopted from genetic algorithms to optimize intra-cluster similarity and inter-cluster dissimilarity.
Cluster based methods has been successful in its task to represent diversity and reduce redundancy within multiple articles. Although these can be considered the advantage of using clustering methods, as far as multi document is concerned, a summary cannot be meaningful enough if the relevance of a sentence is judged merely based on the clusters. This is because in clustering based method, eventually sentences are ranked according to the similarity with cluster centroid which simply represents frequent occurring terms. Thus, this method is also considered to be knowledge poor in term of its inability to capture contextual information contents that exist in the sentences.

Graph based method:
The fundamental theory of graph representation is the connection or linking between objects. These connections exist based on their underlying relation. In the case of text documents, the underlying relation is usually the similarity between objects-in this case, sentences.
Generally, a graph can be denoted in the form of G = (V, E), where V represents the graph's vertex or node and E is the edge between each vertex. In the context of text documents, vertex represents sentence and edge is the weight between two sentences. Using this approach, documents can therefore be represented as a graph where each sentence becomes the vertex and the weight between each vertex corresponds to the similarity between the two sentences. As in most literature concerning graph based approach, the most widely used similarity measure is the cosine similarity measure. An edge then exists if the similarity weight is above some predefined threshold. Figure 3 shows an example graph for multi document.  (Erkan and Radev, 2004b). For multi document each node represents a sentence Once the graph is constructed for a set of documents, important sentences will then be identified. It follows the idea that a sentence is considered important if it is strongly connected to many other sentences (Erkan and Radev, 2004b).
This approach differs from the cluster based approach where sentences are ranked based on its closeness to cluster centroid. Two well known graph based ranking algorithms are the HITS algorithm (Kleinberg, 1999) and the Google's PageRank (Brin and Page, 1998). Both methods have been traditionally used in Web-link analysis and social netstudys. Lexrank (Erkan and Radev, 2004b) and TextRank (Mihalcea and Tarau, 2004) are two successful graph-based ranking systems that implements these algorithms.
Further studies have been carried to make improvement through modification in the ranking algorithm. Wan and Yang (2006) assigned different weights to intra-document links and inter-document links. They give more priority to sentence with high inter-document links. In the study by Hariharan and Srinivasan (2009), they approach the graph based method differently i.e., by discounting the already selected sentence by removing the sentence from further consideration when ranking the remaining sentences.
Apart from sentence level information, Wan (2008) and Wei et al. (2010) devise the document-sensitive graph model to explore document impact on the graph-based summarization, by incorporating both the document-level information and the sentence-todocument relationship in the graph-based ranking process. The document-level relations are used to adjust the weights of the vertices and the strength of the edges in the graph.
The approach to graph based methods have resulted in positive feedback from the multi document summarization research communities as it was able to identify 'prestigious' sentences across the documents. The resulting graph is also able to capture distinct topics from unconnected sub-graphs. However since this approach depends heavily on sentence similarity to generate graph, it only treats sentence as bag of words without "understanding" the text. This would result the final summary to be not complete enough specifically for an informative summary generation. We will discuss further on this issue later in our proposed component based approach.
Knowledge based method: Most documents or articles will have its content related to a particular topic or event. These topics or events generally belong to a particular domain and each domain normally has its own common knowledge structure. Thus, there have been efforts made by researchers to utilize the background knowledge (i.e., ontology) to improve summarization results. In fact, many other applications have tailored their model to be ontology-driven (Shareha et al., 2009, Nasir andNoor, 2011).
Ontology, equipped with concise concepts and rich domain-related information, can capture the hidden semantic information. With the support of the ontology, information can be related with each other through the shared and common understanding of a domain (Khelif et al., 2007). Li et al. (2010) developed the Ontology-enriched Multi-Document Summarization (OMS) system to generate query-relevant summary from a collection of documents. OMS first links the sentences from documents onto a domain-related ontology, then maps the given query to a specific node in the ontology and finally extracts the summary from the sentences in the sub-tree rooted at the corresponding query node. Another example is the utilization of knowledge from UMLS (www.nlm.nih.gov/research/umls/), a medical oncology, to summarize biomedical documents (Verma et al., 2007). Here, the authors apply the medical ontology as dictionaries of valid concepts and choose sentences that contain only those words corresponding to concepts in the ontology. Kogilavani and Balasubramanie (2009) also utilized UMLS but as an alternative they used the ontology to expand user's natural language queries with synonyms and semantically related concepts.
In previous related study, Wu and Liu (2003) manually constructed a domain specific ontology for business news articles. They determine the main subtopics of the articles of interest by comparing the sentence words to the ontology. Sentences which are most "close" to the subtopics are then selected. A similar idea but with additional ontology features were proposed by Hennig et al. (2008) for sentence scoring. The features they used were tag overlap, subtree depth and subtree count.
Ontology can be useful for domain specific documents where key concepts pertaining to the domain can be identified. In most ontology based text summarization, the ontology functions to bootstrap the process of sentence selection by picking sentences which contain predefined ontology concepts. However it can be seen that the ontology is mainly used for similarity measure and as dictionary of valid concepts. One of the major concerns in ontology based summarization is the availability of the ontology itself. That is, this approach is only feasible when the ontology is available. Due to this reason, in most cases the ontology is manually constructed by experts. Unlike the previous three methods which can be applied to any domain, current ontology based method depends on a particular domain where the ontology design requires input from the domain experts.
Proposed component based approach: If we look back at previous approaches described in this study, we can observe that those approaches were mainly based on flat text feature representations without any attempts to "understand" the text. Moreover, until now, most text summarization models incorporate only bag of words as text representation and do not include much contextual information. We believe that providing comprehensive contextual information coverage would be ideal for summary creation.
In the context of news documents (which will be our research focus), different news sources reporting on a particular event tend to contain common components that make up the main story of the news. The common components of a news article consist of WHO, WHEN, WHERE, WHAT and HOW. For example, new articles on natural disaster events often contains components such as information about person, location, description of the disaster, the damages to human and properties, the relief efforts, organizations involved, the disruption of services and etc. Such occurrence of components with its information content description is what the readers usually search for while reading a news story. Moreover, as far as multi document is concerned, these components usually overlap each other and often reappears in different parts of the document as well as in documents from other news sources. We aim to capture these components' content information to better represent multiple document coverage of a news story.
Our proposed approach will incorporate ontology learning as part of our effort to learn the relationship linking certain pairs of components content. This is relevant to the understanding of the text. In contrast, existing ontology based methods discussed in the literature merely used ontology to identify important concepts in documents. For example, Wu and Liu (2003) perform term based mapping of sentences to ontology to find the most informative concepts in a document, while Hennig et al. (2008) classify sentence to nodes on the ontology to identify the main topics in a document. Their efforts were mainly focused on matching the ontology concepts which appear in the text, so that frequent occurring concepts can be labeled as important topics. As opposed to their approach, we do not use ontology for identifying important concepts or topics; instead we use the idea of ontology learning to capture the relations that exist among predefined components in the news documents. Ontology learning can also be used to generate terms which are relevant to the underlying domain, in order to capture only relations among the generated terms.
This way of utilizing component's content knowledge will benefit the summarization process in two ways. First, breaking documents down to their components and capturing those links between them would produce broad information coverage for the summary, thus generating an informative summary. The other benefit is that this approach gives an "intuitive thought" on the kind of information we know that is essential to be included in the summary. The latter is close to the way how humans prepare a news related summary.

CONCLUSION
This study provides a general survey on multi document summarization approaches. Indeed, this study has been tailored in a way that researchers whom are new to the area of text summarization can grasp the idea of various multi document summarization approaches. Four types of approaches have been discussed, namely the feature based method, cluster based method, graph based method and knowledge based method. It appears that each of these methods possess its own advantages towards multi document summarization. At the same time, there are some issues or limitations pertaining to those methods. For future improvement, we propose a novel approach, taking into account the generic components of a news story in order to generate a better summary which is well suited for an informative type summary generation. We belief that the proposed component based approach can alleviate some of the aforementioned limitations.