Enhancement of Search Results Using Dynamic Document Seed Reranking Algorithm

,


INTRODUCTION
The problem of searching on the World Wide Web (WWW), which is the process of discovering pages that are relevant to a given query. The commonly used tool to search on the web is a search engine. The process of determining the relevance ranking of search results based on a given query is still a challenge. The discomfort faced by users of search engines is two fold. One is the users feel they are unable to clearly specify what they need to search in the form of a query. Second is that though the query is given, the search results given by a search engine, is not well ranked. The objective of this paper is to address these two problems. When a user initiates a search process often he himself is not clear about what exactly he needs from that search process. The user refines his search query based on the initial search results. The user given query is the beginning for the process of searching on the web. Jon. M. Kleinberg [1] has classified queries in the following types.
Specific queries. Example " Database support by Java using JDBC" Broad Topic queries. Example " Find information about Database connectivity" Similar queries. Example "Find pages 'similar' to java.sun.com" The difficulty in handling specific queries is centered roughly, around what could be called the scarcity problem. There are very few pages that contain the required information, and it is often difficult to determine the identity of these pages. For broad topic queries, on the other hand, one expects to find many thousand relevant pages on the web, which may be generated by variants of term matching. The fundamental difficulty lies in what could be termed as 'Abundance problem: The number of pages that could reasonably be returned as relevant is far too large for a human user to digest'. For the third type of query, the challenge is to extract the features of a given page and then initiate the search. The user may feel a particular web document closer to his search and may look for documents similar to it. This can be achieved by document reranking based on the features of the selected document. In our work we have developed an algorithm where the user can dynamically choose a particular web document and present it as a seed. The key features of this seed document are heuristically extracted to create an index for that document. Similarly document index is created for all the search result documents. Based on the similarity measure between the seed document and all the other documents, the algorithm reranks the remaining search results in order to minimize the similarity mean absolute distance between them. Most similar documents are ranked higher than the dissimilar documents.
The objective of a ranking function is to match the documents in a text collection against a query and order them in descending order of their predicted relevance. The similarity between a query and a document can be calculated by the widely used cosine measure given by Salton [2] . Documents are then ordered by decreasing values of this measure. In the vector space model, these weights are commonly measured by their statistical properties or statistical features. For example, one of the most widely used statistical features in term weighting strategy is term frequency (TF), which measures how many times the term has appeared in the document or query [2] . Another commonly used feature is the inverse document frequency (IDF), which can be calculated by log (N/DF), where N is the total number of documents in the text collection and DF is another feature that measures the number of documents in which the term has appeared in the document collection.
Rorvig [3] studied the impact of ranking / similarity functions on visual information retrieval (IR). In visual IR, not only the similarity between query and document, but also the relationships among documents need to be visualized. Rorvig used multidimensional scaling to visualize document similarities using five different similarity functions. A key finding in all of these studies is that a single ranking function cannot work well for all contexts.
Many methods have been proposed to rerank documents. In the literature, Lee et al [4] proposed a document reranking method based on document clusters. They build a hierarchical cluster structure for the whole document set and use the structure to rerank the documents. In the works of Balinski [5] a document reranking method was proposed, that uses the distance between documents to modify initial relevance weights. Crouch et al [6] used the unstemmed words in the queries to reorder the documents. Xu et al [7] made use of global and local information to do local context analysis and then use the information acquired to rerank documents. Manually built thesaurus was also used to rerank retrieved documents [8] . Each term in a query topic is expanded with a group of terms in the thesaurus. Bear et al [9] used manually crafted grammars for topics to reorder documents by matching grammar rules in some segments of an article. Kamps [10] proposed a reranking method based on assigned, controlled vocabularies. Yang et al [11] used query terms that occurred in both query and top N (N<=30) retrieved documents to rerank documents.
For a given query q, we first obtain a set of documents retrieved and ranked by an external search engine. We propose a document reranking algorithm where the user selects a document as the seed for the reranking procedure. The similarity weightage is calculated based on the importance of query key term weightage, document term frequency and document distance as in the case of vector model [2] . But the importance are not calculated globally for entire search result documents but only for its subset whose members are relevant to the given query q. Consequently, the implementation depends on the search engine used. We have considered Google web search engine for the purpose of our research.
Our algorithm initially accepts a query from the user, extracts the key terms from the query. The top N search results are acquired from any search engine. The dynamic reranking algorithm generates a distance matrix for the top N documents.
Once the user selects a particular document as the seed, based on the document distance metrics, the search results are reranked, in such a way that documents similar to the given document appears closer. The objective of the algorithm is to minimize the similarity mean absolute distance between the documents. SYSTEM ARCHTECTURE Figure 1 depicts the system architecture. Initially the user gives a query to search for. This query is given to the search system. The stop words are removed from the query and the key terms are given to any external search engine to search the Internet. From these results the user can browse and choose the seed document.
The Document Index Generator generates the index vector for every web document. The key features are extracted and stored as an index vector for each document. When the user selects a seed document and requests reranking, the Dynamic seed reranker algorithm is initiated. The various similarity metrics as discussed below are applied and the documents are reranked based on their similarity to the selected seed document. Thus reranked results are given to the user. The user can choose again a new seed and request for reranking again or the user may opt for rephrasing the query itself. Search System: Keywords are extracted from the given query. The extracted keywords are converted into a string by placing '+' symbol between them and this is given to an external search engine (say Google) and the search is triggered. The results from the search engine are captured and the system stores the URLs of the search result documents in a database for further use by the index generator. Document Index Generator :In this module, the URLs of the search result documents are retrieved from the database. Every web document is retrieved and detagged. We have restricted our work to only web documents of HTML format and text format. The task of feature extraction focuses on the key term extraction. All the stop words are removed. Stemming of words is also considered. For example 'network', 'networking', networked' are considered alike. The following three parameters are calculated.
The term frequencies of the key terms are tabulated. Term frequency (TF) is how many times a particular key term has occurred in the document or query [1] . For similarity measure we define a heuristic technique which states the density of key term distribution reflects on the importance of that term in the document. Hence the Term Density measure is also calculated. Term Density Measure (TDM) is the mean distance between the successive occurrences of the term. Let x 1 , x 2 , x 3 ... x n be the occurrences of the keyword x in the document. Then the mean distance is calculated as is the number of words in between the successive occurrences of a particular key term. The maximum inter term distance measure is limited to a cutoff value, in our case it is set as 8. For documents with TF=1, TDM is set to zero. The lesser the TDM, the closer they appear in the document. Hence for every document a term index is generated with terms whose TDM is below the cutoff value.
In this algorithm we have considered the key phrases also. Considering the time delay we have restricted our key phrases to a length of two words only. If there are n key terms a Key Phrase Matrix (KPM) of size n x n is generated, and the frequency of the occurrences of the key phrases are computed and stored in this matrix.

KPM (i,j) =x
indicates that key terms words i and j occur next to each other x times. We have considered KPM (i,j) as equal to KPM (j,i). For example the phrase 'Programming in network' is considered as the same key phrase as 'network and programming'. Hence the upper diagonal matrix alone has to be calculated. KPM (i,i) is ignored. Though the key terms are very high, we found that the KPM is highly sparse and does not need very high memory storage, since we considered storing only the nonzero elements of the matrix.

Document Seed reranking Metrics:
The key features of each document are indexed by the index generator. The following metrics are applied and the overall similarity measure of each document with respect to the given seed document is calculated. The vector of key terms of seed document is taken as X. Y is the vector of the document (from the rest N-1 documents) to which the similarity to the seed document is to be calculated.

Matching Coefficient (MC):
The Matching Coefficient is a simple vector based approach which counts the number of terms, (dimensions), on which both vectors are non zero. So for vector set X of document A and vector set Y of document B, the matching coefficient is |X & Y|. This can be seen as the vector based count of co-referent terms. For this the position of occurrence of terms is not taken into account. Hence for any two documents A and B, the Matching coefficient (MC) based on terms is denoted as MC t(A,B) Dice Coefficient (DC): Dice coefficient is a term based similarity measure (0-1) whereby the similarity measure is defined as twice the number of terms common to compared entities divided by the total number of terms in both tested entities. For any two documents A and B, the Dice coefficient (DC) based on terms is denoted as DC t(A,B) Jaccard Similarity (JS): Jaccard Similarity uses word sets from the comparison instances to evaluate similarity. The Jaccard similarity penalizes a small number of shared entries (as a portion of all non-zero entries) more than the Dice coefficient. Each instance is represented as a Jaccard vector similarity function.
The Jaccard similarity between two vectors X and Y is (X*Y) / (|X||Y|-(X*Y)) Where (X*Y) is the inner product of X and Y, and |X| = (X*X)^1/2, i.e. the Euclidean norm of X. The Jaccard similarity between two documents A and B denoted by term vectors X and Y respectively is denoted by JS t(A,B). For key phrase similarity, it is denoted as JS kp(A,B) .

Document Seed Reranking Algorithm:
The similarity between two documents are measured based on term similarity (TS) and key phrase similarity (KPS) as given below. The overall document similarity metric is computed by giving additional weightage for term similarity over key phrase similarity  DS(B,A). For a seed document D, the D th column of the matrix, a linear array x 1 to x N is extracted from the matrix. The documents are reranked in descending order of their closeness of similarity with the seed document D. This ordering minimizes the overall similarity mean absolute distance (MAD) between documents. The similarity mean absolute distance (MAD) between documents for a given query Q and N search result documents, is defined as, where, x denotes the similarity distance between two successive documents. Algorithm: Initiate the search process using the query given by the user.

RESULTS AND DISCUSSION
For the search result obtained from Google for the query "data structures and algorithms", we have generated the Document Similarity matrix for top 10 search results. Considering the first search result document as the seed document, the reranking based on this seed document is given in graph 1. Table 1 gives  the computation of MAD before reranking and table 2 gives the computation of MAD after reranking. By applying our algorithm, it is found that, there is significant improvement in mean document distance.

CONCLUSIONS
Search engines are enhancing their search algorithms so as to answer the queries of the user. Various ranking and reranking algorithms focus on similarity between the user query and the search results. In this paper we have developed an algorithm which accepts a web document as seed document extracts the features from the search result documents and reranks the documents based on their degree of similarity with the seed document. This is useful for users who are not very specific about their search process and would like to explore from the initial search documents. The future work is focused on implementing an algorithm which lists the features of the seed document and the user can choose the features of his interest to initiate the reranking process. The work can be extended for multiple document seeds.