Performance Evaluation of Search Engines Using Enhanced Vector Space Model

: Vector space model allows computing a continuous degree of similarity between queries and retrieved documents and then ranks the documents in increasing order of cosine (similarity) value. It computes cosine or similarity value using their cosine function. The cosine function computes the similarity value by computing the weight of each term in the documents using a weighting scheme but it is a complex process to compute the weight of each term in the documents. It is also found that sometimes it fails to compute a similarity score, Firstly if there is only one document in the corpus and query terms match with the document and secondly, if the number of documents containing query terms and total number of documents retrieved are equal. To address this problem in order to improve the performance, we proposed an enhanced approach for computation of cosine or similarity value by enhancing the vector space model. Our work intends to analyze and implement our proposed method in performance evaluation of three search engines Google, Yahoo and MSN. To verify our method, we compared our proposed method with a manually computed relevance score and found that our evaluations match with manual method.


Introduction
The search engine is an information retrieval system that helps users to find useful information from the web whereas the web is a system of interlinked documents. The information retrieved is usually key words or phrases that are possible indicators of what is contained on the web page as a whole, the URL of the page, the code that makes up the page and links into and out of the page. It has a user interface where users enter a search term, a word or phrase in an attempt to find specific information using search engines. It is important to us to choose the most appropriate search engine for a query and retrieved best information of interest to the user. Hence, performance evaluation of search engine is a great challenge. Many performance measures can be used to evaluate the performance of search engine. They are precision, recall, coverage, response time and interface etc. In this study, we focus on precision of search engine. Precision is commonly defined as the ratio of retrieved documents that are judged relevant. Performance evaluation of search engine also done manually based on precision (Chu and Rosenthal, 1996;Leighton and Srivastava, 1999). The major benefit of manual precision evaluation used in the existing methods is the high accuracy and drawback is that it is time consuming. Now automatic evaluation of search engine performance is most preferred due to fast changing nature of both the web and search engine. In evaluating the precision performance of search engines, automatic relevance evaluation is critical. So it uses a similarity measure for relevance evaluation of web documents which is generally used in Information Retrieval (IR). The commonly used similarity computation measures are Vector Space Model (VSM) (Hiemstra, 2009;Salton, 1989), Okapi similarity measurement (Okapi) (Robertson and Walker, 1999) and Cover Density Rankin (CDR) (Cormack et al., 1999) whereas Vector Space model and Okapi similarity measurement face certain problems in using them on the web because some of the parameters required by these measures, such as total number of documents on the web and the number of documents that contain the query terms are unknown. The VSM, however can be used be analysing only a fixed the number of hits for a query on the web. In a previous study (Singh and Dwivedi, 2012), we analyzed different approaches of Vector Space Model and various derivations of its weighting scheme and observed few problems. To improve this model (Vector Space Model), we present a new method for evaluating the performance of search engines on the web.
Search engines are evaluated in two steps based on sample queries: (a) Computing relevance scores of hits from each search engine and (b) ranking the search engines based on a statistical comparison of relevance score. Statistical metrics, including the Probability of win may be used in the performance comparison of search engines. In our experiment, the proposed new method has been applied to three popular search engines, Google, Yahoo and MSN, based on TREC pattern queries. The accuracy of our method was compared to an existing VSM and a manual method.

Classical Method of VSM
The vector space model (classical method) (Singh and Dwivedi, 2012;2013), computes similarity score using following formula. We consider this method as base of our research in computation of similarity scores. The similarity is computed using the cosine function (Lee et al., 1997) given by: where, w i,j is the weight of term j in the document i and w Q,j is the weight of term j in the query Q. The denominators in this equation, called the normalization factor, discard the effect of document length on document scores. The weight of a term is computed by TF-IDF method (Buckley, 1993;Takao et al., 2000;Stephen, 2004;Jung et al., 2000) as given by Equation 2: TF is the term frequency (number of occurrences of a query term in a document) and IDF is the inverse document frequency (global information). The simple method for computation of IDF (Salton and Buckley, 1988;Polettini, 2004;Papineni, 2001) is given by Equation 3: D is the number of documents in the document collection and df j a number of documents containing the query term.

Issues in Similarity Value Computation
After analysis of classical vector space model we derive following observation.
If there is only one document in the corpus and query terms match with the document. Institution shows that cosine similarity would be one, but IDF will be zero by using an existing IDF method. So that similarity value becomes zero in such condition.
If all query terms present in the all documents, the IDF value computed by using the existing IDF method becomes zero, so it fails to compute similarity value of such corpus.
It favours for long documents but it is very difficult to compute a similarity score for long documents, due to high dimensionality.
Computation of weight of each term in the document is very difficult and requires large processing time.
Existence of stop words (a, an, the etc.) in the documents also affects computation similarity score.

Proposed Method
Having certain observations on the computations of similarity values using the classical vector space model, we further explored literatures to analyse some prominent methods for computations of IDF (Salton and Buckley, 1988;Ramos, 2003;Takao et al., 2000) as the IDF has a key role in term weight computation. The term weight has an influence in similarity value computation. We used following method for computation of IDF (Buckley, 1993): With this variation in inverse document frequency, weight of terms is computed as given by: where, as TF is term frequency. In this situation IDF is computed using Equation 4 and the weight of the terms using Equation 5. To make the similarity computation easy, our proposed new simpler method of a cosine similarity function given by: where, length of the document is number of unique term in document j. Since IDF formula of Equation 4 which is used in our proposed method cannot remove the stop words from the documents, it is removed using our new cosine function as given by Equation 6. Similarity score is computed for each query. It is computed, as an average across the number links considered.

Similarity Values Computation Using Proposed and Classical Method of VSM
In the process of similarity computation, we have applied our proposed method and classical method of VSM to compute the similarity scores between documents and queries. These experiments were based on an accepted number of TREC pattern short queries.
These queries contain 2, 3 terms. The set of 50 queries are given in Table 1. There are various search modes discussed before, but we have applied only keyword based or defaults search mode and considered only top 10 documents from several documents retrieved. This is a mode that most users use in their searching because the vector space model mostly supports keyword based searching. Only lower case queries were used because different search engine treats capitalized queries differently. We applied these queries on three search engine Google, Yahoo and MSN and computed their similarity score.
We have computed similarity for top 10 retrievals of each query using classical method of VSM using Equation 1 and proposed method using Equation 6 for selected queries. Table 2 shows the average similarity values using three search engines obtained by our proposed method and classical method of VSM on queries listed in Table 1.

Comparison Between Proposed Method and Classical Method
We compared the similarity values computed by our proposed method and similarity values computed by the classical method of the VSM on three search engines Google, Yahoo and MSN. The comparison is shown in figures as given by.
The Figs. 1, 2, 3 show the comparisons of similarity scores between the two methods for three search engines. The observation clear-similarity scores computed by proposed method provided higher values in comparison to the classical method of VSM in all the cases. Since a document with higher similarity score is assumed more relevant to the user and always maintains the high ranking to such documents, we can say that our method has a better chance of evaluation and ranking the documents for the queries.
Based on the similarity values and figures, we have been able to establish that our proposed method provides a strong correlation between document and the query terms for each of the three search engines as compared with the classical method, hence is more effective in the evaluation of performance of search engines.

Manual Scoring Method
To check the accuracy of our proposed method, we have also given these queries to ten students (whom we selected for this task and have been carefully guided how to perform the task) to inspect the similarity score of search engine. The manual scoring method we have used extends the existing methods (Leighton and Srivastava, 1999). Following criteria has been used for manual scoring.
• The documents that are related to the information need of a query which may be useful to the given query are termed relevant. They get a score of 2 • Documents that are slightly related to the query or contain some short description relevant to the query are termed as slightly relevant. They get a score of 1 • Duplicate links are the pages that appear in the returned links with the same URL more than once. They are given a score of zero • Inactive links are those which give an error message, like file not found (404) or server not responding (603) errors. They are also given a score of zero • Irrelevant links are the links that contain irrelevant information. They also get a score of zero • Based on the above criteria, the score of search engine is computed as the average of score per page per query

Probability of WIN
Performances of search engines have been compared based on similarity value computed in Table 2 and 3. The statistical metric probability of win (P win ) (Li and Shang, 2000a;2000b;Shang and Li, 2002;Ieumwananonthachai and Wah, 1996) measure statistically how much better (or worse) sample mean of one hypothesis, µ 1 is as compared to other, µ 2 . In hypothesis testing hypothesis {H: µ 1 >µ 2 } is specified without alternative hypothesis and it is evaluated based on sample values. P win is computed based on the mean and the variance of the performance data. First, we compute the difference of performance value (similarity value) between two search engines, that of P1 and P2 are the performance values of two search engines under considerations respectively, we compute (P1-P2) for n sample queries. Then compute µ as the sample mean of P1-P2, followed by sample variance σ 2 . Now P win is defined as Equation 7: where, Ft (v, x) is the cumulative distribution function of student's t-distribution with v degree of freedom. To compare a pair of search engines (say S1 and S2), if the P win value is larger than 0.5, then S1 is better than S2, else S2 is better than S1.

Similarity Values Computation Using Manual Method
To check the accuracy of our method of relevance, we provided these queries to ten students as discussed in the manual scoring method. The similarity score is based on the manual method. For each of the three search engines, scores have been assigned by each student manually (as per the criteria of manual method) for each query. The final similarity score for a query has been obtained as the average of scores given by ten students as shown in Table 3.

Performance Comparision of Search Engines Using Proposed and Manual Method
Performances of search engines have been compared based on similarity value computed in Table 2 and 3 and using statistical metric probability of win discussed above section. The P win values shown in the Table 4 have been computed for a pair of search engines. The values are similar for both our proposed method and for the manual method. For example, between Google and Yahoo, both the methods have give values greater than 0.5, which means performance of Google is better when compared with Yahoo. Similarity, the performance of Yahoo is found to be better than MSN as the P win value computed greater than 0.5 with the proposed as well as manual method. Both methods arrive at similar comparison results: Google outperformed other two search engines. Yahoo took the second spot while the MSN got the third place. These results show that our method of computation of similarity values is accurate as the same is also justified by manual scoring method.

Conclusion
In this study, we have proposed an enhancement to the existing vector space model to compare the performance of search engines. Using an acceptable number of TREC pattern queries, we computed similarity values for our proposed method and classical method of VSM. The similarity values computed by our proposed method have been better as compared to the classical method of VSM which is shown in the figures. We also compared our proposed method with the manual method using the same query set for top 10 hits of each query. Both manual and our methods obtained similar results in which Google outperformed others two search engines, whereas Yahoo and MSN obtained the second and third spot respectively.