Wordnet and Ontology Based Query Expansion for Semantic Information Retrieval in Sports Domain

: Semantic Search has been a major longing factor from the envisage state of Semantic Web. Information on the Web is growing at a very rapid pace and has become quite voluminous over the past few years. Semantics of the query is not considered in Traditional Search system since it is a mere Keyword based Search. To increase the number of relevant documents retrieved, queries need to be disambiguated by looking at their context. A query expansion algorithm for Semantic Information Retrieval in Sports Domain (SIRSD) is proposed to do Semantic Search to improve search over large document repositories. This algorithm reformulates user queries by using Word Net and domain ontology to improve the returned results. Our proposal is illustrated with sample experiments showing improvements with respect to Traditional Search and providing ground for further research and discussion. SIRSD reduces the issue of semantic interoperability during the user query search. It has been inferred that there will be a higher value of average precision and recall for the Semantic Search system when compared to Traditional Search. The results show its effectiveness in generating a suitable number of query search with an accuracy of 87.1% compared to other competitors of generic search engines.


Introduction
World wide Web Consortium (W3C) defines the Semantic Web which is the data on the world wide web with the Semantic content. Information Retrieval is the process of obtaining Information Resources relevant to the query given by the end user. It is difficult to specify information that is needed using exact query terms.
The most critical language issue for retrieval effectiveness is the term mismatch problem: The indexers and the users do not often use the same words. This is known as the vocabulary problem. Furnas et al. (1987), compounded by synonymy (same word with different meanings, such as "apple" ('company' vs. 'fruit')) and polysemy (different words with the same or similar meanings, such as "tv" and "television"). Synonymy may result in a failure to retrieve relevant documents, with a decrease in Recall value. Polysemy may cause retrieval of erroneous or irrelevant documents, thus implying a decrease in Precision of retrieval.
Several approaches such as interactive query refinement, relevance feedback, word sense disambiguation and Search results Clustering have been proposed to deal the Vocabulary problem. The natural and the most successful technique is to expand the original query with suitable words that best capture the actual user intent. The accuracy of the search can be improved by Semantic Search which will understand the intent of the searcher and the contextual meaning of terms to generate more relevant results. Patil and Jadhav (2012) designed semantic information retrieval using ontology and SPARQL and its application to cricket domain.

Why Query Expansion
The query expansion is needed to abstain the ambiguity of natural language. The query expansion adds the meaningful terms to the initial query. New terms can be added by either manual or automatic or user-assisted. User expertise is required to make decisions on which terms can be added in the new query in case of manual query expansion but the terms with highest weighting are added to the initial query in automatic query expansion. The system generates possible query expansion terms during the user assisted query expansion and the user selects which query to be included. The method which provide more contextual information in the result retrieved should be chosen as the term selection method. Bhogal et al. (2006) acquired the contextual information from relevance feedback, term co-occurrence and also it has been derived from knowledge models such as ontologies. The user's original query is augmented by new features with a similar meaning in automatic query expansion. Salton and McGill (1983) defined relevance feedback as a technique for modification of the initial query using words from top-ranked or identified relevant documents. It is an easier way of improving the retrieved document as opposed to the user having to construct a new query. The relevance feedback loop requires the user to enter an initial query which results in a display of ranked documents (usually titles/abstracts). From this display, the user makes relevant judgments and selects the relevant documents. The relevant terms from these documents are added to the initial query. An alternative to this is pseudo relevance feedback where the top ranked 'n' documents are assumed to be relevant. Terms from these documents are selected and used for expanding the query. Whether pseudo-relevance feedback or traditional relevance feedback is used, the term selection method is a key factor in the performance of expanded queries.

Query Expansion Using Ontology
Ontologies are applicable to domain independent retrieval such as web information retrieval and also more useful in specialized information retrieval tasks. They have also been used in query expansion. Ontologies are effectively formal and explicit specifications in the form of concepts and relations of shared conceptualizations. Wilson et al. (2012) learnt ontology from text. Ontologies range from general to domain-specific. Word Net and Euro Word Net are examples of a general ontology. Many domain-specific ontologies exist for example in the medical and legal domains. Word Net has been a popular general ontology used in the area of query expansion.

Traditional Method
Searching is done in three stages namely document indexing, term weighting, similarity coefficients in classical search engines. In the first stage, document indexing is done. Importance of the term used within the document is calculated with the help of term frequency in the second stage. In the last stage, documents and queries are represented by vectors of term weight. Finally the retrieval is done by cosine similarity. In this method documents are represented poorly, search keyword must be precise, document with similar context but different term won't be retrieved. Salton and McGill (1983) proved that this method gives very low precisions and recall values.

Semantic IR Method
Traditional methods are unable to understand the meaning of the content. In semantic web, knowledge representation is very much easier. Information related to the specific domain is stored in the ontology. Vallet et al. (2005) used the semantic knowledge base instead of keyword-based index for efficient searching. Vallet et al. (2005) were not able to achieve greater performance.

Semantic Indexing
One of the techniques to improve precision and recall values is Semantic indexing. Semantic index is created on ontological data and keyword search is performed. Semantic indexing based on the lucene indexing extends the traditional full text index with extracted data. The document containing ontological data get higher rates while Ranking. Performance of the system is improved when compared to Traditional Search. Kara et al. (2010) were not able to achieve the 100% precision and recall values.

Term Co-Occurrence
Term co-occurrence refers to two or more terms that are situated next or near to each other in the source document. Smeaton and Rijsbergen (1983) have done experiments with new terms which is generated from sources such as maximum spanning trees and found very little improvement. Peat and Willett (1991) have used query terms that have high collection frequencies for query expansion. Since high frequency terms do not discriminate between relevant and non-relevant documents, the addition of these terms for query expansion is ineffective.

Problem Statement
One of the important concern in Search is to capture the Semantic of the user query. The indexers and the users need to use the related and synonym words instead of using the same words. This can be handled by expanding the original query with other words that best capture the actual user intent. System is implemented for the Sports domain. Data is extracted from the Sports domain and stored in the ontology. Ontology is stored in OWL format which is based on RDF and RDFS. The objective of the work is to retain the user friendly nature of the search query as well as to improve the retrieval performance.

System Implementation
The overall architecture of the system is shown in Fig. 1. The system is fragmented into number of segments for better understanding. The integral segments within the system consist the phases like parsing of input query, extracting synonym words, ontology construction, keyword extraction from ontology, Semantic Search with expanded query using Google API. The following subsections describe these phases in more detail.

Parsing of Input Query and Extracting Synonym Words
The system user enters a query to fetch data on sports domain in natural language. The input user query need to go through the parser to analyze the given query grammatically. The parsing is done to analyze the query syntactically to determine every word in the query. The parsed output is stored in the file in the format shown as parsed sentence. The parsed tree form of query "name of football clubs in EEFA "is depicted in Fig. 2. The parsed words are shown in Fig. 4.

Tool Used: Stanford Parser
Related synonym sets of various words in the query are also obtained using Word Net API. Similar words for the query is also obtained from Similarity measure using the Equation 1: We define the document vector of Di as d i = (d i,1 , …, d i,M ), where di, j is the weight of the annotation of document Di with Ij, if such annotation exists and zero otherwise. We define the extended query vectoras q = (q 1 ,q 2 ,........q M ). Now the similarity measure of Di for the query Q is computed as Equation 1: The Semantic Similarity between the two words P and Q is also modeled as a function sim (P, Q) that returns a value in range [0: 1]. The popular Co-Occurrence measure, Web Jaccard is used to compute Semantic Similarity using page counts.

Ontology Construction and Keyword Extraction from Ontology
A formal representation of knowledge as a set of concepts within a specific domain is an ontology. Strong ontology is constructed based on handful of information that is gathered by referring through various Sports websites. The PROTÉGÉ Tool is used in building a perfect ontological hierarchy and can be used to represent exact interconnected relationships between the classes using properties. RDF is used to represent information and to exchange knowledge in the web. The Sample RDF for Sports class is shown in Fig. 3. The class Hierarchy for sports domain is depicted in Fig. 6. The set of more related key words to the given user query is extracted from the sports ontology. At the end, the collection of words which are Semantically related to the Sports Domain are obtained.

Query Expansion and the Semantic Search
Various queries are formed based on permutations and combinations of Similar words and the words obtained from the ontology. The expanded query of the input query is given in Fig. 5. The queries formed will be more refined and are sent to Google Search API which fetches the web links related to the user query.

Example:
The set of expanded queries is an input for the web Search which provide the corresponding web links to the given queries. The retrieved web links for the semantic search are obtained by means of the extracted domain keywords from the ontology. The algorithm for query expansion is as follows.

Results
The added value of Semantic Information Retrieval with respect to traditional keyword-based retrieval, as envisioned in our approach, relies on the additional explicit words obtained from synset and ontology, leading to higher Recall and Precision values. In summary, our system achieves the following improvements with respect to Keyword-based Search: • Better Precision by using structured Semantic queries. Structured queries allow expressing more precise information need, leading to more accurate answers. For instance, in a keywordbased system, it is not possible to distinguish a query for USA players in Catalan basket teams Vs. Catalan players in USA teams, which is possible with a Semantic Search The Recall calculated here is the relative recall in which performance is compared with relation to the Traditional Google Search Engine.

Number of relevant links given by the system Total number of links retrieved
The measure precision calculated here is the number of correct information that the system returns (accuracy).
When this SIRSD system was tested, an improvement in performance was observed as a result of semantic analysis. But a small fraction of them had a similar performance with a Traditional Search Engine. The Table 1 gives the Sample Queries and Table 2. Shows the samples of the Performance levels of our system comparing it with Traditional Search systems.
After the detailed analysis it is found that this query expansion algorithm gives the better performance improvement. It is proved that the precision and recall vales are high for all the sample queries. It gives an average of 40% increase in Precision value and 20 to 30% increase in recall value.
The Fig. 7 depicts that the precision value of our system SIRSD is higher than the values obtained in Traditional Search Engines. The recall value of our system is also increased than the values obtained in Traditional Search Engines. It is depicted in the Fig. 8.
The recall and precision Value Vs all the eight queries for both Semantic Search and the Traditional Search was done and it is shown in Fig. 9.

Discussion
In this study, it is found that the precision and recall value of our system is higher than the Traditional Search system. Ra et al. (2013) constructed a military ontology for semantic data processing in the Army Tactical Command Information System (ATCIS). It can be used for constructing intelligent military information systems. Li et al. (2012) implemented a query expansion algorithm to improve the performance of image retrieval system by expanding the original query based on the information from top ranked images. Chen et al. (2012) expanded the query using relevance feedback fusion to enhance IR effectiveness. Linguistic and structural information of Feedback documents have not been exploited to improve the performance. Faiz et al. (2012) proposed a learning information Retrieval system based on a semantic annotation process with contextual exploration. To improve the results, a machine learning technique is applied to sort the results according to their similarity with the request term. Kumar et al. (2010), developed an ontology for education domain using Protégé and retrieve University information. Kotov and Zhai (2012) presented a study of the methods leveraging the concept net knowledge base for query expansion to improve the search results for poorly performing (or difficult) queries. When the initial search results are of very bad quality and other techniques such as relevance feedback and pseudo relevance feedback become ineffective. Luo et al. (2012) have done a Explicit Semantic Analysis (ESA) for query expansion. Tanapaisankit and Song (2012) developed a personalized search system Query In Context (QIC), that was trying to enhance the individual search by incorporating user preferences in query expansion, capturing meanings embedded in documents. They were trying to explore other techniques to incorporate the semantic properties of terms in user queries. Anh et al. (2012) proposed an approach of word sense disambiguation by creating a new model of semantic Network. They have summarized how to add semantics into a query in Vietnamese language based on Ontology of Object Member Property (OOMP) by improved Similar Noun Phrase Expansion (iSNPE) algorithm. Gandhi and Srivatsa (2006) constructed an algorithm to improve the accuracy of Sessionizers in web usage analysis. Kanimozhi and Christy (2013) presented a system by incorporating ontology and SPARQL for semantic image annotation. Gandhi and Srivatsa (2007), done a Study report to resist web users against malicious activities using web logs. Magesh and Thangaraj (2013), developed methods to retrieve the images from ontology and compare the image retrieval performance by using SPARQL query language, decision tree algorithm and Lire which is an open source image search engine. The SPARQL query language is used to retrieving the semantically relevant images using keyword based annotation and the decision tree algorithms are used in retrieving the relevant images using visual features of an image.
Sathianesan and Sankaranarayanan (2013) proposed blog summarizer to check the similarity of each sentence in a blog, sort the order of sentence based on the similarity of the text with query word and reduces the number of sentences. This produces better relevant and summarized number of blogs compared to various search engines. SBMF confirms the hundred percentage relevancy compared to other blog search engines. Vadivu and Hopper (2012), developed a Knowledge framework for the Indian Medicinal Plants (KIMP) that includes the ontology creation, user interface for querying the system. Jena is used to build semantic web applications and SPARQL Protocol and RDF Query Language (SPARQL) is used to retrieve various query patterns. This system achieved the automated mapping by considering lexical and edge based relatedness.
After the detailed analysis of the Fig. 7 to 9 we can infer that there is higher value of average precision and recall for the Semantic Search system when compared to Traditional Search. It depicts that our system has a better performance and accuracy in retrieving the results than the generic Search Engines. The average relative recall value of SIRSD system is also higher compared to that of the Traditional Search system. The higher value denotes the best coverage of our system compared to the generic search engines. Table 3 shows the average precision and recall values of both SIRSD and Traditional Search for the sample queries taken.

Conclusion
Semantic relevant Information has been retrieved as a result of our SIRSD system. The system is successfully implemented and using the technology like Ontology, OWL, information extraction using Jena API. Query Expansion was introduced and Semantic Search was done which overcomes the drawback of Traditional Search. Considerable improvement in the performance of the system using domain specific information extraction is observed. As shown in the result section, it can achieve better Recall and Precision values for various domain and for multilingual databases. Other domain can be successfully implemented by having changes in the domain ontology and information extraction in line with sports domain. Other modules can be easily used in the new implementation without any changes. SIRSD reduces the issue of interoperability during the search of user need. Our ongoing research work is to use Semantic Similarity measures in query expansion, where a user query is modified using Semantically Similar words to improve the relevancy of the search. Multiliguality can be achieved by storing the Semantic Information from multiple languages.