AN AGENT BASED FRAMEWORK FOR SENTIMENT CLASSIFICATION OF ONLINE REVIEWS USING ONTOLOGY

In this study, we design and develop an agent based framework for sentiment classification of online reviews using ontology. The book review ranking is based on the sentiment classification result. We propose a novel approach with the help of the JADE platform to solve problems by non-visual automatic sentiment classification. The description of book reviews ranking are generated from the ontology based mapping. This approach employs the data extraction agent which is used to retrieve the books comments i.e., the user reviews from the specified blogs. The Second agent is the recommendation agent i.e., domain ontology is used for identifying domain related features in comments. The Third agent is feature selection agent in which XML document content is split into single sentence. Each word in the sentence is mapped with ontology. A Mapping process is used for identifying the domain related sentences in that context. These processes are used for ranking the book results based on customer reviews. The book review ranking system can be extended to other product-review easily.


INTRODUCTION
Web mining is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining. Web usage mining is a process of extracting useful information from server logs. Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web content mining is the process to discover useful information from text, image, audio or video data in the web. Web content mining sometimes is called web text mining, because the text content is the most widely researched area. The technologies that are normally used in web content mining are Natural Language Processing (NLP) and Information retrieval. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site.
Sentiment classification/analysis is becoming a promising topic in the field of Customer Relationship Management (CRM). Customer profiling becomes more effective and enterprises can move towards one-to-one marketing.
A basic task in sentiment classification/analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Sentiment analysis refers to the application of natural language processing, computational linguistics and text analytics to identify and extract subjective information in source materials. Sentiment analysis aims to determine the attitude of a writer with respect to some topic or the overall tonality of a document (Wanga et al., 2013). The attitude may be his/her judgment, affective state or the intended emotional communication.
Opinions are also important when someone wants to hear others opinions before they make a decision. There are two types of opinion: Direct opinion and Science Publications

JCS
Comparisons. Direct opinions are opinion expresses on products, events, topics and people (e.g., this book is very easy to read). Comparisons express the similarities or differences between more than one object (e.g., this book explains concepts better than JAVA EDITION 5).
Consumers can use sentiment analysis to research books before making a purchase. Marketers can use this to research public opinion of their books, or to analyze customer satisfaction. Publishers can also use this to gather critical feedback about problems in newly released books.
The main objective of this study is retrieving the books name and corresponding recent reviews from specified blogs. The first agent is the data extraction agent which is used to retrieve comments about book i.e., the user reviews from the specified blogs. The second agent is the recommendation agent i.e., Domain Ontology which is used for identifying domain related features in comments. The third agent is feature selection agent in which XML document content is split into a single sentence. Each word in the sentence is mapped with Ontology. Mapping processes are used for identifying the domain related sentences in that context. These processes are used for re ranking the book results based on customer reviews.
An agent can act as an information collector, preprocessor (Othman et al., 2007) and classifier (Bakar et al., 2008) to a user.
The structure of the study is as follows. Section 2 surveys some works related to the study. Section 3 explains proposed system architecture. Section 4 shows the design of each module and implementation details. Section 5 discusses the results. Chapter 6 summarizes the study and talks about the future enhancements.

RELATED WORK
Several techniques were used for opinion mining tasks in history. The following few works are related to this technique. Liu et al. (2012), proposed designed and developed a movie-rating and review-summarization system in a mobile environment. They used a sentiment classification approach based on Latent Semantic Analysis (LSA) to identify product features. Tong et al. (2008) proposed a real time Data Mining and Multi-Agent System called DMMAS, modeling chronic disease data. They suggest that the DMMAS approach employs data partitioning and multiple agents with an option to employ heterogeneous or homogenous data mining techniques, distributing agent based processing for modeling. Xia et al. (2011), in this study, ensemble framework is applied to sentiment classification tasks, with the aim of efficiently integrating different feature sets and classification algorithms to synthesize a more accurate classification procedure. The author applied, two types of feature sets for opinion mining and, three well-known text classification algorithms, namely naive Bayes, maximum entropy and support vector machines, which are employed as base-classifiers for each of the feature sets. Next, three types of ensemble methods, namely the fixed combination, weighted combination and meta-classifier combination, are evaluated for three ensemble strategies. Zhang et al. (2009) presented a system for an ontology-based e-commerce product information retrieval system and proposed an ontology-based adaptation of the classical Vector Space Model with the considering the weight of product attribute. A Computer and its components related ontology has been built, which is adopted to annotate the html documents and construct concept vectors of the documents. Mistry and Shah (2011), in this study, the author proposed an architecture for a hospital system with the help of the Jade platform. It gives an idea about different agents used and how communication occurres between them and how to manage different agents. Multi Agent System (MAS) provides an efficient way for communication between agents and is decentralized. Wiebe et al. (2004) used review data from automobiles, banks, movies and travel destinations. He classified words into two classes (positive or negative) and counts the overall positive or negative score for the text. If the documents contain more positive than negative terms, it is assumed as positive document; otherwise, it is negative. These classifications are based on document and sentence level classification. These classifications are useful and improve the effectiveness of sentiment classification but cannot find what the opinion holder liked or disliked about each feature. Zhang et al. (2008) used the data of customer feedback review and product review. They used Decision learning method for sentiment classification. Decision tree learning is a method for approximating discretevalued target functions, in which the learned function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to improve human readability. These learning methods are among the most popular of inductive inference algorithms and have been Science Publications JCS successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants. Chen and Chiu (2009) proposed a Neural Network (NN) based index which combines the advantages of machine learning techniques and semantic orientation indices to effectively classify sentiment. Tao and Tan (2004) used emotional function words instead of emotional keywords to evaluate emotional states. Hu and Liu (2004) used adjective synonym sets and antonym sets in WordNet to judge the semantic orientations of adjectives.
Existing works use semantic orientation of words for classifying positive and negative sentiments. These classifications cannot find domain related features. The proposed system introduces the combined approach of POS tagging, domain ontology and classifier intends to enhance the sentiment classification.

AGENT TECHNOLOGY AND MULTI AGENT SYSTEM
Agent Technology is a new concept derived from artificial intelligence. The term agent describes a software abstraction, an idea, or a concept, similar to Object Oriented Programming (OOP) terms such as methods, functions and objects. The concept of an agent provides a convenient and powerful way to describe a complex software entity that is capable of acting with a certain degree of autonomy in order to accomplish tasks on behalf of the user. But unlike objects, which are defined in terms of methods and attributes, an agent is defined in terms of its behavior. Agents itself have several characteristics that makes researchers interested to explore the agent technology. The term agent, or software agent, has found its way into a number of technologies and has been widely used, for example, in artificial intelligence, databases, operating systems and computer network literature. Therefore, an agent is autonomous, because it operates without the direct intervention of humans or others and has control over its actions and internal state. An agent is social, because it cooperates with humans or other agents in order to achieve its tasks. An agent is reactive, because it perceives its environment and responds in a timely fashion to changes that occur in the environment. An agent is proactive, because it does not simply act in response to its environment but is able to exhibit goal-directed behavior by taking initiative (Mistry and Shah, 2011).
For real world applications a single agent is not enough. So we go for multi-agents. A Multi-Agent System (MAS) is a system composed of multiple agents acting collectively to reach the goals that are difficult to achieve by an individual agent or monolithic system. In order to solve the problems mentioned above, we decided to use JADE as our implementation Tool for agents. Java Agent Development Environment (JADE) is a middleware that facilitates the development of multi-agent systems. It provides a Foundation for Intelligent Physical Agents (FIPA) compliant environment and an implementation of Multi agent system, Bellifemine et al. (2010).
Multi agents system is selected for the proposed system for several reasons. First of all, integrating data from various sources i.e., from various web pages is a very complex task; web pages are highly dynamic and uncertain. Secondly, agents are capable of independent action on behalf of a user or owner and can act, capture and manage information automatically when it is necessary. Thirdly, agents can interact with other external systems and can be used to manage both distributed and local knowledge. Fourthly, agents can learn from their own experience. This is particularly important in the field of web mining as the data is constantly modified and updated. This results in the system performing better over time since the agents have learnt from their previous experiences. Finally, agents have the autonomy and social ability and a multi-agent system is inherently multithreaded for control. Therefore, multi-agent approach is suitable for the development of a Sentiment Classification system.

METHODOLOGY
In the proposed agent based framework, the software agents are used to guide the user who has no prior knowledge in sentiment classification. The proposed system has three agents: Data extraction agent, Recommendation agent and Feature selection agent. The first agent is the data extraction agent which is used to retrieve the books comments i.e., the user reviews from the specified blogs.
The Second agent is the recommendation agent i.e., Domain Ontology is used for identifying domain related features in comments. It uses the existing ACM Classification hierarchical structure for constructing domain Ontology. Opinions are stored in XML document.
The Third agent is the feature selection agent in which the XML document content is split into a single sentence. Each word in the sentence is mapping with Science Publications JCS Ontology. A Mapping process is used for identifying the domain related sentences in that context. Java WordNet Interface (JWI) is used to access the WordNet database. WordNet can only recognize the following four parts of speech-NOUN, VERB, ADJECTIVE and ADVERB. Product features are usually nouns or noun phases in review sentences. We used Brill Tagger on each review to split text into sentences and to produce the part-of-speech tag for each word. Each sentence is saved in the review database along with the POS tag information of each word in the sentence.
Define the features by labeling positive or negative sentiment words. For example positive sentiment words are 'strong', 'clear', 'neat' and negative sentiment words are 'disagree', 'difficult', 'bad'. The classifier classifies the sentence as either positive or negative. The final outcome of the proposed work is to re rank the book's results based on opinions of that book.
Overall system architecture is shown in Fig. 1. In the above architecture the focus is on domain related sentiment classification. The overall system proposes four main approaches.

Data Source
The data source proposed is www.amazon.com. This website has a lot of book reviews. These book review are downloaded using a crawler and can be used as opinions for the algorithm.

Data Extraction Agent
This is the First agent of the proposed system. Data Extraction agent is used to retrieve the book name and the corresponding book's customer reviews from specified blogs. Two different functions are used for implementing this module efficiently. The first function captures the book name and corresponding customer review links of the book from specified blogs. The second function captures the customer comments/opinion from this link. Path ascending crawling algorithm is used to implement the crawler module.
Some crawlers intend to download as many resources as possible from a particular web site. So path-ascending crawler was introduced that would ascend to every path in each URL that it intends to crawl. For example, when given a seed URL of http://www.amazon.com/Web-Haralambos/product-reviews/19339, it will attempt to crawl /Web-Haralambos/product-reviews/19339/, /Web-Haralambos/product-reviews/, /Web-Haralambos/ and /. Path-ascending crawler is very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling. Based on this algorithm the crawler is able to capture all the book review links.
The proposed system user specifies the starting URL (www.amazon.com) and search word on web that the crawler should crawl. Data Extraction agent reads all the content and converts it into a string. The crawler captures the links only having the path containing "productreview".
For example: http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/product-reviews/19339. After capture the links the agent should retrieves the opinions from corresponding links and store in XML document. This XML document is input for the next module.

Path Ascending Crawling Algorithm for Data Extraction Agent
The first function captures the book name and corresponding customer review links of the book from specified blogs. The second function captures the customer comments and opinion from this link. The user specifies the starting URL on web that the data extraction agent should crawl. Crawler reads all the content and converts it into a string:

Recommendation Agent-Domain Ontology
Ontology is a formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts. It is used to reason about the entities within that domain and may be used to describe the domain. A domain ontology (or domain-specific ontology) models a specific domain, or part of the world. It represents the particular meanings of terms as they apply to that domain. For example the word "ACID" has different meanings in different domains. ACID is a chemical substance in the domain of chemistry while in the domain of database management system, ACID means properties of transaction.
Domain Ontologies are used in artificial intelligence, the Semantic Web, systems engineering, software engineering, biomedical informatics, library science and enterprise bookmarking and information architecture as a form of knowledge representation.
Domain ontology is constructed that contains the domain related concepts. ACM, the world's largest educational and scientific computing society provides a hierarchical structure of Computing Systems. It uses constructing domain ontology. General Search Tree Algorithm is used for constructing the domain ontology. Normally non binary trees are used to construct the ontology. In a binary search tree, each node contains a key and points to two sub trees (Left and Right). A non binary tree contains a key and points to more than two sub trees.
In this algorithm each node of the tree contains a key and three pointers. Each node contains child, parent and brother pointer. First child only connect to parent node; all other child node of that parent connect to brother Science Publications JCS node of previous child. This algorithm is used for inserting a new node in hierarchical order and accessing all the nodes in a fixed sequence.

Feature Selection Agent-POS Tagger
Part of Speech Tagger (POS) is a process for marking up the words in a text as corresponding to a particular part of speech, based on both its definition as well as its context i.e., if relationship with adjacent and related words in a sentence. POS tagger module contains a set of tags such as Noun (N), Verb (V), Adjective (AJ), Adverb (AV), To (TO), Not (NOT), Conjunction (CJ), Preposition (PP), Determiner (DT) and Other (OTH). Any word from the input sentence is match with one of the tag that present in the tagset. We used the Brill Tagger algorithm for assigning tag to each word. WordNet is used for finding the POS tag of each word in the sentence. Delimiters are used to split the sentences from paragraph. The delimiters are full stop (.), expression mark (!) and question mark (?).
The Brill tagger algorithm is a method for doing partof-speech tagging. It was described by Eric Brill. It can be summarized as an "error-driven transformation-based tagger". It is error-driven in the sense that it recourses to supervised learning and transformation-based in the sense that a tag is assigned to each word and changed using a set of predefined rules. The output of the POS Taggers stored in an XML document.
Java API for XML Parser (JAXP) is a Java interface that provides a standard approach to parsing XML documents. JAXP provides parsers for DOM and SAX approaches to processing XML documents.

Brill Tagger Algorithm for POS Tagger
Generally WordNet captures only basic POS tags such as noun, verb, adjective and adverb. If the words have more than one tag (such as "book", "saw") brill tagger algorithm is used to find the appropriate tag.

Algorithm:
Known words (present in word net): If (Probability of word is equal to one) Assigning the tag associated to a form of the word Else (probability of word is less than one) If (The word is determiner) Assign the tag-DT Else if (The word is conjunction but not a first word in that sentence) Assign the Tag-CJ Else if (The word has more than one tag) Contextual rules apply for finding appropriate tag Else Assigning the most frequent tag associated to a form of the word Unknown words (out of Word net) Assign the tag-OTH Science Publications

Classifier
Classifier analyzes data and recognizes patterns, used for sentiment classification. It has two different types of datasets. The First dataset contains 250 positive words (such as "good", "fabulous", "recommended") and the Second dataset contains 150 Negative words (such as "bad", "not", "difficult").
All positive and negative words are ranged from 0.25 to 0.75. Classifier identifies the positive and negative words of book reviews. Equation (1 and 2) we used for calculating the cumulative positive and negative value of book review.
Positive value: (3) Figure 2 shows the angle between reference vector and result vector. If the θ value is 0 o both reference book and review book are in the positive category. So the book is considered in a positive sense. If the θ value is 90° both reference book and review book are negative category. So the book is considered in a negative sense. Result vector is calculated for all other books. The angle between Reference vector and result vector calculate for remaining books. Now all the values are plotted in a graph. The book results are reranked based on classifier result.

Pseudo Code for Classifier Module
Do { Read the XML file. Find the Positive and Negative words of the book review Calculate Positive value and Negative vale of each book. Find the angle between reference vector and result vector of book using dot product. Plot the value in graph } While (until the entire books are read)

Data Extraction Agent
The proposed system data extraction agent was used for retrieving the comments/opinions from www.amazon.com website. Figure 3 shows the user interface of data extraction agent. User can enter the URL name (www.amazon.com) and search word in user interface.
Results are shown in user interface .Book name and corresponding book reviews are stored in XML document. Book names are stored in initial.xml file. Book reviews are stored in bookname.xml file. Figure 4 shown as xml content of opinion about book 4.

Recommendation Agent-Domain Ontology
Domain ontology is used for identifying domain related sentences. Clas.txt is a domain ontology file that contains domain concepts. Output of data extraction agent is given as input to Recommendation Agent. Figure 5 shows user interface of domain ontology module. User can enter the ontology file name and opinion file name in user interface.
A tree is constructed and domain related sentences are stored in domain.xml file. Figure 6 shows a XML content of domain related sentences. Following xml contents shows domain related sentences of book 4.

POS Tagger
The POS Tagger module is used for marking up the word in a text as a corresponding to a particular part of speech tag based on brill tagger algorithm. Output of a domain ontology module (i.e., domain.xml) is given as inputs to the POS tagger. Tagged sentences are stored in pos.xml file. Figure 7 shows xml content of POS tagger module. Output of POS Tagger module (i.e., pos.xml) is given as input to the classifier module. Re ranked book results store in finalresult.xml. Following xml content shows re-ranked result. Table 1 shows before and after sentiment process results. POS Tagger, Domain Ontology and classifier are the main approaches to involve the sentiment process .Re ranked results are classified by classifier based on positive and negative comments in the review.  Before sentiment process After sentiment process 1

JCS
Mining  True positive means number of positive sentences which the system predicted as correct. False positive means number of positive sentences which the system predicted as wrong. True negative means number of negative sentences which system predicted as correct. False positive means number of negative sentences which system predicted as wrong. Figure 8 illustrates the accuracy of the sentiment process. A Line graph is drawn between percentage of accuracy and book opinion. Percentage of accuracy is plotted in Y axis and books are plotted in X axis. Graph explains that the number of inputs increases performance of classifier also improves. Figure 9 illustrates the precision of the sentiment process. Line graph is drawn between percentage of precision and book opinion. Percentage of precision is plotted in Y axis and books are plotted in X axis.

CONCLUSION
Sentiment classification of reviews is an important objective and challenge in customer relationship management. The proposed system uses the online data source (www.amazon.com) for implementing the study. The system uses agents to classify the user opinions. In this study, agent retrieved the books name and corresponding recent reviews from specified blogs. The first agent is the data extraction agent which is used to retrieve the books comments i.e., the user reviews from the specified blogs. The Second agent is the recommendation agent i.e., Domain Ontology is used for identifying domain related features in comments. The Third agent is feature selection agent in which XML document content is split into a single sentence. Each word in the sentence is mapped with Ontology. A Mapping process is used for identifying the domain related sentences in that context. These processes are used for re ranking the book results based on customer reviews.
Moreover, we used only 500 sentiment words to evaluate ontology based sentiment classification. More sentiment words need improve the classifier. Thus it will be our future work to achieve greater accuracy.
The proposed method can also be applied to other languages. A multilingual sentiment-based lexicon needs to be developed in the future. The proposed system used single domain ontology for identifying domain related sentences. Further research can also used multi domain ontology for identifying domain related sentences.