Sentiment Analysis: Comparative Study between GSVM and KNN

: Sentiment classification aims detecting general opinion of users in social media towards business products or daily life events. The classification tells whether sentiment is positive or negative. Techniques of sentiment classification are categorized into lexical analysis and machine learning techniques. In this paper, we propose a comparative study between SVM applied genetics (GSVM) against KNN algorithm in terms of speed and accuracy. We present also an experimental study of sentiment classification on different domains movie reviews, financial and amazon toys products. The experimental results shows that GSVM achieves a classification accuracy of 92% and KNN achieves 87% on movie reviews dataset. For classification speed, KNN shows a remarkable improvement (above 10% improvement) in comparison with GSVM.


Introduction
People are get used to express their sentiment in social media towards daily events in different life areas whether sports, political and so on.Sentiment analysis is defined as the process of detecting sentiment or opinion of user's statement towards daily activities (Mohamed et al., 2017).
Sentiment takes either positive, negative, or neutral.Positive sentiment is when people express good feeling towards movie, while negative sentiment is vise versa.
Table 1 shows a sample of positive and negative posts from Cornell movie reviews dataset (Heather Whiting et al., 2017).
Different approaches in sentiment classification area are presented in (Mohamed and Ezzat, 2015;Shivhare and Khethawat, 2012;Kalaivani and Shunmuganathan, 2014;Guo et al., 2003), which is categorized based on lexical analysis or Machine Learning (ML) techniques.
The main idea of lexical analysis is detecting effective words in posts based on common lexicon like Wordnet or Wordnet-Affect.While machine learning role is to find the class label of input text based on training data and predictive model (Shivhare and Khethawat, 2012).This problem is called text classification which is different than classification in other domains due to large number of features.SVM, NAIVE Bayes, KNN are popular techniques in sentiment classification, while they are depending on bag of words model to generate unique words from input text.a lot of generated features of this technique are irrelevant, redundant or noisy and filtration mechanism has to be applied.SVM Applied Genetics (GSVM) (Mohamed et al., 2017) is an enhanced technique that aimed filtering selected features to achieve better classification accuracy.
The main contribution of this paper is to present a comparative results between GSVM and K-nearest neighbor algorithm in terms of speed and accuracy.The implemented experiments are based on three different dataset which are movie review dataset (Zhang et al., 2011), financial dataset (Mohamed et al., 2017) and amazon Toys review dataset (He and McAuley, 2016).
The rest of this paper is organized as follows, Section II discuss popular approaches in emotion detection area from lexical to machine learning approaches.Section III presents methodology implemented in comparative study.Section IV shows the evaluation results.Finally, conclusion and future work listed in section V.

Background and Related Work
In sentiment analysis, there exists specific research challenges.Text Informality, Language Acronyms, Languages Mixture, Emotion icons and Relevance (Mohamed et al., 2017)   Sentiment " jaws" is a rare film that grabs your attention before it shows you a single image on screen.
Positive One cannot observe a star trek movie and expect to see serious, science fiction.
Positive The purpose of star trek is to provide flashy, innocent fun" snake eyes" is the most aggravating kind of movie: the, kind that show so Negative much potential then becomes unbelievably disappointing whether you like the beatles or not, nobody wants to see the bee, Negative gee's take on some of the fab four's best known songs Neviarouskaya et al. (2009), Construction of domainoriented sentiment lexicon as clustering of sentiment words and extends the information-bottleneck clustering algorithm by integration more restriction for building an appropriate knowledge context of every sentiment word.Opinion-Finder, WordNet-Affect, MPQA and SenticNet are popular lexical resources that highly used in sentiment analysis rather than SentiWordNet.Point-wise Mutual Information (PMI) (Kamble1 and Deshmukh, 2016) is a criterion commonly useful for statistical language model of word associations and its related applications.This method calculates mutual information between two words to obtain numeric score as in Equation 1: prob word word PM word word prob word prob word Here, each word is defined based on percentage of its relation to positive PMI (word1, positive word) or negative emotion PMI (word1, negative word).Finally, Semantic Orientation (SO) is calculated using Equation 2: P x y P y x PM x y P x P y P y = = (2) Sentiment classification of movie reviews is proposed in (Neviarouskaya et al., 2009) by applying three machine learning techniques of SVM, NAIVE Bayes and character based N-gram model for sentiment classification of the reviews.The evaluation results tells that accuracy of all approaches is more than 80% and also that SVM and N-gram approaches outperformed NAIVE Bayes technique.
K-Nearest Neighbor (KNN) algorithm in combination with TF-IDF for classifying sentiment is utilized in (Guo et al., 2003).A key advantage of KNN is it simplicity and execution speed.KNN is based on finding the most similar objects (documents) from sample based on mutual Euclidean distance (He and McAuley, 2016).Based on given results, it is proved that KNN applied TF-IDF method has been a good choice taking into consideration that amount of unusable words in documents has a significant impact on the final quality of classification.
A new way of sentiment analysis is proposed in (Zhang et al., 2011), it combines Lexicon-based and Learning-based Methods.The method first adopts a lexicon based approach to perform entity-level sentiment analysis that gives high precision but low recall.Then a classifier is trained to assign polarities to the entities newly identified tweets, proved that this way gives better F-Score.Corpus collected from Twitter with annotated microblog posts (or "tweets") annotated at the tweetlevel with seven emotions: anger, disgust, fear, joy, love, sadness and surprise.This research illustrate framework of EmpaTweet system for annotating and detecting emotion from twitter.The system uses a series of binary SVM classifiers to detect each of the seven emotions annotated in the corpus.Each classifier performs independently on a single emotion.
Another novel method introduced in (Li et al., 2016), by applying a pre-training method to deep neural networks based on restricted Boltzmann machines, which aims to gain competitive and stable classification performance of user emotions over short text.The result indicates that this method performed competitively competitively in terms of accuracy and robustness.
An improved NAIVE Bayes algorithm is presented in (Kang et al., 2011) for sentiment analysis of restaurant reviews based on unigram and bigram features.The experiments showed an accuracy that improved by a maximum of 10.2% in recall and a maximum of 26.2%.
A new method of SVM applying genetics is presented in (Mohamed et al., 2017), which is based on important feature selection method which is information gain by removing the irrelevant or redundant features.Information gain outperformed than other feature selection method which is calculated based on entropy (Preotiuc-Pietro et al., 2012;Neviarouskaya et al., 2009;Kamble1 and Deshmukh, 2016).Entropy is a common way in information retrieval area to measure impurity, while impurity refers to class distribution within dataset, High impurity leads to high classification accuracy.Entropy is calculated as in Equation 3: where, P i is the probability of class i, the higher Entropy leads to better accuracy and high information content.Information Gain (IG) then is calculated to check which features are considered the most important in our classification problem.IG is calculated as in Equation 4: The experiments in (Mohamed et al., 2017) shows an improvement of classification accuracy (89.9%) rather than support vector machine technique (88.6%).

Methodology Preprocessing Steps
Sentiment analysis process start with tweet tokening to split the text into a sequence of words.To assure accuracy, all characters are converted to lowercase.Stemming phase is then applied for removing morphological affixes from words that generated from previous step.Lancaster stemming is the used algorithm in our research process.Removing Sarcastic words is then applied to remove tokens like stop words from wordlist.We use python NLTK (Natural Language toolkit) for text processing.

KNN
KNN algorithm is based on finding the most similar objects (tweets) from sample groups using mutual Euclidean distance (Zhang et al., 2011).The process start with preparation of weight matrix which evaluates importance of words in given dataset based on Term frequency-inverse document frequency (TF-IDF).Assuming matrix N*M, where N is defined by unique words that is generated by preprocessing stage while M represents number of collected post.Thus, matrix constructed as relational matrix between each word and each tweet.Equation 5 is used to calculate the weight value of word i tweet j.
Where: a ij = The weight of term i in tweet j N = The number of tweets in dataset t f ij = The term frequency of term i in tweet j df i = The tweet frequency of term i in the dataset while equation 6 determine vector distance between any two tweets:

GSVM
SVM classification algorithm uses a set of training instances and predicts new instances with two possible class label-1, 1 (Zaguai and Beizak, 2015).The process start with applying Bag-of-Word models to model frequencies or number of occurrence for each word in tweet.The main problem here that Bag of word generates hundred or thousands features in input space which is not efficient way of vectorizing features, a lot of generated features of this technique are irrelevant, redundant or noisy (Buck et al., 2014).Based on genetic algorithm, feature selection method is applied on these features that generated from text to select best chromosome of features with high information gain.Algorithm 2 shows implementation of GSVM, it starts with initialized chromosome of generated features.The objective function is to maximize F1 score of the best generated chromosome.

Experiment Setup
In this research, we use Cornell movie review dataset (Li et al., 2016) collected from twitter; it contains 1000 positive reviews and 1000 negative reviews.The rating classifier determine whether a review was positive or negative by obtaining accurate rate specified by user, which takes either numerical rating (range from 1 to 10).Sample of published posts taken is shown Table 1.
Another case study we show in this research is amazon toys reviews (He and McAuley, 2016).This dataset contains product reviews and metadata from Amazon, including about 167,597 reviews span till July 2014.This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand and image features) and links (also viewed/also bought graphs).The dataset is represented in JSON format, Fig. 1 shows a sample of these reviews.
Financial dataset is the third case study in this research, Sample of published posts taken at 03-Mar-2017 for EURUSD currency pair shown in Fig. 2, data represented in CSV format.While investors express their opinion/expectation for EURUSD after critical event hold in USA for FED in subject related to interest rate.Each record contains the following: Author Name-No.Of Followers-Tags-Time-Post-Sentiment as shown in Fig. 2.
The key software and hardware specifications of our server that we think may affect the performance are shown in Table 2.

Evaluation Criteria
We consider accuracy, precision, recall, F-Measure and classification speed as evaluation criteria.
Accuracy: It is a powerful factor to evaluate ML technique, it is calculated based on ration of correct predicated sentiment related to all movie reviews.Equation calculated as per equation 7:

Results
In this experiment, we select top k values with highest performance, Table (3 and 4) show experimental results of both GSVM and KNN classifier on three different dataset.In movie review, KNN achieves best result when k=30, while it achieves best result when k=20 when classifying amazon toys reviews.When classifying financial dataset, KNN achieves best result when k=20.As shown in all results, GSVM is much better than KNN classifier with different k values in all our case studies.On the classification speed side, the results in Table 6 show that the classification based on KNN has faster speed than GSVM.The reason is that GSVM use multiple rounds of feature filtration of input vector space.
In general, from the above results, selection between both technique is depending on business domain and real case study.GSVM will be better option if we take classification accuracy into account.On other side, if real time if we take classification speed into account KNN will be better option.

Conclusion and Future Work
The scope of this research is to present comparative study between GSVM and KNN.We use movie review data set from twitter source as input for this experimental study.From classification accuracy perspective, the result shows that GSVM approach outperform the KNN technique.While observed that KNN takes less time in implementing classification.A future work direction is to implement parallel processing of GSVM that allow speed up of calculation.Morever, more comparative studies need to be presented with Nonstationary LDA (NSLDA) classification rule which is based on the Kalman Smoother algorithm.
are samples.Early works in sentiment analysis are depending on lexical resources.Preotiuc-Pietro et al. (2012), SentiWordNet lexicon was applied by counting positive and negative terms found in a review and the sentiment polarity was determined based on which class received the highest score.

Algorithm 1 :--
KNN implementation -Euclidian distance Procedure KNNAlgorithm (K) > Initialize -T <-number of tweets -N <-Number of unique words > Steps -For each tweet in training dataset (i in T) o For each word in the list of unique words (Return k tweet that has least distance d[i] While r < N do o P <-Mutate (P ); reconstruct population by replacing one or more features by other ones that have high information gain.o F(r) = ComputeFitness (P) -End while > Return -Return best P End procedure y) = The distance between any two tweet N = The number of unique words in given dataset arx = Weight of term r in tweet x ary = Weight of term r in tweet y Algorithm 1 shows an implementation of KNN.

Fig. 1 :Fig. 2 :
Fig. 1: Sample of Amazon review of toys product is calculated based on ratio of right predicated positive sentiment in related to total of positive sentiment movie reviews as per Equation8: is calculated based on the right predicated positive sentiment in related to all sentiment in actual sentiment class as per Equation9: it is calculated based on weighted average of precision and recall as Equation10: it indicates the time required for sentiment classification on given dataset, the classification speed is measured by CPU time.

Table 1 :
Sample of positive and negative tweets from movie review dataset Tweet

Table 3 :
Comparative results between GSVM and KNN on movie reviews

Table 4 :
Comparative results between GSVM and KNN on Amazon toys reviews

Table 5 :
Comparative results between GSVM and KNN on financial dataset