Sentiment Analysis of Arabic Tweets in e-Learning

: In this study, we present the design and implementation of Arabic text classification in regard to university students’ opinions through different algorithms such as Support Vector Machine (SVM) and Naive Bayes (NB). The aim of the study is to develop a framework to analyse Twitter “tweets” as having negative, positive or neutral sentiments in education or, in other words, to illustrate the relationship between the sentiments conveyed in Arabic tweets and the students’ learning experiences at universities. Two experiments were carried out, one using negative and positive classes only and the other one with a neutral class. The results show that in Arabic, a sentiments SVM with an n-gram feature achieved higher accuracy than NB both with using negative and positive classes only and with the neutral class.


Introduction
The number of Arabic users on microblogging platforms such as Twitter has been rapidly increasing (Mourtada and Salem, 2014). In March 2014, the total number of active Arabic Twitter users reached 5,797,500 and the estimated number of tweets created was 533,165,900 tweets, with an average of 17,198,900 tweets per day (Mourtada and Salem, 2014). With the rapid growth of social media utilisation in the Arabic world, there also came a growth of Arabic reviews, comments, ratings, feedback and opinions.
Today, students use social media sites on a daily basis to express their opinions and describe their activities. Therefore, it follows that universities utilise social media in educational practice for the purpose of improving their teaching processes as well as for the purpose of analysing opinions and learning about experiences. This is especially relevant for students who mostly depend on the web, such as online distance education students. Therefore, this paper aims to understand distance education students' learning experiences by analysing students' opinions from the Twitter account of the Deanship of e-Learning and Distance Education (DELDE) at King Abdulaziz University in Saudi Arabia. Opinion mining and sentiment analysis identify how sentiments are expressed in texts as negative, positive or neutral toward the subject (Nasukawa and Yi, 2003). Many studies (Pak and Paroubek, 2010;Agarwal et al., 2011;Deng et al., 2014) have been carried out to obtain users' opinions in different fields, such as e-commerce (Hu and Liu, 2004), the stock market (Bollen et al., 2011;Martin, 2013;Zhang and Skiena, 2010) and politics (Tumasjan et al., 2010;O'Connor et al., 2010).
The rest of the paper is organised as follows: Section 2 covers related work. Section 3 presents the process of obtaining the Arabic sentiments from Twitter. Section 4 covers the classification methods used. The experiment and the results of the study are described in section 5. The final section covers the conclusion and plans for future works. Tian et al. (2009) developed an e-learner's affective category model that applies a questionnaire. They first extract an emotion feature from the corpus on the Internet. They then compare emotion words with one another to measure the intensity of sentiment in each category. This is a comprehensive decision-tree approach that enables one to achieve positive results in dealing with classification challenges common to the analysis of interactive texts (Chinese, for example), which are characterised by the richness of emotions and short abbreviations. Wang et al. (2010) in their studies described a monitoring and analysis system of Internet education for public sentiment. They worked on the basic function of public sentiment in education over the Internet. Wang (2010) established the Student Feedback Mining System (SFMS) on the basis of text analytics and an opinion-mining approach. In their studies, they performed an in-depth analysis of the qualitative student feedback and provided insight into teaching practices. As a result, they significantly improved student learning. Donovan et al. (2006) found out that at least 54% of online student comments were more detailed and informative than typical written feedback on papers. That emphasises the importance of paying more attention to online comments. However, the downside is that the analysis of online comments is more time-consuming and requires an automated feedback system to collect and analyse online comments. It also requires performing a visual analysis of the opinions-sentiments classification. In contrast to the traditional text classification, which is a pre-defined class for identifying a document's category, the sentiment classification defines the attitudes and opinions of users by mining and analysing their interests and other personal information (Fu and Wang, 2010).

Sentiment Analysis
Sentiment Analysis (SA) is a form of natural language processing that utilises computational linguistics and data mining to identify text sentiments as negative, positive and neutral. Text mining has been used to detect patterns in text. The major advantage of the text-mining technique is that it is capable of analysing the unstructured data (Gharehchopogh and Khalifelu, 2011). However, Duncan and Zhang (2015) also highlighted that traditional sentiment analysis is fundamentally different from Twitter sentiment analysis and that difference should be considered. Because of the 140-character limit in tweets, there is a high level of misspellings and colloquial language on Twitter in comparison to any other source. Thus, the application of neural networks identified only 74.15% accuracy. In their study, Pang and Lee (2008) were two of the few researchers who attempted to use the sentiment analysis method for opinion-oriented information. Thus, there is a scarcity of studies on applying this method to blogs in general and even less research specifically in the area of microblogging platforms such as Twitter.

Sentiment Analysis Approach
Most studies on sentiment analysis are focused on text written in English using the sentiment lexicons technique. However, applying the lexicons technique to other languages will cause a domain adaptation problem (Cambria et al., 2013). Also, there is no lexicon for Arabic sentiment words and thus the machine-learning or corpus-based technique is usually used. Both techniques are supervised methods, where a set of data is first categorised as negative, positive or neutral and appears by feature vectors. Afterwards, the classifier uses those vectors as training data to identify similar features and to group the data them in a certain class (Shoukry and Rafea, 2012).

The Uniqueness of the Arabic Language
This paper discusses the structural complexity of the Arabic language and, as a result, the challenges associated with Arabic text analysis. These challenges are also coupled with the recent spread of social media and the use of colloquial Arabic language on social platforms. This work highlights the lack of research in the field and the limitations of the previous studies. As an alternative, it also proposes a different method for text analysis, called sentiment analysis, which is further discussed in detail in the below sections.
In comparison to other language families, including German, Hindi and Chinese, the structure of the Arabic language is fundamentally different. First, the writing rule is based on the principle of 'right to left', while the form of letters depends on their place in a sentence. Second, gender plays a key role in determining the affixes of words. Third, capital letters do not subsist in the Arabic language and there are grammatical rules that aim to detect entities, acronyms and abbreviations (Fu and Wang, 2010). All these complexities result in difficulties in Arabic text analysis, which are also coupled with the lack of research in that field. A few authors have attempted to analyse Arabic texts by applying different methods. However, their approaches have a number of limitations. For instance, Arabic text web documents from the Al Jazeera website were classified into five main categories using Naïve Bayes (Gharehchopogh and Khalifelu, 2011;El Kourdi et al., 2004). Their dataset contained 300 Arabic articles using NB. The average accuracy in those identified categories did not exceed 68.78%. El-Halees' applied a different method based on depicting the extraction of opinion from Arabic texts (Duncan and Zhang, 2015;El-Halees, 2007). That method encompassed three stages. First, the documents were classified according to lexicon; second, those documents were utilised to train the maximum entropy model; and finally, the rest of them were categorised by the k-Nearest Neighbour. Such approaches enabled the achievement of a higher level of accuracy -80.40%. Ibrahim et al., (2015) in their studies investigated the application of Modern Standard Arabic (MSA) and colloquial Arabic to Egyptian dialects. Even though they emphasised that it is a Natural Language Processing approach (NLP) that works relatively well for other languages, the main limitation of this approach is that it is not applicable to the Arabic language. The main reason for that is the morphological richness of Arabic as a language (Pang and Lee, 2008;Cambria et al., 2013;Duwairi et al., 2014;Farghaly and Shaalan, 2009;Farra et al., 2010). In fact, the Arabic informal (colloquial) language lacks structure and is difficult to standardise. There is limited research on how to analyse the Arabic colloquial language. This limitation is also coupled with the complexity of the social media phenomenon, which tends to include a large number of words with ambiguous meanings. Applying the opinionmining technique to the sentiment analysis in social media is still in the process of its development, especially in the area of spelling mistakes, emoticons or any other special characters. To deal with the discussed complexities of the Arabic language, this study will do data mining, applying the sentiment analysis technique to modern standard Arabic with the colloquial Saudi Arabian dialects. The main complexity of this research design is in the analysis of the isolated semantics common to the structure of the Arabic language. To deal with this complexity and to facilitate the process of text mining, the ontology of the incipient keywordprocessing model will be constructed. The suggested method will enable the withdrawal of unobserved information, applying the results for higher-order ngrams. The n-gram feature is a subsequence of N items from a given text. Higher order N-grams are more accurate in detecting the context, as they give a better understanding of the word location. The strategy is to use tokens such as trigrams (N = 3) in the feature space rather than just unigrams (N = 1) (Tripathi and Naganna, 2015). This model demonstrates the essentiality of a neutral class in sentiment analysis in the case of the Arabic language. A neutral class with generalisation of the proposed classification methods will lead to a higher classification accuracy (Koppel and Schler, 2006;Hamed et al., 2016). Figure 1 presents the process of obtaining the Arabic sentiments from Twitter.

Data Collection
To accumulate the Arabic tweets, an application was developed to grab tweets from a famous social site called Twitter. Two thousand tweets by different students on the DELDE Twitter account (K.A. University, 2016) were downloaded by this application. The application was developed in C# and Twitter's official Developers API was utilised to download tweets.
The tweets were preserved in a database. The tweets could then be tagged in the grid view based on the text. The tweet could be manually marked as negative (-1), positive (1) or neutral (0). Marked tweets are updated in the database. The marked tweets can then be exported to text files, Excel sheets and .csv files.
The REST APIs were habituated to read and write Twitter data, for instance, GET statuses/mentions_timeline and GET statuses/user_timeline.

Labelling Process
The labelling process started with all the Arabic tweets saved in the main dataset. Then, all the saved tweets were filtered. The filtering process was performed by applying the following criteria: • Tweets should not contain hashtags • Tweets should not contain links/URLs • Duplicate tweets are eliminated • Tweets should not contain special characters Then, all filtered tweets were saved in a dataset called the 'ToBeMarked' dataset. The user culled the sentiment of the tweet to be rated and then preserved it in the database. After the tweet was preserved, it was moved back to the 'ToBeMarked' dataset. The final dataset holds only tweets that are certain to be marked correctly. Once the label assigned by the owner matches the label assigned by the supervisor, then the tweet is considered as marked correctly. In case the anterior two labels are mismatched, a third one (manager) comes to avail and all three decide the final outcome. If these precedent conditions are met, the tweet peregrinates to the final stage and is then utilised in the subsequent phases. The supervisor is one of the authors and the owner is the employee who deals with the Twitter account. The manager is the vice dean of the Deanship of e-Learning and Distance Education and is responsible for the Twitter account and also has experience dealing with students' quandaries. Table 1 displays the number of collected tweets (2000 Arabic tweets); 1121 tweets were marked for the training dataset and a total of 879 unrelated tweets were deleted.

Data Pre-Processing
Many studies in sentiment analysis corroborated that pre-processing consists of several steps: Online text cleaning, stop words abstraction, white space abstraction, stemming, expanding abbreviation, negation handling and feature culling (Gokulakrishnan et al., 2012;Haddi et al., 2013).

Upper Case and Lower Case
The goal is to make the cases consistent for the classifier and to map tokens to the corresponding feature irrespective of casing.

URL Extracting
URLs are used to share more explanation than what can be given in a short post. Some studies replace that with an equivalence class <URL> in order to reduce feature size, while others simply remove URLs from the text.

Detection of Pointers (Usernames and Hashtags)
In Twitter, messages can point to another utiliser with the utilisation of the @ token in front of the username. Moreover, a hashtag is utilized to categorise the post. Supersession of @ and the hashtag with the fine-tuned symbols <utilizer> and <hashtag> will reduce the feature size.

Identification of Punctuation
Punctuation is used in microblogging to avoid proper grammar and to communicate emotions more easily. Removing punctuation from the text will reduce the redundant feature in the training set.

Removal of Stop Words
It is a common method to remove unvalued substantial words such as "a", "an" and "the" in the classification process.

Compression of Words
Twitter users often express strong emotions in an informal way, for example, writing "happyyyyyyy" and "coooooool". Reducing these expressions to three characters will make a difference between the regular usage and emphasised usage of these words. Rapidminer (2016) was used to pre-process the collected Arabic tweets and then replace any Arabic words that use different styles. For example, Table 2 shows the replacement of some words and icons that use different styles in the pre-process phase.
The following steps were used to pre-process tweets using Rapidminer: • Tokenization: Splits the text of a tweet into a sequence of tokens • Stop-word process: Filters Arabic stop words • Light stem: Stems Arabic tweets by removing the suffixes and prefixes • Filter token by length: Filters tokens based on their character numbers Then, the 'Process Documents from Data' operator engenders a word vector from the dataset utilising the TF-IDF method that can be calculated with a multiple of the value of TF and IDF for a particular word. This scheme is often utilised in information retrieval and text mining (Hamed et al., 2016;Tripathi and Naganna, 2015). NB and SVM were used to build the classification models used to classify tweets as negative, positive and netural. Finally, precision and recall methods were used to evaluate the classification results.

Classification Methods
The NB model has many advantages: It treats each feature independently, it is relatively expeditious to compute, it is easy to construct with no desideratum for any intricate reiterated parameter estimation schemes and it has less over-fitting compared to other models. On the other hand, the strength assumption of conditional independence between features reduces the potential of NB (Alsaedi et al., 2014). In addition, SVM were applied prosperously in many sentiment analysis tasks. SVM has outperformed other machine-learning techniques because of its primary advantages. For instance, generality of text categorisation dilemmas are linearly separable, it's robust in high-dimensional spaces and powerful when there is a sparse set of samples and any feature is pertinent (Rushdi-Saleh et al., 2011). The fundamental conception of SVM is to find a hyper-plane represented by a vector that does not only dissever the document vectors into a different class from those in other documents.

Evaluation
The precision and recall metrics were obtained to evaluate the classification results (Khan et al., 2014):

Experiment
This experiment explores Arabic text classification in e-learning as an education pattern with different algorithms such as NB and SVM. It would be necessary to pre-process the MSA Arabic language with the dialects and analyse students' feedback (comments) to test the performance of the sentiment classification method, which can help improve the elearning system process. The classifiers, which were used to explore the polarity of 2000 tweets with the DELDE, were NB and SVM. Table 3 shows the classification results using NB and SVM. Table 4 shows the classification results using NB and SVM with an n-gram equal to three.
Receiver Operating Characteristics (ROC) graphs are a utilisable technique to evaluate and visualise the predictive power of a binary classifier. ROC graphs represent the difference of likelihood level between the True Positive Rate (TPR) that is plotted on the yaxis and the Mendacious Positive Rate (FPR), which is plotted on the x-axis (Fawcett, 2004). The best point in the ROC space is located in the upper left corner (0,1). Figure  The conclusion of the first experiment is that the best precision was achieved by SVM without an n-gram feature. On the other hand, the worst precision was consummated by NB when the n-gram feature was involved. In addition, researchers tend to ignore the neutral class under the hypothesis that there is less learning sentiment from neutral texts in contrast to positive or negative classes. At the sentiment level, neutral usually means no opinion (Liu, 2012). Koppel and Schler (2006) showed that it is important to use neutral examples in learning polarity. Table 5 shows the classification results using NB and SVM with all classes.
Lastly, to show the difference in the accuracy and the class precision and recall, the n-gram feature added to this experiment. Table 6 shows the classification results using NB and SVM with n-gram for all classes.
A lift chart is a quantification of the efficacy of a predictive model calculated as the ratio between the results obtained with and without the predictive model (Jaffery and Liu, 2009). The lift chart is easier to read and illustrate than the ROC chart, which is a discrete version used to represent and visualise the classifier performance. The highest confidence numbers are shown first and, as can be seen, the confidences numbers decrease at some point. For example, Fig. 6 shows the lift charts for the neutral class with the NB classifier.   In conclusion, the first run shows the NB performance drops when we utilised the n-gram feature. On the other hand, there was minor amelioration in the precision of the result when SVM utilised the same feature.

Conclusion and Future Works
This paper presents the design and implementation of Arabic text classification in regard to King Abdul-Aziz University students' opinions using different algorithms such as SVM and NB. The analysis shows promising results with online student comments. We found that their text was more traditional and that local Saudi Arabian dialects are exercised in feedback more often than Modern Standard Arabic (MSA). To summarise, in the two experiments, which used negative and positive classes only, the best accuracy was achieved by SVM with the n-gram feature. On the other hand, the best accuracy was completed by SVM when the neutral class was added and the n-gram feature was involved.
In future work, we plan to gather blackboard structured data from King Abdualziz University and then label the student rows into three classes, negative, positive and neutral, based on behaviour study patterns. Then, semantic schema will be established utilising the exact same approach as this paper. The extracted features can help to understand the comportment of the students during their study.