Sentimental Analysis on Health-Related Information with Improving Model Performance using Machine Learning

: Social media platforms are extensively used in exchanging and sharing information and user experience, thereby resulting in massive outspread and viewing of personal experiences in many fields of life. Thus, informative health-related videos on YouTube are highly perceptible. Many users tend to procure medical treatments and health-related information from social media particularly from YouTube when searching for chronic illness treatments. Sometimes, these sources contain misinformation that cause fatal effects on the users’ health. Many sentimental analyses and classifications have been conducted on social media platforms to study user post and comments on many life science fields. However, no study has been conducted on the analysis of Arabic user comments, which provide details on herbal treatments for people with diabetes. Therefore, this study proposes a model to detect and discover emotions/opinions of YouTube users on herbal treatment videos is proposed through an analysis of user comments by using machine learning classifiers. In addition, a new Arabic Dataset on Herbal Treatments for Diabetes (ADHTD), which is based on user comments from several YouTube videos, is introduced. This study examines the impact of four representation methods on ADHTD to show the performance of machine learning classifiers. These methods remove repeating characters in Arabic dialect and character extension known as ‘TATAWEEL’ or ‘MAD’, stemming of Arabic words, Arabic stop words removal and N-grams with Arabic words. Experiments has been conducted based aforementioned methods to handle imbalanced proposed dataset and identify the best machine learning classifiers over Arabic dialect textual data. The model has achieved a higher accuracy that reached 95% when using Synthetic Minority Oversampling TEchnique (SMTOE) techniques to balanced dataset than imbalanced dataset.


Introduction
Social media platforms occupy a large part of our daily life activities. They allow users to exchange information in seconds. YouTube is the most popular video platform, generating billions of views through the uploaded video content. More than two billion logged-in users visit YouTube every month, allowing people to share their perspectives on daily activities, thoughts, experiences, advertisements and educational resources (Bhuiyan et al., 2017;Burns et al., 2020). YouTube facilitates the users in liking, commenting and sharing ideas. The advancing technology has made YouTube easily accessible to users. Thus, it has become popular among kids, adults and the elderly crowd as a mode of education and entertainment. However, users face difficulties in differentiating the desirable and undesirable content due to the massive unstructured data on YouTube (Awal et al., 2018;Chen et al., 2017).
As a solution for this, researchers utilize natural language processing methods, such as text mining and sentiment analysis through machine learning or deep learning approaches in ranking and analyzing the best suitable content through user comments and the number of views and likes (Awal et al., 2018;Choi and Segev, 2020;Dabas et al., 2019;Tahir et al., 2019;Vedula et al., 2017). These comments can indicate user perspectives, emotions and opinions against the content that can be positive, neutral, or negative using sentiment analysis. Sentiment analysis is the process of extracting and discovering user opinions. It can be useful for service improvement and to obtain user feedback on products and services. Besides, it can also provide users with the best approach when making choices between different video content on YouTube in agreement with their requirements as the content desirable for kids, etc. (Tahir et al., 2019;Alghowinem, 2018).
Sentiment analysis is carried out using machine learning approaches. Especially the supervised classification techniques. The most popular Machine Learning Classifiers (MLCs) that are used in the process are Naiva Bays (NB), Decision tree, Random Forest (RF), k-Nearest Neighbours (KNN), Support Vector Machine (SVM) and Logistic Regression (LR). These MLCs are used in classifying data into two parts; positive or negative. This classification is carried out based on the previous training process on the dataset called the training dataset. The accuracy of the model performance depends on the training dataset and the pre-processing steps. The preprocessing steps that is to use in dataset preparation can easily include in English language textual datasets. A lot of scholarly attention is given to the user comments on social media in other dialects than in English.
A lot of studies were conducted on sentiment analysis upon many domains such as advertisement classifications (Vedula et al., 2017;Gauba et al., 2017;Chauhan and Meena, 2019), business intelligence (feedback on products) (Aufar et al., 2020;Tafesse, 2020), education (Veletsianos et al., 2018;Lee et al., 2017;Anggraini and Tursina, 2019;Thelwall and Mas-Bleda, 2018) and the quality of videos dedicated to kids (Tahir et al., 2019;Alghowinem, 2018;Araújo et al., 2017). However, only a few studies were conducted on health-related misinformation. Many YouTube videos on disease treatments and health information are usually created through individual experiences. Thus, the content in such videos can mislead the followers while raising health issues among them. Most of the scholarly studies are focused on user opinions (sentiment analysis) and on detecting the rumours/misinformation on healthrelated issues through social media platforms, such as Twitter and Facebook (Daabes and Kharbat, 2019;Oksanen et al., 2015;Sicilia et al., 2017;Alayba et al., 2017;Alsaeedi and Khan, 2019;Alsaeedi, 2019).
To the best of my knowledge, there had been no studies focusing on building a dataset on herbal treatment for diabetes in the Arabic dialect utilizing the user comments on YouTube videos. Therefore, this study addresses two issues namely; sentiment analysis on Herbal Treatments for Diabetes in YouTube content and improving the performance of MLCs on Arabic datasets using sentiment analysis case study. Therefore, this study proposes a model to detect the user opinion on herbal treatment on YouTube videos related to diabetes based on user comments by using the MLCs. In addition, a new dataset called Arabic Dataset on Herbal Treatments for Diabetes (ADHTD) is introduced. ADHTD dataset includes 21,320 user comments from 22 YouTube videos on herbal treatments for diabetes. After removing the noise, data cleaning and pre-processing, the remaining 4,111 comments were divided into two classes (positive and negative) through the human annotation process. Additionally, this study examine the impact of four representation methods on ADHTD to show the model performance. These methods remove repeating characters in Arabic dialect and characters extension known as 'TATAWEEL' or 'MAD', stemming of Arabic words, Arabic stop words removal and N-grams with Arabic words. Furthermore, handle imbalance dataset and identify the best MLCs in terms of accuracy. The experimental findings show that the stop word removal method cannot effectively use on model performance comparing with the removal of repeating characters, character extension and stemming of Arabic words that will decrease the model performance. Besides, LR and SVM outperform KNN, DT and NB. The worst accuracy in all experiments recorded with KNN. The model has achieved a higher accuracy level that reached 95% when using Synthetic Minority Oversampling TEchnique (SMTOE).
The rest of this paper organized as follows: Section 2 presents the related studies. The methods, model architecture and ADHTD dataset is explains in section 3. Section 4 includes the discussion and results of the proposed model while the conclusion of this paper in section 5.
Authors in (Qi et al., 2017) focuses on the analysis of rumours about cancer on the internet by collecting data from two websites. The experiment had been conducted with the assistance of doctors, nurses and students specialise in medicine to classify and verify data into two classes namely, dread and wish. The ANOVA statistical method was used due to the diminutiveness of the dataset. In the same way, (Song et al., 2019) conducted experiments by collecting health information from china's website. The dataset of 872 was verified by 218 participations. It was found that dread health rumours are more credible than wish rumours. While (Sicilia et al., 2017), proposed a model to detect health related rumours on twitter about Zika virus with a dataset of 800, which was classified into rumours and non-rumours. It focuses on feature extraction based on three levels of influencing the users and network by using random forest classifier. The model performance of approximately 89% was achieved in detecting rumours. Alayba et al. (2017) introduced a new Arabic dataset on health-issues by extracting the related information from user comments on Twitter, applying several machine learning algorithms and deep learning methods using Convolutional Neural Networks (CNN) on the imbalanced dataset. The experiment shows that the model performance achieved approximately a success rate of 85 to 90%. However, the issue of the imbalanced dataset was not considered. Besides, the deep learning concept was applied to a small dataset. Daabes and Kharbat (2019) demonstrated the importance of the study of Arabic health information through cancer-based YouTube videos by assembling the video specifications such as the number of likes, comments, views and the amount of dislikes.
Al-Tamimi et al. (2017) carried out sentiment analysis on Arabic a language from YouTube user comments. Approximately 5,986 comments were collected and annotated into three classes. The dataset was collected from the top-rated videos on YouTube and its unbalanced dataset. Several machine learning algorithms with model performance were applied which reached a success rate of 88.8%. Najadat and Abushaqra (2018) proposed a Multimodal sentiment analysis on twenty-one Arabic videos collected from YouTube. Five machine learning models were applied to the voice and facial features of the person where the model performance achieved 76%.
An Arabic dataset based on hotel reviews was introduced by (Elnagar et al., 2018). This imbalanced dataset contained 373772 reviews which can be materialized in further research functions. Six types of machine learning classifiers were applied on this balanced and imbalanced dataset, along with three experiments namely; Polarity, rating and lexicon classification.
Al-Horaibi and Khan (2016) constructed a general dataset from Twitter with selected Naïve Bayes and decision trees as machine learning classifiers While measuring the sentiment analysis on the Arabic language. It is mentioned that Naïve Bayes had the best accuracy rate of 64.85% while the decision tree has an accuracy rate of 53.75%.
Al-Rubaiee et al. (2016). introduced a dataset with 1121 tweets to study student emotions on e-learning. They applied the two machine learning classifiers Naïve Bayes and support vector machine with four experiments on two to three classes. The best accuracy rate had been noticed when Naïve Bayes classifier on two classes rating 84.62%. Even though the dataset was small (Heikal et al., 2018) used the Deeping learning CNN and Long Short-Term Memory (LSTM) while utilizing the methods of ensemble to obtain the best accuracy rate which was at 64.46%. The experience was on ASTD: Arabic Sentiment Tweets Dataset (Nabil et al., 2015).

Methods and Model Architecture
This section presents the methods and model architecture of this research. The model architecture consists of seven phases namely; data collection, data cleaning, data pre-processing, annotation, feature extraction, build model and model evaluation as shown in Fig. 1.

Data Collection
This study aims to analyze the user opinions against the herbal treatments for diabetes based on YouTube content. Such datasets need to be prepared and collected. This initial step of data collection is known as Phase 1. In phase 1, the criterias of the data collection are the videos published between 2015-2020 with more than 100K views, 5000 likes and 2000 comments. These are published by people, professional doctors, herbal treatment experts and the users of such medical substances.
Data was collected from 22 YouTube videos. Besides, the focus was given to the comments by the speakers of Arabic dialect. Some keywords were used in order to retrieved videos such as Herbal treatments for diabetes ‫بالعشاب"‬ ‫السكرى‬ ‫,"عالج‬ diabetes cure " ‫مرض‬ ‫عالج‬ ‫,,"السكرى‬ cure from diabetes without medications " ‫"عالج‬ ‫ادوية‬ ‫بدون‬ ‫.السكرى‬ Python version 3.8 and YouTube API was utilized to extract the user comments on the videos. Figure 2 presents the sample of extracted comments from a YouTube Video.
There are several steps to carry out before combining the extracted comments into a single file. A total of 22 files were allocate for each comments extracted YouTube video. After the said steps, the dataset was named as The Arabic Dataset on Herbal Treatments for Diabetes (ADHD). This was followed by the data cleaning process.

Data Cleaning
This is the Phase 2. Upon data collection and storage, the following data cleaning steps was carried out to remove the noise: Then, the annotation process will start to identify the positive or negative user comments.

Annotation Process
In Phase 3, the annotating process is a human-based, manual process conducted with assistance from three native Arabic speakers as annotators. If at least two annotators agreed that comments are positive, they consider being positive, failing which it will consider negative. If there is any ambiguity from the annotators, the comments are not to be removed or considered.
The user comments with negative and positive denotations were inspected and verified, reducing the data set from 21362 to 4111 comments. Hence 1013 comments were positive and 3098 were negative, ADHD is considered an imbalanced dataset because the number of positive comments is greater than the number of negative comments. Table 1 describes the balanced dataset and the original imbalanced dataset with attributes of classes that are positive, negative, including the maximum and minimum length of characters in comments.

Fig. 2: Extracted comments
Based on the preprocessed ADHTD dataset, the pre-processing step is carried out in phase 4.

Data Pre-Processing
In Phase 4, the data pre-processing step is performed. It consists of four steps. They are normalization, stop word removal, tokenization and Arabic stemming. Python version 3.8 was utilized in carrying out the aforesaid steps. The words were returned to the normal form through normalization. The Arabic stop words were removed using the NLTK package while tokenization was used in separating the words. Also, the following condition had been considered which are frequently appear in Arabic dialect comments. Such preprocessing steps help in improving the MLCs, particularly with Arabic comments:  Remove the repeated character such as " ‫"رااااااائع‬ to " ‫"رائع‬  Remove the character extension in Arabic language such as ‫"جميـــــــــــل"‬ to ‫"جميل"‬

Feature Extraction and N-Gram
In Phase 5, after preprocessing steps performed in phase 4, the extracted words/terms from comments represent a form of n-grams. The n-gram was applied to five forms of unigram, bigram, trigram, 4-grams and 5grams while the best results were achieved with bigram and trigram which means no performance can be noticed after that. Then, the n-grams converted numeric values using "CountVectorizer" and "sklearn.feature_extraction.text" by python version 3.8 program language. The term frequency-inverse document was applied through "TfidfTransformer" in the "sklearn" package. The TFIDF algorithm as in (1): Where: TF(w) C = Denotes number of word (w) in comment (C) CFt = Denotes number of comments containing word (w) N = Denotes is the total number of comments in dataset The output of this phase is used as input for MLCs in phase 6.

Build Model
In phase 6, the most popular MLCs were selected (Pranckevičius and Marcinkevičius, 2017;Maxwell et al., 2018;Saharudin et al., 2020). The implementation was executed using the "sklearn" package by Python 3.8. These MLCs are; NB, DT, RF, KNN, SVM and LR. The results were analyzed, discussed and presented in the next section. The dataset had been divide into three phases for training and testing. They are, Phase 1: 70% tranning and 30% testing data, phase 2: 60% training and 40% testing and phase 3: 80% training and 20% testing. The results did not show a major difference in the model performance in the three phases. The results present in this research are based on 70% training and 30% for testing.

Model Evaluation
In phase 7, the MLCs are evaluated. There are two types of the evaluation conducted on the dataset and performance of MLCs. First, cross-validation is conducted to evaluate the quality of the dataset to void the over-fitting problem. The cross-validation splits data into 3, 5 and 10 folds. The result shows the accuracy is to test data (MLCs performance), as shown in Table 1 with into 5 folds. Secondly, the performance MLCs in term of accuracy has been evaluated using Precision (2), Recall (3) and F1-score (4) and accuracy (5)

Experiments, Results and Discussion
This section presents the experiment settings and the results. There are two types of experiments has been carried out are; Experiments with word representation methods and Experiments with Handling Imbalanced Dataset.

Experiments with Word Representation Methods
The experiment has been carried out to examine the four word representation methods on extracted features to determine its impact on the performance of the proposed model. These methods remove repeating characters in the Arabic dialect and characters extension known as 'TATAWEEL' or 'MAD', stemming of Arabic words, Arabic stop words removal and N-grams with Arabic words. They also determine the best MLCs and handle the imbalance dataset. The MLCs that has been included NB, DT, RF, KNN, SVM and LR. In all experiments, the cross validation process was carried out with five folds to avoid the over-fitting problem as shown in Table 2. In addition, unigram, bigram, trigram and 4-gram were applied in all experiments. There are two types of experiments has been carried out.
In the first Experiment (E1), these aforementioned methods were not considered. The results of this Experiment (E1) are shown in Table 3. The model performance in terms of accuracy using VSM and LR reached 90 and 89%, respectively and the worst accuracy was 69% using KNN.
In the second Experiment (E2), remove repeating characters in Arabic dialect and characters extension which is known as 'TATAWEEL' or 'MAD', stemming were considered. The in stemming words revert to original form, the repetition of character is a common form of writing comments in Arabic dialect particularly in social media and the character extension is usually used by Arabic commenters. In this experiment, the performance of MLCs has improved compared with that in (E1), as shown in Fig. 2-5 with all forms of N-gram, Unigram, Bigram, Trigram and 4-gram, respectively.   In addition in E2, VSM and LR had the best accuracy. The experiment results show that the model performance in terms of accuracy in trigram and 4-gram is the same. It reached 91 and 90% for VSM and LR, respectively. These results are better than those of Unigram and Bigram, as shown in Fig. 2-5. However, the accuracy of KNN decreased to 64, 65, 67 and 67 with Unigram, Bigram, Trigram and 4-gram, respectively, as shown in Fig. 2 Unigram ( In the third Experiment (E3), where the stop word removal in the NLTK of Arabic dialect is used, no significant improvement in the model performance given the slight decrease in all four classifiers, except for KNN, which shows slight improvement in Unigram from 64 to 67% and Bigram from 65% to 66 but no improvement and decrease in accuracy in Trigram and 4-gram.
Overall, the performance of MLCs in terms of accuracy in E2 and E3 outperform E1. This improvement was achieved due to use of the method for removing repeating characters in Arabic dialect and characters extension known as 'TATAWEEL' or 'MAD', stemming in E2. Such methods can improve the performance of the MLCs, especially when handling Arabic dialect textual data extracted from social media. While using stop word removal for Arabic textual data it does not show any improvement. Furthermore, using N-grams method with methods of removing repeating characters in the Arabic dialect and the characters extension, stemming of Arabic words, the accuracy of the model has shown improvement with bigram and trigram while the accuracy improvement has not been noticed using 5-grams.

Experiments with Handling Imbalanced Dataset Techniques
In this research, the dataset that is considered is imbalanced. Therefore, imbalanced dataset techniques were applied to the imbalanced dataset. These techniques were oversampling, under sampling and SMTOE (Jeatrakul et al., 2010). These techniques were used to make the two classes in each class close to each other in the number of comments.
In all three techniques, the cross-validation process has been conducted and results of 5-Folds crossvalidation based on the SMOTE technique shown in Table 4. In these experiments, the n-gram of unigram and trigram was utilized. The results of MLCS using oversampling, under sampling and SMOTE techniques shown in Table 5. The best accuracy level had been achieved by SMOTE (trigram) using the two classifiers SVM and LR with a rate of 94.94 and 92.58% respectively in unigram and a rate of 95.27 to 92.85% respectively in the trigram.
In using SMOTE technique, the performance of the model has been significant improved with MLCs except with KNN were noticed the decrease of accuracy. In under sampling the number of comments is reduced from majority class (high number of comments/rows in positive class) to the number of comments to minority class (low number of comments/rows in negative class. In opposite, oversampling which randomly duplicates comments to reach a close number of comments in the majority class. While SMOTE used a different way of increasing the minority class to majority class. It calculate the average distance between the data points in space to increase the number of comments (data point in space) to reach the size of the majority class. Therefore, this method is achieved a good accuracy compared with under sampling or oversampling.
The main contribution of this study as follows:

Conclusion
A lot of misinformation is spread over the digital world, such information can impact on people if related to health. Diabetes herbal treatment via YouTube has been detected through this study by analysis user opinions based on user comments. Besides, this study examines the impact of four representation methods on Arabic datasets to evaluate the model performance. Therefore, based on the finding of this study using the mentioned methods can improve the models or research that will be conducted on data mining based on textual data. This study examines the impact of four representation methods on Arabic datasets to evaluate the model performance. The experiments results show that the model performance can be improved by remove repeating characters in Arabic dialect and characters extension known as 'TATAWEEL' or 'MAD', stemming of Arabic words, while using Arabic stopword removal, the model performance is slight increase with KNN only compared with another machine learning classifiers. Furthermore, the results demonstrates that the VSM and LR outperform all the machine learning classifiers in all experiments while the worst is using KNN. The best accuracy level was recorded when applying SMOTE on the imbalanced dataset which achieved a rate of 94 to 92% using SVM and LR, respectively compare with under sampling or oversampling techniques.