Multimodal Sentiment Analysis: A Comparison Study

: Sentiments and emotions play a pivotal role in our daily lives. They assist decision making, learning, communication and situation awareness in human environments. Sentiment analysis is mainly focused on the automatic recognition of opinions’ polarity, as positive or negative. Nowadays, sentiment analysis is replacing the old web based survey and traditional survey methods that conducted by deferent companies for finding public opinion about entities like products and services in order to improve their marketing strategy and product of advertisement, at the same time sentiment analysis improves customer service. Large number of videos is being uploaded online every day. Video files contain text, visual and audio features that complement each other. Multimodality is defined by analyzing more than one modality, Multimodal Sentiment Analysis refers to the combination of two or more input models in order to improve the performance of the analysis; a combination of text and audio-visual inputs is an example. The automatic analysis of multimodal opinion involves a deep understanding of natural languages, audio and video processing, whereas researchers are continuing to improve them. This paper focuses on multimodal sentiment analysis as text, audio and video, by giving a complete image of it and related dataset available and providing brief details for each type, in addition to that present the recent trend of researches in the multimodal sentiment analysis and its related fields will be explored.


Introduction
Several scientific fields have a great interest of research because of their effective recognition and classification practically and theoretically, such as machine learning, signal processing, computer vision, computational linguistics, cognitive, social psychology and neuroscience. Such scientific fields are an emerging research area today (Picard, 2010) and (D'mello and Kory, 2015). People are now extensively using the social media environment, such as YouTube, Facebook, Blog and Microblog in order to express their opinions. People are increasingly making use of images, audio and videos on different social media platforms to disclose and express their opinions. Thus, it is highly crucial to mine opinions and point out sentiments from the various modalities (Cambria et al., 2014) and (KgaogeloLetsebe, 2017). At the same time, there is an increasing need to know not only what information a user conveys but also how it is being conveyed as described in Data Analysis. Many researches done by psychologists and neuroscientists have shown the effect of an emotion that plays a significant role in the rational actions of human beings as it is closely related to decision making (Damasio, 1994).
Many of the works, to date, on sentiment analysis have focused on textual data and number of resources have been created include the use of lexicons (Liu and Zhang, 2012) and (Pang et al., 2002), but very little of the literature examines the vocal correlates and other relevant aspects of emotion effects in human speech and video. A recent and modern development of multimodal sentiment analysis is the visual sentiment analysis. Users of social media frequently share their text messages along with images and videos, those kinds of visual multimedia are an additional source of information in expressing users' sentiment ( Fig. 1).

Fig. 1: General sentiment analysis
The aim of multimodal analysis is to increase the accuracy and achieve the best prediction. This research presents a detailed discussion of the literature, which describing human text opinion, vocal emotion, visual express and its principal findings. The parameters of voice and video which affected by emotion, will be described in details for a domain of specific emotions.
The remaining of paper is organized as follows: Part II presents the techniques and features of multimodal sentiment analysis, part III shows the multimodal sentiment analysis in more details, part IV presents the datasets of multimodal sentiment, part V presents the applications of sentimental analysis, part VI presents sentimental analysis approaches of challenge and gap and finally, part VII is the conclusion.

A. Multimodal Sentiment Analysis Technique
Multimodal fusion is the process of combining data collected from various modalities for analysis tasks. Multimodal fusion is a profusion information gain when using the fusion of multimodal with a better accuracy of the overall results, which helps to take a decision.
Three main levels (types) of fusion have been studied by researchers: Feature-level fusion (early fusion), decisionlevel fusion (late fusion) and hybrid fusion approach. Furthermore, there are many different multimodal fusion models such as Model-level fusion, Rule base model, Estimation based and Classification based methods.
Feature Level (Early Fusion)  Features of text, audio and video are extracted from various modalities separately. First, the previous features are treated as a general feature vector and then combined features are presented for analysis and classification. The advantage of this approach is that the relations of various multimodal features are completed at the beginning as an early stage, which may provide better achievement. The disadvantage of this approach is concurrence timing, as the obtained features belong to various modalities and can vary in many aspects; however, all features must be imported into the same format before the fusion analysis takes place.
Decision Level (Late Fusion) (Celli et al., 2014) The features of each modality are extracted, analyzed and classified separately and all the results of analysis are merged in to a vector in order to obtain the final decision. The advantage of this approach is that it is easily compared to Feature Level fusion; since the fusion of decisions are gained from various modalities and have same form of data. One more advantage is that, no need for converting data to the same format; since every modality has used its best learning model and the most appropriate classifier for its features, however this could be considered as disadvantage for analysis because more than one classifier is used and mire-learning process for each model becomes uninteresting, boring and time consuming.
Hybrid Multimodal Fusion (Wöllmer et al., 2013) It is a combination of both feature-level and decisionlevel fusion methods. It aims to obtain the advantages of both feature and decision level fusion approaches and overcomes the disadvantages of both.

B. Multimodal Sentiment Analysis Features
Multimodal seminal analysis features are a compilation of two or more different features, text, audio and image. In textual opinions, the only available source of information consists of the words in the opinion and the dependencies among them, which may sometime insufficient to convey the exact sentiment of the consumer. Instead, video opinions provide multimodal data in the form of vocal as well as visual responses. The vocal modulations in the recorded response help us

Linguistic Features
Sentimental analysis gained from textual aims at the extraction of appraising meaning which starts by automatic detection of the state's subjective. Here, an overview is provided about the sentiment analysis approaches inNLP (Natural Language Processing), including supervised and unsupervised methods and future directions and limitations in the field (Table 1).
Feature types can be explicit or implicit; the explicit has four feature types: Syntactic, semantic, link-based and stylistic features, while the implicit focuses on semantic and linguistic rules (Sharef et al., 2016).
In natural language processing, Sentiment Analysis refers to find whether sentiment of a text which is written in natural language is positive, neural or negative (Obaidat et al., 2015), this can be achieved using supervised "Corpusbased" sentiment analysis approach, which relies on manually labeled samples, unsupervised "Knowledge or Lexicon-based" sentiment analysis approach, or hybrid (both lexicon-based and corpus-based) approach.
Supervised sentiment analysis (corpus based) aims at building predictive models for sentiment based by exploiting machine learning classifier that is trained on a labeled data, which in turn, test data based on it. This approach builds a feature vector of each text entry in which certain aspects or word frequencies are quantified and then, training the standard machine learning tools and validating them against reference annotated texts. Wiebe et al. (1999) use annotations that tag the evaluated content, but not its orientation, as a first approach in order to supervise the sentiment analysis, whereas the result of text classification is either subjective or objective. However, most of supervised approaches related to sentiment analysis are trained on specific domain and require a huge annotated corpus that manually labeled, this process is expensive and time consuming.
Unsupervised (lexicon based) approaches allow an estimation that is based on expert knowledge without the need to annotated data. The expert knowledge, which is used for the estimation, is often encoded in a lexicon, in which words or phrases are annotated with their sentimental meanings. These lexica can be manually annotated by means of raters who can interpret the meaning of words. Stone (1997) The General Inquirer (GI) is the one of the most widely used reference lexica for sentimental analysis, which includes a list of positive and negative terms. Wilson et al. (2005) define a method to increase the recall of unsupervised techniques that combines GI with other lexica (Wilson et al., 2005). (Pennebaker et al., 2015a) a new Linguistic Inquiry and Word Count (LIWC) are applied as a new method that counts the positive and negative affecting terms in text. On information retrieval, De Luca and Nürnberger (2006a) implemented methods, based on relations, to merge SynSets with hypernyms, hyponyms and context information.
(RAC) dataset fusion prosody, MFCC, or spectrogram and using SATSVM and DCM Bradley and Lang (1999) and Dodds and Danforth (2010) Affecting Norms of English Words (ANEW) originally were not designed for sentimental analysis, but they are benefit in classified sentiment such as happiness motion on the text. However, there is a difficulty in lexicon construction. In order to make it deals with a variability of languages; it is very expensive if it is done manually and not reliable if it is done automatically. On the other hand, the semi supervised approach is a class of supervised approach that makes use of unlabeled data for training. It minimizes the cost associated with labeling process. However, it highly depends on performance of initial labeled set that is classified. Sentiment analysis can be grouped into three different levels based on the target of study: Document level; the entire document is classified either positive or negative using machine learning approach or lexicon based approach. Sentences level; evaluation of opinion is done sentence by sentence in order to decide whether it is positive, neutral or negative. But the drawback of both levels, they provide high level of classification, the researches illustrate the previous described levels are , (Ibrahim et al., 2015), (Duwairi et al., 2014), , (Salameh et al., 2015), (Duwairi, 2015), (Al-Kabi et al., 2016), (ElSahar and El-Beltagy, 2015), (Wang et al., 2015) and (Ghareb et al., 2015). While Poria et al. (2014) enable the ability to distinguish the polarity of each aspect, it clarifies each aspect whether it is positive or negative.
Sentiment analysis systems are categorized into statistics-based and knowledge-based (Pang et al., 2002). Initially, the use of knowledge bases was popular for the identification of emotions and polarity in text, after that, the supervised statistical methods become in common by most researchers. Pang et al. (2002), apply and compare the performance of a review dataset in different machine learning algorithms by using large textual features with accuracy 82.9%, where Socher et al. (2013), use a Recursive Neural Tensor Network (RNTN) and obtain an accuracy 85% using the same dataset.
Another approach by Yu and Hatzivassiloglou (2003) uses semantic orientation of words to identify polarity at sentence level. Melville et al. (2009) develop a framework that exploits word-class association information for domain dependent sentiment analysis. Other unsupervised or knowledge-based approaches to sentiment analysis include: Turney (2002) uses seed words to calculate polarity and semantic orientation of phrases; Hu et al. (2013) propose a mathematical model to extract emotional clues from blogs and used them for sentiment recognition; Gangemi et al. (2014) present an unsupervised frame-based approach to identify opinion holders and topics; and Sentic Computing, Cambria and Hussain (2015) use a hybrid approach for sentiment analysis that exploits an ensemble of deep learning, commonsense reasoning and linguistics to better grasp semantics and Sentics (i.e., denotative and connotative information) associated with natural language concepts (Fig. 2). Narr et al. (2011) apply language processing approach (NLP) to extract information from tweets and transform them into a semantic knowledge base.

Audio Features
Today, a rich body of literature has been established, including many surveys such as (Schuller et al., 2011), (Crouch and Khosla, 2012) and (Pérez- Rosas and Mihalcea, 2013). There is specific components feature for emotion and sentiment analysis through audio. Various prosodic and acoustic features have been used in the literature to learn how machines detect emotions (Navas et al., 2006), (Morrison et al., 2007), (Wu and Liang, 2011), (Murray and Arnott, 1993), (Luengo et al., 2005) and (Koolagudi et al., 2011). In psychological studies related to emotion, it is found that vocal parameters, especially pitch, intensity, speaking rate and voice quality play an important role in recognition of emotion and sentiment analysis (Murray and Arnott, 1993). There are different voice parameters which are affected by emotion such as Voice Quality, Utterance timing and Utterance pitch contour.
Many further works done and focus on sentiment analysis from the textual content as present in the speech, (Pereira et al., 2014) they used sentiment analysis in speech for information retrieval, their proposed approach takes a spoken query and retrieves documents. (Kaushik et al., 2103a) and its extension (Kaushik et al., 2103b) observe that sentiment analysis on natural speech data can be understand clearly even when faced with low word recognition rates and this is same as what proposed by (Metze et al., 2010). Pérez- Rosas and Mihalcea (2013) using speech recognition and focus on the linguistics reviews too.
Further studies show that not only the acoustic parameters are changing and depending on personality traits, but also through oral variations. Many researches are performed based on the types of features that are needed for better analysis (Muda et al., 2010). Researchers find that pitch and energy related features are playing a key role in affecting recognition. Other features that have been used by some researchers for feature extraction, which are affected by emotion, include Mel Frequency Cepstral Coefficients (MFCC), Log Frequency Power Coefficients (LFPC), Linear Prediction Cepstral Coefficients (LPCC), pause, teager energy operated based features and formants. Some of the effected audio features are mentioned briefly below: One of the most commonly feature extraction method used in ASR, MFCCs are coefficients that collectively form a mel-frequency spectrum (MFC). The MFC computes the power of each frequency band of an audio clip • Spectral centroid: One of the measures which is used in DSP in order to characterize spectrum, it indicates the center of mass of the spectrum, it provides an indication of the brightness of sound • Spectral flux: It is a measure of how quickly the power spectrum of a signal is changing. This feature is usually calculated by comparing the power spectrum of one frame against the power spectrum of previous one, it is calculated by measuring the Euclidean distance between two normalized spectra. An acoustic study of emotions expressed in speech (Yildirim et al., 2004) investigates acoustic properties of speech associated with four different emotions (sadness, anger, happiness and neutral); they are intentionally expressed in speech by an actress. They aimed to obtain detailed acoustic knowledge and how speech is modulated when speaker's emotion changes from neutral to a certain emotional state. They experiment show happiness, anger, neutral and sadness share similar acoustic properties in a specific speaker. Speech associated with anger and happiness are characterized by longer utterance duration, shorter inter-word silence, higher pitch and energy values with wider ranges, showing the characteristics of exaggerated or hyper articulated speech, however this means that acoustic reparability is relatively poor.

Visual Features
Visual language is a type of non-verbal communication in which physical behavior, as opposed to words, are used to express or convey information. Such behavior includes facial expressions, body posture, gestures, eye movement, touch and the use of space.
Processing sentiment analysis using computer vision is a relatively recent area of research. The main research tasks in visual sentiment analysis is focusing on detecting, modeling and obtaining information of sentiment that expressed through facial, physical, gestures and any sentiment that can be observed in visual multimedia. Ekman and Keltner (1970) are pioneers in this field of research; they put through costly studies on facial emotions. They argued that it is possible to detect basic emotions as Anger, Joy, Sadness, Disgust and Surprise from cues of facial expressions. In this section, we present various studies on the use of visual features for multimodal affecting analysis (Fig. 3).

Facial Action Coding System
Many measurement systems for facial expressions were developed (Paul and Friesen, 1978), (Izard et al., 1983) and (Kring and Sloan, 1991). One of these systems, the Facial Action Coding System (FACS) developed by Paul and Friesen (1978) has been widely used. FACS depended on Action Units (AU) to reconstruct facial expressions. Human facial muscles are almost identical and AUs are based on movements of the human facial muscles, which consist of three basic parts: AU number, FACS name and muscular basis. FACS differentiates between various facial actions but cannot recognize emotions. But FACES were later complained with other resources to reconstruct emotions (Ekman et al., 2002), (Ekman et al., 1998) and (Ekman and Rosenberg, 1997;Matsumoto (1992) added new emotion ('contempt' or disrespect) to the set of six previously defined emotions.
An important side in video-based methods is preserve accurate tracking throughout the video sequence. A wide range of deformable models, such as muscle based models (Ohta et al., 1998), 3D wireframe models (Cohen et al., 2003), elastic net models (Kimura and Yachida, 1997) and geometry-based shape models (Verma et al., 2005) and (Davatzikos, 2001), have been used to track facial features in videos. After that, deformable models have demonstrated an improvement in both facial tracking and facial expression analysis accuracy, (Wen, 2003). Pantic and Rothkrantz (2000a) Pantic and Rothkrantz (2000b) and Fasel and Luettin (2003) proposed automatic methods.

Body Gestures
Most research works have concentrated on facial feature extraction for emotion and sentiment analysis, however, there are some contributions based on features extracted from body gestures which provide valuable source of features for emotion and sentiment recognition. A relation between body gestures and emotion was explored in (De Meijer, 1989) include qualities and dimensions in different emotions. Another study was show that it is easy to distinguish the basic emotions from some simple statistical measures of motion's dynamics (Kapur et al., 2005).
A mathematical model to analyze body gestures for emotion expressiveness were developed by (Caridakis et al., 2007). Piana et al. (2014), two feature were extracted facial and hand gesture features were gain and used in emotion analysis.
A sentiment prediction framework was developed (Xu et al., 2014) to sentiment images using convolutional neural network. One of it is advantage, it is not require domain knowledge for visual sentiment. A deep 3D convolutional networks (C3D) have been proposed for spatio temporal feature learning (Tran et al., 2015), it show a successful feature' learning for spatio temporal in a comparison with 2D networks. A recurrent neural network (RNN) where developed by (Poria et al., 2017) to extract visual features (Fig. 4).

Multimodal Sentiment Analysis
Sentiments and emotions play a pivotal role in our daily lives. They assist decision making, learning, communication and situation awareness in human environments. Recently, most of researches in this field have focused on multimodal emotion recognition using visual and aural information. But at the same time, there is currently rare literature on multimodal sentiment analysis. Most of the work and available data resources are restricted to text opinion mining and the field of natural language processing. On the other side, most researches were based on English language, researches and sentiment analysis experiments were rarely based on other languages (especially the Arabic language) in comparison to English. Lee and Narayanan (2005) explore domain specific emotion recognition from speech signals using data obtained from the application of real-world call center dialog. The experimental results of Language and discourse information, as well as acoustic features that most studies have focused on, show that significant improvements can be made by combining information sources in the same framework. However, their drawback is domain specific. Eyben et al. (2010) contribute with three different point: First they address the task of tri-modal sentiment analysis by integrating three different modalities: Visual, audio and linguistic features, in order to determine the polarity of an input stream. Second, they present qualitative and statistical analyses that identify five multimodal features that are found helpful to differentiate between negative, neutral and positive sentiments: Polarized words, smile, gaze, pauses and voice pitch. And third, they introduce a new real dataset consisting of video opinions, which is collected from YouTube web site.
What scene?
What actions?
Pre-trained 3D ConvNet Yamasaki et al. (2015) propose a method to accurately predict multiple impression-related user ratings for a given video talk. Their proposal considers multimodal features including linguistic as well as acoustic features, correlations between different user ratings (labels) and correlations between different feature types by using Single Markov Random Field (MRF) and the optimization of label assignment problem in order to obtain a consistent and multiple set of labels for a given video. Their experimental results on this dataset show that the proposed method obtains an accuracy of 93.3%.
D'mello and Kory (2015) design a survey and discuss both unimodal and multimodal accuracy comparison using statistical measures. The experiments compare the accuracy of different algorithms of different datasets using statistical methods. Zeng et al. (2009) design a survey on multimodal emotion recognition; mainly, they focus on collecting and processing audio, visual and audio-visual material in order to identify the challenges that involve in multimodal data. Rosas et al. (2013) and Morency et al. (2011) address the multimodal sentiment analysis by designing some experiments on a new dataset, which consisting of Spanish videos that are collected from the social media website, they combine the three features in comparative experiments, they show that the using of visual, audio and textual features jointly improves the use of one modality at a time .

B. Other Languages
Pérez- Rosas et al. (2013) present a multimodal approach for utterance-level sentiment classification. The paper introduces a new multimodal dataset, which consists of sentiment annotated utterances that is extracted from video reviews, where each utterance is associated with a video, acoustic and linguistic DataStream. The experiments show that sentiment annotation of utterance-level visual data streams can be effectively performed and the use of multiple modalities can lead to a reduction in error rate of up to 10.5% as compared to the use of one modality at a time.

Dataset for Multimodal Sentiment Analysis
Many exhaustive surveys on sentiment analysis of text input are available, rarely surveys focus on the analysis of audio, video and multimodal input, one of the survey reviews the recent progress in the field of sentiment analysis with the focus on available datasets and sentiment analysis techniques are (D'mello and Kory, 2015) and (Zeng et al., 2009).
There are two main methodologies for dataset collection: Video recordings that depend on specific scripts and natural videos. Multimodal framework achieves better performance than unimodal systems, but improvement was much lower when it is trained on natural data versus acted data (D'mello and Kory, 2015). It is important to track and label the emotion of the opinion in a video and so, labeling is done at the utterance level, where every utterance is associated with sentiment label for both approaches.
There are few datasets available for multimodal sentimental analysis; the datasets in multimodal affect recognition that are recently covered are (Table 2): • YouTube Dataset: The dataset was developed by Morency et al. (2011).

Sentimental Analysis Applications
Sentimental analysis can be used in several applications such as Marketing Strategies; for example, to understand and analyze customers' demands. It helps organization to increase innovation, retain customers and increase the operational efficiency. It can be used in Prediction to understand customer's needs and predict the future possibilities in every aspect which replace old surveys or create focus groups, which was much slower and much more expensive. Government policies; whereas politicians and governments often use sentiment analysis in order to understand how people feel about themselves and their policies. To contextualize the likes and dislikes of the user (Langlet and Clavel, 2016) and the ability to extract topic words from each user's speech.
Another domain for sentiment analysis is (Ellis et al. 2014) utilize multimodal sentiment analysis on broadcast video news to automatic analysis and summarization of TV programs. Multimodal sentiment analysis technologies can be also used to identify politically persuasive content (Siddiquie et al., 2015). Using this way and technologies will make it possible, easy and fast to obtain and mine opinions expressed through too many broadcast TV channels or any other online channels on the Internet (Langlet and Clavel, 2016

Multimodal Sentimental Analysis Challenges
Feature extraction in sentiment analysis is facing different problems such as redundancy, domain dependency, difficulty in implicit feature identification and limited work on Lexico-structural features. Followings are the general challenges in feature extraction that are identified by different researchers (Yildirim et al., 2004), (Pennebaker et al., 2015b) and (Redondo et al., 2007).
• Domain Dependency: Performance of classification and clustering, which based feature extraction techniques, is domain dependent that creates cross domain and generalization problems (Redondo et al., 2007). One solution for that clustering, clustering process is used to improve the categorization of the documents (De Luca and Nürnberger, 2006b). And (De Luca et al., 2004) introduce clustering process can enhance the semantic classification • High Dimensionality: It means large feature sets that causes performance degradation due to computational problems and thus proper feature selection methods are essentially required (Wilson et al., 2005) • Different Writing Styles: The same word can be considered positive in one situation and negative in another one. For example, the word 'long' is considered as a positive opinion in the sentence 'The laptop battery's life is long' but it is considered negative opinion in the sentence 'The laptop boot time is long'. And also people's opinions are change over time • Comparative Manner Expression: A serious challenge in sentiment analysis comes from the fact that people usually express their opinions in a comparative manner; they express their positive and negative reviews in the same sentence • Context Quality: Multimedia content on social media is a rich resource of data that provide us with scale, but the quality and the context of recorded material can vary and the data is limited to certain demographics that are more represented on the internet (Poria et al., 2017) Conclusion Sentiment analysis is mainly focused on the automatic recognition of opinions' polarity, as positive or negative. Multimodality is defined by analyzing more than one modality. Multimodal Sentiment Analysis refers to the combination of two or more input modes in order to improve the performance of the analysis; huge number of videos is being uploaded online continuously and so analysis of such media is important; the automatic analysis of multimodal opinion involves a deep understanding of natural languages, audio and video processing, whereas researchers are continuing to improve them. This paper provides a comprehensive overview about the multimodal sentimental concept and goal and at the end it discusses some challenges related to the field. We found that most of researches are based on English language and rarely depend on other languages. Also we present most available datasets. This review encourages for further research in this field; and in future we will focus on the methods targeted to Arabic language.