Fusion of Features and Extreme Learning Machine for Facial Expression Recognition

: Human emotion is highly correlated to facial expressions. Due to its growing demand in different sectors, an emotion recognition method is proposed through recognizing facial expressions. The input image is preprocessed and then the resulting image is segmented into four facial expression regions following the newly proposed segmentation method. Histogram of Oriented Gradients (HOG) and Local Binary Patterns (LBP) are fused to extract the necessary features from the four segmented parts. The dimension of the feature vector is reduced using Principal Component Analysis (PCA). To classify the expressions, Extreme Learning Machine (ELM) is used. For evaluating the performance of the proposed method, three widely used and publicly available facial expression datasets (JAFFE, CK+, RaFD) are used. The proposed method achieved 95.3%, 99.84% and 98.65% accuracy while using images from JAFFE, CK+ and RaFD dataset respectively. Performance of the proposed method on these datasets is compared to other facial expression recognition methods on these datasets to indicate that the proposed method achieves state-of-the-art performance.


Introduction
Ability to detect the mental condition, feelings of a person can be of great importance. Facial expressions play a vital role in both nonverbal and social communication. According to Mehrabian (1968), facial expressions contribute the most (55%) in reflecting the feelings of a speaker. So successful recognition of human emotion, feelings is highly dependent on the successful recognition of facial expressions. Recognized facial expressions can be used in different sectors of our lives for improving the everyday experience. It can be used for security purposes as well.
Whether a person is mentally fit or not for a sensitive task, can be determined by recognizing the emotion of that person in advance. Social networks can incorporate the emotion recognition feature and suggest users post on their timeline depending on the expression of their uploaded photo. Robots can be given this feature to extract the best out of them. As the world is heading towards automation, so the ultimate target is to recognize facial expressions by machines as humans can do it spontaneously in real-time.
Seven basic facial expressions are usually considered while dealing with Facial Expression Recognition (FER) problems. They are neutral, fear, disgust, sad, happy, angry and surprise (Sandbach et al., 2012). Some other expressions such as contemptuous, pain are considered in some works. Some works are done excluding neutral expressions. As it has many applications in many sectors, researchers have been trying to develop FER systems with the ability to recognize expressions accurately in the least possible time. So, expression recognition has been a prime research topic in the field of computer vision, image processing, human-computer interaction for the last few decades.
Usually, FER systems include image preprocessing, feature extraction and classification step. Different available methods use different techniques to perform these steps. Gabor wavelets (Islam et al., 2018a), Principal Component Analysis (PCA) (Mahmud et al., 2015), Linear Discriminant Analysis (LDA) (Mahmud et al., 2015), Scale Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), moments, Gray level co-occurrence matrix are few among the popular ones. But all of them have some problems with them. For example, Gabor wavelets are effective against photometric disturbances but the high dimension of the extracted Gabor feature vector is a concern. Considering the issues with all the feature extraction techniques, to extract the features, a fusion of Histogram of Oriented Gradients (HOG) and Local Binary Patterns (LBP) is used. This paper is not the first paper that uses the combination of both of them. Many FER systems have already used this mechanism successfully (An and Ruan, 2016;Islam et al., 2018b;2018c).
For classification purpose, Artificial Neural Networks (ANN) (Islam et al., 2018a), multiclass Support Vector Machines (SVM) (Islam et al., 2018b), SVM (Jayalekshmi and Mathew, 2017), random forests (Jayalekshmi and Mathew, 2017), nearest neighbor classification (Agrawal and Yadav, 2018), AdaBoost classification (Verma and Dabbagh, 2013), Extreme Learning Machine (ELM) (Liu et al., 2015;Ghimire and Lee, 2014;Alphonse and Dejey, 2016) are among some which have been widely used. Now a days deep learning is also used extensively to classify expressions (Fathallah et al., 2017). ANN requires long training time, many training samples and many adjustable parameters to be tuned. Proper kernel function and regularization term, overfitting, training time are challenges with SVM. The number of neighbors to be considered and proper distance metric are crucial while using nearest neighbor classification. Noisy data can be problematic when using AdaBoost classification. To handle these problems, the proposed method uses ELM to classify the expressions to their corresponding classes. Next section of the paper briefly describes the proposed method and the following few sections elaborately describe each of the steps of the proposed method. Finally, last few sections analyze the performance of the proposed method, compare performance with other methods and discuss the flaw, future possibilities of the work.

Proposed Method
At first, taking an image as an input, it is converted to grayscale. The face region from the grayscale image is detected using Viola-Jones face detection method (Viola and Jones, 2001). Then the facial image is converted to a fixed size. These three steps are cumulatively termed as image preprocessing. Following the proposed image segmentation method, the preprocessed image is segmented into four facial expression regions (right eye, left eye, nose, mouth). From these segmented parts, features are extracted using HOG and LBP. PCA is then used to reduce the dimensionality of the feature vector. Finally, ELM is used to train the system with some sample images. Rest of the images are used to test the performance of the system. The whole process is illustrated in Fig. 1.

Image Preprocessing
The input image of the system might be a color image or a grayscale image. If it is a color one then it is converted to grayscale otherwise, this step is omitted. From the grayscale image, only the facial region is detected using Viola-Jones face detection method (Viola and Jones, 2001), if it has any face in it. The facial region is then resized to a fixed size of 150×150 pixels for using it in the image segmentation step. The dimension of an image from JAFFE, CK+ and RaFD dataset are 256×256, 640×490 and 681×1024 respectively. After applying Viola-Jones face detection technique the dimension of the resultant facial images are 155×155, 319×319 and 388×388 respectively. But for developing a uniform system it is necessary to convert them to a fixed size. So they are converted to 150×150 because if they were converted to a much lower or much higher dimension then the image could lose some vital required information. The process is shown in Fig. 2 on a sample image from RaFD dataset (Langner et al., 2010).

Image Segmentation
The purpose of this image segmentation step is to segment the facial image into four parts that contribute most in facial expressions. Different methods are available to segment the contributing parts from a facial image. Viola-Jones object detection method (Viola and Jones, 2001), Active Appearance Models (AAM) (Cootes et al., 2001) are two established methods. But each has some drawbacks while using in real-world problems. When the eyes are closed or almost closed then Viola-Jones method fails to detect the eyes. Dependency on proper initialization, wide object variations in the training set are challenges with AAM. To overcome these problems, a manual segmentation method is proposed in this paper to segment right eye, left eye, nose and mouth from any facial image of size 150×150. At first, four different coordinate points are defined for the four parts and then four different regions are cropped using four defined width and height values. Let us consider the example of segmenting nose from an image to understand the procedure. At first, the coordinate point 54.33, 81.84 is found from the preprocessed facial image of size 150×150. Then a region with width 45.43 and height 38 is selected and used as the nose region. Same procedure with different coordinate values and width, height are applied to different all four parts successfully. The required values for segmenting the four parts from a preprocessed image of size 150×150 are given in Table 1. These values are defined by analyzing many facial images and the position of these four parts in those images. Analyzed images were from people of different races, ethnic groups and different parts of the world. The challenge with this segmentation method was to segment the four parts as accurately as possible and at the same time with the least possible dimension. When the values of Table 1 are applied to an image of size 150×150 then the four parts get segmented as shown in Fig. 3. The whole image segmentation method is step by step illustrated on a sample image from RaFD dataset (Langner et al., 2010) in Fig. 4 for clearly understanding the process.

Feature Extraction
A fusion of HOG features and LBP features are used to extract the useful features from the segmented facial parts.

A. Histogram of Oriented Gradients (HOG)
Robustness against geometric and photometric transformations are the key benefits that led to use HOG as the feature extraction method. Due to its effectiveness, it has been used in many image processing tasks. Dalal and Briggs (2005) showed a widely accepted use of HOG to detect objects. Steps to calculate HOG features from any image are given below:  Create small adjacent cells by dividing the whole image  Calculate both the gradient magnitude and direction for all pixels in each cell  Compute the appropriate bin for every gradient magnitude, direction and represent it using a histogram of gradients  Blocks are built from adjacent cells and feature vector is calculated by performing block normalization Choosing an appropriate number of bins, size of cells, blocks are challenging and problem specific. During the implementation, the unsigned orientation of gradients, 9 bins in the histogram, cells of size 8×8 pixels and blocks of size 2×2 cells were used. Fig. 5 shows feature extraction on the left eye of a sample image from RaFD dataset (Langner et al., 2010) using HOG.

B. Local Binary Patterns (LBP)
LBP can handle illumination changes, they are computationally efficient as well. LBP has been widely used in many problems such as image analysis, facial analysis and motion analysis. A popular work using LBP for analyzing texture is shown by Ojala et al. (2002). Steps to calculate LBP features of an image are as follows:  Create small cells with a specified radius and number of neighbors from the whole image  Perform thresholding considering the center pixel and its neighbor pixels  A binary number (LBP) is found as the outcome of thresholding, later it is converted to a decimal number (LBP)  Store the 'count' of each LBP  Calculate the histogram over the cell, considering the frequency of each 'count' occurring  Concatenate histograms of all the cells and compute the feature vector Using the proper value of radius, number of neighbors are crucial and problem dependent. While implementing, a radius of 1 with 8 neighbors surrounding the center pixel was considered.

C. Fusion
At first, HOG features are extracted from the four segmented parts and then LBP features are extracted from the same four segmented parts. The final feature vector is formed by concatenating the features found using HOG and LBP. Length of the final feature vector was 1892, among them 1656 features were the contribution of HOG descriptor and rest were contributed by LBP.

Dimension Reduction
PCA can represent the data in terms of its principal components (Smith, 2003). For this ability, PCA is used to reduce the dimension of the feature vector.
Steps of PCA are as follows:  Normalize the data and calculate the covariance matrix  Calculate eigenvectors, eigenvalues of the covariance matrix  Analyze principal components and translate the data in terms of the components During the implementation, PCA was used aiming to retain 99% variance ratio and it resulted in up to 89.85% dimension reduction. For example, with JAFFE dataset (Lyons et al., 1998) 1892 features were reduced to 192 features which means 89.95% (1-(192÷1892)) dimension reduction. Although the dimension of the four segmented parts was different they were same for all sample images and their summation was also same for all sample images. So when given to classifier all images had a same number of features for a segmented part and as a result, there was no conflict while comparing them for classification purpose.

Classification
Extreme Learning Machine (ELM) is a single hidden layer feed-forward neural network. It is named such because of its ability to learn thousands of time faster than traditional neural network approaches. ELM works by finding the minimum norm least square solution of a system. As it is a feed-forward network, so there is no necessity of propagating the objective function's current state as done in backpropagation. So the solution is found just in a single forward iteration and so the algorithm is extremely fast. With ELM, the only parameter required to be tuned is the number of nodes in the hidden layer. But with conventional neural network approaches, many parameters are required to be tuned to extract the best out of the algorithm. ELM also aims to use the minimum norm of weights which leads to a better generalization performance. These above-mentioned properties of ELM make it a superior choice for classification task in image processing problems like FER (Huang et al., 2004). Given a training set     , | , , 1,2, , , of nodes in hidden layer, hidden node output function g (w,b,x), then the algorithm is as mentioned below:  Randomly generate input weights wi and bias bi, i = 1,2,,L  Use the following formula to calculate the hidden layer output matrix H: Sigmoid activation function and 2500 nodes in the hidden layer were used during the implementation.

Result Analysis
For implementing the proposed FER method, a computer with 64-bit operating system, 4GB of memory and core-i5 processor was used. Three publicly available facial expression dataset JAFFE (Lyons et al., 1998), CK+ (Lucey et al., 2010) and RaFD (Langner et al., 2010) were used. All 213 images from JAFFE dataset, 1219 facial expression images of 22 people from CK+ dataset and 1407 front facing facial expression images of 67 people from RaFD dataset were used. Sample images from these datasets are shown in Fig. 6 to 8. Seven facial expressions were considered. Contemptuous expressions were not considered from RaFD dataset. K-fold crossvalidation was used to avoid biased results. Splitting of training and testing sets were done randomly.
Average accuracy achieved on these three datasets by the proposed method is mentioned in Table 2. Here Correct Recognition Rate (CRR) is meant by accuracy. Average accuracy by traditional method is also mentioned. Traditional method refers to using full face to extract features. No segmentation or region of interest is considered in the traditional method. The feature-length without any dimension reduction is also mentioned in Table 2. It is apparent that there is not a huge difference in the average accuracy achieved by the proposed method and the traditional method. But there is a huge difference in the length of the feature vector. Obviously dimension reduction techniques can be applied to reduce the dimension of the feature vector as done in this paper. But reducing 1892 features to a few hundred and 10463 features to a few hundred are not computationally equivalent. The latter would require longer time and be a hindrance to creating a real-time system. Comparison of performance, of the three used datasets, by the proposed method is shown in Fig. 9.    It is a plot of CRR versus a different number of folds for the three datasets. It shows that with limited number of training data the results are inferior with the JAFFE dataset. But with less number of training images the results are quite similar for CK+ and RaFD dataset. As the number of folds and thus the number of training image increases, correct recognition rate also gets increased. Accuracy is not always the ultimate metric to analyze any FER system. Confusion matrices are also analyzed to measure the performance of FER systems. Table 3, to 5 represent confusion matrices for JAFFE, CK+, RaFD datasets respectively on random test cases. From the confusion matrix, it is evident that sad expression images had the lowest CRR for JAFFE dataset for that particular test case. The system classified some of the sad images as anger images and some as disgust images for a certain test case. Fear expression images had the lowest correct recognition rate for CK+ dataset on a random test case. Only neutral and disgust images were classified properly while considering RaFD dataset. As these are random test cases so the same confusion matrices are not guaranteed when new test cases are generated.

State-of-The-Art
Performance comparison of the proposed method with other available methods of FER would justify the relative position of the proposed method. Table 6 reflects the performance of the system by comparing with other FER systems on the used three datasets. Liu et al. (2015) used Viola-Jones face detection method to detect faces from images and then used Active Appearance Model (AAM) to segment the facial images into its parts. Gabor filters were used to extract features from those parts and PCA to reduce the dimension of the feature vector. Finally ELM was used to classify the obtained features. Training sets and testing sets were generated from JAFFE and CK+ dataset and their system resulted in 94% and 95% accuracy on these two datasets respectively. In their work, Ghimire and Lee (2014) used HOG to extract features from facial images and then they used ensemble ELM and finally they selected the winner by voting scheme. Throughout the process they used the images from JAFFE and CK+ dataset and achieved 94.37% and 97.3% CRR respectively. Alphonse and Dejey (2016) used images from JAFFE dataset to develop an FER system. They considered 77 facial points from any facial image as feature and then again they extracted features from facial images using Gabor filter. Features were then fused and PCA was used to reduce the dimensionality. Finally they used Extreme Learning Machine, Support Vector Machine, K Nearest Neighbour and Partial Least Squares (PLS) to classify the features. They concluded that ELM outperforms other classifiers used by them. Rao et al. (2015) proposed a SURF boosting framework and then used gentle AdaBoost with logistic regression for classification. They have considered images from RaFD dataset for their experiment. Using visual saliency and deep learning Mavani et al. (2017) developed an FER system. They have used CFEE and RaFD dataset. By training with CFEE dataset and testing with RaFD dataset they got an accuracy of 95.7%. Zeng et al. (2018) obtained 95.79% accuracy while testing images from CK+ dataset on their proposed system which considered facial geometric features and appearance based feature. Deep sparse auto encoders were the basis of their framework.

Conclusion
Unique, simple yet effective image segmentation method along with a fusion of features and selection of proper classification algorithm led to achieving higher accuracy compared to other FER systems. One of the most controversial steps of the proposed method is its image segmentation method. Even initially, the image segmentation process might seem to be absurd but the result section justifies its effectiveness. The system was tested with almost 3000 front facing images and all of them were handled properly by the system. But if any person has an unusual facial structure then that facial image might not be segmented properly by the system due to the unusual position of the four parts. Some facial expressions are even tough for a human to recognize, let alone machines.
For the time being the system is capable of handling only front-facing images and it is an obstacle in the way of creating a robust system. So developing a more robust system capable of handling images rotated at any angle in any direction would be a future work. Beside facial information, vocal information also contributes in determining emotion of a person. So developing a system to recognize the emotion of a person by using both facial and vocal information would be another interesting work in near future.