Symbolic Aggregate approXimation-Local Binary Pattern Feature Descriptor Combination for Automatic Facial Expression Recognition

: Automatic identification of facial expression is a significant research area which is anticipated for real time processing in Human-Computer Interaction domain. Along with an efficient classifier for assigning the class label to each of the input face image, it is very necessary to have a strong feature vector for training the classifier. This paper proposes an effectual combination of Local Binary Pattern and Symbolic Aggregate approXimation method for the feature vector generation for the classifier. Twenty one facial patches are extracted from the face image and the LBP value and SAX string for these twenty one patches are utilised for feature vector generation. The feature vectors of images are submitted to the Ensemble Bag classifier for training purpose. Images which were not used for training is used for testing. An average accuracy of 98.7% was obtained when tested on JAFFE data set for seven expressions and an accuracy of 96.96% was obtained for nine expressions on fused database. A detailed analysis of the testing conducted on images with partial occlusion and illumination variance are presented here.


Introduction
Internet of Things (IoT) is undoubtedly an emerging field where the human and machine communication becomes a pivot element. The digital era demands a high need of computations based on human thought processes. Human thoughts and emotions are communicated through the facial expressions. Affective computing makes communication of human thoughts to the machines effectively (Cruz et al., 2014). Ekman and Friesen (1971) published a paper titled "Constants across cultures in the face and emotions" in which the six facial emotions of happiness, sadness, anger, fear, surprise and disgust are suggested as universal human expressions. Suwa et al. (1978) in his paper presented his research work on the analysis of facial expressions from a sequence of facial images. But the research in this field got geared up only after 1991.
The major step in facial expression identification begins with the detection of the face in an image. Many popular face detection algorithms are used for this purpose. Viola-Jones face detection algorithm is a robust method which detects the face using Haar features. The next important step is the generation of feature descriptors from the face image which are capable enough to give details for the recognition purpose. Two types of feature extraction can be considered, local feature extraction and global feature extraction. In the local feature extraction method, some of the localised pixels are considered and the features are extracted from these set of pixels only. In global feature extraction method the whole image matrix is used for the computation of feature extractions. Some of the widely used feature extraction methods are Principal Component Analysis (PCA) Bakhshi et al. (2016) in their work "Face Recognition using SIFT, SURF and PCA for Invariant Faces" proposes PCA, Eigen face method which was proposed in Turk and Pentland (1991), Fisherface method proposed by Belhumeur et al. (1997) and Local Binary Pattern (LBP) by Lajevardi and Hussain, (2012) Selection of classifier is yet another important research area for facial expression identification. Some of the popular classifiers used in this application are Support Vector Machines (SVM), k-Nearest Neighbour (kNN), Convolutional Neural Net (CNN), Hidden Markov Models (HMM), Rule based classifiers, Multi Class SVM, Bayesian net and Decision Trees.

Challenges and Motivation
The emotion recognition system can be under risk when the appearance of the input face image and the trained image varies in different aspects including illumination variance, presence of occlusion and pose variations. The imaging parameters such as resolution, focus and noise also play a vital role.
The increasing demand of effectual Human-Machine-Interface (HMI) based applications.

Applications
Facial expression recognition can be utilised in the scenarios were robots are used as workers in industries or in household activities. There are lots of incidents were robots damaged the life of human beings by hitting on them, since they failed to recognise the presence of a human beings. The working environment of such places can be made far better by adding a module to recognise the various expressions elicited by human.
In IoT driven healthcare sectors, facial expressions can be included to improve the effectiveness. For example if a medical procedure is being carried out which demands the active functioning of brain cells and sensory organs, the patient is not supposed to sleep in between the procedure. By adding the facial expression recognition module, it will be easy to identify the state of the patient and appropriate alert signals can be generated.
The rest of the paper has been organized as follows: Section 2 describes related research carried in the past years, section 3 describes the proposed methodology and the algorithm used for the implementation of the proposed work, section 4 describes the experimental setup, section 5 discusses the results obtained and section 6 gives the summary.

Related Work
The study in this field proves the fact that only those portions which are responsible for the elicitation of various expressions need to be examined and processed from the image. Zhong et al. (2012) proposes that 19 active facial patches are necessary for categorising the six universal facial expressions. This fact demands the need of identifying and extracting of these facial patches from the image. The changes in the texture of the 19 facial patches can be taken into account for the expression recognition purpose (Happy and Routray, 2014). The Region Of Interest (ROI) for eyes and nose are located with the help of Haar classifiers, which can return rectangular regions of detected facial features. The coordinates of these rectangles can be used for the localization of other facial features. Essa (1997) in his article describes a computer vision system for finding the facial motion by using an optimal estimation optical flow method along with the geometric, physical and motionbased dynamic models describing the facial structure.
Feature descriptors must be selected very carefully for getting accurate classification results. Local Binary Pattern (LBP) is one of the popular methods which can efficiently describe the textural content necessary for this application. Lajevardi and Hussain (2012) in their paper titled "Automatic facial expression recognition: Feature extraction and selection" proposes this method as a strong feature descriptor. Principal Component Analysis (PCA) of faces can be utilised as another feature descriptor which is capable of identifying and extracting the inter-correlated variables present in the image. The visual information necessary for recognising various emotions can be made available from PCA (Calder et al., 2001). Grey Level Cooccurrence Matrix (GLCM) is yet another textural descriptors for features from an image. Haralic et al. (1973) describes how GLCM can provide a second order distribution of gray levels which is helpful in identifying the textural details in an image. The relative positions of the neighbouring pixels can be obtained from GLCM. In the research article "facial expression recognition development and applications to HMI" (Cherifa and Eddine, 2018) shows how HMI can be effectively benefited by facial expression recognition systems.
Much research has been conducted on classification algorithms in facial expression analysis problems. In Support Vector Machines (SVM), the learning is carried out using statistical learning theory and a hyper plane is created as the decision boundary. This is advantageous for input data which is not linearly separable. In SVM, kernel functions are applied effectively for mapping the input data to the most similar class. Michel and El Kaliouby (2015) in their paper titled "Facial expression recognition using SVM" proposes SVM classifier for automatic recognition of six facial expressions, they were able to achieve a total accuracy of 87.9% when used with Cohn-Kanade database. Dauda and Bhoi (2016) propose a combination of kNN and Hidden Markov Model (HMM) classifiers and observed that this duo is capable of achieving improved performance. Lopes et al. (2017) suggests that Convolutional Neural Network (CNN) can be used for classifying facial expressions-they could achieve an accuracy of 96.76% when tested with CK+ database.

Objectives of the Work
The objectives of the work are to implement the concept of SAX method in facial recognition and to verify the recognition accuracy with normal face images, partially occluded face images and on face images with illumination variance.

Proposed Method
The proposed work recognizes nine facial expressions such as anger, disgust, fear, happiness, sadness, surprise, neutral, confused and sleepiness. The overview of the whole process is illustrated in Fig. 1.
Step 1: Pre-processing and Face Detection Histogram equalisation, has been applied to the input image for enhancing the contrast of the image by adjusting the pixel intensities (Celik, 2012). Viola Jones face detection algorithm has been used in this work for locating human face from the given input image. The Haar like features used in the Viola Jones algorithm can offer lesser computations on the data and fast detection of face (Lienhart and Maydt, 2002).
Step 2: Active Facial Patches Extraction The various emotions elicited by human face can be differentiated by analysing a few facial muscles, which can be marked as active facial patches in the image. The variations in the textural behavior of these facial patches can be used for generating feature descriptors. Figure 2 illustrates the locations of 21 extracted facial patches. In order to extract these patches, it is required to locate the facial landmarks such as eyebrows, eyes, nose and lip corners. This land mark detection has been done using the algorithms and methods proposed by Happy and Routray (2014) in their work "Automatic facial expression recognition using features of salient patches". Both the irises have been extracted from the coordinate values obtained from eyes.    Fig. 3: Images obtained after introducing partial occlusion Step 3: Partial Occlusion Handling The proposed work handles partial occlusion in the image by exploiting the facial symmetry. 21 facial patches are used for differentiating the facial expressions from one another. The twenty one patches are selected from the following locations: 2 patches from both the eye brow corners medially, 1 patch from the forehead, 2 patches from both the irises, 1 from the nose, 2 patches from the lip corners, 1 from middle of lips, 6 patches from the left sides of nose and 6 from the right side of nose. When the image is partially occluded, some of the facial patch values may be missing. This is overcome by selecting the facial patches in such a symmetrical manner. Even if one eye brow corner, one side of nose or one side of lip corner is occluded, it is possible to substitute the patch value from its corresponding side. Figure 3 demonstrates the partial occlusion applied to the face image for testing. Partial occlusion in the face image was applied using Mat lab code.
Step 4: Feature Extraction This work proposes a combination of two feature descriptors LBP and SAX. LBP being a robust feature extractor which works effectively with illumination invariant images helps in classifying the facial images with variations in illuminations. SAX being a method based on applying approximations on data offers dimensionality reduction in feature descriptors and still capable of producing the original data from the abstracts.

Local Binary Pattern (LBP)
Originally introduced by Ojala et al. (1996) it is a powerful textural feature extractor. For getting LBP operator from the image, the image is divided into square grids of equal size. Then an LBP operator is generated from each of the grids by thresholding the centre pixel value with the neighbouring pixel values. Any pixel whose value is greater than centre pixel value will be assigned '1' and others '0'. A binary string constituting 8 elements is obtained in the above manner from each of the grids. After forming this binary string, one single value is obtained by multiplying the corresponding binomial weights. Figure 4 illustrates the working of LBP operator with a 3×3 grid of pixels.

Symbolic Aggregate approXimation (SAX)
Symbolic Aggregate approximation is a time series of real numbers, representing the measurements of a real variable at equal time intervals.. Image can be best thought of as a time series. From the literature review it is clear that symbolic representation of time series facilitates easy pattern recognition and computation (Chen et al., 2010).

A Converting Time Series into SAX
When the conversion of time series to SAX are considered the real valued time series data is taken as the input and the corresponding symbolic representation of the input time series is computed as the output.
The conversion includes two major steps: (i) Conversion of time series to Piecewise Aggregate Approximation (PAA) (ii) Conversion of PAA to a string of defined symbols Steps in the generation of PAA: Step1: Normalise the input series according to the application specific domain. Step2: Divide the input time series of length 'N' into 'w' number of portions of equal lengths.
The value of 'w' is the parameter that determines the length of PAA and output string. If 'w' has a higher value detailed information will be available from the PAA generated, whereas if 'w' has a lower value, only an abstract of the input time series will be available. Figure 5 shows the division of input series into equal sized portions.
Before PAA conversion of input time series are carried out, the lower bounding value and upper bounding value of the domain on which we are applying the PAA is computed. This computation helps in the structural representation of data in that particular domain and in the analysis of the input time series.
Once the PAA conversion is over next step is the generation of the output string. This is done by assigning each 'n/w' portions of time series to a predefined alphabet from the look-up particular application domain. In this step the corresponding average value from each of the 'n/w' portions is replaced by a symbol from the pre-defined alphabet set. Research in this area found that 5 to 8 number of alphabets give a good representation of the underlying data (Chen et al., 2010). Table 1 illustrates the assignment of alphabets in PAA conversion when the representation varies from 5 to 8 in number. The input time series is divided into 'w' portions of equal length and assigned an alphabet according to the corresponding value. Each such division of areas are bound by break-points or values according to the domain of the application.. Break-points assigned for different alphabets are structured as a look-up table. The alphabets thus obtained are then concatenated sequentially generating a SAX string. Figure 6 demonstrates how each portion of time series is assigned with an alphabet.
The generated SAX string is the feature vector which can be used for training the classifier. After training is completed the learned model can be tested for known or unknown patterns of time series of the same domain.

Fig. 7: Illustration of assigning alphabets to pixel grids
The SAX String Generated Here from the Figure is "Cbacedcd"

Implementation of SAX in Proposed Work
Image to SAX conversion in the proposed method is implemented by dividing the face image into equal number of 3×3 pixel grids. The average intensity value of the 9 pixels in the grid (block) is computed. This average value is then checked against the look-up table for obtaining the corresponding alphabet. Figure 7 shows the computation of block mean values from an image.
Step 2: Divide the image into 'N' blocks of equal sized grid 3×3 pixels.

Blockmean
Block-mean uses many classifiers for the classification purpose. When a feature vector with "N" dimension is submitted to the classifier, a random selection of feature variables or attributes from the feature vector is selected. Ensemble bag incorporates supervised learning, which facilitates the learning for each class by assigning class label at the time of training. The training is done for nine classes. The feature descriptor used for training is one of the major factors that affect performance of classifier. When an input image with a class label is used in training phase, the classifier adjusts the decision boundary for that class. This adjustment goes on until the classifier learns completely and no adjustments in the decision boundaries occur further on training. Hyper planes are formed as decision boundaries which allow accurate classification, even with feature sets which are not linearly separable. Testing of the proposed system has been carried out with images of nine expressions, partial occlusion and with illumination variant images with shadow effects.

Experimental Setup
Dataset JAFFE and Yale were the data bases used for experiments and analysis.
The JAFFE data base includes a total of 213 images (256×256 pixels) with seven facial expressions, six basic plus neutral, which was captured from 10 females from Japan. In the proposed work JAFFE was used to the seven facial expressions. Yale dataset was used for testing the facial images with illumination variance. Images of 15 persons are available with Yale dataset in which images with expressions including sleepiness and images with spectacles are also present. The size of each image is 320×243 pixels. Figure 9 shows the samples taken from JAFFE and Yale data set.

Tools
Implementation was done in Matlab17a on an Intel® Core ™i3-500U CPU with 2.00 GHZ processor frequency and 4 GB RAM memory, windows 10 and 64 bit Operating System.

Results and Discussion
Sample images from JAFFE data set is shown in Fig. 8. Figure 9 shows the intermediate results like active patches generated, SAX image, SAX string formation and the classifier outcome by assigning label to the input image. The accuracy of classification has been found by evaluating the number of occurrence of true positives and false positives. When testing was conducted for nine expressions in JAFFE and Yale dataset a total accuracy of 96.96% was achieved. The proposed work achieves an accuracy of 98.7% for recognition of seven expressions when evaluated on JAFFE database. The confusion matrix obtained during testing with JAFFE for seven expressions is shown in Table 2. Table 3 shows the confusion matrix obtained for nine facial expressions on JAFFE and Yale database.

(b) Experiments on Images with Partial Occlusion for Recognition of Expressions
Partial occlusion was successfully handled by the proposed work. Figure 11 illustrates the various stages of recognising facial expression as "Happiness" of partially occluded image. Similarly testing was conducted with all the possibilities including partial occluded image at right eye brow, left eye brow, right nose, left nose, right lip corner and left lip corner. Table 4 shows the average accuracy in percentage for each of the nine expressions. It can be said that when the images are partially occluded at different locations in the face image the total average accuracy obtained by the proposed work is 91.42%.  Figure 10 and 11 shows the intermediate results taken out in the testing process. An image with illumination variance was given whose expression was 'Neutral". In Fig. 11 the expression "Neutral" has been recognised correctly and in Fig. 10 expression "sad" has been recognised correctly. The steps completed during the recognition is illustrated, as Face Detection,. extraction of 21 facial patches, SAX representation of the image, SAX String generated for 21 facial patches, LBP histogram for the 21 facial patches and finally the expression recognition. Figure 12 illustrates the test results on illumination variant face image. Figure 13 illustrates the intermediate results and recognition result for the expression sleepiness.

Key Findings
The proposed work was found to be effectual in recognizing nine facial expressions using ensemble bag classifier. Experiments were carried out with partially occluded images and illumination variant images. Partial occlusion was introduced at different positions on the face image and exploiting the property of face symmetry we were able to remove the occlusion from the image and facial expression was identified accurately. Illumination invariance in the input images were overcome with LBP feature vectors along with the proposed SAX method.    Comparing the Results with Related Work Table 5 shows a comparison of related research works with proposed work in JAFFE database for seven expressions. From the comparison it is evident that the accuracy of the proposed work for recognising seven expressions exceeds that of the related works.

Conclusion
An automatic facial recognition system for recognising nine facial expressions has been implemented which handles partial occlusion and can work effectively on images with illumination variance and which works on person independent images. A new combination of LBP and SAX feature descriptor for generating the feature vector has been proposed in this paper for automatically recognising nine facial expressions. Twenty one active facial patches where extracted from the image and feature vector for these twenty one patches where extracted. The experiments were conducted on JAFFE facial expression data Yale data set. The classification was done using Ensemble bag classifier, which uses supervised learning strategy. An accuracy rate of 98.7% was achieved for the seven expressions available with JAFFE database and an average accuracy of 96.96% was obtained for all the nine expressions together on fused database of JAFFE and YALE. Experiments were conducted successfully for normal face images, partially occluded face images and face images with illumination variance.