QUALITY BASED SPEAKER VERIFICATION SYSTEMS USING FUZZY INFERENCE FUSION SCHEME

Performances of single biometric speaker verification systems are outstanding in clean condition but drop significantly in noisy condition. Implementation of multibiometric systems is one of the solutions to this limitation. However, in order to ensure the performances of multibiometric systems are sustained, the optimum weight for the fusion system must be determined correctly according to the quality of current data. This study proposes the use of Fuzzy Inference System for weight inference. Two traits i.e., speech and lip are used while Support Vector Machine (SVM) is employed as the classifier in this study. The speech features are extracted using the Mel Frequency Cepstrum Coefficient (MFCC) method and the lip features are extracted using Region of Interest (ROI) method. The performances of single modal system (i


INTRODUCTION
Previously, the traditional verification uses passwords, keys or smart cards which are less secure since few problems may occur due to forgotten password, duplicated keys or stolen smart cards. Nowadays, biometric data for verification systems are commercially used in data security, internet access, ATMs, network logins, credit cards and government records. More studies on biometric system have been done by researchers due to the increase of requirement of automatic information processing in many industrial fields (Chia and Ramli, 2011). Biometrics is defined as the development of statistical and mathematical methods applicable to data analysis problems in the biological sciences. Biometrics is also a technology, which uses various individual attributes of a person to verify his or her identity. Biometric characteristics can be divided into two main classes i.e., physiological and behavioral characteristics. Physiological characteristics refers to the human body such as face, fingerprints, palm print, iris, DNA, hand geometry and finger vein structure while behavioral characteristics are related to the actions of a person such as voice, keystroke dynamics, gait, typing rhythm and signature (Jain et al., 2004). This study implements biometric system for speaker verification systems. Speaker verification system is used to verify a person's claim from the enrollment database by using speech signal as the input data.
Single biometric systems have to face few limitations such as non-universality, noisy sensor data, large intra-Science Publications JCS user variations and susceptibility to spoof attacks. For example, a single biometric system uses voice patterns to identify the individuals may fail to operate because of a noisy data signal captured by the system. Limitations faced by single biometric system can be overcome by applying the multibiometric system. Multibiometric system enhanced the matching accuracy of a biometric system in noisy condition as well as increases the population coverage with multiple traits (i.e., lip, iris, voice and face). Studies on multibiometrics are further discussed in Ben-Yacoub et al. (1999) and Pan et al. (2000). Besides that, multibiometric system may continuously operate even though a certain trait is unreliable due to user manipulation, sensor or software malfunctions. . However, this is only true when fusion scheme is done at the decision level where hard decision fusion for example or operator is executed. For the score level decision fusion, the multibiometric systems are at its best performance only when all traits operate in clean condition. In noisy condition, the unreliable speech signal tends to cause the system to obtain false scores for genuine and imposter signal. This problem does not occur in clean condition since both speech and lip signal gives reliable scores for genuine and imposter signal.
This study proposes the use of quality based score fusion approach to improve the performances of multibiometric systems. The quality based fusion depends on the input current condition. This method is very useful to ensure the speaker verification system is at its best performance especially in noisy condition. The quality based fusion implements the quality measure identification system to identify the quality of sample data. Researches on quality measure identification system have been discussed in Fierrez-Aguilar et al. (2005) and Nandakumar et al. (2008). In order to take full advantage of the quality based fusion approaches, this study implements the fusion mechanism for different biometric information. For this purpose, Fuzzy Inference System is developed so as to infer the optimum weight for robust and reliable multimodal biometric based security systems. The use of fuzzy logic as the fusion scheme for quality based fusion approach improves the system performances.
According to Vasuhi et al. (2010), the fuzzy logic decision-making is approximately the same with the human decision-making. Fuzzy design can accommodate the ambiguities of human languages and logics. It provides both an intuitive method for describing systems in human terms and automates the conversion of those system specifications into effective models. Fuzzy logic has the ability to add human-like subjective reasoning capabilities to machine intelligences as described in Prade and Dubois (1996). General block of fuzzy logic with Mamdani-type and Sugeno-type is shown in Fig. 1. Fuzzification is the process where each input is assigned to a lingustic variable. Degree of membership can be obtained from the lingustic variable. The degrees of membership are combined using fuzzy rules which may be expressed in terms such as "if x is A, then y is B". The process of converting the fuzzy output based on the strength of membership is called defuzzification. Defuzzification is used in fuzzy modeling and in fuzzy logic control to convert the fuzzy outputs from the systems to crisp values.
There are two types of Fuzzy Inference System (FIS) i.e., mamdani and sugeno. A Mamdani-type FIS has fuzzy inputs and a fuzzy output. For Mamdani-type, the input is transformed into a set of linguistic variable during the fuzzification process. The Fuzzy Inference System (FIS) uses the input variables and fuzzy rule to derive a set of conclusion which will be used during the defuzzification process. A crisp number is the output of the defuzzification process (Jassbi et al., 2007). Mamdani-type FIS is widely accepted for capturing expert knowledge. It allows us to describe the expertise in more intuitive and human-like manner. The advantages of the Mamdani-type FIS are it have widespread acceptance, intuitive and wellsuited to human inputs. However, Mamdani-type FIS entails a substantial burden.
In short, both Mamdani-type and Sugeno-type are similar in term of the fuzzification and rule evaluation process. The main different between Mamdani-type and Sugeno-type is the output of Sugeno-type is linear or constant.
Besides that, Mamdani-type uses defuzzification method to extract the output while Sugeno-type uses weighted average method to extract the output. Sugeno-type FIS is computationally effective and works well with optimization and adaptive techniques, which makes it is very attractive in control problems, particularly for dynamic nonlinear systems. So that it works well with linear technique and well-suited to mathematical analysis FLT, 2010.
The first objective of this study is to analyze the performances of single modal system i.e., speech and lip at different quality conditions. Consequently, the Fuzzy Inference System is designed for weight inference. Finally, the performances of the fusion systems with weight inferred from FIS are compared to the performances of the single systems.

MATERIALS AND METHODS
Data Acquisition: In data acquisition, voice which is continuous electrical signal is converted to digital signal using a sampler and Analog-to-Digital (A/D) converter. The digitization process consists of sampling, quantization and coding. Sampling process is discussed extensively in (Rabiner and Schafer, 1978). After sampling process, the sampled signal is discrete in the time domain but still continuous in the amplitude domain. The quantization process divides the continuous amplitude range into finite subrange (Furui, 2000). Finally, the coding process is done by assigning these finite values into a sequence of codes for binary number representation.
In this study, the audio and visual data are obtained from Audio-Visual digit database (Sanderson and Paliwal, 2001). The database consists of 20 repetition of number zero from 37 different subjects. Mel Frequency Cepstrum Coefficient (MFCC) is used to obtain the features for speech modality. This study uses 12 MFCC features to form the feature vector. The data is collected in 32 kHz, 16-bit mono format. For the lip verification, the Region of Interest (ROI) of lip images are cropped and stored as JPEG files with resolution of 512×384 pixels. The ROI method to extract the lip features in this study as discussed in (Potamianos et al., 2000;Iyengar et al., 2001).
The database is divided to two sessions which are training and testing. During the enrolment process, 2220 audio data are developed for all 37 subjects. For training purposes, 740 data are used to train the system. Each subject is treated as the claimant and the other subjects as the imposters during the verification process. Therefore, the database has 40 testing data from the authentic speaker and 1440 from the imposter speaker. The visual data consists of 60 sequences of images (20 for training and 40 for testing) where each sequence consists of 10 images. In total, 22200 data are developed for all 37 subjects. Similar to speaker verification, each subject is treated as the claimant and the other subjects as the imposters during the verification process. Hence, the database has 400 testing data from the authentic lip image and 14400 from the imposter lip image.

Feature Extraction
A preemphasis of high frequencies is required to compress the signal dynamic range by flattening the spectral tilt in order to raise the SNR. The first order FIR filter is used to filtering the speech signal. The use of window function is important to minimize the signal discontinuities at the beginning and end of each frame by zeroing out the signal outside the region of interest. This study implements the Mel Frequency Cepstrum Coeficient (MFCC) processing to extract the audio features. There are few steps involved in MFCC process. First, all frames of the signal are computed using discrete Fourier transform. Next, the filter bank processing formed the spectral features at defined frequency at its exit. After that, log energy computation which consists of computing the logarithm of the square magnitude of the filter bank is performed. Finally, the mel frequency cepstrum is computed (Becchetti and Ricotti, 1999).

Classification
This study implements the Support Vector Machine (SVM) as classifier. A SVM performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. SVM mode is a supervised learning method that generates input-output mapping functions from a set of labeled training data. The foundation of Support Vector Machines (SVM) has been developed as discussed in (Vapnik, 1995) and becomes popular and accepted nowadays due to many attractive features and promising empirical performance. Theory regarding SVM is further explained in (Gunn, 1998). In brief, decision boundary in support vector machine can be explained as presented in Fig. 2.
The SVM identifies the data points that are found to lie at the edge of an area in space which is a boundary from one class to another. The space between regions containing data points in different classes as being the margin between those classes. SVM is used to identify a hyperplane that separates the classes. The maximum margin between the different classes is found. An advantage of this method is that the modeling only deals with these support vectors, rather than the whole training dataset.

Fusion Scheme
A fuzzy fusion mechanism for robust and reliable multimodal biometric based security systems is developed. The use of fuzzy logic system as the fusion scheme improves the system performances. For this experiment, the fuzzy logic system consists of two inputs (speech and lip) and one output (weight). The parallel nature of the rules is one of the most important aspects in fuzzy logic (Hellmann, 2001). Initially, the input verification scores (speech and lip) are scaled to some range of score by using the min-max normalization equation as in Equation (1) where denote the ith match score output and K is the number of the match scores available in the set (Jain et al., 2005). The fuzzy logic system procedures are proposed as below (Zadeh, 1965;1984).

Step 1: Fuzzification
In this study, there are two fuzzy models for Mamdani-type and Sugeno-type, respectively. Each model has two inputs, speech and lip and one output which is weight. Figure 3 shows the fuzzy inference system using Mamdani-type and Sugeno-type method in Matlab Fuzzy Toolbox.
Next, the inputs are identified and the degree of each input is determined according to appropriate fuzzy sets via membership function. The membership functions are Gaussian shapes because it can covers several values in one membership. The inputs are always a crisp numerical value. For input 1 (speech), the interval is varied between [0, 40] SNR and for input 2 (lip), the interval is varied between [0, 1] quality density. The output (weight) is varied between [0, 1].
Then, the speech fuzzy set is modeled for three mfs: speech (Qlow), speech (Qmed) and speech (Qhigh) and three mfs are also modelled for the lip fuzzy set: lip (Qlow), lip (Qmed) and lip (Qhigh) as shown in Fig. 4. For the output fuzzy set, three mfs: weight (Qlow), weight (Qmed) and weight (Qhigh) are used. Output for Mamdani-type and Sugeno-type are as illustrated in Fig. 5. Step 2: Rule Evaluation For this study, there are nine rules for the system. From the experiment, lip performs better than speech. Therefore, this study relies more on lip since uncertainty inputs condition are involved during the process. For example, when both speech (Qhigh) and lip (Qlow) are determined, the weight output is mapped to weight (Wmed). Rule editor is used to define the rules for each model. The rule editor for each model is shown in Fig. 6:

Step 3: Aggregation
Aggregation is the process of unification of the outputs of all rules. The membership functions for all rules are scaled and combined into a single fuzzy set. The aggregation's inputs are the list of scaled membership functions and the output is one fuzzy set for each output variable. The Mamdani-type method and Sugeno-type method for aggregating the fuzzy rules and computing the output are shown in Fig. 7 and 8, respectively. All the rules must be combined and tested in order to make a decision.  Step 4: Defuzzification The output of aggregation will be used as input for the defuzzification process and the output is a single number (weight). For defuzzification process, the Mamdani-type applied the centroid calculation method in order to obtain the centre of area under the curve while the Sugeno-type used the weighted average of few data points' method. The output (w) obtained from fuzzy logic system is implemented as in Equation (2) in order to calculate the fusion scores: where, Y is the score and W is the weight applied to speaker's modality input data which are and respectively.

RESULTS
System performances for fuzzy logic fusion using Mamdani-type and Sugeno-type based on equal error rate (EER) at different levels of SNR are shown in Table  1 and 2, respectively. System performances based on receiver operation characteristic (ROC) showing the tradeoff between GAR and FAR percentages are then presented in Fig. 9-11. Some results obtained by the single biometric and multibiometric system using Mamdani-type and Sugenotype fusion method are also compared in terms of GAR and FAR at certain condition of speech and lip quality as illustrated in Fig. 9-11. Figure 9 shows the performances of fusion systems compared to single systems at 5dB SNR with 0.2, 0.5 and 0.8 quality densities.   Subsequently, the performances of fusion systems compared to single systems at 15dB SNR with 0.2, 0.5 and 0.8 quality densities are illustrated in Fig. 10. When system at 15dB SNR and 0.2 quality density, GAR performances are observed as 94, 95, 86 and 7% for Mamdani-type, Sugeno-type, lip and speech respectively, at 0.1% FAR. Meanwhile, at 5dB SNR and 0.5 quality density, GAR performances are observed as 90, 82, 82 and 50% for Mamdani-type, Sugeno-type, lip and speech, respectively at 10% FAR. At the same FAR, i.e., 10%, when system at 5dB SNR and 0.8 quality density, GAR performances for Mamdani-type, Sugeno-type, lip and speech equals to 57, 56, 28 and 50%, respectively.
Finally, the performances of fusion systems compared to single systems at 35 dB SNR with 0.2, 0.5 and 0.8 quality densities are illustrated in Fig. 11 below. The GAR performances for Mamdani-type, Sugeno-type, speech and lip are observed as 99%, 99%, 95% and 83%, respectively at 0.1% FAR when system at 35dB SNR and 0.2 quality density. While system at 35dB SNR and 0.5 quality density, the GAR performances for Mamdani-type, Sugeno-type, speech and lip are defined as 97, 96, 95 and 10%, respectively at 0.1% FAR. GAR performances of 96, 96, 96 and 2% are then observed for Mamdani-type, Sugeno-type, speech and lip, respectively at 0.1% FAR when system at 35dB SNR and 0.8 quality density.

DISCUSSION
From the experimental results illustrated in Fig. 9-11, it is observed that fusion systems based on Mamdani-type FIS and Sugeno-type FIS are able to increase the performances of single systems i.e., speech and lip when one of the traits is in clean condition or under minor quality degradation. Fusion systems based on Sugeno-type FIS and Mamdani-type FIS are observed as the most outstanding systems compared to the other fusion schemes.
Consequently, when both of the traits are severely corrupted by noise, the performances of single system tend to decrease. However, by implementing Sugenotype FIS and Mamdani-type FIS fusion schemes, the systems are able to maintain its performances.

CONCLUSION
This study concludes a multibiometric verification system that combines both speaker and lip verification using fuzzy logic with Mamdani-type and Sugeno-type. Experimental results show that Mamdani-type and Sugeno-type are quite similar in accuracy performance and much better compared to the performances of single biometric systems. As a conclusion, the limitation faced by score level fusion in multibiometric system can be overcome using the fuzzy logic system due to its capability to infer the optimum weight according to the quality of verification data.