Evaluation on Score Reliability for Biometric Speaker Authentication Systems

: Problem statement: Fusion weight tuning based on score reliability is imperative in order to ensure the performances of multibiometric systems are sustained. Approach: In this study, two variant of conditions i.e., different performances of individual subsystems and inconsistent quality of test samples are experimented to multibiometric systems. By applying multialgorithm scheme, two types of features extraction method i.e., Linear Predictive Coding (LPC) and Mel Frequency Cepstrum Coefficient (MFCC) are executed in this study. Support Vector Machine (SVM) is used as a classifier for both subsystems for the pattern matching process. Scores from both LPC and MFCC based sub systems are fused at score level fusion using fixed weighting and adaptive weighting approaches. For fixed weighting, sum-rule method is employed while for the adaptive weighting, sum-rule based on weight adaptation and sum-rule with weight produced from fuzzy logic inference are executed. The performances of single, fixed and adaptive systems are then compared. Results: Experimental results show that at 40dB and 20dB SNR signals, EER performances of single systems are 1.1730 and 38.2695% respectively. Consequently, the EER performances are observed as 2.7355 and 1.1359% for the sum-rule based on weight adaptation and sum-rule with weight produced from Fuzzy Logic. Conclusion: The results show that fusion system based on fuzzy logic gives advantage due to its capability in adjusting the weight based on the subsystem performance and quality of the current data.


INTRODUCTION
Speaker recognition is a biometric system that uses individual's voice for recognition purpose and has become one of the premier applications for machine learning and pattern recognition technology. The speaker recognition process relies on feature influenced by both physical structure of an individual's vocal tract and behavioral characteristics of the speech. The biometric speaker recognition system has co-evolved with the technology of speech recognition because of the similar characteristics and challenges associated with each other. Hence, this system uses specific information contained in speech signal for authentication and identification purposes. In authentication system, the systems verify either to accept or reject the claimed identity by approving the genuine otherwise rejecting the imposter while, for identification systems, the task is to determine the unknown user for authorizing intention.
The advantages of using speech signal trait for biometric systems are that the signal is natural and easy to produce, requiring little custom hardware, has low computation requirement and is highly accurate (in clean noise-free conditions) (Ramli et al., 2008). However, sometimes a single biometric system fails to authenticate the identity of a person due to insufficient information or by spoofing. For instance, the major setback utilizing speech signals for biometric systems is due to the severe degraded performance as the Signal to Noise Ratio (SNR) of the speech signal drops in noisy conditions. In addition, since voice is categorized as a behavioral signal, the information has a tendency to be different due to the change of speaking rate and environment, for instances, sickness (e.g., head cold can alter the vocal tract), extreme emotional state (e.g., stress or duress), long interval between enrolment and verification process, poor or inconsistent room acoustic and aging (Campbell et al., 2003;Samad et al., 2007).
One of the solutions to overcome these limitations is by comparing different existing algorithms on the specific problem and selecting the best of the algorithms that is able to be applied. However, selecting the best algorithm is not an easy task. Hence, combining multiple algorithms that employ multiple feature extraction and/or multiple matching algorithms on the same biometrics is executed as an alternative approach due to the supplementary information from multi algorithm also helps to improve the performance. Moreover, utilization of new sensor is not required thus it is cost effective. Many researchers have proved that the implementation of the fusion approach can help to improve the performance of biometric system (Ramli et al., 2009). It is also imperative to assign different weighting in fusion to each biometric trait in order to vary the contribution of matching scores of each biometric trait since the optimum weight can maximize the performance of multibiometric system. This study evaluates the score reliability of multialgorithm approaches by fusing the data at match score level. The database consists of 2220 audio data which obtained from 37 speakers from three recording sessions. The experiment is conducted based on clean, 40, 35, 30, 25, 20, 15, 10, 5, 0 and -5dB Signal to Noise Ratio (SNR) of audio signal. Two features based on Mel Frequency Cepstrum Coefficient (MFCC) and Linear Predictive Coding (LPC) is executed in this study. The Support Vector Machine (SVM) is used as a classifier for both subsystems in the pattern matching process. The objective of this research is given as follows. First, to develop single biometric systems based on two different feature extraction algorithms which are MFCC and LPC. Subsequently, both of the MFCC and LPC features will be combined at the score level fusion namely as multi algorithm speaker authentication system. The second objective is to evaluate the performances of this fusion system based on fixed and adaptive weighting schemes. For fixed weighting, sum-rule method is employed while for adaptive weighting, sum-rule based on weight adaptation and sum-rule with weight produced by fuzzy logic inference system are executed. The third objective of this study is to compare the performances of the single system, fixed and adaptive weighting fusion systems.

MATERIALS AND METHODS
Data acquisition processing: An audio feature is extracted by taking the information of the speech recording based on the speaker's tone and inflection analysis. In this study, the audio is obtained from the Audio-Visual Digit Database (Sanderson and Paliwal, 2003). The digital audio is monophonic, 16 bit 32 kHz and WAV format. The database consists of 2220 audio data which obtained from 37 speakers from three recording sessions has been simulated with Additive White Gaussian Noise (AGWN). Each of the data undergoes a series of speech processing step that is preemphasis, framing and windowing as shown in Fig. 1 (Kisku et al., 2010;Daugman, 2000).
The pre-emphasis process is the process to compress the signal dynamic range by passing it through a filter to emphasize the signal to higher frequencies in order to raise the SNR. In this process, the speech signal is filtered with a first order FIR filter whose transfer functions in the zdomain as given in Eq. 1: where, as is the preemphasis parameter (Furui, 2001, Kisku et al., 2010. In the time domain, the relationship between output (x) x' and the input (n) x' of the preemphasized signal is given as in Eq. 2: In this study, the value of a is considered as 0.95 where this value can increase the SNR to more than 20dB amplification of the high frequency spectrum (Becchetti and Ricotti, 1999).
The process of digitization is applied to convert the speech samples from Analog to Digital Conversion (ADC). For, speech signal, spectral evaluation can be performed using short time analysis by windowing the preemphasized signal x'(x) into a string of windowed sequence, x t (n), t = 1,2,...,T, called frames which are processed individually as in Eq. 3 and 4: where, w (n) is the impulse response of window. In this process, the audio signal is divided into frames of N samples where N is the length of each frame. Each frame is shifted by a temporal length M with M<N, makes N-M samples at the end of frame x t '(n) are duplicated at the beginning of the following frame x t+1 '(n). A suitable value for length N is important according to Kondoz (1969). If N is very large, the short time energy will be averaged over a long time hence will not reflect the changing properties of the speech signal. However, if N is small, the short time will change rapidly. 20 ms duration of length N with 50% overlapping is an ideal measurement. Windowing process is then applied to minimize the signal discontinuities at the beginning and end of each frame by zeroing out the signal outside the region of interest. The Fourier Transform X t (e jω ) for the discrete time signal x t (n) can be written as Eq. 5: In order to increases the resolution and no side lobes or frequency leakage; the ideal window function should be a narrow main lobe. In this study, Hamming window w (n) H is used as the window function due to the side lobes of this window are lower compared to other windows. Moreover, a high resolution is not required in speaker recognition since it reduces resolution. Hamming window, w H (n) is defined as in Eq. 7: Feature extraction: In this study, two features which are MFCC and LPC have been used for the development of the multibiometric systems. MFCC feature is based on the known variation of the human ear's critical bandwidths which frequency expressed in the mel-frequency (Chen and Luo, 2009). The mel-frequency is linear spaces below 1000Hz and logarithmic spaces above 1000Hz. The operation of this system is based on two types of filter which are linearly and logarithmically spaced and processes on the Fourier transform of x t (n): X t (e jω ). The X t (e jω ) is evaluated only for discrete number of ω values. There have several steps in MFCC processing. The first step is computation of the Discrete Fourier Transform (DFT) of all frames of the signal. By considering 2 k , N π ω = the DFT of all frames of the signal, x t (k) is obtained as in Eq. 8: The computational complexity can also be reduced if the number of samples N is a power of 2. The result obtained after this step is called as signal's spectrum.
A filter bank processing is the second step in MFCC processing. Filter banks properly integrate a spectrum at defined frequency and spectral features are obtained after this process. The outputs of the filter bank are denoted as Y t (m), 1≤ m≤ M where M is number of band-pass filters. In general, a set of 24 band-pass filter is used since it simulates human ear processing. Subsequently, computation of the log energy is the third step which computes the logarithm of the square magnitude of the filter banks outputs, y t (m).
In this study, the database of MFCC features consists of 2220 set of MFCC features from 37 persons with 60 speech signal data per person. There are 12 mel cepstrum coefficients, one log energy coefficient and three delta coefficients per frame. The overall process of the MFCC is shown in Fig. 2.
LPC feature extraction models the process of speech production and is defined as a digital method for encoding an analogue signal in which a particular value is predicted by a linear function of the past values of the signal (Rabiner and Juang, 1993;Furui, 1981). The most important aspect of LPC is the linear predictive filter which allows the value of the next sample to be determined by a linear combination of previous samples. In other word, linear prediction filters attempt to predict future values of the input signal based on past signals. LPC analysis is based on the assumption that the relation between the current sample x(n) and first-order linear combination of the previous p samples given as in Eq. 10: 1 p x(n) a x(n 1) ... x(n p) ≈ − + + α − The linear predicted value x ɶ (n) with prediction coefficients, σ i for x(n) is presented as in Eq. 11: Consequently, LPC cepstrum can be derived through the LPC model. For a time sequence x(n) , complex cepstrums c ɶ (n) are represented as Eq. 12-14: The database of LPC features in this study consists of 2220 set LPC features from 37 persons with 60 speech signal data per person. 14 cepstrum coefficients per frame are extracted in this method. The overall process of the LPC is shown in Fig. 3.

Fig. 3: LPC block diagram
Classification using SVM: SVM is a classifier which can classify sample within two or more classes. In the simplest form, linear and separable case, it is the optimal hyper plane that maximizes the distance of the separating hyper plane from the closest training data point called the support vectors (Gunn, 2005;Wan and Campbell, 2000). The solution of linearly separable case is started by considering a problem of separating the set of training vectors belongs to two separate classes as given in Eq. 15: With a hyperplane as in Eq. 16: where, w and b are the direction and position in space, respectively and w is normal to the plane. The hyperplane has the same distance from the nearest points from each class and the margin is twice the distance for each direction, w. The support vectors which is a linear combination of a small subset of data, x s , s ∈{1,..., N}is the solution for the optimal hyperplane. Eq. 17 is minimized by the hyperplane that optimally separates the data which is equivalent to minimizing an upper bound on VC dimension: VC dimension is a scalar value that measures the capacity of the learning function. The saddle point of the Lagrange functional (Lagrangian) is used to solve the optimization problem and given as in Eq. 18: where, a i is the Lagrange multiplier. The Lagrangian has to be maximized with respect to a ≥ 0 and minimized with respect to w and b. The solution of the linearly separable case is given by Eq. 19: L L L * i j i j i j k i 1 j 1 k 1 1 a arg min y y x , x 2 With constraints in Eq. 20: The nonlinear mapping is used in the case of the linear boundary is inappropriate which the SVM can map the input vector, x into a manifold embedded in a high dimensional feature space z. The SVM construct an optimal separating hyperplane in the higher dimensional space (Chen and Luo, 2009). The non-linear mappings are polynomial functions, radial basis function and certain sigmoid functions. In this study, polynomial kernel is employed. Hence, the optimization problem becomes as in Eq. 21: With constraints as in Eq. 22: L i j j j 1 0 a c i 1,...,L and a y 0 x(t) s(t) 0.95)x(x 1) where, (K x i , x i ) is the kernel function that performs the nonlinear mapping into feature space. For the polynomial kernel, it is defined as Eq. 23: where, γ > 0 and γ, r and d are kernel parameters.
Fusion System: In this study, both of MFCC and LPC subsystems are combined together as a fusion system as shown in Fig. 4. By taking the benefit of score level fusion as discussed before, the scores from MFCC and LPC subsystem are then fused and the decision is made. Two types of fusion schemes i.e., fixed weighting and adaptive weighting are implemented and compared at different level of SNR. In fixed weighting approach, the fusion algorithm which is sum-rule fusion scheme is applied while the optimum weight for the weight adaptation fusion system is then computed. The sumrule fusion method is shown in Eq. 24: where, w is a fusion weight. W is varied from 0 to 1 in steps of 0.1. This study involves the fusion based on clean data. For this purpose, each speaker model is trained using 20 client training data and 720 (20×36) imposter training data. During testing, speaker model from each speaker is tested on 40 client data and 1440 (40×36) imposter data from 36 persons using clean signal. 1480 scores for each type of testing data are obtained.
In the adaptive weighting, the sum-rule with weight adaptation and sum-rule with weight produced from fuzzy logic inference system are applied (Vasuhi et al., 2010). For sum-rule with weight adaptation, the optimum weight is adapted from the value of optimum weight in fixed weighting system and the audio systems are evaluated based on different SNR levels. Each speaker model is trained using 20 client training data and 720 imposter training data while 40 client data and 1440 imposter data are used as the testing data. The clean data are corrupted into 10 levels of SNR i.e., 40, 35, 30, 25, 20, 15, 10, 5, 0 and -5dB. During testing, speaker model from each speaker is tested on 40 client data and 1440 (40×36) imposter data from the other 36 persons for each level of the corrupted signals. 1480 scores for each type of testing data are obtained. In the sum-rule with weight produced from fuzzy logic inference system, the range of SNR levels is divided into three levels; high, medium and low level. Hence, the important part of the fuzzy logic is to determine the optimum weight of the fusion systems according to SNR levels and both subsystem performances. For MFCC feature, the range between 25-40 dB is determined as high SNR level, 5-30dB is medium level while the low level is between 5-10dB. For LPC feature, the range of high SNR level is between 34 to 40dB, medium level is between 19-36dB and low level is between-5-21dB.

RESULTS
Performance of single biometric system: System performances based on Equal Error Rate (EER) for MFCC-SVM systems at different levels of SNR are shown in Table 1. A performance based on receiver operation characteristic is presented in Fig. 5. Table 2 shows the EER performance for LPC-SVM systems based on different levels of SNR. The results based on ROC curve is presented in Fig. 6.   Performance of fixed weighting systems: The score ratios between MFCC-SVM subsystem and LPC-SVM subsystems and their corresponding EER performances using sum-rule scheme is shown in Table 3.    Performance of adaptive weighting system compared to other systems: Performance of adaptive weighting system is compared to the fixed weighting system and single system (LPC) at 20dB SNR using the ROC curve as shown in Fig. 8. The overall performances are also illustrated in Table 5. Table 1 and Fig. 5 show the performances of the LPC-SVM systems based on 40dB, 20dB, 10dB and-5dB. At 40dB SNR, the Genuine Acceptance Rate (GAR) is almost 100% at False Acceptance Rate (FAR) of 6%. At the same FAR, the GAR performances for 20, 10 and -5dB are 70, 23 and 7%, respectively. The SNR at 40dB gives the lowest value of EER which signify the highest performance. Table 2 and Fig. 6 shows the performances of the LPC-SVM systems based on 40dB, 20dB, 10dB and-5dB. For 40dB SNR, the GAR performance is almost 100% at FAR of 12%. At the same FAR, the GAR performances for 20, 10 and -5dB are 59, 30 and 24%, respectively.

DISCUSSION
Based on the fixed weighting system experimental results as given in Table 3, the score ratio between MFCC and LPC subsystems for sum rule fusion method is fixed to 0.7: 0.3 for this adaptive system. Table 4 and Fig. 7 show the performances of sum-rule fusion method based on different levels of SNR.
For the adaptive weighting system, the fuzzy logic is applied as second approach in order to determine the optimum weight. For comparison, EER performance of the both single systems i.e., MFCC and LPC and fusion systems for adaptive weighting approach i.e., sum-rule with weight adaption and fuzzy logic have been computed. Table 5 summarizes the performances of single systems (LPC and MFCC subsystems) and fusion systems (sum-rule with weight adaptation and Fuzzy logic inference system). Fusion system using sum-rule with weight adaptation performs well only when both subsystems are in clean conditions or when the LPC subsystem is in high SNR compared to the MFCC subsystems. Otherwise, the performances are worse than the single systems. This trend occurs because of the weight for this fusion scheme is adjusted with ratio MFCC to LPC of 0.7:0.3 which is based on the performance individual subsystems only and not according to the quality of testing data.
Consequently, the advantage of implementing multibiometric systems compared to single systems can be observed through the implementation of Fuzzy based fusion system due to an effective weight tuning, considering the deviation of both subsystems and the data quality.

CONCLUSION
This study reveals that the importance of fusion weight tuning in order to maintain the effectiveness of executing multibiometric systems over single biometric systems. By considering two types of variants i.e., different performances of individual subsystems and inconsistent quality of test samples, the fusion weight tuning is applied so as to ensure the fusion systems is at its best performances. Future research should focus on the other sources that influence the reliability of biometric scores and towards the proper approach in handling weight tuning for multibiometric systems.