A FRAMEWORK FOR MULTILINGUAL TEXT-INDEPENDENT SPEAKER IDENTIFICATION SYSTEM

This article evaluates the performance of Extreme Learning Machine (ELM) and Gaussian Mixture Model (GMM) in the context of text independent Multi lingual speaker identification for recorded and synthesized speeches. The type and number of filters in the filter bank, number of samples in each frame of the speech signal and fusion of model scores play a vital role in speaker identification accuracy and are analyzed in this article. Extreme Learning Machine uses a single hidden layer feed forward neural network for multilingual speaker identification. The individual Gaussian components of GMM best represent speaker-dependent spectral shapes that are effective in speaker identity. Both the modeling techniques make use of Linear Predictive Residual Cepstral Coefficient (LPRCC), Mel Frequency Cepstral Coefficient (MFCC), Modified Mel Frequency Cepstral Coefficient (MMFCC) and Bark Frequency Cepstral Coefficient (BFCC) features to represent the speaker specific attributes of speech signals. Experimental results show that GMM outperforms ELM with speaker identification accuracy of 97.5% with frame size of 256 and frame shift of half of frame size and filter bank size of 40.


INTRODUCTION
In automatic speaker recognition, an algorithm plays the listener's role in decoding the speech into a hypothesis concerning the speaker's identity. Speaker identification is the task of the determination of a given speaker out of a set of known speakers using the speaker specific characteristics extracted from their voice signal. Voiced speech is produced when the vcal folds vibrate during airflow from lungs to vocal cords and the unvoiced speech is produced when these vocal folds does not vibrate (Justin and Vennila, 2013). Only the voiced segment contains more information of the vocal source production than unvoiced speech (Salam et al., 2009). Speaker identification involves extraction of the acoustic features of the speakers, model the features and perform the identity test. The acoustic patterns of acoustic features reflect both anatomy and learned behavioral patterns. The speaker identification process consists of two phases training and testing. During training, the speaker's voice is recorded and typically a number of features are extracted to form a voice print model. This is called as enrollment. In the testing phase, a speech sample or utterance is compared against multiple voice print models in the feature database and the most likelihood pattern is identified. If the text uttered is different for enrollment and verification this is called text-independent speaker identification otherwise it is called text-dependant speaker identification. The proposed speaker identification task uses LP residual Cepstral Coefficients, MFCC features and its variants. LP (Prathosh et al., 2013) analysis of speech assumes the source-filter model, means adaptively filtering the formants required to synthesize the speech (Tiun et al., 2012). The LP residual signal could be derived even for noisy signals. The residual signal is used Science Publications JCS to excite the time-varying all-pole filter to generate the enhanced speech. MFCC features show discriminative ability (Hasan et al., 2012) for the coefficients that is important in Speaker Identification applications. Analysis of speaker identification with features extracted for different frame sizes (Jayanna and Prasanna, 2009) helps in improving the speaker identification accuracy. Various technologies used to process and store voice prints include frequency estimation, Hidden Markov Models (HMM) (Justin and Vennila, 2013) Gaussian Mixture Models (GMM) (Quiros and Wilson, 2012), Student's-t mixture model (tMM), pattern matching algorithms, Neural Networks (NN) (Al-Ani et al., 2007), matrix representation, Vector Quantization (VQ) and wavelet transform. Extreme Learning Machine (ELM) modeling technique is used to provide better performance than the traditional tuning-based learning methods (Bharathi and Natarajan, 2011). It provides the best generalization performance at extremely fast learning speed. It is a new learning algorithm based on Single hidden Layer Feed forward neural Networks. Also GMM performs better for text-independent speaker identification. The Input weights and hidden neurons or kernel parameters are not necessarily tuned.
This study focuses on both text independent and multilingual speaker identification, where there is no constraint on what the speaker speaks and what language the speaker speaks. The languages that are used in this work include Tamil, English, Telugu and Hindi. This study aims at: • Achieving higher speaker identification accuracy with varying frame and filter bank sizes • Increasing the speed of speaker identification performance using Extreme learning machine • Analyzing multilingual speaker identification This study is organized as follows: Section 2 describes the materials and methods. Section 3 gives results of various methods. In section 4 results are discussed elaborately. Finally Section 6 concludes the work.

Database Description
The materials used are speech databases. The database used for this work encompasses both synthesized voices from jyamagis (jyamagis homepages) tool kit and recorded voices. The total size of the speaker data base is 50 which consisting of recorded speech of 25 speakers and synthesized speech of another 25 speakers from (jyamagis) toolkit. The speech is recorded using a high quality microphone in a sound proof booth at a sampling frequency of 16 kHz, with a session interval of one month between recordings. This speech is designed to have a rich phonetic content in four different languages Tamil, English, Telugu and Hindi and four sessions for each language are recorded. Gold wave software is used to record the voices in mono recording mode with a sampling frequency of 16 KHz. The recorded voice is encoded using PCM encoding. The voices are generated from the 'jyamagis-the center for speech technology and research' for 25 speakers belonging to 6 different categories of Scottish, English and American male and female.

Feature Extraction
After preprocessing the speech signal by silence removal, wavelet based denoising, pre-emphasis, frame blocking and windowing processes, the features of the speech signal are extracted. Transforming the input data into the set of features is called feature extraction. In this work LPRCC, MFCC, MMFCC and BFCC feature extraction techniques are used.

Residual Cepstral Coefficient
LP analysis of speech estimates a residual, representing the excitation source of the speaker. The prediction error is also referred to as residual signal. In the linear predictive modeling of speech, a speech sample s(n) is approximated as the weighted sum of a limited number of past samples. The residual signal r(n) is obtained for each frame y(n) of the signal s(n). Predicted version for the frame y(n) is y'(n) and is given by Equation 1: where, a k are LP coefficients for k = 1, 2…,p. p is the length of the signal.
The LP residual signal is given by Equation 2 and 3: Weighted LP introduces a temporal weighting of the squared residual in model coefficient optimization. This study proposes calculating log energies to each frame of the LP residual signal r(n) and subjecting it to the

JCS
Gaussian Mel scale filter bank and Cosine transform to arrive at LP residual Cepstral Coefficients (LPRCC).

Mel Frequency Cepstral Coefficient
Mel is a unit of pitch. Pairs of sounds perceptually equidistant in pitch are separated by an equal number of mels. Mel frequency of a given signal is given as Equation 4: where, mel (f) is the subjective pitch in mels corresponding to the actual frequency in Hz. The bandwidth of human speech communication is approximately the frequency range upto 7KHz (Dhanaskodi and Arumugam, 2011), because both the production and perception organs are most efficient at these low frequencies.
Here the actual frequency of the speech signal f is considered as 8KZ, assuming high frequency portion of the speech signal is also carrying some minimum amount of speaker specific information.
Mel-Frequency Cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency (Bharathi and Shanthi, 2012). MFCCs are then calculated by taking N point DFT for each frame y(n) as Equation 5: Whose energy spectrum is |Y(k)| 2 where 1 ≤ k ≤ N. Triangular filter bank is reshaped to Gaussian filter to make higher correlation with adjacent sub bands. A triangular filter provides crisp partitions in an energy spectrum by providing non-zero weights to the portion covered by it while giving zero weight outside it. The phenomena cause loss of correlations between a sub band output and the adjacent spectral components that are present in the other sub band, whereas Gaussian shaped filters shown in Fig. 1 can provide much smoother transition from one sub band to other preserving most of the correlation between them.
The Cepstral mean subtracted MFCC coefficients are calculated as follows Equation 6: where, 1≤ i≤ Q , Q is the number of filters of the bank: R is the number of cepstral features.

Modified Mel Frequency Cepstral Coefficient
A Modified Mel-Frequency Cepstral Coefficient (MMFCC) is the improvised version of conventional MFCC. MMFCC uses compensation based on the magnitude of spread, through a frame based weighting function to preserve the speaker dependent information in different frames. The intensity/loudness at different segments of a spoken word may influence the magnitude of the coefficients affecting cluster formation in parameter space variation of for a speaker. MMFCC is a frame-based technique to reduce these effects through normalization of coefficients in each frame by its total spread, so that coefficients of all the frames are brought to same level of spread. The cepstral mean subtraction procedure is followed by normalization as follows Equation 7: Weighting function is defined as Equation 8: The modification in above through the weighting function gives the Modified MFCC coefficients as given by Equation 9:

Bark Frequency Cepstral Coefficient
The Bark scale provides an alternative perceptually motivated scale to the Mel scale. The Bark is a unit based on critical band boundaries. Speech intelligibility perception in humans begins with spectral analysis performed by the Basilar Membrane (BM). Each point on the BM can be considered as a band pass filter having a bandwidth equal to one critical bandwidth or one Bark (Singh, 2010). The bandwidth of several auditory filters were empirically observed and used to formulate the Bark scale. The following function transforms real (linear frequency) to bark frequency by Sumithra et al. (2011) as Equation 10: After converting the filter bank spacing to bark scaled spacing, the remaining conversion into cepstral coefficients are similar to MFCC. Barks relate very strongly to mels.

Extreme Learning Machine Modeling
ELM is an algorithm that is designed for single hidden layer feed forward neural networks (Bharathi and Natarajan, 2011). It takes as input the number of input neurons, hidden neurons, output neurons, activation function.
For a given a training set N = {(x i ,t i ) x i ∈R n , t i ∈R m , i = 1,2,…N}, activation function g(x) and the hidden node number Ñ: Step1: The input weight w i and bias b i , i = 1,2,...Ñ, should be randomly assigned Step2: The hidden layer output matrix H must be calculated Step3: The output weight β must be calculated using The activation function is usually an abstraction representing the rate of action potential firing in the node. Activation functions may include the sigmoidal functions as well as the radial basis, sine, cosine, exponential and many non regular functions. Single hidden Layer Feed Forward Neural networks (SLFNs) with N hidden nodes can exactly learn N distinct observations. If input weights and hidden biases are allowed to be tuned SLFNs with at most N hidden nodes and with almost any nonlinear activation function can exactly learn N distinct observations and these activation functions include differentiable and non differentiable functions, continuous and non-continuous functions. The ELM runs 170 times faster than conventional BP algorithms. The testing time spent for Support Vector machine for Regression (SVR) is 190 times longer than the testing time for ELM. The overall proposed system for speaker identification using ELM and GMM is shown in Fig. 2.

RESULTS
In this work, Extreme Learning Machine (ELM) and Gaussian Mixture Model (GMM) based speaker Identification is performed under different frame size and filter bank size conditions and the identification performance is analyzed. Human speakers and machine synthesizers produce speech signals, while human listeners and machine recognizers receive and analyze such signals to estimate the underlying textual message and to identify the speaker. Hence to cover the speech signal produced by human speaker and machine synthesizers, in this speaker identification work both recorded speech and synthesized speech are used. The recorded speech and synthesized speech waveform for the utterance 'is', both obtained at the sampling frequency of 16 KHz is shown in Fig. 3.
In Table 1 identification accuracy, when testing with same language is calculated by finding the percentage of correct identification for English-English, Tamil-Tamil for respective speakers. Table 2 illustrates the speaker identification accuracy with ELM classifier for different frame sizes of speech signal. The analysis of the effect of different frame size and filter bank size in speaker identification using MMFCC feature and GMM and ELM techniques is shown in Fig. 4. Table 3       Combining classifier decisions to get further improved decision has been successful in speaker identification. The classifier decision combining method used in this study is shown in Fig. 5. Table 5 shows the performance of GMM based speaker identification system for different combinations of score fusions.

DISCUSSION
The speaker identification task uses a priori information and determines which speaker from a set of possible speakers is the one currently talking. This priori information is captured in the form of features of the registered user's speech signal. This work uses features such as Linear Predictive Residual Cepstral Coefficient (LPRCC), Mel Frequency Cepstral coefficient (MFCC), Modified Mel Frequency Cepstral Coefficient (MMFCC) and Bark Frequency Cepstral Coefficient (BFCC) features. The human speech production and hearing mechanisms are likely to have evolved in parallel, each systems taking advantage of properties of the other. The BFCC feature used in this work is related to hearing mechanism which helps in analyzing speaker specific information present in the speech signal in the frequency range 200-5600 Hz.
In the Mel Frequency Cepstral Coefficient feature, the initial c 0 coefficient represents the average energy in Science Publications JCS the speech frame and is discarded for amplitude normalization. The coefficient c 1 reflects the energy balance between low and high frequencies, positive values indicating sonorants and negative values for frication. For i > 1, c i represent increasingly fine spectral detail frequency ranges. For ELM classifier the primary focus is on labeling and retrieval. The testing and training files are generated for pre emphasized features that are extracted. Those files are fed into the ELM classifier to compute the testing and training accuracy. ELM classifier provides as output the testing time, training time, testing accuracy and training accuracy when the testing data files and training data files are loaded to it. The output weights of the testing data and training data are calculated and the output label of the given speaker is classified. Reduction in frame size increases the number of frames obtained. Hence finding an appropriate match among a larger number of frames makes the task effective for finding good matching among the training and testing samples. In GMM the maximum likelihood score estimation is used to identify the speaker.
For evaluation purpose speakers are asked to utter different short length utterances in Tamil, English, Telugu and Hindi Languages. Maximum length of speech signal in each session is limited to 4 sec, Enrollment Phase Identified speaker Classification Test Phase Totally four sessions are recorded for each speaker, out of which 2 sessions are used for training and the remaining 2 sessions are used for testing. Most of the researchers have concentrated on clean speech or noisy speech for speaker identification task. But this work focuses on combination of noisy recorded speech along with synthesized speech for speaker identification task. After pre processing the speech signal is framed to 2048, 1024, 512, 256, 128 frame sizes. Also the frame shift trial is made with 40, 50, 60 and more than 60% of frame sizes. To improve the speaker identification accuracy Mel scale filter bank is constructed using Gaussian shaped filters in contrary to the triangular filters used in conventional systems. This triangular filter bank is reshaped to Gaussian filter to make higher correlation with adjacent sub bands. The Mel scale filter bank is constituted with 20 and 40 Gaussian shaped filters respectively. ELM performs well when the classes to which the test signal is associated is less, when number of class increases (in this case number of speakers) the performance of ELM drops. Identification accuracy when testing with other languages is calculated when testing with other than test languages. From Table 2 it is inferred that when the filters in the filter bank are 20 and the frame size is 1024 the accuracy is less. Moreover, when testing is done with same training language speech (for example English-English), the identification accuracy increases whereas when testing with other language speech signals (for example English-Tamil) the identification accuracy decreases.
Identification accuracy of testing with same language is calculated by averaging the identification accuracies of all the same language testing (for example Tamil Utterance testing with Tamil utterance training, English with English, Telugu with Telugu) for a specific feature. Similarly identification accuracy of testing with other language is calculated by averaging the identification accuracies of all the language testing with other three language utterances. Experimental evaluation indicates that the characterization of the speakers with varying frame sizes and filter bank sizes play a significant role in capturing the identity of the speaker. As increased filter bank size could capture all the minor variations in the sound and aids better identification rates, the identity of a human speaker can be exploited robustly by increasing the filter bank size. Speech signals are assumed to be stationary for 10-20 ms duration. Substantiating this, frame size of 256 samples and frame shift of 50% of frame size perform better than frames with 512 samples and 1024 samples. When frame size decreases below 256 or increases beyond 1024, filter bank size increases beyond 40 and frame shift increases above 50% of frame size there is a reduction in identification accuracy.
The ELM runs 20 times faster than GMM algorithm in testing. The overall result reveal that out of the four cepstral features the MMFCC feature with mean cepstral subtraction contributes more to speaker specific attributes intern to enhancement in speaker identification accuracy. Ranjan et al. (2010) used LPC, RC, APSD, Number of zero crossing and Formant frequencies features and Artificial Neural Network using back propagation learning algorithm and clustering algorithm for training and identification processes of 20 speakers uttered in Hindi,

JCS
Telugu, Sanskrit and Punjabi languages. The average identification rate 83.29% was achieved when the network is trained using back propagation algorithm and it was improved by about 9% and reached up to 92.78% when using clustering algorithm.
In this proposed work with 50 speakers when single hidden layer feed forward neural network ELM is used 79.25% identification accuracy is achieved and when GMM is used 94% identification accuracy is achieved. Combination of classifier would perform better if they are provided with information that is relevant in nature. Using this concept the loglikelihood score of MMFCC feature is combined with the log-likelihood score of MFCC feature for the GMM modeling.
The weighting factor w used in this work is 0.77. This value is arrived after several trials. The new loglikelihood score obtained when the weighted scores of MMFCC and MFCC are summed up result in improved speaker identification accuracy of 97.5%.

CONCLUSION
In this study the task of finding the speaker's identity using the voice characteristics of multilingual speakers is evaluated with different frame sizes, frame shifts and filter bank sizes. Frame size of 256 samples together with frame shit of 50% of frame size performs better than frames with 512 samples and 1024 samples. Filter bank size of 40 Gaussian shaped filters performs better than bank with 20 filers. The overall identification rate of 79.25% is achieved for MMFCC feature with Frame size 256 by using ELM modeling technique. The maximum identification rate 97.5% is achieved for MFCC feature with frame size 256 and the mixture weight 16 by using GMM modeling technique. Experimental results show that Modified Mel Frequency Cepstral Coefficient features perform better for both with GMM and ELM algorithms. Since ELM runs faster than GMM, ELM algorithm is suitable for speaker identification applications, which require faster response with some tolerance. GMM outperforms ELM with large difference in identification accuracy. The robust performance exhibited by the GMM model is promising and can promote further work in the area of speaker identification when combined with emerging feature extraction and modeling techniques.