Speaker Identification: A Hybrid Approach Using Neural Networks and Wavelet Transform

: In speaker identification systems, a database is constructed from the speech samples of known speakers. The approach implemented in this paper is hybrid, where the wavelet transform and neural networks are used together to form a system with improved performance. Features are extracted by applying a discrete wavelet transform (DWT), while a neural network (NN) is used for formulating the system database and for handling the task of decision making. The neural network is trained using inputs, which are the feature vectors. A criteria depends on both false acceptance ratio (FAR) and false rejection ratio (FRR) is used to evaluate the system performance. For experimenting the proposed system, a set of 25 randomly aged male and female speakers was used. Results of admitting the members of this set to a secure system were computed and presented. The evaluation criteria parameters obtained are; FAR=14.5% and FRR=24.5%


INTRODUCTION
Speaker identification has been a wide and attractive area of research. Many works based on speech features, were proposed. In a speaker recognition system there are three important components; the feature extraction component, the speaker models and the matching algorithm. Feature extraction drives a set of speaker-specific vectors from the input signal. Speaker model is then generated from these vectors for each speaker. The matching procedure performs the comparison of the speaker models [1] . Some of recent works on speaker identification depend on classical features including cepstrum with many variants [2] , sub-band processing technique [3][4][5][6] , Gaussian mixture models (GMM) [7] , linear prediction coding [8,9] , wavelet transform [10][11][12] and neural networks [11][12][13] . In [14] , an overview of several modeling techniques is given. In [11] , a hybrid approach of wavelet transform and neural networks is adopted, where the sounds heard over a chest wall, not an uttered ones, are classified such that they can be used for diagnosing pulmonary diseases. This same hybrid approach together with a number of other approaches are studied in [12] and their performances are compared for phoneme recognition uttered by a single speaker.
In this study, we consider a hybrid approach, where the feature extraction component is performed using discrete wavelet transform (DWT), while the speaker modeling and speaker matching components are both performed using neural networks. Our trend is motivated by the fact that wavelet transform offers fine approximation characteristics compared with other spectral analysis techniques; such as discrete cosine transform (DCT). The possibility of introducing a selective zeroing of the coefficients is another merit of wavelet transform. With wavelets, it is possible to analyze a signal at several levels of resolution, making it possible to capture transient, high-frequency bursts with poor frequency resolution and also slowly varying characteristics with high-frequency resolution. Therefore, it is possible to trade off frequency resolution for better time resolution (for analyzing transients) and time resolution for better frequency resolution (for analyzing slow variations), a facility not afforded by the short-time Fourier transform [15] . Spoken sentences by a relatively large society of random speakers were used in this work to form the database of the system. The diversity of such society had imposed a challenge on the performance of the system, the training process, necessary input data used to train the neural networks, and the choice of features to be extracted. The selection of the features varies from application to another and it is desirable that dissimilar acoustic vectors would be clearly separable from each other (forming separate clusters). However, detailed analysis of feature vectors does not support this assumption, where it is found that the distribution of the feature vectors can be considered more or less as of a continuous probability distribution rather than a set of data clusters [16] . In accordance with this, we in our work consider a vector composed of a set of features without concentrating specifically on a certain feature.

Concepts of speaker identification systems:
Speaker identification systems may be classified into two categories based on their principle of operation. Text-dependent systems, which make use of a fixed utterance for test and training and rely on specific features of the test utterance in order to affect a match. Text-independent systems, which make use of different utterances for test and training and rely on long-term statistical characteristics of speech for making a successful identification.
Text-dependent systems require less training than text-independent systems and are capable of producing good results with a fraction of the test speech sample required by a text-independent system. The pitch period or fundamental frequency of speech varies from one individual to another; pitch frequency is high for female voices and low for male voices. This suggests that pitch might be a suitable parameter to distinguish one speaker from another, or at least to narrow down the set of probable matches [17] . This concept of speaker identification is adopted in this paper. The analysis of the frequency spectrum of the test utterance provides valuable information about speaker identification. The spectrum contains both pitch harmonics and vocal-tract resonant peaks, making it possible to identify the speaker with a high probability of being correct. The vocal-tract filter parameters (filter coefficients) can also be used to good effect for speaker identification. This is due to the fact that different speakers have different vocal-tract configurations for the same utterance, depending on their physical and emotional conditions, as well as whether the speaker is a native or non-native speaker [9] .
In any text-dependent speaker identification system, an important decision is the choice of test utterance. The source-filter model is most accurate at representing voiced sounds, such as the vowels. Vowels have a definite, consistent pitch period. The vocal-tract configuration for vowel-utterances exhibits a clear formant (resonant) structure. The frequency spectrum corresponding to vowel-utterances therefore contains a wealth of information that can be used for speaker identification. In general, it is difficult to guarantee a hundred percent recognition even with the best speaker identification approaches.
Generally speaking, two parameters may be used to describe the overall performance of a speakeridentification system.
A false acceptance: Which occurs when the system incorrectly identifies an unregistered individual as an enrolled one, or when one registered individual is mistaken for another. The FAR (False Acceptance Ratio) is the ratio of the number of false acceptances to the total number of trials. The value of FAR can be reduced by setting a strict low threshold.
A false rejection: Which occurs when the system incorrectly refuses to identify an individual who is registered with the system. The FRR (False Rejection Ratio) is the ratio of the number of false rejections to the total number of trials. Setting the threshold to a liberal high value can minimize the value of FRR. The requirements for low FAR and FRR are seen to be conflicting and both parameters cannot be simultaneously lowered. However, a low FAR is vital for good speaker identification systems and most systems are biased for good FAR performance at the expense of FRR.
Spectral analysis using wavelets: The spectral analysis tool, which were used in this work is the wavelet transform (WT).The Discrete Wavelet Transform (DWT) is a special case of the WT that provides a compact representation of a signal in time and frequency that can be computed efficiently [18,19] . The DWT analysis can be performed using a fast, pyramidal algorithm related to multirate filterbanks. The main process performed by this algorithm is a number of successive highpass and lowpass filtering of the time domain signal and is defined by the following equations [12,20] : (2) where Y high (k), Y low (k) are the outputs of the highpass (g) and lowpass (h) filters, respectively after subsampling by In this work, the 9 th level wavelet obtained for each sampled speech input is a vector composed of 22 coefficients serving as the model for the speaker.
The matching and decision making processes: The ability of neural networks to accumulate knowledge about objects and processes using learning algorithms  The type of neural networks, which is adopted in our work is a one based on multi-valued neurons (MVN),. The MVN-based neural network has been chosen depending on the fact that it support multivalued threshold logic and its ability to implement arbitrary mapping between inputs and outputs described by partially defined multiple-valued function. This type of neural networks is also known by their quick converging learning algorithms. A comprehensive observation of MVN and its theoretical aspects together with its learning and properties are presented in [21] .
The proposed system: The proposed system of speaker identification is composed of two main phases; first is an off line processing to generate a model (pattern) matching data file. This implies number of sequential steps as shown in the block diagram of Fig. 1. Once the samples are collected, preprocessing is applied to remove unwanted data as well as the redundant noise. Then it is converted to digital forms and stored in data files. A rectangular window is applied to limit the data used to a specific period. The data patterns of the different samples collected before are used to train the neural networks. The second phase of the system, whose steps are shown in Fig. 2, implements a strategy of speaker identification. This phase of the system applies a model matching approach that compares average features derived from test data with the collection of the stored average speaker's templates which are built in during the training process. Based on the mathematical properties of MVN and their learning policy, we propose the NN structure. The general structure of the MVN-based neural network used for identification is presented in Fig. 3.
In the above figure, the input layer is composed of n neurons corresponding to n input values (n =22, which are the values contained in the ninth level DWT). The output layer contains 25 neurons representing a set of p speakers, where p=25. A hidden layer is used with eleven neurons (k =11), where this number of neurons is found suitable. Throughout the experimentation, we have found that any increase of the hidden neurons amount does not improve the results, results may get worse for a smaller amount of the hidden neurons. This scheme complies with the set of the adopted number of Our goal is to identify a representative speaker related to such an organization. Such society is formed basically of two classes (male and female). These two classes are further classified according to the age into six classes (up to 20 years, 20-40 years and more than 41 years as being a male or a female). Results and evaluation of system performance: Referring to the structure of the proposed system, a series of processing are applied to the input speech samples, as shown in Fig. 1 & 2. In the digitization and windowing stage of Fig. 1, the acquired samples are passed through a window to truncate the data to a set of (10000) specific values that contain data of high speech entropy. All simulations are performed by using MATLAB 7.0. Figures 4 and 5 are showing the output of processing two speech samples (segments) acquired from speaker 1, who is a male aged above 40 years. We will refer to them as S 11  containing speech samples of another speaker (speaker 2, a female aged 21-40 years). A similarity between the DWT levels (1-9) of Fig. 6 and their counterparts of Fig. 7 can be clearly noticed since they belong to the same speaker. Such form of similarity can also be seen in the spectral analysis presented in figures 4 and 5. Similar to what is performed on the samples S 11 , S 12, S 21 and S 22 , elements of many sets ({S 1 } , {S 2 }, ……., {S p }) belonging to a society of p =25 speakers are used to train the neural network with each set comprising (30-100) elements. The system is then tested with speech samples which are either stored (i.e. from the training set) or samples which are not stored. The non-stored samples may belong to an enrolled speaker or to some other speaker. All the neurons were taught using a learning algorithm based on equation proposed in [22] . The final classification results are given in Table (I). In addition to the values appearing in this table the two parameters of the evaluation criteria (i.e. FAR and FRR) are calculated and found to be: FAR= 14.5% and FRR = 24.5%. These results, when compared to outcomes from other works, prove to be good, taking in consideration that the speech samples used in our work are not dedicated for certain class of speakers but are gathered from a relatively random society of speakers. We can therefore consider the overall performance of the system is successful and promising.

CONCLUSION
Neural networks and wavelet transform techniques have been used as a hybrid approach for speaker identification, with the intention that a better performance of identification is to be obtained. Through the use of wavelet transform specific properties of speakers are extracted as vectors (patterns) and then subjected to a neural network based on multi-valued neurons. The activation function and learning properties of these neural networks are being invested to widen the threshold of accurate identification. In speaker identification systems, it is a fact that there is no 100% guarantee of accurate results. Our system, with its hybrid structure, gives acceptable results in speaker identification. Results, which can be considered of a good accuracy in spite of the fact that the society of speakers is relatively random. Concentrating on extracting the dominant features of a speech sample is an advantage of using wavelet transform, where it leads to a reduction in the storage capacity. Storage reduction is an important factor when talking about applications through internet and telecommunications. Finally, with a total of 61% correct identification, we can not claim that our system can be directly deployed into practical implementation. Further work can be carried out for improvements, specially on the feature extraction phase.