Speaker Identification Using Discrete Wavelet Transform

: This study presents an experimental evaluation of Discrete Wavelet Transforms for use in speaker identification. The features are tested using speech data provided by the CHAINS corpus. This system consists of two stages: Feature extraction stage and the identification stage. Parameters are extracted and used in a closed-set text-independent speaker identification task. In this study the signals are pre-processed and features are extracted using discrete wavelet transforms. The energy of the wavelet coefficients are used for training the Gaussian Mixture Model. Daubechies wavelets are used and the speech samples are analyzed using 8 levels of decomposition.


Introduction
The goal of Automatic Speaker Recognition is to extract, characterize and recognize information about the speaker identity (Da Wu and Fu Lin, 2009).Speaker Recognition is classified as Speaker Identification and Speaker Verification.In the Speaker Identification (SI) system, an unknown speaker is compared with a database of N known speakers.SI is further classified into open-set identification and closed-set identification.The task of identifying a speaker who is assumed to be one of the N registered speakers is known as closed-set speaker identification.If the target speaker is not a member of the set of registered speakers it is known as open-set identification.SI, system can be further classified into text-dependent or text-independent task.If a known test utterance is presented to the recognizer it is a textdependent task, otherwise it is a text-independent task.
The speaker-specific information is mainly represented by spectral features like Mel-Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCCs) and Short-Time Fourier Transforms (STFT).MFCCs that are calculated by taking the Discrete Cosine Transform (DCT) of melscaled log filter bank energies, have two drawbacks: • Since basis vectors of the DCT cover all frequency bands, corruption of a frequency band of speech by noise affects all MFCC • A frame of speech may contain information of two adjacent phonemes The LPCC method is based on a linear (all-pole) model of speech production.All the above methods assume the signal to be stationary within a given time frame and lack the ability to analyze localized events correctly.
The speaker identification system is composed of two distinct phases, a training phase and a test phase.In the first phase parameters are extracted from speech.

Wavelet Based Feature Extraction
Wavelets are shifted; scaled version of original or mother wavelets (Yung Lung, 2010).Wavelets split a signal into components that are not pure sine waves.Wavelets have the ability to examine signals simultaneously in both time and frequency.Therefore, wavelet transforms are useful for analyzing noisy, transient signals.
The main advantage of wavelets (Hariharan et al., 2013) is that they have a varying window size, being wide for low frequency and narrow for high frequency.This is because low frequency components complete a cycle at large time interval.Therefore, slow varying components can only be identified over long time intervals but fast varying components can be identified over short time intervals.Owing to the fact that windows are adapted to the transients of each scale, wavelets lack of the requirement of a signal to be stationary during the analysis interval.This leads to optimal time-frequency resolution in all frequency ranges.It can produce high frequency resolution in the low frequency part of the signal, while it has high time resolution in the high frequency part of the signal.
Discrete Wavelet Transform (DWT) decomposes (Claude et al., 2011) non-stationary signals at different frequency intervals with various resolutions.The signal passes through a low pass and a high pass filter.Thus, the signal is decomposed into a rough approximation and a detail component.In the orthogonal wavelet decomposition procedure, the generic step splits the approximation coefficients into two parts.After splitting, we obtain a vector of approximation coefficients and a vector of detail coefficients, both at a coarser scale.The information lost between two successive approximations is captured in the detail coefficients.The next step consists in splitting the new approximation coefficient vector; successive details are never re-analysed.Other factors influencing the selection of DWT over conventional methods are that it allows timefrequency localization.It is possible to know simultaneously the exact frequency and the exact time of occurrence of this frequency in a signal.
Wavelet spaces (Daqrouq, 2011) are a series of function spaces that are highly decorrelated from each other and are particularly suitable for the representation of signals and operators at different resolution scales that exhibit speech and speech feature behaviour.

Daubechies' Discrete Wavelet Transforms
The most common family of wavelets, Daubechies, DWT-db has its low-pass filter coefficients determined by solving the following system of Equation 1 (Da Wu and Fu Lin, 2009):

Database Description
The features are extracted using speech data provided by the CHAINS Corpus as mentioned in Table 1.The corpus contains the recordings of 36 speakers obtained in two different sessions with a time separation of about two months.The first recording session provided speech in three different speaking styles (SOLO, SYNCHRONOUS and RETELL).The SOLO condition is used as training and testing material in this study.Sentences s10 to s33 are used to generate the training sets and sentences s1 to s9 are used as speech samples for testing.

Pre-Processing
The sound files contained in the corpus are.WAV files sampled at 44.1 kHz with a resolution of 16 bits.It was down sampled to 22.05 kHz.The speech samples in the frequency range 400 Hz to 8 kHz were used in this study.The speech was pre-emphasized using the factor 0.97.It is done to boost the higher frequencies.

Feature Extraction
The orignal signal is split as shown in Fig. 1.The accuracy of the reference SI system is estimated using DWT for parameterization.
The daubechies wavelets db6 to db10 are used to encode signals.8 levels of decomposition are used for this study.The total length of the training material is approximately 50 sec per speaker; the number of speakers is 16 while utterances of 10 s are used for testing.The energy of the wavelet coefficients is taken as features for training the Gaussian Mixture Model.The training of the GMM is performed using SOLO recordings from the first recording session.The accuracy of the SI system is expressed as the mean of ten runs.

Results and Discussion
Figure 2 shows the accuracy of the GMM classifier varying the number of Gaussian components.The maximum accuracy is 83.3%.The accuracy of identification reduces as the number of Gaussian components is increased and reaches a minimum for 64 gaussian components.The speaker set consisting of 16 female speakers achieved a maximum accuracy of 74.31% when decomposed to 8 levels using db7.
Four sets of speakers each consisting of 16 speakers are employed.The first set contains eight male and 8 female speakers.The second and the third sets contain 16 male and 16 female speakers.The fourth set contains 16 female speakers speaking the same dialect (Co.Dublin-IE).