Speaker Recognition Using Spectral Cross-correlation: A Fast Algorithm

This study presents an original algorithm for computing the cross-correlation function applied for speech recognition A spectral correlation estimation algorithm based on the comparing the magnitude spectrum of the two signals is presented. The number of samples is reduced by a factor of two, after eliminating the image spectrum. A moving average filter is used to smooth the magnitude spectrum and a re-sampling is performed in the frequency domain, which reduces the spectrum size, by a factor of 8. The algorithm shows good results in recognizing the voice of a specific person, hence its application in speaker identification.


INTRODUCTION
Speech recognition has been an active field of research during the last three decades. The rapid pace of business today requires employees and customers to have fast, constant access to information. Technology has provided a mechanism for efficient communication and information storage-but it has also introduced new levels of complexity into the business environment. Long learning curves impact productivity and complex interfaces make software difficult to use. Speechenabled interfaces to computers can help solve important business problems and improve sensitive area security systems. Speech-enabled applications can help reduce the training costs of rapidly changing software products by providing a more intuitive user interface, allowing users to substitute complex drop-down menus commands with simple spoken commands. Customer service departments can provide customers with automated access to information and services since speech-enabled applications can eliminate the constraints of the telephone keypad and allow easier-touse-automated systems.
Because speech is a learned function, any interference with learning ability may be expected to cause speech impairment. The most common interfering conditions are certain neuroses and psychoses, mental retardation and brain damage, whether congenital or acquired. Voice disorders, socalled dysphonias, may be the product of disease or accidents that affect the larynx. They may also be caused by such physical anomalies as incomplete development or other congenital defect of the vocal cords. Disorders of rate and rhythm are generally either psychogenic or have a basis in some neurological disturbance. Hence speaker-independent speech recognition system must have a robust algorithm to identify a special phrase pronounced by the speaker and permits the access to specific location or the use of a given device. A computer application was developed which activates a hardware device based on the result of recognition (Fig. 1). The predefined-recorded phrase is compared in real-time with a speech-recorded online from a microphone placed in front of the device to be controlled.
Many algorithms based on Cepstral analysis or homomorphic analysis for automated speech recognition are used in research [1][2][3] . The hidden Markov model (HMM) is one of the most frequently used methods [4] . HMM uses a statistical modeling and libraries of words and grammar rules to select the highest probability outcome from a sequence of samples. The cepstral analysis supplanted the direct use of linear prediction analysis LP, derived from the hidden Markov modeling, other works has implemented a person identification system based on acoustic and visual features [5] .
Speaker recognition or person identification is a process that automatically authenticates a personal identity based on his or her voice. Although speaker recognition includes diverse tasks that discriminate people in terms of their voices, most of studies focus on speaker identification and verification. a speaker recognition system often works in either of two operating modes: text-dependent and text-independent. By text-dependent, the same or known text is used for training and test. The algorithm presented in this study is based on text-dependent speaker recognition.

TIME AND SPECTRAL CROSS-CORRELATION
Correlation Function is used to obtain the similarity between the Pass-Phrase and the person speech. The cross-correlation function is given by: The cross-correlation function result depends on the recorded speech strength and the duration of recording. To perform cross-correlation independently of scale and duration the normalized cross-correlation is used. The recognition results are included in the interval[0-100].
r xx and r yy are the energies of x(n) and y(n) respectively.  The comparison of the signals in the time domain is not suitable for speech signals, because of the statistical behaviour of these signals.
This is a Moving Average filter (MA), with a Finite Impulse Response (FIR) system. L is chosen to be 10, YM is the magnitude spectrum and YF is the filtered Magnitude spectrum.

Spectral cross-correlation algorithm
Step 1: Compute FFTs of windowed x(n) and y(n) Step 2: Eliminate the upper side of each spectrum.
Step 3: Smooth the spectrum using the MA filter.
Step 4: Resample the magnitude spectrum X(f) and Y(f) at each 4Hz (e.g.1/8 sample is taken) Step 5: Computer the spectral cross-correlation S(f) Step 6: Find the maximum of S(0).
Step 7: if S(0) > Threshold then Recognition procedure Else Recognition failure When processing 2 seconds of speech digitized at 8KHz, this represents 16000 samples. The FFT is computed using a radix-2 on 16384 samples, this means that 8192 samples represents the actual spectrum and the other 8192 sample represents the image spectrum which is eliminated. The actual spectrum has a resolution of 0.5 Hz (e.g. the inverse of the signal duration), which is in reduced to 4 Hz. This means that the new spectrum contains only 1024 samples after resampling. Therefore the correlation function has a length of 2048 samples instead of 32768 samples in the time domain. The gain is then 16 in term of correlation size. This method is more appropriate for person identification or speaker-dependent speech recognition. The spectral analysis is based on calculating the crosscorrelation between the FFT's of the two signals.
We can reduce sidelobe leakage by selecting windows that have low sidelobes. Hence, each signal is multiplied by a Blackman-Harris window before calculating the corresponding FFT (Fig. 1). This window gives a -92dB attenuation of the peak sidelobe. The N defines the length of the window.

BH
x n x n w n = The actual spectrum is given by: The spectral correlation and the similarity criterion are given by:

RESULTS AND DISCUSSION
We implemented a new recognition algorithm based on spectral cross-correlation. The first step was the programming of a cross-correlation between the reference signal and the pre-recorded (target) voice in the time domain Fig. 2. The second method is the use of a spectral correlation algorithm Fig. 3. The spectrum is estimated on 1024 samples. Table 1 gives the comparison of the similarity criterion result depending on the signal to noise ratio SNR.  Fig.6-a 20dB 68% with Child voice Fig.6-b 20dB 49% Figure 2 shows two signals, which are almost the same, but the time-domain cross-correlation gives a similarity of less than 20% due to the statistical behaviour of speech signals. The same signals are compared in term of spectral content using magnitude spectrum correlation, the result shows a similarity of 91%.
Another advantage of using the spectral correlation is the computation time, which is the soul of a real-time process. As a comparison the time-domain cross-correlation of 16384 sample for each signal, takes 2 seconds, however the calculation of the two FFT and the spectral correlation is done in less than 0.1 seconds, therefore the second method is three times faster for 2 seconds of speech processing. The number of samples in the frequency domain is reduced by a factor of two when eliminating the image of the spectrum (i.e. the upper half part of the spectrum). The filtered spectral content recognition shows robustness to noise. Another Main feature is that, the spectral correlation and we only need to compute one value of the spectral cross-correlation S(0). The overall results obtained show that time-domain correlation is not suitable for speech recognition.  Fig. 4 and 5). A recognition failure test is done on a different voice short-phrase pronounced by a woman and a child Fig .6. The spectral recognition algorithm computation time is 0.80 seconds done on a personal computer with celeron processor running at 550 MHz CPU clock speed. This time includes the computation of two FFT's, magnitude spectrum, moving average filtering of the spectrum, cross-correlation and calculation of the similarity criterion. We can speedup the recognition algorithm by calculating the spectral correlation in a reduced range (Rr) of 5% ( Fig. 6-b). since the similarity creterion is defined at S(0). Experements shows that only S(0) is necessary, hence a great reduction in the computation requirements.

CONCLUSION
Recently, speaker verification has been increasingly demanded for security in miscellaneous information systems [6][7][8] . This study introduced a modified cross-correlation computation algorithm applied for person identification used for a access control security system. The presented algorithm is used to compare the spectral content of a reference signal (short-sentence or word) with a pre-recorded signal. The results show a great improvement in the recognition process computation speed. This algorithm applied on short words or on segmented sentences, it is more appropriate for person identification or textdependent speaker recognition.