Robust Speech Recognition Using Fusion Techniques and Adaptive Filtering

The study proposes an algorithm for noise cancellation by using recursive least square (RLS) and pattern recognition by using fusion method of Dynamic Time Warping (DTW) and Hidden Markov Model (HMM). Speech signals are often corrupted with background noise and the changes in signal characteristics could be fast. These issues are especially important for robust speech recognition. Robustness is a key issue in speech recognition. The algorithm is tested on speech samples that are a part of a Malay corpus. It is shown that the fusion technique can be used to fuse the pattern recognition outputs of DTW and HMM. Furthermore refinement normalization was introduced by using weight mean vector to obtain better performance. Accuracy of 94% on pattern recognition was obtainable using fusion HMM and DTW compared to 80.5% using DTW and 90.7% using HMM separately. The accuracy of the proposed algorithm is increased further to 98% by utilization the RLS adaptive noise cancellation.


INTRODUCTION
Speech Recognition (SR) is a technique aimed at converting a speaker's spoken utterance into a text string or other applications. SR is still far from a solved problem. It is quoted that the best reported word-error rates on English broadcast news and conversational telephone speech are 10 and 20%, respectively [1] . Meanwhile, error rates on conversational meeting speech are about 50% higher and much more under noisy conditions [2] .
Robustness is a key research in speech recognition for the past 50 years. The main issues in robustness are invariance to extraneous background noise and channel conditions as well as speaker and accent variations [3] . Recursive Least Squares (RLS) algorithm is used to improve the presence of speech in a background of noise. The RLS algorithm provides good performance for models with accurate initial information on a parameter or a state to be estimated [4] . In many applications of noise cancellation the changes in signal characteristics could be quite fast. This requires the utilization of adaptive algorithms, which converge rapidly. From this perspective the best choice is the RLS [5] . The beginning and end of a word should be detected by the system that processes the word after noise cancellation has been done.
Fusion pattern recognition is used such as with Dynamic Time Warping (DTW) and Hidden Markov Model (HMM). DTW is popularly used in speech recognition in the 70's and 80's [6,7] and HMM is popular after 90's until now [8]. Meanwhile, the fusion techniques started being used in the middle of 90's for complementing the benefits of each other [9,10] . There are a few types of fusion in speech recognition amongst them are HMM and Artificial Neural Network (ANN) [11] and HMM and Bayesian Network (BN) [12] .
The algorithm is tested on Malay digit speech corpus. A hundred speakers were involved in this project each spoke with 10 repetitions for each digit. The Malay isolated digit are from 0-9 spoken as KOSONG, SATU, DUA, TIGA, EMPAT, LIMA, ENAM, TUJUH, LAPAN and SEMBILAN.

MATERIALS AND METHODS
The system begins with recording speech, RLS noise cancellation, end point detecting, framing, normalization, filtering, MFCC, weighting signal, time normalization, Vector Quantization (VQ) and labeling. Then HMM is used to calculate the reference patterns and DTW is used to normalize the training data with the reference patterns as in Fig. 1. In this paper Mel- Frequency Cepstral Coefficient (MFCC) is chosen as the feature because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity properties of the high-order cepstral coefficients [13] . WAV file was recorded for 60 speakers. Each speaker says KOSONG, SATU, DUA, TIGA, EMPAT, LIMA, ENAM, TUJUH, LAPAN and SEMBILAN with a second pause for each number.
The RLS was used in preprocessing for noise cancellation as shown in Fig. 2 [14] . The explanation for Fig. 2 is as follows: n = Background noise of any type n ∧ = Noise correlated to n s = Speech signal d = Desired signal W = Optimum filter weight matrix y = Output of adaptive process e = Error signal in ideal case (clean speech) Figure 3 shows the results of using the RLS adaptive filtering to the noisy signal. Figure 3a, shows the amplitude of the noisy speech and Fig. 3b shows the amplitude after processing using RLS. After getting the filtered noise speech sample, the first process is endpoint detection. For detection, two basic parameters are used: Zero Crossing Rate (ZCR) and short time energy. The energy parameter has been used in endpoint detection since the 1970's [15] . By combining with the ZCR, speech detection process can be made very accurate [16] .
For labeling the segmented speech frame the zero crossing and energy were applied to the frame. Unfortunately it contained some level of background noise due to the fact that energy for breath and surround can quite easily be confused with the energy of a fricative sound [17] .
As a result, this algorithm performs almost perfect segmentation for voice recoded by male speakers. For recoding done at noisy places, segmentation problem happens because in some cases the algorithm produces different values caused by background noise. This causes the cut off for silence to be raised as it may not be quite zero due to noise being interpreted as speech. On the other hand for clean speech both zero crossing rate and short term energy should be zero for silent regions.
Feature extraction: Mel Frequency Cepstral Coefficients (MFCC) is chosen because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity properties of the high-order cepstral coefficient [18] . Currently it is the most popular feature extraction method [18,19] . MFCC is produced after the recorded signal is pre-emphasized, framed and Hamming windowed. Then the signal is normalized and lowpass filtered. Lowpass filter is used to remove the potential artificial high frequencies appearing in their modulation spectrum due to transmission errors.
The Hamming window was calculated after getting the results from the endpoint process. The equation used is as follows: where α w is equal 0.54, meanwhile β w , functions to normalized the energy through the operation so that the signal will not change. For the purpose of front end processing to obtain the desired frequency resolution on a Mel scale, the simple Fourier Transform (FT) is used. The average spectral magnitude for each amplitude coefficient is calculated as: where the number of samples to get the average value is denoted as N, weighting function is denoted as The cepstral coefficient is computed to minimize the non-information bearing variability from that amplitude via the following calculations: where the average signal value in the kth is denote as S avg .
Dynamic time warping (DTW): DTW is one of the main algorithms in this system for recognition after HMM. Due to the wide variations in speech between different instances of the same speaker, it is necessary to apply some type of non-linear time warping prior to the comparison of two speech instances. DTW is the preferred method for doing this, whereby the principles of dynamic programming can be applied to optimally align the speech signals. On the other hand, for detecting similar shapes with different phases, DTW has been used to calculate more robust distance for time series data. It can be used to measure similarity between sequences of different lengths. Because of these advantages many researchers use DTW such as for generic analysis and mining tasks on time series data, voice recognition and signature verification [20] . The distance metric used is a Euclidean distance for the cepstral coefficients over all frames after DTW is applied to align the frames optimally. The distance metric between frame i of the test word T MFCC and frame j of the reference word R MFCC is calculated as: This DTW algorithm has been tested with 80.5% correctness [21] . But for this fusion system the distance is calculated as: for the purpose of processing one digit at a time. This distance will be used by decision fusion to process the weight mean vector for one digit.

Hidden markov model (HMM):
HMM is typically an interconnected group of states that are assumed to emit a new feature vector for each frame according to an emission probability density function associated with that state. Viterbi algorithm is the most suitable for the estimation the parameters for HMM on the maximum likelihood criterion [22] . For HMM the expression is defined as λ = (A, B, π). A is denoted by a state transition probability matrix, B is denoted as output probability matrix and π denoted as initial state probability. The probability of the observation sequence p(o|λ) is given multidimensional observation sequences o, known as feature vectors. For word-level HMM, the recognizer computes and compares all the p(o|λ v ) where (v = 1,2,…,W) and W is the digit word models. For left-to right, HMMs, p(o|λ v ) is computed using the Log-Viterbi algorithm as follows [23] : for initialization, for recursion, and for termination, The acronym used in the algorithm: which expands to, For example if recognition percentage for HMM is h and for DTW is d for one digit, then in the fusion model after the query is recognized by DTW and HMM individually, the final percentage is calculated as follows:

RESULTS AND DISCUSSION
We have evaluated the algorithm using the data described in the methodology section. The recognition algorithms HMM, DTW and DTW-HMM pattern recognition fusion is then tested for the percentage of accuracy. The test is limited to Malay digits from 0-9. Random utterance of digits is done and the accuracy of 100 samples is analyzed. The results obtained from the accuracy test is about 80.5% of accuracy for DTW and 90.7% for HMM and 94% for pattern recognition fusion. The results obtained are shown in Table 1.  Meanwhile for robustness, the speech is first filtered by using RLS noise cancellation, the results obtained are as shown in Table 2. Noise cancellation increases the accuracy for HMM, DTW and Fusion to 94.2, 91.4 and 98.1%, respectively.

CONCLUSION
This research has shown a speech recognition algorithm using MFCC vectors to provide an estimate of the vocal tract filter. DTW and HMM are the two recognition algorithms used. DTW is used to detect the nearest recorded voice. Meanwhile HMM is used to emit a new feature vector for each frame according to an emission probability density function associated with that state. The results showed a promising speech recognition module as tested on a Malay digit database. This paper has shown that the fusion technique can be used to fuse the pattern recognition outputs of DTW and HMM. Furthermore it also introduced refinement normalization by using weight mean vector to get better performance with an accuracy of 94% for pattern recognition fusion HMM and DTW. This can be compared to the accuracy for DTW and HMM, which is 80.5 and 90.7%, respectively. The accuracy is further increased after RLS noise cancellation to 98.1% for the fusion technique.