Isolated Malay Digit Recognition Using Pattern Recognition Fusion of Dynamic Time Warping and Hidden Markov Models

This paper is presents a pattern recognition fusion method for isolated Malay digit recognition using Dynamic Time Warping (DTW) and Hidden Markov Model (HMM). The aim of the project is to increase the accuracy percentage of Malay speech recognition. This study proposes an algorithm for pattern recognition fusion of the recognition models. The endpoint detection, framing, normalization, Mel Frequency Cepstral Coefficient (MFCC) and vector quantization techniques are used to process speech samples to accomplish the recognition. Pattern recognition fusion method is then used to combine the results of DTW and HMM which uses weight mean vectors. The algorithm is tested on speech samples that are a part of a Malay corpus. This paper has shown that the fusion technique can be used to fuse the pattern recognition outputs of DTW and HMM. Furthermore it also introduced refinement normalization by using weight mean vector to get better performance with accuracy of 94% on pattern recognition fusion HMM and DTW. Unlikely accuracy for DTW and HMM, which is 80.5% and 90.7% respectively.


INTRODUCTION
In many speech recognition systems, endpoint detection and pattern recognition are used to detect the presence of speech in a background of noise. The beginning and end of a word should be detected by the system that processes the word. The problem of detecting the endpoints would seem to be easily distinguished by human, but it has been found complicated for machine to recognize. Instead in the last three decades, a number of endpoint detection methods have been developed to improve the speed and accuracy of a speech recognition system. This study uses the Malay language, which is a branch of the Austronesian (Malayo-Polynesian) language family, spoken as a native language by more than 33,000,000 persons distributed over the Malay Peninsula, Sumatra, Borneo, and the numerous smaller islands of the area, and widely used in Malaysia and Indonesia as a second language [1] .
Speech Recognition (SR) is a technique aimed at converting a speaker's spoken utterance into a text string or other applications. SR is still far from a solved problem. It is quoted that the best reported word-error rates on English broadcast news and conversational telephone speech were 10% and 20%, respectively [2] . Meanwhile, error rates on conversational meeting speech are about 50% higher, and much more under noisy conditions [3] . This paper proposes a fusion pattern recognition method for isolated Malay digit recognition using Dynamic Time Warping (DTW) and Hidden Markov Model (HMM). DTW was used in speech recognition in 70's and 80's [4,5] and HMM was popular after 90's and still continues now [6] . Meanwhile fusion techniques are used to solve biometric problems especially for sensors, extractors, classifiers and supervisors as shown in Fig. 1 [7] . For further improvement in speech recognition [8] used a technique, which used decision fusion for making decisions as shown in Fig. 2.
In Figure 2, it shows the Confidence Measure (CM) as assessing the reliability of recognition results in two ways: Fig. 2(a) Feature-level fusion and Fig. 2(b) Decision-level fusion. Furthermore CM gives a confident estimation and is followed by a decision whether to reject or accept the hypothesized isolated digit. In this paper the decision-level fusion is used where decision making X1 is assumed as DTW, decision making X2 as HMM, and the fusion centre or pattern recognition fusion method as weight mean vector is as shown in Fig. 3.
The aims of this project are: (i) to increase the accuracy percentage of Malay speech recognition; (ii) to develop patterns of reference for the Malay digit in the recognition database by using HMM and DTW; and (iii) fuse them using weight mean vector for improving the recognition. This paper is segmented in 4 sections: Introductions, materials and Methods, Results and Discussion and Conclusions.

MATERIALS AND METHODS
The algorithm is tested on Malay digit speech corpus. The Malay isolated digits are from 0 to 9 spoken as KOSONG, SATU, DUA, TIGA, EMPAT, LIMA, ENAM, TUJUH, LAPAN and SEMBILAN with 10 repetitions for each digit. The system begins with the input speech, end point detection, framing, normalization, filtering, MFCC, time normalization, and using HMM and DTW to calculate the reference patterns. The DTW is used to normalize the training data with the reference patterns respectively as shown in Fig. 3. Finally weighted mean vector is used with the results from DTW and HMM to get the final decision output.

End Point Detection:
After getting the speech sample, the first process is endpoint detection. For detection, two basic parameters are used: Zero Crossing Rate (ZCR) and short time energy. The energy parameter has been used in endpoint detection since the [9] . By combining with the ZCR, speech detection process can be made very accurate [10] . The begin and end for each utterance can be detected.
The measurements of the short time energy can be defined as follows [11] : c. sum of absolute energy: As mentioned in the definition above, we write the algorithm E as energy, N as samples in a frame. The frame size is 256, sample rate is 8 KHZ, the upper level energy is -10dB and the lower level energy is -20dB.
The flowchart of the end point detection is shown on Fig. 4. The system begins with reading a WAV file, which is recorded, from 15 male and 15 female speakers. Each speaker says KOSONG, SATU, DUA, TIGA, EMPAT, LIMA, ENAM, TUJUH, LAPAN and SEMBILAN with a 1 second of pause between each digit.
Then the ZCR is adjusted to the number of times in a sound sample the amplitude of the sound wave changes the sign by getting their mean. A tolerance of threshold is included in the function that calculates zero crossing which is 10% of maximum ZCR. Next logarithm (log) energy allows us to calculate the amount of energy in a sound at specific instances. For specific window size there are no standard values of energy. Log energy depends on the energy in the signal, which changes depending on how the sound was recorded. In a clean recording of speech the log energy is higher for voiced speech and zero or close to zero for silence. It expands the endpoint lower level by reversing the sound index until it reaches the first point's energy which falls below a low level energy threshold. Next it expands the end point for the high ZCR area in which, if the ZCR index is greater than the ZCR threshold, then the ZCR index is moved to the first point. Lastly it transforms a sample point-based index for the beginning and ending index. Figure 5 shows the waveform, zero crossing rate and energy for continuous digit recorded from a male speaker. Also as shown in Fig. 5, the voiced speech can be distinguished from unvoiced speech as it has much greater amplitude displacement when the speech is viewed as a waveform. It also shows a boundary line for begin and end point for each segment.
This endpoint technique managed to show the voiced speech and unvoiced speech (including silence). Furthermore this endpoint detection algorithm has been tested at various places [12] and also tested on Malay digits [13] showing good segmentation for male and female speakers with a reasonable accuracy rate of 87.5%.
For labeling the segmented speech frame, the ZCR and energy are applied to the frame. Unfortunately it contains some level of background noise due to the fact that energy for breath and surroundings can quite easily be confused with the energy of a fricative sound [14] . For voiced speech, energy is high and the ZCR are low. On the other hand, for unvoiced speech the energy is low and the ZCR are high.
Feature Extraction: In this project Mel Frequency Cepstral Coefficients (MFCC) is chosen because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity properties of the high-order cepstral coefficient [15] . Currently it is the most popular feature extraction method [15,16] . MFCC is produced after the recorded signal is pre-emphasized, framed and Hamming windowed. Then the signal is normalized and lowpass filtered. Lowpass filter is used to remove the potential artificial high frequencies appearing in their modulation spectrum due to transmission errors.
The Hamming window is calculated after getting the results from the endpoint process. The equation used is as follows: where α w is equal 0.54, meanwhile β w , functions to normalized the energy through the operation so that the signal will not change. For the purpose of front end to obtain the desired frequency resolution on a Mel scale the simple Fourier Transform (FT) is used. The average spectral magnitude for each amplitude coefficient then is calculated as: where the number of samples to get the average value is denoted as N, weighting function is denoted as w FB (n) and magnitude of the frequency computed by the Fourier transform is denoted as . The cepstral coefficient is computed to minimize the non-information bearing variability from that amplitude via the following calculations: (6) where the average signal value in the k th is denote as S avg .

Dynamic Time Warping (DTW):
DTW is one of the main algorithms in this system for recognition after HMM. Due to the wide variations in speech between different instances of the same speaker, it is necessary to apply some type of non-linear time warping prior to the comparison of two speech instances. DTW is the preferred method for doing this, whereby the principles of dynamic programming can be applied to optimally align the speech signals. On the other hand, for detecting similar shapes with different phases, DTW has been used to calculate more robust distance for time series data. Indeed it can be used to measure similarity between sequences of different lengths. Because of these advantages many researchers use DTW such as for generic analysis and mining tasks on time series data, voice recognition and signature verification [5,18] . The distance metric used is a Euclidean distance for the cepstral coefficients over all frames after DTW is applied to align the frames optimally. The distance metric between frame i of the test word T MFCC and frame j of the reference word R MFCC is calculated as:

Fig. 4: Flowchart of Segmentation
This DTW algorithm has been tested with 80.5% correctness [19] and one of the screens shot is shown in Fig. 6. But for this fusion system the distance is calculated as: (8) for the purpose to process one digit only. This distance will be used by decision fusion to process the weight mean vector for one digit.

Hidden Markov Model (HMM):
HMM is typically an interconnected group of states that are assumed to emit a new feature vector for each frame according to an emission probability density function associated with that state. Viterbi algorithm is the most suitable for the estimation the parameters for HMM on the maximum likelihood criterion. [14] . In HMM the expression is defined as λ = (A, B, π). A is denoted by a state transition probability matrix, B is denoted as output probability matrix and π denoted as initial state probability. The probability of the observation sequence P(oλ) is given multidimensional observation sequences o, known as feature vectors.
For word-level HMM, the recognizer computes and compares all the P(oλ) where (v = 1,2,…,W) and W is the digit word models. For left-to right HMMs, P(oλ) is computed using the Log-Viterbi algorithm as follows [20] : For recursion, and for termination, The acronym used in the algorithm: N is number of states, T is number of frames for feature vectors , a ij is state transition between i and j A = {a ij }are their N-by-N matrix, B = {log b j (o t )}is a N-by-T matrix in log output probability and δ t (j) is the likelihood value at the time index t and state j

Pattern Recognition Fusion HMM and DTW:
The pattern recognition fusion method used to fuse the results of DTW and HMM is weight mean vector. DTW measures the distance between recorded speech and a template, expanding or shrinking the temporal axis of the target to find the path or warping function which maximizes the similarity between the two speech signals. The distance of the signals is computed at each instant along the warping function. Meanwhile, HMM trained cluster and iteratively moves between clusters based on their likelihoods given by the various HMMs. The weight mean vectors equation used is as follows: which expands to, where w 1 is query recognition rate in HMM test phase, w 2 is query recognition rate in DTW test phase, x n is the real time value of recorded speeches and is weight mean vector. For example if recognition percentage for HMM is h and for DTW is d for one digit, then in the fusion model after the query is recognized by DTW and HMM individually, the final percentage is calculated as follows: =(((h * w 1 ) + (d * w 2 )) + h-d (14)

RESULTS AND DISCUSSION
We have evaluated our algorithm using the data described in the methodology section. The recognition algorithms HMM, DTW and DTW-HMM pattern recognition fusion is then tested for the percentage of accuracy. The test is limited to Malay digits from 0 to 9. Random utterance of digits is done and the accuracy of 100 samples is analyzed. The results obtained from the accuracy test is about 80.5% of accuracy for DTW and 90.7% for HMM and 94% for pattern recognition fusion. The results obtained are shown in Table 1.
Meanwhile for the robustness, we add Gaussian noise to the original speech signals. Table 2. shows the comparison digit recognition percentage after the noise with various signal to noise ratios (SNRs). Amongst the SNR that we have chosen: SNR greater than 30dB (original speech), 20dB, 15dB, 10dB and 5dB. The results show that pattern recognition fusion is better than stand alone recognition even in noisy conditions.

CONCLUSION
This paper has shown a speech recognition algorithm for Malay digits using MFCC vectors to provide an estimate of the vocal tract filter. DTW and HMM are the two recognition algorithms used. DTW is used to detect the nearest recorded voice. Meanwhile HMM is used to emit a new feature vector for each frame according to an emission probability density function associated with that state. The results showed a promising Malay digit speech recognition module. This paper has shown that the fusion technique can be used to fuse the pattern recognition outputs of DTW and HMM. Furthermore it also introduced refinement normalization by using weight mean vector to get better performance with accuracy of 94% on pattern recognition fusion HMM and DTW. Unlikely accuracy for DTW and HMM, which is 80.5% and 90.7% respectively. The percentage of the recognition can be increased by focusing on tweaking the cut-off values used by the algorithm to label the different parts of  speech especially on breathy-voice female speakers. This is because the ZCR has a low value for silence and voiced speech, therefore there is more chance of an error between these values, but energy is only high when voiced speech occurs.