SMOOTH FORMANT PEAK VIA DISCRETE TCHEBICHEF TRANSFORM

With the growth in computing power, speech recognit io carries a strong potential in the near future. It has even become increasingly popular with the devel opment of mobile devices. Presumably, mobile devices have limited computational power, memory si ze and battery life. In general, speech recognition operation requires heavy computation du e to large samples per window used. Fast Fourier Transfom (FFT) is the most popular transform to sea rch for formant frequencies in speech recognition. In addition, FFT operates in complex fields with im aginary numbers. This paper proposes an approach based on Discrete Tchebichef Transform (DTT) as a p ossible alternative to FFT in searching for the formant frequencies. The experimental outputs in te rms of the frequency formants using FFT and DTT have been compared. Interestingly, the experimental results show that both have produced relatively identical formant shape output in terms of basic vo wels and consonants recognition. DTT has the same capability to recognize speech formants F 1, 2, F3 on real domains.


INTRODUCTION
Speech recognition systems have become one of the useful applications for pattern recognition, machine learning, computer-assisted translation and mobile devices. Speech is a natural source of interface for human machine communication (Erzin, 2009). Formant frequency is a significant parameter to interpret linguistic as well as non-linguistic speech word (Tomas and Obad, 2009). Formant frequency is an important element speech feature and rich source of information of the uttered word in speech recognition. The formant is associated with the free resonance of the vocal-tract system (Fattah et al., 2009).
A detection of the formant frequencies via Fast Fourier Transform (FFT) is one of the fundamental operations in speech recognition. The FFT is often used to compute numerical approximations to continuous Fourier transform. However, a straightforward application of the FFT often requires a large window to be performed even though most of the input data to the FFT may be zero. FFT algorithm is a computationally complex which requires operating on an imaginary domain. It is a complex exponential function that defines a complex sinusoidal function.
The Discrete Tchebichef Transform (DTT) is another transform method based on discrete Tchebichef polynomials. DTT has a lower computational complexity and it does not require complex transform Science Publications JCS unlike continuous orthonormal transforms (Ernawan et al., 2011a). At the same time, DTT does not involve any numerical approximation on a computationally friendly domain. The Tchebichef polynomials have unit weight and algebraic recurrence relations involving real coefficients. These factors in effect make DTT suitable for transforming the speech signal from time domain into frequency domain. In the previous work, DTT has been applied in audio processing and image processing applications. For example, DTT has been used in speech recognition (Ernawan et al., 2012a), image projection, image super resolution (Abu et al., 2009), image dithering (Ernawan et al., 2012b) and image compression (Ernawan et al., 2011b;2013a;Abu et al., 2010).

MATERIAL AND METHODS
The input sounds of five vowels and five consonants being used here in this paper are coming from male voices at a sampling rate of 11 KHz per second from the International Phonetic Alphabet. A sample sound of the vowel 'O' is shown in Fig. 1. This section provides a brief overview on the existing of mathematical transforms, namely, Fast Fourier Transform (FFT) and Discrete Tchebichef Transform (DTT).

Fast Fourier Transform
The standard spectrum analysis method for speech analysis is the FFT (Saeidi et al., 2010). FFT is a simple class of special algorithm that perform Discrete Fourier Transform (DFT) with considerable savings in computational time. FFT is applied to convert time domain signals into frequency domains on the speech signals. The FFT takes advantage of the symmetry and periodic properties of the Fourier transform to reduce the computational time. In this process, the transform is partitioned into a sequence of reduced-length transforms that are collectively performed with reduced computation. FFT is much faster for large values of N, where N is the number of samples in the sequence (Sukumar et al., 2010). In short, FFT is a complex transform which operates on an imaginary number by a special algorithm. FFT has not been changed nor being upgraded for several decades.

Discrete Tchebichef Transform
In previous research, Mukundan found that the discrete orthonormal Tchebichef moments appear to provide a much better support than continuous orthogonal moments (Ernawan et al., 2013b). The discrete orthonormal Tchebichef polynomials are more stable especially whenever Tchebichef polynomials of large degree are required to be evaluated. Speech recognition requires large samples of data in speech signal processing. To avoid such problems, the orthonormal Tchebichef polynomials use the set recurrence relation to approximate the speech signals. For a given positive integer N (the vector size) and a value n in the range [1, N-1], the orthonormal version of the one dimensional Tchebichef function is given by the following recurrence relations {t k } of moment order k in polynomials t k (n) (Jassim and Paramesran, 2009): t n a nt n a t n a t n = + + (1) For k = 2,3, …, N-1 and n = 0,1,…,N-1 where: The starting values for the above recursion can be obtained from the following Equations: The recurrence relation to compute the polynomial value for t k (n) recursion is given below (Jassim and Paramesran, 2009): 1,2,..., 1, 2,3,..., 1 , 2 The forward discrete Tchebichef transform of order N is defined in Equation 12 as follows: For k = 0,1,…, N-1. The X(k) denotes the coefficient of orthonormal Tchebichef polynomials.
The inverse discrete Tchebichef transform is given in Equation 13 by: For n = 0,1,…,N-1. The Tchebichef transform involves only algebraic expressions and it can be computed easily using a set of recurrence relations in Equations 1-11 above. The first five discrete orthonormal Tchebichef polynomials are shown in Fig. 2.

RESULTS
This section presents a step-by-step process on the speech recognition algorithm. This section also explores the experimental results on each step of the speech recognition. The speech recognition involves silence detector, pre-emphasis, speech signal windowed, power spetral density and autoregressive model.

Silence Detector
Speech signals are highly redundant and typically contain a variety of background noise (Dalen and Gales, 2011). Unfortunate effect from the background noise has a severe impact on the performance of speech recognition system. By removing the silence part, the speech sound can provide useful information of each utterance. Certain level of the background noise interfere with the speech. At the same time, silence regions have quite a high zero-crossings rate as the signal changes from one side of the zero amplitude to the other and back again. For this reason, the threshold is included in order to remove any zero-crossings. In this experiment, the threshold is set to be 0.1. This means that any zerocrossings that start and end within the range of t α , where -0.1<t α <0.1, are not included in the total number of zero-crossings in that window.

Pre-Emphasis
Pre-emphasis is a technique used in speech processing to enhance high frequency signals. It reduces the high spectral dynamic range. The use of preemphasis is to flatten the spectrum consisting of formants of similar heights. Pre-emphasis is implemented as a first-order Finite Impulse Response (FIR) filter which is defined in Equation 14 as follows: where, α is the pre-emphasis coefficient. A value used for α is typically around 0.9 to 0.95. E(n) is the sample data which represents speech signal with n within 0≤n≤ N-1, where N is the sample size which represents speech signal. The speech signals after pre-emphasis of the vowel 'O' is shown in Fig. 3.

Speech Signal Windowed
Speech recognition consumes a heavy process that requires large samples of data which represent speech signal for each frame. FFT is calculated on a window of speech frame (Mahmood et al., 2012). A windowing function is used on each frame to smooth the signals and make it more amendable for spectral analysis. Hamming window is a window function used commonly in speech analysis to reduce the sudden changes and undesirable frequencies occurring in the framed speech. Hamming window is defined in Equation 15 as follows: where, L represents the width of S n and k is an integer, with values 0≤k≤L-1. The resulting windowed segment is defined in Equation 16 as follows: where, S n is the signal function and w(k) is the window function on FFT. Whereas, DTT consists of only algebraic expressions and the Tchebichef polynomial matrix can be constructed easily using a set of recurrence relations. Therefore the window is very inefficient when the sample data are multiplied by a value that is close to zero. Any transition occurring during this part of the window will be lost so that the spectrum is no longer true real time. Speech recognition using DTT does not use windowing function.
In this paper, a sample speech signal has been windowed into 4 frames as illustrated in Fig. 4. Each window consists of 1024 sample data which represents speech signal. This blocking assumes that the signals are stationary within each frame. The windowed signal is then transformed into spectral domain, giving good discrimination and energy compaction. In this experiment, the third frame for 2049-3072 sample data is used. The speech signals using FFT of the vowel 'O' are shown in Fig. 5. The speech signals using DTT of the vowel 'O' are shown in Fig. 6.

DTT Coefficient
Consider the discrete orthonormal Tchebichef polynomials definition in 1-12 above, the set of coefficients on discrete Tchebichef transform is given in Equation 17 and 18. A set of kernel matrix 1024 of Tchebichef polynomials are computed with speech signal on each window. The coefficients of DTT of order n = 1024 sample data for each window are given using the formula as follows: where, C is the coefficient of discrete Tchebichef transform, which represented by c 0 , c 1 , c 2 , ….,c n-1. T is a matrix computation of discrete orthonormal Tchebichef polynomials t k (n) for k = 0,1,…., N-1. S is the sample of speech signal window which is given by x 0 , x 1 , x 2 , …, x (n-1) . The coefficient of DTT is given in Equation 19 as follows: Next, speech signal on frame 3 is computed with 1024 discrete orthono1rmal Tchebichef polynomials.

Spectrum Analysis
The spectrum analysis using FFT can be generated in Equation 20 as follows: where, c(n) is the coefficient of DTT, x(n) is the sample data at time index n and t k (n) is the computation matrix of orthonormal Tchebichef polynomials. The spectrum analysis using DTT of the vowel 'O' is shown in Fig. 8.

Power Spectral Density
Power Spectral Density (PSD) shows the strength of the variations (energy) as a function of frequency. In other words, it shows the frequencies at which variations are strong and at which frequency variations are weak. The one-sided power spectral density using FFT can be computed in Equation 23 as follows: where, X(k) is a vector of N values at frequency index k, the factor 2 is called for here in order to include the contributions from positive and negative frequencies.
The result is precisely the average power of spectrum in the time range (t 1 , t 2 ). The power spectral density in (23) and (24) are plotted on a decibel (dB) scale of 20log 10 . The power spectral density using FFT for vowel 'O' on frame 3 is shown in Fig. 9. The power spectral density using DTT can be generated in Equation 24 as follows: where, c(n) is the coefficient of discrete Tchebichef transform. The power spectral density using DTT for vowel 'O' on frame 3 is shown in Fig. 10.

Autoregressive Model
Autoregressive (AR) models are used for linear prediction model (Hsu and Liu, 2010) to obtain all pole estimate of the signal's power spectrum. Autoregressive model is used to determine the characteristics of the vocal and to evaluate the formants. The autoregressive process of a series y j using FFT of order v is given in Equation 25 as follows: where, a k are real value autoregression coefficients, q j represents the inverse FFT from power spectral density and v is set to 12. The peaks of frequency formants using FFT in autoregressive for vowel 'O' on frame 3 are shown in Fig. 11. The autoregressive process of a series y j using DTT of order v is given in Equation 26 as follows: where, a k are real value autoregression coefficients, v is 12 and c j is the coefficient of DTT at frequency index j. e j represents the errors that are term independent of past samples. The frequency formants using DTT which are autoregressive for vowel 'O' on frame 3 are shown in Fig. 12.

Frequency Formants
The uniqueness of each vowel is measured by formants. The resonance frequencies known as formant can be detected as the peaks of the magnitude spectrum of speech signals. Formants are defined as the resonance frequencies of the vocal tract which are formed by the shape of vocal tract (Ozkan et al., 2009).
A formant is a characteristic resonant region (peak) in the power spectral density of a sound. Next, the frequency formants shall be detected. The formants of the autoregressive curve are found at the peaks using a numerical derivative. These vector positions of the formants are used to characterize a particular vowel. The first two formants (F 1 , F 2 ) of a vowel utterance cue the phonemic identity of the vowel (Patil et al., 2010). The third formant F 3 is also important for vowel categorisation (Kiefte et al., 2010). However, it is frequently excluded from vowel plots overshadowed by the first two formant.
The frequency peak formants of the experiment result F 1 , F 2 and F 3 are compared to referenced formants to decide on the output of the vowel. The frequency formants of the five vowels and the five consonants using FFT and DTT on frame 3 are as shown in Table  1 and 2 respectively.    k  796  1152  2347  721  1130  2336  n  764  1324  2519  839  1345  2508  p  785  1076  2573  753  1065  2562  r  635  1281  2121  624  1248  2131  t  829  1152  2519  796

DISCUSSION
In the sample above, the experimental result are presented on how the vowels and consonants are recognized. The experimental result on speech recognition using FFT and DTT is compared and analyzed. Speech signals of the vowel 'O' using FFT as in Fig. 5 produce a speech signal that is clearer compared to the DTT. On the other hand, the speech signals of the vowel 'O' and consonant 'RA' using DTT as presented in Fig. 6 produce more noise.
Next, spectrum analysis of the vowel 'O' using FFT as in Fig. 7 produces a lower power spectrum than DTT. On one hand, spectrum analysis using DTT as in Fig. 8 has a higher power spectrum than FFT. It is also capable of capturing the fourth formant for consonant 'RA'. Spectrum analysis using DTT produces four formants F 1 , F 2 , F 3 and F 4 concurrently in spectrum analysis for a consonant. The power spectral density of vowel 'O' using FFT as in Fig. 9 shows that the power spectrum is higher than power spectral density using DTT. Next, the power spectral density using DTT in Fig. 10 produce more noise than FFT in frequency spectrum.
According to the observation as presented in Fig. 11 and 12, the peaks of first frequency formant (F 1 ), second frequency formant (F 2 ) and third frequency formant (F 3 ) using FFT and DTT respectively appear to produce identically quite similar output. Based on the result of the experiment as presented in Table 1 and 2, the result of frequency formants of speech recognition using FFT and DTT for five vowels and five consonants respectively is nearly equally similar.
The result showed that the peaks of five vowels and five consonants using DTT are identically similar to FFT in terms of vowel and consonant recognition. DTT is able to capture all three formants concurrently, F 1 , F 2 and F 3 . The frequency formants using FFT and DTT are compared and it is evident that they have produced relatively identical outputs in terms of speech recognition. DTT indeed has the potential to perform well in terms of basic vowel and consonant recognition.

CONCLUSION
Speech recognition using FFT has been a popular form of transform over the last decades. Alternatively, this paper introduces DTT on speech recognition. As a discrete orthonormal transform, DTT produces a simpler and more computationally efficient transform than FFT. On the one hand, FFT is computationally more complex dealing with imaginary numbers but DTT on the other hand consumes simpler computation on real rational numbers only. Therefore, DTT operates on friendly domain which involves only algebraic expressions and it can be computed easily using a set of recurrence relations. It is ideal for discrete transform in speech recognition to transform from the time domain into the frequency domain. The autoregressive model using FFT and DTT produces the smoother similar shape. DTT has proven to perform better in a smaller frame size in the recognition of vowels and consonants.
Furthermore, speech recognition using DTT can be extended in the future in terms of time complexity. On one hand, FFT algorithm produces the time complexity O (nlog n). Next, the computation time of DTT produces time complexity O(n 2 ). For future research, DTT can be efficiently improved to reduce the time complexity from O(n 2 ) to be O(nlog n) using convolution algorithm. DTT is capable of increasing the speech recognition performance and at the same time getting the similar frequency formants in terms of speech recognition.