Thai Speech Coding Based On Conjugate-Structure Algebraic Code Excited Linear Prediction Algorithm

: Problem statement: In mobile communication, speech coding aims at compressing the speech with lowest bitrate and highest quality for standard languages such as English, German and French. As for other languages with different uttering styles, the encoded speech quality is not guaranteed at the same bitrate. The appropriate evaluation should be performed to develop the speech quality by applying some suitable techniques. Approach: This study presents the comparison results of speech quality that is encoded and decoded by CS-ACELP coder according to ITU-G.729 standard. The purpose is to test the performance of CS-ACELP coder between Thai speech and English speech. Results: The study used 2 coding methods; (1) CS-ACELP coder without Voice Activity Detection and (2) CS-ACELP coder with Voice Activity Detection. The objective test was used to measure the speech quality for each case. The results show that both methods give Thai speech quality mostly below than English speech quality, as for methods comparison; both Thai and English, method (2) gives speech quality better than method (1). Eventually, we modified the coder by increasing the order of LP analysis to improve the Thai speech quality. Conclusion: From the finding, by no other modification, the quality of Thai coding is not equivalent to the English Language. After modifying the LP analysis by increasing the LP order from 10-12 or 14, the quality of Thai speech coding are truly improved. But the coding rate also increased for allocating the higher order information.


INTRODUCTION
Nowadays, the digital communications are widely developed. The information; audio, images, video or data can be transmitted through wire or wireless network channels (Jabrane et al., 2007). Simultaneously, the number of users to access these networks increases rapidly. Consequently, channel capacity has to be increased, signal compression aims to perform this. As for speech, speech coding was created almost 60 years ago and improved from then until now (Cox, 1997;Hergli et al., 2005).
The International Telecommunications Union-Telecommunications Sector (ITU-T) has already standardized 64 kb sec −1 µ/A-law PCM, 32 kb sec −1 ADPCM and 16 kb sec −1 Low-Delay Code-Excited Linear Prediction (LD-CELP). The next step in the progression is an 8 kb sec −1 speech coding algorithm. Since the three existing standards all provide highquality and short-delay coding, the main requirement for the 8 kb sec −1 algorithm is also initially high-quality and short-delay coding (less than 5 ms) (Juan et al., 1998;Chitra and Ravichandran, 2007).
To achieve high quality and short-delay coding at 8 kb sec −1 , the backward adaptation technique was performed. Many coders used the Linear Predictive Coding (LPC) predictor in a backward-adaptive manner by performing LPC analysis on previously quantized speech. Since the reconstructed signal is available in both the encoder and decoder, this approach does not require that the LPC coefficients be sent to the decoder. However, although this technique is useful for 16 kb sec −1 LD-CELP, an 8 kb sec −1 coder was not found to give high quality when using only backward PLC analysis without pitch prediction. In 1995, CS-ACELP coder was developed and standardized as 8 kb sec −1 G.729 (Chompun et al., 2000;2001a;2001b;Chompun, 2004;Aqel, 2006;Moussaoui et al., 2006).
To verify this circumstance, this study will study and compare the performance of CS-ACELP coder according to ITU G.729 between Thai and English speeches. The study will use 2 coding methods: the first is CS-ACELP coder without Voice Activity Detection (VAD) and the second is CS-ACELP coder with Voice Activity Detection. The objective test as segmental signal to noise ratio (segSNR) will be selected to measure the speech quality for each case.
However, since Thai language is a tonal language. So the use of CS-ACELP coder following to ITU G.729 to compress Thai speech will be not guarantee the same quality as English language.
Later, we modified some parameters in CS-ACELP coder in order to improve its efficient for Thai speech compression. In the 10th linear prediction analysis and filtering, the order is increased to 12 and 14, respectively (Nomura et al., 1998;Ozawa and Serizawa, 1998;Ozawa et al., 1996).

CS-ACELP algorithm:
The CS-ACELP coder is based on the Code-Excited Linear Predictive (CELP) coding model. The coder operates on speech frames of 10 ms corresponding to 80 samples at a sampling rate of 8000 samples per second. For every 10 ms frame, the speech signal is analyzed to extract the parameters of the CELP model (linear-prediction filter coefficients, adaptive and fixed-codebook indices and gains). These parameters are encoded and transmitted. At the decoder, these parameters are used to retrieve the excitation and synthesis filter parameters. The speech is reconstructed by filtering this excitation through the short-term synthesis filter based on a 10th order liner prediction filter and the long-term or pitch synthesis filter implemented using adaptive-codebook approach. After computing the reconstructed speech, it is further enhanced by a post-filter (Schroder and Sherif, 1997).
The encoding principle is shown in Fig. 1. The input signal is high-pass filtered and scaled in the preprocessing block. The pre-processing signal serves as the input signal for all subsequent analysis. LP analysis is done once per 10 ms frame to compute the LP coefficients. These coefficients are converted to Line Spectrum Pairs (LSP) and quantized using predictive two-stage vector quantization with 18 bits. The excitation signal is chosen by using an analysis-bysynthesis search procedure in which the error between original and reconstructed speech is minimized according to a perceptually weighted distortion measure. This is done by filtering the error signal with a perceptual weighting filter, whose coefficients are derived from the unquantized LP filter. The amount of perceptual weighting is made adaptive to improve the performance for input signals with a flat frequency-response. The decoder principle is shown in Fig. 2. First, the parameters indices are extracted from the received bitstream. These indices are decoded to obtain the coder parameters corresponding to a 10 ms speech frame. These parameters are the LSP coefficients, the 2 fractional pitch delays, the 2 fixed-codebook vectors and the 2 sets of adaptive and fixed-codebook gains. The LSP coefficients are interpolated and converted to LP coefficients for each subframe. Then, for each 5 ms subframe, the excitation is constructed by adding the adaptive and fixed-codebook vectors scaled by their respective gains, the speech is reconstructed by filtering the excitation through the LP synthesis filter, finally, the reconstructed speech signal is passed through a postprocessing stage, which includes an adaptive post-filter based on the long-term and short-term synthesis filter, followed by a high-pass filter and scaling operation.
Voice Activity Detection is in the pre-processing part to decide the input speech frame as voiced or unvoiced speech. Consequently, the unvoiced speech mode neglects the adaptive codebook quantization part because no periodicity is needed while the voiced speech mode still employs both fixed and adaptive quantization part.

Methods of coding:
In the experiment, 2 coding methods were used: One was CS-ACELP coder without VAD, another was CS-ACELP coder with VAD (Chomphan, 2010a;Chomphan, 2010b). Two sets of sentences were used by those coding methods. Then these two sets of sentences were compared. Each set contained 6 Thai and 6 English sentences. The first set of sentences was in Table 1, the second was in Table 2. In both sets, speeches of 3 men and 3 women were recorded. The sentences chosen in these sets covered all of the characters in each language. The second set was performed as same as the first set to compare the results from each one.

RESULTS
The quality of speech was evaluated in both 2 sets by using the values of segmental signal to noise ratio defined in Eq. 1 (Chompun et al., 2000;2001b;Chompun, 2004;Sharmeela et al., 2006). Figure 3 and 4 show the original signal, the reconstructed signals of both methods of the first sentence in the first set for Thai and English respectively, while Table 3 shows segSNR for both methods:

DISCUSSION
The results in Table 3 show that both methods give Thai speech quality mostly below than English speech quality about 0.25-0.34 dB for the first set and about 0.27-0.31 dB for the second set. As for methods comparison; both Thai and English, method 2 gives speech quality better than method 1 about 0.01-0.05 dB for the first set and about 0.01-0.04 dB for the second set. Comparing 2 sets, the results of them were corresponding.  CS-ACELP 9.45 9.38 9.53 9.45 9.54 9.48 CS-ACELP with VAD 9.50 9.42 9.62 9.50 9.64 9.52 2nd set CS-ACELP 9.47 9.40 9.54 9.45 9.53 9.48 CS-ACELP with VAD 9.49 9.44 9.60 9.49 9.62 9.51 Finally, the order of LP analysis was increased to 12 and 14 to improve the quality of reconstructed speech in case of Thai speech. Table 4 shows the segSNR of both methods. For order 12, it shows the improvement of speech quality about 0.07-0.12 dB for the first set and about 0.05-0.11 dB for the second set, in comparison to LP order 10. For order 14, it shows the improvement of speech quality about 0.09-0.14 dB for the first set and about 0.06-0.13 dB for the second set, in comparison to LP order 10. Comparing 2 sets, the results of them were corresponding.
The results were shown that the quality of coding was improved. But the coding rate also increased for allocating the higher order information.

CONCLUSION
The ITU G.729 speech coder was applied to Thai Language and the performance evaluation was conducted. Without modifications, the quality of Thai coding is not equivalent to that of the English speech. After modifying the LP analysis by increasing the LP order from 10-12 or 14, the quality of Thai speech coding are explicitly improved. But the coding rate also increased for allocating the higher order information.

ACKNOWLEDGEMENT
The researcher wishes to thank Digital Signal Processing Research Laboratory, Faculty of Engineering, Chulalongkorn University, for the facility and technical support for this study and Kasetsart University at Si Racha campus for the research scholarship through the board of research.