Speech Compression of Thai Dialects with Low-Bit-Rate Speech Coders

: Problem statement: In modern speech communication at low bit rate, speech coding deteriorates the characteristics of the coded speech significantly. Considering the dialects in Thai, the coding quality of four main dialects spoken by Thai people residing in four core region including central, north, northeast and south regions has not been studied. Approach: This study presents a comparative study of the coding quality of four main Thai dialects by using different low-bit-rate speech coders including the Conjugate Structure Algebraic Code Excited Linear Predictive (CS-ACELP) coder and the Multi-Pulse based Code Excited Linear Predictive (MP-CELP) coder. The objective and subjective tests have been conducted to evaluate the coding quality of four main dialects. Results: From the experimental results, both tests show that the coding quality of North dialect is highest, meanwhile the coding quality of Northeast dialect is lowest. Moreover, the coding quality of male speech is mostly higher than that of female speech. Conclusion: From the study, it can be obviously seen that the coding quality of all Thai dialects are different.


INTRODUCTION
In the present day, the digital communication have been are considerably improved and developed. The audio, still image, video or text information can be transmitted through wire and wireless networks, meanwhile, the number of users to access these networks increases extremely. Therefore, the channel capacity must be increased, signal compression aims at overcoming this situation (Chompun et al., 2000). In the communication system with an occurring of packet loss, the high quality speech compression or speech coder is highly demanded. One of standardization activities is conducted under the project of MPEG-4 (Nomura et al., 1998;Chomphan, 2010b). The MP-CELP coder has been proposed to be a scalable coder. This speech coder employs the multipulse excitation which the number of pulses in fixedentry codebook is selective for bitrate scalability and multiple bitrates functionality according to the MPEG-4 CELP speech coder requirements, (Nomura et al., 1998;Chomphan, 2010b). It should be noted that the MP-CELP coder has been developed from the CS-ACELP coder standardized as 8-kb/s G.729 in 1995.
In the MP-CELP core coder, amplitudes or signs for multi-pulse excitation are simultaneously vector quantized. Moreover, to improve speech quality for background noise conditions, the adaptive pulse location restriction method are utilized . This coder operates at various bitrates ranging from 4-12 kbps by applying the flexibility in multi-pulse excitation coding (Chomphan, 2010a).
This study performs a comparative study of the coding quality of four main Thai dialects of spoken by Thai people residing in four core region including central, north, northeast and south regions. The core compression is based on two different low-bit-rate speech coders including the Conjugate Structure Algebraic Code Excited Linear Predictive (CS-ACELP) coder and the Multi-Pulse based Code Excited Linear Predictive (MP-CELP) coder. The objective and subjective tests have been performed to evaluate the coding quality of four main dialects.

CS-ACELP coder:
The CS-ACELP coder is based on the Code-Excited Linear Predictive (CELP) coding model. The coder operates on speech frames of 10 ms corresponding to 80 samples at a sampling rate of 8000 samples per second. For every 10 ms frame, the speech signal is analyzed to extract the parameters of the CELP model (linear-prediction filter coefficients, adaptive and fixed-codebook indices and gains). These parameters are encoded and transmitted. At the decoder, these parameters are used to retrieve the excitation and synthesis filter parameters. The speech is reconstructed by filtering this excitation through the short-term synthesis filter based on a 10th order liner prediction filter and the long-term or pitch synthesis filter implemented using adaptive-codebook approach. After computing the reconstructed speech, it is further enhanced by a post-filter (Schroder and Sherif, 1997).
The encoding principle is shown in Fig. 1. The input signal is high-pass filtered and scaled in the preprocessing block.
The pre-processing signal serves as the input signal for all subsequent analysis. LP analysis is done once per 10 ms frame to compute the LP coefficients. These coefficients are converted to Line Spectrum Pairs (LSP) and quantized using predictive two-stage vector quantization with 18 bits. The excitation signal is chosen by using an analysis-by-synthesis search procedure in which the error between original and reconstructed speech is minimized according to a perceptually weighted distortion measure. This is done by filtering the error signal with a perceptual weighting filter, whose coefficients are derived from the unquantized LP filter. The amount of perceptual weighting is made adaptive to improve the performance for input signals with a flat frequency-response.
The decoder principle is shown in Fig. 2. First, the parameters indices are extracted from the received bitstream. These indices are decoded to obtain the coder parameters corresponding to a 10 ms speech frame. These parameters are the LSP coefficients, the 2 fractional pitch delays, the 2 fixed-codebook vectors and the 2 sets of adaptive and fixed-codebook gains. The LSP coefficients are interpolated and converted to LP coefficients for each subframe. Then, for each 5 ms subframe, the excitation is constructed by adding the adaptive and fixed-codebook vectors scaled by their respective gains, the speech is reconstructed by filtering the excitation through the LP synthesis filter, finally, the reconstructed speech signal is passed through a post-processing stage, which includes an adaptive post-filter based on the long-term and shortterm synthesis filter, followed by a high-pass filter and scaling operation.

MP-CELP coder:
The MP-CELP core coder achieves a high coding performance by introducing a multi-pulse vector quantization as depicted in Fig. 3 (Taumi et al., 1996;Ozawa et al., 1997). The input speech of 10 ms frame is processed through Linear Prediction (LP) and pitch analysis. The LP coefficients are quantized in the Line Spectrum Pairs (LSP) domain. The pitch delay is encoded by using an adaptive codebook. The residual signal for LP and the pitch analysis is encoded by the multi-pulse excitation scheme. The multi-pulse excitation signal is composed of several non-zero pulses. The pulse positions are restricted in the algebraic-structure codebook and determined by an analysis-by-synthesis approach, (Laflamme et al., 1991;Chomphan, 2010a). The pulse signs and positions are encoded, while the gains for pitch predictor and the multi-pulse excitation are normalized by the frame energy and encoded. The supporting core bitrates of this coder are 5600, 8200, 12200 bps, for the core coder with one pulse in fixed codebook, five pulses in fixed codebook and ten pulses in fixed codebook, respectively.

RESULTS
The coding quality of the bitrate-scalable coder was evaluated subjectively and objectively by using 50 tested sentences for each of four main dialects spoken by Thai people residing in four core region including central, north, northeast and south regions. Each dialect consists of the speech from a man and a woman.
As for the subjective test, The Mean Opinion Score (MOS) has been chosen for evaluating the coding quality of all dialects and genders. The subjects consist of four men and four women. The averaged MOS scores for each dialect are presented in Fig. 4-7.
As for the objective test, the Signal to Noise Ratio (SNR) score has been chosen for evaluating the coding quality of all dialects and genders to confirm the result from the subjective test. The SNR score are computed from the energy of natural speech and the energy of the difference between the natural speech and the encoded speech. The averaged SNR scores for each dialect are presented in Fig. 8-11.
Finally, Fig. 12-13 present the comparisons between the coding quality among all four dialects using the subjective test and the objective test, respectively.

DISCUSSION
From the subjective test results, considering the MOS scores in Fig. 12, the coding quality of North dialect is the highest level. The coding quality of South dialect is the second highest level, meanwhile the coding quality of Northeast dialect is the lowest level. Moreover, the coding quality of male speech is mostly higher than that of female speech. From the objective test results, considering the SNR scores in Fig. 13, the coding quality of all dialects corresponds to that of MOS scores in Fig. 12.
Furthermore, comparison of the coding quality among different coding methods has been conducted. From Fig. 4-7, the MP-CELP coder with 10 pulses in fixed codebook gives the best coding quality, while the CS-ACELP (G.729) coder gives the worst coding quality. From Fig. 8-11, the coding quality of all dialects corresponds to that of MOS scores in Fig. 4-7.

CONCLUSION
In this study, a comparative study of the coding quality of four main Thai dialects by using different lowbit-rate speech coders of the CS-ACELP coder and the MP-CELP coder has been conducted. The objective and subjective tests are used to evaluate the coding quality of four main dialects. Both tests show that the coding quality of North dialect is highest, meanwhile the coding quality of Northeast dialect is lowest. Moreover, the coding quality of male speech is mostly higher than that of female speech. From the study, it can be seen that the coding quality of all Thai dialects are different.