Tone Question of Tree Based Context Clustering for Hidden Markov Model Based Thai Speech Synthesis

,


INTRODUCTION
Speech synthesis is one of the key technologies for realizing natural human-computer interaction. For this purpose, Text-To-Speech (TTS) synthesis systems are required to have an ability to generate speech with arbitrary speaker's voice characteristics and various speaking styles. A number of TTS techniques have been proposed and state-of-the-art TTS systems based on unit selection and concatenation can generate natural sounding speech. However, it is still a difficult problem to synthesize speech with various voice characteristics and various speaking styles.
Hidden Markov model (HMM) based TTS system in which each speech synthesis unit is modeled by HMM is proposed in the past decade (Masuko et al., 1996;Yoshimura et al., 1999;Al-Haddad et al., 2008). A distinctive feature of the system is that the speech parameters used in the synthesis stage are generated directly from HMMs by using a parameter generation algorithm (Tokuda et al., 2000;Curran et al., 2005;Aliwa et al., 2010). Since the HMM-based TTS system uses HMMs as the speech units in both modeling and synthesis, the voice characteristics of synthetic speech can be changed by transforming HMM parameters appropriately.
As for Thai speech synthesis, a TTS synthesis system based on unit selection is initially implemented by Luksaneeyanawin in 1991 (Chomphan and Kobayashi, 2008). Subsequently, a TTS synthesis system based on unit selection with TD-PSOLA technique is developed by National Electronics and Computers Technology Center (NECTEC) in 2003 (Hansakunbuntheung et al., 2005). Since Thai is a tonal language, this study is proposed to implement a Thai speech synthesis based on HMM which has the ability of synthesizing speech with various voice characteristics and various speaking styles. In the treebased context clustering stage, the tone question is applied to improve the overall speech quality. An experiment is conducted and it shows a considerable improvement.

HMM-based speech synthesis:
A block-diagram of the HMM-based TTS system is shown in Fig. 1. The system consists of two stages including the training stage and the synthesis stage (Tamura et al., 2001;Yamagishi et al., 2003). In the training stage, melcepstral coefficients are extracted at each analysis frame as the static features from the speech database. Then the dynamic features, i.e., delta and delta-delta parameters, are calculated from the static features. Spectral parameters and pitch observations are combined into one observation vector frame-by-frame and speaker dependent phoneme HMMs are trained using the observation vectors. To model variations of spectrum, pitch and duration, phonetic and linguistic contextual factors, such as phoneme identity factors, are taken into account (Yoshimura et al., 1999). Spectrum and pitch are modeled by mulyi-stream HMMs and output distributions for spectral and pitch parts are continuous probability distribution and Multi-Space probability Distribution (MSD) , respectively. Then, a decision tree based context clustering technique is separately applied to the spectral and pitch parts of context dependent phoneme HMMs (Young et al., 1994). Finally state durations are modeled by multidimensional Gaussian distributions and the state clustering technique is also applied to the duration distributions (Yoshimura et al., 1998). In the synthesis stage, first, an arbitrary given text to be synthesized is transformed into context dependent phoneme label sequence. According to the label sequence, a sentence HMM, which represents the whole text to be synthesized, is constructed by concatenating adapted phoneme HMMs. From the sentence HMM, phoneme durations are determined based on state duration distributions (Yoshimura et al., 1998). Then spectral and pitch parameter sequences are generated using the algorithm for speech parameter generation from HMMs with dynamic features (Tokuda et al., 1995). Finally by using MLSA filter (Fukuda et al., 1992), speech is synthesized from the generated mel-cepstral and pitch parameter sequences. Thai Speech Attributes: As for tonal language, such as Thai, a syllable is composed of consonants, vowels and tone (Chompun, 2004). The basic Thai textual syllables can be represented in Fig. 2, where C i , V, C f and T denotes an initial consonant, a vowel, a final consonant and a tone respectively. The significant difference between tonal and toneless language is the syllable tone, where meaning of a syllable changes as the syllable tone changes (Thathong et al., 2000;Chompun et al., 2001). Table 1 summarizes the number of the Thai characters and phones according to each part of syllables.
In Thai language, four different tone markers are generally used to indicate 5 Thai tones; middle tone (0), low tone (1), falling tone (2), high tone (3) and rising tone (4). For example the syllable "บาน" (to widen) has a middle tone which is pronounced as /ba:n/, meanwhile syllable "บ าน" (home) has a falling tone which is pronounced as /bâ:n/. Each syllable tone can be characterized by its corresponding fundamental frequency contour which is depicted in Fig. 3 (Chompun, 2004;Chompun et al., 2001). Each contour line is constructed by plotting the voice fundamental frequency extracted periodically via the normalized syllable duration.

Questions for tree-based context clustering process:
Since tone information is a very crucial factor in Thai language as mentioned above, therefore tone number (0-4) is employed in the context clustering process of HMM-based TTS system. Moreover, the following contextual factors were also taken into account: • Syllable position in word • Part of speech  (Hansakunbuntheung et al., 2005). The speech in the database is uttered by a professional female speaker with clear articulation and standard Thai accent. The text dependent phoneme labels are extracted based on the phoneme labels and linguistic information included in the database. There are almost 79 phonemes including silence and pause. Speech signal were sampled at a rate of 16 kHz and windowed by a 25 m sec Blackman window with a 5ms shift. Then mel-cepstral coefficients were extracted by mel-cepstral analysis. The feature vectors consisted of 25 mel-cepstral coefficients including the zeroth coefficient, logarithm of fundamental frequency and their delta and delta-delta coefficients (Tachibana et al., 2005).
We used 5-state left-to-right HMMs in which the spectral part of the state is modeled by a single diagonal Gaussian output distribution (Tachibana et al., 2006;Yamagishi and Kobayashi, 2005). The number of training utterances is varied as 500, 1000, 1500, 2000 and 2500 sentences.
Subjective evaluations of synthesized speech: First, the naturalness of the synthesized speech generated from 6 approaches; 5 are from the HMM-based system set up by varying number of training utterances and another one is from the unit selection approach with the corpus size of 5200 sentences (Hansakunbuntheung et al., 2005), was evaluated by a paired comparison test. The subjects were nine Thai persons. They were presented a pair of speech synthesized from different approaches in random order and then asked which one sounded more natural. For each subject, five test sentences were chosen at random from 25 test sentences which were not contained in the training sentences. The preference scores are shown in Fig. 4. Secondly, the subjective evaluation of tone question in the context clustering stage was conducted. The correction of the syllable tone of the synthesized speech generated from 2 systems was evaluated by a paired comparison test. The first system has tone question in the context clustering process, meanwhile another system has no tone question. The preference scores are shown in Fig. 5.

DISCUSSION
It can be seen from Fig. 4 that the more the number of training sentences is increased the more the naturalness of the synthesized speech is obtained. Although, the score of unit selection approach is above of HMM-based approach with 2500 training sentences, the HMM-based approach can be further developed to synthesize the speech with various voice characteristics and various speaking styles as mentioned earlier. It can be said that HMM-based approach is newly constructed for Thai language. Moreover, we expect that the system be further improved in the near future.
From Fig. 5, it can be seen that the score of the system with tone question is considerably superior than that of the system without tone question for every number of training sentences. When increasing the number of training sentences, the percentage score of no tone question case increases. The reason is that the lacking problem of training syllable tones is relieved.

CONCLUSION
In this study, we propose an HMM-based Thai speech synthesis. Thai speech characteristic is investigated and subsequently the conventional HMMbased synthesis system is modified according the tonal attributes of Thai. We found that the number of training sentences affected the naturalness of the synthesized speech while the tone information affected significantly with the output synthesized speech.

ACKNOWLEDGEMENT
The researcher is grateful to NECTEC for providing the TSynC-1 speech database.