© 2010 Science Publications Fujisaki’s Model of Fundamental Frequency Contours for Thai Dialects

Problem statement: In general, there are a number of rural dialects i n Thai. However, four dialects are mainly spoken by Thai people residing in four core region including central, north, northeast and south regions. Recognizing and synthe sizing Thai speech with different dialects are consequently difficult. Approach: Prosody is an important factor that must be taken into account, since the prosody effects on not only the naturalne ss but also the intelligibility of speech. To treat the problem, the speech prosody is carefully preserved through modeling the fundamental frequency (F0) contours. The differences among the model parameters of four Thai dialects have been summarized. This study proposed an analysis of model parameters for Thai speech prosody with four regional dialects and two genders which is a preliminary wor k for speech recognition and synthesis. Fujisaki's modeling; a powerful tool to model the F0 contour h as been adopted. Seven derived parameters from the Fujisaki's model are as follows. The first para meter is baseline frequency which is the lowest lev el of F0 contour. The second and third parameters are the numbers of phrase commands and tone commands which reflect the frequencies of surges of the utterance in global and local levels, respectively. The fourth and fifth parameters are p hrase command and tone command durations which reflect the speed of speaking and the length of a s yllable, respectively. The sixth and seventh parameters are amplitudes of phrase command and tone command which reflect the energy of the global speech and the energy of local syllable. Results: In the experiments, each regional dialect includes 200 samples of one sentence with male and female speech. Therefore our speech database contains 1600 utterances in total. The results show ed that most of the proposed parameters can distinguish four kinds of regional dialects explici tly. Conclusion: By using the Fujisaki's model, the results confirm that the proposed parameters can di stinguish the regional dialects efficiently. In the future research, they were expected to be applied i n the speech recognition and synthesis with various regional dialect characteristics.


INTRODUCTION
An appropriate modeling of F0 contour contributes the effectiveness in speech processing, such as speech recognition, speech synthesis and speech coding. Fujisaki's modeling of fundamental frequency for Thai expressive speech conducted in 2010 is proved to be effective for a limited-domain speech corpus (Chomphan, 2010). It has been found that the derived parameters can distinguish one style of speech from each other.
As for speech processing of Thai dialects, it has not been studied despite of a variety of the dialects spreading over four regions of Thailand. Beginning from the Northern region of Thailand, Thai dialect of "Lanna" or "Kammuang" is widely used, Lao-style Thai dialect is spoken in the North Eastern region, meanwhile South Thai dialect is spoken generally in the Southern part of Thailand.
By using the same way of Thai expressive speech (Chomphan, 2010), the study proposes an analysis of F0 modeling of four Thai dialects including standard Thai, Lanna or North dialect, Lao-style or North East dialect and South dialect. The extension of Fujisaki's model which is a preliminary study for the advanced research in speech synthesis and recognition such as the expressive speech synthesis work in Japanese language (Tachibana et al., 2005;2006) is mainly used.

MATERIALS AND METHODS
Fujisaki's model: The F0 contour of an utterance of speech is treated as a linear superposition of a global phrase and local accent components on a logarithmic scale, as depicted in Fig. 1 (Fujisaki and Sudo, 1971). Two commands generate the corresponding components of global phrase and local accent components. The phrase command produces a baseline component, while the accent command produces the accent component of an F0 contour. We use the two parameters of the Fujisaki's model as our phraseintonation features including the baseline value of F0 and the magnitude of the phrase command, which complementarily reflect the global level of voicing frequency. Mathematically, the F0 contour of an utterance generated from an extension of the Fujisaki's model for tonal languages has the following expressions (parameters) (Seresangtakul and Takara, 2003): Where: G pi (t) = The impulse-response function of the phrasecontrol mechanism G t,jk (t) = The step-response function of the tonecontrol mechanism The symbols in the above three equations denote that Fb is the smallest F0 value in the F0 contour of interest and A pi and At, jk are the amplitudes of the i-th phrases and of the j-th tone command. Here, T0i is the timing of the i-th phrase command and T 1jk and T 2jk are the onset and offset of the k-th component of the j-th tone command. α i and β jk are time constant parameters, while I, J, K(j) correspond to the number of phrases, tones and components of the j-th tone contained in the utterance.
The optimization is carried out by minimizing the mean squared error in the ln F0(t) domain through the hill-climbing search in the space of model parameters to find the optimal representative parameters in the modeling process (Seresangtakul and Takara, 2003).
By using this generative model, the parameters are extracted from our speech database, utterance by utterance. Subsequently, the derived parameters are computed are analyzed.
Derived parameters: From the conventional parameters, we calculated seven derived parameters which reflect the geometrical appearance of the F0 contour of an utterance as follows: • Baseline frequency • Numbers of phrase commands • Numbers of tone commands • Phrase command duration • Tone command duration • Amplitude of phrase command • Amplitude of tone command All of these derived parameters have been extracted for four regional Thai dialects including standard Thai, Lanna or North dialect, Lao-style or North East dialect and South dialect.

RESULTS
In our speech database, we use a sentence of "จิ นตนาการสํ าคั ญกว าความรู  " in IPA (means "Imagination is more important than Knowledge" in English) for male and female genders. This sentence has been recorded in four Thai dialects of standard Thai, Lanna Thai dialect, Lao-style Thai dialect and South Thai dialect. Each dialect contains 200 utterances of samples. Therefore we have 800 utterances of samples for each gender. The parameter extraction tools as used in (Mixdorff and Fujisaki, 1997) are applied in this study.
In each derived parameter, we analyzed the frequency distribution over its range and then the distributions of four Thai dialects are plot in a graph to show the differences and similarities among those dialects. The first seven graphs are of female speech ( Fig. 2-8), while the next seven ones are of male speech ( Fig. 9-15). The following abbreviation are defined and used in most figure, EW, MW, NW and SW denote North East woman dialect, Standard Thai woman dialect, North woman dialect and South woman dialect, respectively, while EM, MM, NM and SM denote North East man dialect, Standard Thai man dialect, North man dialect and South man dialect, respectively. From all of these frequency distribution graphs, the first and second statistical moments (mean and standard deviation values) were subsequently calculated and shown in terms of the following comparative bar charts ( Fig. 16-22). From these bar charts, we can also observe some differences between male and female speech.

DISCUSSION
From the frequency distribution graphs of male and female speech in Fig. 2-15, most results show that the four distributions of Thai dialects are significantly different. Except for only some cases, one distribution of Thai dialect is similar to another, i.e., in Fig. 8; the North East and North dialects of amplitude of tone command. It has been noted that some distributions have multimodals, i.e., in Fig. 5 and 12; the phrase command duration. All in all, in nearly all of the frequency distribution graphs, four distributions of dialects are distinguished from each others empirically. From the statistical bar charts in Fig. 16-22, they represent the mean and standard deviation values for all seven parameters between male and female speech in comparison. In Fig. 16-18; the parameters of baseline frequency, number of phrase commands and number of tone commands, it has been observed that the mean values of male speech for all dialects are less than that of female speech. In Fig. 19 and 20; the parameters of phrase command duration and tone command duration, it has been observed that the mean values of male speech for dialects are quite similar to that of female speech. In Fig. 21 and 22; the parameter of amplitude of phrase command and amplitude of tone command, it has been observed that the standard Thai has a significant difference between male and female speech, while the others have some small differences. In the classification problem, one dialect can be distinguished from the others by using the derived parameters compositely. From the above experimental results, it is a strong evidence to further apply the derived parameters in the speech synthesis systems or other speech processing technologies. These experimental results correspond to the previous results conducted for Thai expressive speech (Chomphan, 2010). For examples, the parameters are expected to be applied in the tree-based context clustering process in the hidden Markov model based Thai speech synthesis (Chomphan and Kobayashi, 2007a;2007b;2008;2009) to categorize the training speech units into groups. An appropriate data sharing in each of the speech unit clusters can consequently improve the efficiency of the overall speech synthesis system.

CONCLUSION
An analysis of Fujisaki's model parameters for four Thai dialects has been performed in this study. The specified dialects include standard Thai, North dialect, North East dialect and South dialect, meanwhile the speech database covers both male and female genders. The Fujisaki's model has been applied to model the F0 contour of all dialects. Seven derived parameters from the Fujisaki's model are extracted. The experimental results show that most of the derived parameters can be used to distinguish all four Thai dialects explicitly. From this finding, the parameters are expected to further apply in the speech recognition and speech synthesis systems.