Analytical Study on Fundamental Frequency Contours of Thai Tones using Tone-Geometrical Model

: Problem statement: In tonal language speech; such as Thai, Mandarin and Vietnam, tone is an important feature of a syllable that must be taken into consideration, since tone is a supra-segmental feature in the speech prosody. Modeling of tone with high accuracy could enhance the quality of synthesized speech in the speech synthesis system. This study focuses on a model analysis of fundamental frequency (F0) contours of Thai tones using tone-geometrical model. Approach: Tone-geometrical model applied is this study is a basic model which is expected to extract the dimensional features of a syllable-length portion of fundamental frequency contour. Seven selected parameters are extracted from a syllable-length portion of fundamental frequency contour and then are analyzed. Results: In the experiments, 2,500 speech utterances from TSynC-1 speech database were selected and used as speech materials. The distributions for all seven parameters are presented. The statistical figures of mean and standard deviation values from five tones are also calculated. The results show that most of the proposed parameters can distinguish five Thai tones explicitly. Conclusion: From the finding, the proposed parameters of tone-geometrical model could be further applied in the speech or other speech processing technologies.


INTRODUCTION
Tone analysis issue has been conducted in a number of speech technology research fields for years in many tonal speech languages. In speech processing area; including speech recognition, speech synthesis, speech analysis and speech coding, an appropriate tone modeling of a portion of F0 contour contributes the effectiveness of the implemented speech processing systems. The former study on F0 modeling has been considerably conducted in various speech units and several techniques such as utterance level (Fujisaki and Ohno, 1998;Fujisaki et al., 1990;Tao et al., 2006;Saito and Sakamoto, 2002;Ni and Hirose, 2006;Li et al., 2004), word and syllable levels (Fujisaki et al., 1990;Fujisaki and Sudo, 1971). In Thai speech, Fujisaki's model has been successfully applied for modeling of utterances, tones and words (Hiroya and Sumio, 2002;Seresangtakul and Takara, 2002;2003). In the Thai speech synthesis, Chomphan and Kobayashi implemented a speaker-dependent and speaker-independent systems in 2007(Chomphan and Kobayashi, 20072008;, in which the F0 contour was modeled using statistical approach. Moreover, the speaker-independent system also used the Fujisaki's model in the extended modules. In another approach, tone-geometrical model is a simple geometrical-structure syllable-level model applied in a speech synthesis system (Chomphan and Kobayashi, 2009). However, the it has not been performed thoroughly to model all five Thai tones. Therefore this study proposed an analysis of F0 modeling of Thai tones using the tone-geometrical model which is a preliminary study for the advanced research in speech synthesis and recognition.

Tone-geometrical model:
The F0 contour is treated as a concatenation of a number of the sequential portion of tone-geometrical models as depicted in Fig. 1.
Seven selected parameters are calculated based on the following criteria: Parameter 1: dur = t_final Parameter 2: F0_init Parameter 3: delta_F0 = F0_final -F0_init Parameter 4: F0_range = F0_max -F0_min Parameter 5: sign_F0_range = sign(delta_F0)*F0_range Parameter 6: contour_slope = delta_F0 / dur Parameter 7: sign_contour_slope = sign_F0_range/dur Parameter 1 or dur denotes the syllable duration which can be obtained directly from t_final as shown in Fig. 2. Parameter 2 or F0_init is the initial value of a portion of an F0 contour as depicted in Fig. 2. Parameter 3 or delta_F0 is the difference between two frequencies of the final frequency and the initial frequency. Parameter 4 or F0_range represents the dynamic range of the portion in Hertz or the difference between two frequencies of the maximum frequency and the minimum frequency. Parameter 5 or sign_F0_range is the F0_range with a sign where the positive value represents the upward movement while the negative sign represents the downward movement. Parameter 6 or contour_slope is the ratio between delta_F0 and dur. This parameter reflects the gradient magnitude of the portion. Finally, parameter 7 or sign_contour_slope is the ratio between sign_F0_range and dur. This parameter reflects the gradient magnitude of the portion and also the direction movement. It has been noted that the sign_contour_slope applies the dynamic range of the portion meanwhile the contour_slope applies the delta_F0.
Procedures of parameter analysis: The following procedures of parameter analysis are implemented for an utterance from the speech data material: • Extracting the F0 values from the speech raw file • Extracting the beginning time and ending time of all syllables from the label file and converting them to the corresponding frame number • Cutting the F0 intervals for all syllables from step 1 by using frame numbers in step 2 • Eliminating the interval with some non-sense and zero-value F0s from step 3 • Calculating the tone-geometrical model parameters of the F0 portion from step 4 by using the early definitions • Plotting the distribution of parameters over its range • Calculating the statistical values of the parameters from the distributions in step 6 It should be noted that the output of F0 potions from step 4 is little different from the output of F0 potions from step 3 as depicted in Fig. 3., since the interval of non-sense and zero-value F0s are absolutely eliminated. The term "non-sense F0" refers to the abnormal F0 value which is largely different from the neighboring F0s. In some cases, the "non-sense F0" means the F0s from the adjacent portion of F0 contour. These non-sense F0s can deteriorate the geometrical feature of the model, therefore they should be eliminated before calculating the model parameters.
The term "zero-value F0" refers to the non-existing F0 which means that the F0 cannot be calculated. This non-existing F0 usually locates in the unvoiced or voiceless region of speech (Zainal et al., 2009).
In each of the selected parameter, we analyzed the frequency distribution over its range and then the distributions for all five tones in Thai are comparative shown in Fig. 4-10. Figure 4-10 present the comparison among the distributions of the selected parameters including parameter 1-7, respectively for five tones in Thai and a combined tone ("alltone" notation) to show the differences and similarities among those tones. The "tone0", "tone1", "tone2", "tone3" and "tone4" denote middle tone, low tone, falling tone, high tone and rising tone, respectively.
From all of these frequency distribution graphs, the first and second statistical moments (mean and standard deviation values) were subsequently calculated and shown in Table 1.

DISCUSSION
From the frequency distribution graphs in Fig. 4-10, most results show that the five distributions of each tones are significantly different. Except for only some cases, the distributions of parameter 2 for tone0 and tone1 are quite similar, i.e., in Fig. 5. It has been noted that some distributions have multi-modals, i.e., in Fig. 8 and 10 (parameters 5 and 7, respectively). All in all, in nearly all of the frequency distribution graphs of five tones are distinguished from each other empirically.
From the Table 1, it represents the mean and standard deviation values for all seven parameters in comparison. The parameters of different tones have different levels of mean values and standard deviation values. To distinguish one tone from the others, it is needed to use the derived parameters compositely.
From the experimental results, it has been noted that the selected parameters is needed to further apply with other speech technology. For examples, the parameters are expected to be applied in the tree-based context clustering in Thai speech synthesis (Chomphan and Kobayashi, 2007;Hassini et al., 2009;Jenq et al., 2009;Teymourzadeh et al., 2010) to categorize the speech units into tone groups. The data sharing in each of the speech unit clusters can consequently improve the efficiency of the overall synthesis system.

CONCLUSION
This study proposes an analysis of tonegeometrical model parameters for five Thai tones. The tone-geometrical model has been applied to model a syllable-length portion of the F0 contour. The middle, low, falling, high and rising tones been studied. Seven selected parameters from the tone-geometrical model are extracted. The results show that nearly most of the selected parameters can distinguish five tones explicitly. From this finding, the selected parameters are expected to apply in the speech synthesis systems in the future.

ACKNOWLEDGEMENT
The researcher is grateful to NECTEC for providing the TSynC-1 speech database.