Analytical Study on Fundamental Frequency Contours of Thai Expressive Speech Using Fujisaki’s Model

Problem statement: In spontaneous speech communication, prosody is an important factor that must be taken into account, since the prosody effects on not only the naturalness but also the intelligibility of speech. Focusing on synthesis of Thai expressive speech, a number of systems has been developed for years. However, the expressive speech with various speaking styles has not been accomplished. To achieve the generation of expressive speech, we need to model the fundamental frequency (F0) contours accurately to preserve the speech prosody. Approach: Therefore this study proposes an analysis of model parameters for Thai speech prosody with three speaking styles and two genders which is a preliminary work for speech synthesis. Fujisaki's modeling; a powerful tool to model the F0 contour has been adopted, while the speaking styles of happiness, sadness and reading have been considered. Seven derived parameters from the Fujisaki's model are as follows. The first parameter is baseline frequency which is the lowest level of F0 contour. The second and third parameters are the numbers of phrase commands and tone commands which reflect the frequencies of surges of the utterance in global and local levels, respectively. The fourth and fifth parameters are phrase command and tone command durations which reflect the speed of speaking and the length of a syllable, respectively. The sixth and seventh parameters are amplitudes of phrase command and tone command which reflect the energy of the global speech and the energy of local syllable. Results: In the experiments, each speaking style includes 200 samples of one sentence with male and female speech. Therefore our speech database contains 1200 utterances in total. The results show that most of the proposed parameters can distinguish three kinds of speaking styles explicitly. Conclusion: From the finding, it is a strong evidence to further apply the successful parameters in the speech synthesis systems or other speech processing technologies.


INTRODUCTION
In speech processing area; including speech recognition, speech synthesis, speech analysis and speech coding, an appropriate modeling of F0 contour contributes the effectiveness of the implemented speech processing systems. The former study on F0 modeling has been considerably conducted in various speech units and several techniques such as utterance level (Fujisaki and Ohno, 1998;Fujisaki et al., 1990;Tao et al., 2006;Saito et al., 2002 Ni andHirose, 2006;Li et al., 2004), word and syllable levels (Fujisaki et al., 1990;Fujisaki and Sudo, 1971;Tran et al., 2006). In Thai speech, Fujisaki's model has been successfully applied for modeling of utterances, tones and words (Hiroya and Sumio, 2002;Seresangtakul and Takara, 2003;Seresangtakul and Takara, 2002). In the Thai speech synthesis, Chomphan and Kobayashi implemented a speaker-dependent and speaker-independent systems in 2007-2009(Chomphan and Kobayashi, 2007a2007b Chomphan andKobayashi, 2008;Chomphan and Kobayashi, 2009), in which the F0 contour was modeled using statistical approach. Moreover, the speakerindependent system also used the Fujisaki's model in the extended modules. However, the expressive speech such as sad, happy, angry styles has not been considered. Therefore this study proposed an analysis of F0 modeling of Thai expressive speech using the Fujisaki's model which is a preliminary study for the advanced research in speech synthesis and recognition such as the expressive speech synthesis work in Japanese language (Tachibana et al., 2005;Tachibana, 2006).

Fujisaki's model:
The F0 contour is treated as a linear superposition of a global phrase and local-accent components on a logarithmic scale, as depicted in Fig. 1. The phrase command produces a baseline component, while the accent command produces the accent component of an F0 contour. We use the two parameters of the Fujisaki's model as our phraseintonation features including the baseline value of F0 and the magnitude of the phrase command, which complementarily reflect the global level of voicing frequency. Mathematically, the F0 contour of an utterance generated from an extension of the Fujisaki's model for tonal languages has the following expressions (parameters) (Seresangtakul and Takara, 2003): Where: G pi (t) = Represents the impulse-response function of the phrase-control mechanism G t,jk (t) = Represents the step-response function of the tone-control mechanism The symbols in these equations denote that F b is the smallest F0 value in the F0 contour of interest and A pi and A t,jk are the amplitudes of the i-th phrases and of the j-th tone command. Here, T 0i is the timing of the ith phrase command and T 1jk and T 2jk are the onset and offset of the k-th component of the j-th tone command. α i and β jk are time constant parameters, while I, J, K(j) correspond to the number of phrases, tones and components of the j-th tone contained in the utterance.
To find the optimal representative parameters, optimization is carried out by minimizing the mean squared error in the In F0(t) domain through the hillclimbing search in the space of model parameters (Seresangtakul and Takara, 2003). By using this model, the parameters are extracted from our speech database, utterance by utterance. Subsequently, the derived parameters are computed are analyzed.
Derived parameters: From the conventional parameters, we proposed seven derived parameters which reflect the geometrical appearance of the F0 contour of an utterance as follows: • Baseline frequency • Numbers of phrase commands • Numbers of tone commands • Phrase command duration • Tone command duration • Amplitude of phrase command • Amplitude of tone command All of them have been extracted for three expressive speech styles of happiness, sadness and reading.

RESULTS
In our speech database, we use a sentence of "kȹid tȹɯŋ tɕaŋ lɤːj" in IPA (means "Think of you so much" in English) for male and female genders. This sentence has been recorded in three expressive speech styles of happiness, sadness and reading. Each style contains 200 utterances of samples. Therefore we have 600 utterances of samples for each gender. The parameter extraction tools as used in (Mixdorff and Fujisaki, 1997) are applied in this study.
In each derived parameter, we analyzed the frequency distribution over its range and then the distributions of three expressive speech styles are plot in a graph to show the differences and similarities among those styles. The first seven graphs are of female speech ( Fig. 2-8), while the next seven ones are of male speech ( Fig. 9-15).
From all of these frequency distribution graphs, the first and second statistical moments (mean and standard deviation values) were subsequently calculated and shown in terms of the following comparative bar charts (Fig. 16-22). From these bar charts, we can also observe some differences between male and female speech.

DISCUSSION
From the frequency distribution graphs of male and female speech in Fig. 2-15, most results show that the three distributions of each speaking styles are significantly different. Except for only some cases, one distribution of speaking style is similar to another, i.e., in Fig. 8; the sad and reading styles of amplitude of tone command. It has been noted that some distributions have multi-modals, i.e., in Fig. 5 and 8; the phrase command duration. All in all, in nearly all of the frequency distribution graphs, three distributions of each speaking styles are distinguished from each others empirically.
From the statistical bar charts in Fig. 16-22, they represent the mean and standard deviation values for all seven parameters between male and female speech in comparison. In Fig. 16, 17 and 19; the parameters of baseline frequency, number of phrase commands and phrase command duration, it has been observed that the mean values of male speech for all speaking styles are less than that of female speech. In Fig. 18 and 21; the parameters of number of tone commands and amplitude of phrase command, it has been observed that the mean values of male speech for all speaking styles are higher than that of female speech. In Fig. 19; the parameter of phrase command duration, it has been seen that all speaking styles of male speech have the same level of mean values. However, the other parameters have different levels of mean values for different speaking styles. To distinguish one speaking style from the others, it is needed to use the derived parameters compositely.
From the experimental results, it is a strong evidence to further apply the derived parameters in the speech synthesis systems or other speech processing technologies. For examples, the parameters are expected to be applied in the tree-based context clustering in Thai speech synthesis (Chomphan and Kobayashi, 2007a) to categorize the speech units into groups. The data sharing in each of the speech unit clusters can consequently improve the efficiency of the overall speech synthesis system.

CONCLUSION
This study proposes an analysis of model parameters for Thai speech prosody with three speaking styles and two genders. The Fujisaki's model has been applied to model the F0 contour. The speaking styles of happiness, sadness and reading have been studied. Seven derived parameters from the Fujisaki's model are extracted. The results show that nearly most of the proposed parameters can distinguish three kinds of speaking styles explicitly. From this finding, the parameters are expected to apply in the speech synthesis systems in the future.