Structural Modeling of Fundamental Frequency Contour for Thai Expressive Speech

Problem statement: Appropriate modeling of fundamental Frequency (F0) contour for speech is a key factor to preserve the quality of speech prosody. One successful approach has been conducted for tonal language of Mandarin Chinese. It is based on the assumption that the behavioral characteristics of vocal-fold elongation in vibration could be approximated by those of a simple forced vibrating system. Therefore this approach has been applied to model Thai expressive speech with best-fit function. Approach: An approach of structural modeling of voice F0 contours of Thai expressive speech utterances using an approximation by those of a simple forced vibrating system has been conducted. Nowadays, modeling of F0 contours of Thai expressive speech is very important in an analysis of speech, which brings about the speech communication with more interesting and effective. Our speech database consists of male and female speech and each one contains 4 different speech styles including angry style, sad style and enjoyable style and reading style. We use 5 sentences for each speech style and each sentence includes 100 samples. The speech sample in each group is analyzed for an F0 contour, subsequently a number of structural modeling parameters are extracted for each contour. Thereafter, the parameters are used to synthesis the F0 contour and then the synthesized contour is compared with that of natural speech by calculating RMS error. Results: From the experimental analysis, it is observed that RMS error of each speech style is different from the others. It reveals that the mentioned structural modeling responses to each speech style differently. Moreover the reading style has the smallest error among all styles. Conclusion: From the finding, it is a definite evidence to apply the modeling approach to the speech synthesis systems with good preservation of speech prosody.


INTRODUCTION
Prosody is an inherently supra-segmental feature of human speech. The fundamental frequency of voice speech is the most important feature among all of the features known to carry prosodic information. The F 0 contours of an utterance convey the stress, intonation and rhythmic structures, which determine the naturalness and intelligibility of synthetic speech. As a result, the appropriate modeling of F 0 contour plays a significant role in the speech processing area, e.g., speech recognition, speech synthesis, speech analysis and speech coding. A number of modeling techniques in the former studies have been performed in various levels of speech units, e.g., utterance level (Saito and Sakamoto, 2002;Li et al., 2004;Tao et al., 2006), word and syllable levels (Fujisaki and Sudo, 1971;Tran et al., 2006). In Thai speech, Fujisaki's model has been successfully applied for modeling of utterances, tones and words (Hiroya and Sumio, 2002;Seresangtakul and Takara, 2002;2003). In the Thai speech synthesis, the statistical modeling of F 0 contour has been conducted by Chomphan and Kobayashi in the implementation of both speaker-dependent and speakerindependent systems in 2007-2009(Chomphan and Kobayashi, 20072008;2009;Chomphan, 2009). Lately, the Fujisaki's model has been applied within a speaker-independent system as extended modules. Moreover, it has also been exploited in the modeling of Thai expressive speech; i.e., sad, happy, angry styles (Chomphan, 2010). This study proposed another approach of F 0 modeling of Thai expressive speech using the structural model which is based on the assumption that the behavioral characteristics of vocalfold elongation in vibration could be approximated by those of a simple forced vibrating system (Ni and Hirose, 2006). The RMS error calculation has been done for evaluation the modeling performance for all speech styles including angry style, sad style and enjoyable style and reading style. This study is a preliminary study for the advanced research in an advanced speech synthesis with various speaking styles for Thai.

MATERIALS AND METHODS
Structural model: The voice F 0 contour is modeled in a logarithmic scale, as depicted in Fig. 1. The mathematical model has been applied (Ni and Hirose, 2006) by using a structural control consisting of placing a series of normalized F 0 targets along the time axis, which are also specified by transition time and amplitudes. The transitions between targets are approximated by connecting truncated second-order transition functions.
From the background knowledge that the physical factors to regulate the frequency of vocal-fold vibrations are the mass, length and tension of vibrating structures, all of which are dynamically controlled primarily by the intrinsic and extrinsic muscles of the larynx and secondly by the sub-glottal pressure (Ni and Hirose, 2006). Fujisaki explained that logarithmic fundamental frequency varies linearly with vocal-fold elongation x (Fujisaki, 1983), which can be represented in the following mathematical term: where, a, b and c 0 are constant coefficients (Fujisaki, 1983). The behavioral characteristics of vocalfold elongation in vibration can be approximated by those of a simple forced vibrating system (Ni and Hirose, 2006). Formulating the assumption, the behavioral characteristics of a simple forced vibrating system can be characterized by the amplifying coefficients of its vibrating amplitudes: where, ω d and ω f denote the natural angular frequencies of the driven system and the driving force, respectively, ζ is called damping ratio indicating how tightly the driving force and the driven system are coupled together. Subsequently, replacing 2 2 d f / ω ω (square frequency ratio) by λ and substituting A(λ, ζ) expressed in Eq. 2 for x of Eq. 1, as a result, the logarithmic fundamental frequency can be expressed as: where, C is a constant coefficient. Typically, a speaker has an individual vocal range. Let f 0t and f 0b denote the top and bottom frequencies of the vocal range of a speaker and λ t and λ b denote two λ values that are one-to-one mapped to f 0t and f 0b . The relationship between f 0 within the vocal-range frequency interval and its corresponding λ is shown as follows: Since f 0t and f 0b are the top and bottom frequencies of the vocal range, λ t and λ b shall be determined regardless of ζ.
Practically, f 0 and λ can be determined through f 0 = T f0 (λ, ζ) and λ = T λ (f 0 , ζ), where they can be derived from Eq. 4 as followings: and T λ (f 0 , ζ) can be obtained by searching λ from 1 step-by-step in small increments (e.g., 0.0001), given λ b >λ t , until λ satisfies the following conditions: Allocation of pitch targets: By applying the structural model, the pitch targets on an F 0 contour is allocated in advance. Thereafter the parameter of the structural model will be approximated. For example, Fig. 2 shows a sparser specification of the F 0 contours shown in Fig. 1, the 'o' signs indicate the tonal peaks, the 'x' signs indicate the tonal valleys, and the square sign indicates a neutral target to reset an intonational phrase.
Let t i and f 0i denote the timing and F 0 value of the i th target, respectively. Table 1 lists the first eight target points (t i , f 0i ) and corresponding λ i = T λ (f 0i , ζ 0 ), given λ t = 1; λ b = 2; f 0b = 120Hz; f 0t = 420Hz and ζ 0 = 0.156. The i th local F 0 movement, i = 0, ..., 15, is defined as a scope extending from the i th target (t i , f 0i ) to the next. If f 0i ≤f 0i+1 , the local F 0 movement is therefore rising; otherwise, falling. The measured F 0 contours are subsequently approximated by connecting the rising and falling transitions through these target points. For an F 0 falling movement, say i = 0, first compute λ(t) for t 0 ≤t≤t 1 by using following equation: with the following parameters (based on Table 1): λ p (= λ 0 ) = 1.28; ∆λ = ∆λ 0 /0.95 and ∆t = ∆t 0 /0.95, where ∆λ 0 = (∆λ 1 -λ 0 ) = 0.12; ∆t 0 = (t 1 -t 0 ) = 0.14. Second, synthesize contour by using f 0 (t) = T f0 (λ(t), ζ 0 ) of Eq. 5. The thick line between the 0 th and 1 st targets shown in Fig. 2 indicates the re-synthesized F 0 contour. For an F 0 rising movement, say i = 2, first compute λ(t) for t 2 ≤t≤t 3 by using Eq. 8 with the following parameters: λ p = 2-λ 2 =0.58; ∆λ = ∆λ 2 /0.95, where ∆λ = (2-λ 3 )-(2-λ 2 ) = 0.08 and ∆t = ∆t 2 /0.95, where ∆t 2 = t 3 -t 2 = 0.15. Then, synthesize contour by using f 0 (t) = T f0 (2λ(t), ζ 0 ) of Eq. 5. It has been noted that A(λ, ζ) = A(2-λ, ζ) is applied to the computation. In Fig. 2, the thick line between the 2 nd and 3 rd targets indicates the resynthesized F 0 contour. A mathematical model: Subsequently, let F 0 (t) represent an F 0 contour as a function of time t in a vocal range [f 0b , f 0t ] (Ni and Hirose, 2006). Assume Λ(t) to indicate a sequence of virtual tone graphs in λ-time space to specify the underlying lexical tone structures. Additionally, assume a latent scale ζ(t) to characterize the intonation components. Thus, the F 0 contour on the logarithmic scale of fundamental frequency is expressed as a scale transformation from Λ(t) to F 0 (t), corresponding to the syllabic tones fitting themselves with sentence intonation in the vocal range: Where: Equation 9 and 10 jointly indicate a structural formulation of the control process of coupling the syllabic tones and sentence intonation together to form a final sentence melody. Equation 9 states that F 0 contour F 0 (t) is a transformation of a sequence of virtual tone graphs Λ(t) on a latent scale ζ(t). Λ(t) is expressed as a concatenation of n parametric bellshaped patterns lining up in series on the time axis with the following definition:  Figure 3 shows an example of re-synthesis of F 0 contour by using the structural model. Figure 3a shows the F 0 contour extracted from the natural speech, while Fig. 3b shows corresponding value of λ for three different fixed damping ratios ζ (0.156, 0.02 and 0.9). Figure 3c shows the re-synthesized F 0 contour with the three damping ratios in Fig. 3b, while Fig. 3d compares the F 0 contour extracted from the natural speech and the re-synthesized F 0 contour with limited samples.

An experimental design:
The flow chart in Fig. 4 shows the core process for our experiment. At first the speech corpus has been implemented. There is male and female speech in the corpus. Each of them has four speech styles including happy, sad, angry and reading styles. Each style consists of 5 sentences with 100 samples of utterances. Therefore our speech corpus contains 4,000 utterances. At the beginning, the F 0 values of an utterance have been calculated and then the pitch targets have been allocated by using local minimum/maximum criteria. In between any two adjacent pitch targets used as fixed points, an exponential function has been approximated to minimize the difference between the approximated function and the F 0 contour. The corresponding parameters from all of the functions along the utterance will be used as its representatives. Subsequently, the resynthesis of F 0 contour from the parameters has been conducted. Thereafter, the RMS error between the natural F 0 contour and the resynthesized F 0 contour has been executed. Finally, we analyzed the summarized data from the previous stages.

RESULTS
From the RMS error calculation process, the experimental data can be summarized in the five following graphs (Fig. 5-9). The averaged RMS errors from five different sentences have been calculated. : Averaged RMS error for sentence "  -  ----" in IPA "I will go back home." in English Each graph represents one sentence and contains those of four speech styles including happy, sad, angry and reading styles. Moreover each graph contains 2 lines of male and female speech.

DISCUSSION
From the experimental results in Fig. 5-9, we found that the averaged RMS error of the angry speech is the highest level; meanwhile the averaged RMS error of the reading speech is the lowest level. The averaged RMS errors of the happy and sad speech are in the middle level. It can be obviously seen from all Fig. 5-9 that all 5 sentences have the corresponding results. When considering the differences between genders, we found that the averaged RMS error of female speech is above that of male speech. It can be seen that all sentences confirm this observation.

CONCLUSION
This study proposes an approach of structural modeling of voice F 0 contours of Thai expressive speech utterances using an approximation by those of a simple forced vibrating system. Four different speech styles have been considered. It has been observed from the experimental analysis that the RMS error of each speech style is different from the others. The reading speech style can be modeled with best fit; meanwhile the angry speech style can be modeled with the highest RMS error. It can be concluded that the structural modeling responses to each speech style differently. For further study, this modeling approach is expected to apply for the speech synthesis systems to preserve the speech prosody with various speaking styles.