Modeling of Fundamental Frequency Contour of Thai Expressive Speech using Fujisaki’s Model and Structural Model

: Problem statement: In spontaneous speech communication, prosody is an important factor that must be taken into account, since the prosody effects on not only the naturalness but also the intelligibility of speech. Focusing on synthesis of Thai expressive speech, a number of systems has been developed for years. However, the expressive speech with various speaking styles has not been accomplished. To achieve the generation of expressive speech, we need to model the fundamental frequency (F0) contours accurately to preserve the speech prosody to preserve the quality of speech prosody. Approach: This study presents a comparison of two successful F0 models. One approach is based on the Fujisaki’s model which has been applied for many tonal and toneless languages. Another one is based on the structural model which has been conducted primarily for Mandarin Chinese. It is based on the assumption that the behavioral characteristics of vocal-fold elongation in vibration could be approximated by those of a simple forced vibrating system. Therefore this approach has been applied to model Thai expressive speech with best-fit function. Our speech database consists of male and female speech and each one contains 4 different speech styles including angry style, sad style, enjoyable style and reading style. Five sentences are used for each speech style and each sentence includes 100 samples. The speech sample in each group is analyzed for an F0 contour, subsequently a number of Fujisaki’s and structural modeling parameters are extracted for each contour. Thereafter, the parameters are used to synthesis the F0 contour and then the synthesized contour is compared with that of natural speech by calculating RMS error. Results: From the experimental analysis, it has been observed that RMS error of each speech style is different from the others for both models. It also reveals that the RMS error of the Fujisaki’s model is higher than that of the structural model for all speech styles. In other words, the structural model gives the better fit for modeling of the F0 contour of the expressive speech than that of the Fujisaki’s model. Conclusion: From the finding, it is a definite evidence that the structural model is more appropriate than that of the Fujisaki’s model for modeling four different speech styles including angry style, sad style and enjoyable style and reading style.


INTRODUCTION
The fundamental frequency of voice speech is the most important feature among all of the features known to carry prosodic information which is an inherently supra-segmental feature of human speech. The F 0 contours of an utterance convey the stress, intonation and rhythmic structures, which determine the naturalness and intelligibility of synthetic speech. As a result, the appropriate modeling of F 0 contour plays a significant role in the speech processing area, e.g., speech recognition, speech synthesis, speech analysis and speech coding. A number of modeling techniques in the former studies have been performed in various levels of speech units, e.g., utterance level (Saito and Sakamoto, 2002;Li et al., 2004;Tao et al., 2006), word and syllable levels (Fujisaki and Sudo, 1971;Tran et al., 2006). In Thai speech, Fujisaki's model has been successfully applied for modeling of utterances, tones and words (Hiroya and Sumio, 2002;Seresangtakul and Takara, 2002;2003). In the Thai speech synthesis, the statistical modeling of F 0 contour has been conducted by Chomphan and Kobayashi in the implementation of both speaker-dependent and speaker-independent systems in 2007(Chomphan and Kobayashi, 20072008;. Lately, the Fujisaki's model has been applied within a speaker-independent system as extended modules. Moreover, it has also been exploited in the modeling of Thai expressive speech; i.e., sad, happy, angry styles (Chomphan and Kobayashi, 2008;. Another study has been conducted by using a structural model which is based on the assumption that the behavioral characteristics of vocal-fold elongation in vibration could be approximated by those of a simple forced vibrating system (Ni and Hirose, 2006;Chomphan and Kobayashi, 2009). The RMS error calculation has been done for evaluation the modeling performance for both mentioned speech models and also for all speech styles including angry style, sad style, enjoyable style and reading style. This study mainly aims at comparing the Fujisaki's model and the structural model.

Fujisaki's model:
The F0 contour is treated as a linear superposition of a global phrase and localaccent components on a logarithmic scale, as depicted in Fig. 1 (Chomphan and Kobayashi, 2008;. Typically, the phrase command produces a baseline component, while the accent command produces the accent component of an F0 contour. Mathematically, the F0 contour of an utterance generated from an extension of the Fujisaki's model for tonal languages has the following expressions (Seresangtakul and Takara, 2003): Where: G pi (t) = Represents the impulse-response function of the phrase-control mechanism G t,jk (t) = Represents the step-response function of the tone-control mechanism The symbols used in three equations denote that F b is the smallest F0 value in the F0 contour of interest and A pi and A t,jk are the amplitudes of the i-th phrases and of the j-th tone command. Here, T 0i is the timing of the ith phrase command and T 1jk and T 2jk are the onset and offset of the k-th component of the j-th tone command. α i and β jk are time constant parameters, while I, J, K(j) correspond to the number of phrases, tones and components of the j-th tone contained in the utterance.
To find the optimal representative parameters, optimization is carried out by minimizing the mean squared error in the Ln F0(t) domain through the hillclimbing search in the space of model parameters (Seresangtakul and Takara, 2003). To use this model, the parameters are extracted from the speech database, utterance by utterance. The derived parameters are subsequently computed.
Derived parameters: From the conventional parameters, we proposed seven derived parameters which reflect the geometrical appearance of the F0 contour of an utterance as follows: • Baseline frequency • Numbers of phrase commands • Numbers of tone commands • Phrase command duration • Tone command duration • Amplitude of phrase command • Amplitude of tone command All of them have been extracted for four speech expressions of angry style, sad style, enjoyable style and reading style. Thereafter, the extracted parameters are used to resynthesize the F0 contour in the evaluation process.

Structural model:
The voice F 0 contour is modeled in a logarithmic scale, as depicted in Fig. 2. The mathematical model has been applied (Ni and Hirose, 2006;Chomphan and Kobayashi, 2009) by using a structural control consisting of placing a series of normalized F 0 targets along the time axis, which are also specified by transition time and amplitudes. The transitions between targets are approximated by connecting truncated second-order transition functions.
From the background knowledge that the physical factors to regulate the frequency of vocal-fold vibrations are the mass, length and tension of vibrating structures, all of which are dynamically controlled primarily by the intrinsic and extrinsic muscles of the larynx and secondly by the sub-glottal pressure (Ni and Hirose, 2006). Fujisaki explained that logarithmic fundamental frequency varies linearly with vocal-fold elongation x (Fujisaki, 1983), which can be represented in the following mathematical term: where, a, b and c 0 are constant coefficients (Fujisaki, 1983).

Assumption:
The behavioral characteristics of vocalfold elongation in vibration can be approximated by those of a simple forced vibrating system (Ni and Hirose, 2006). Formulating the assumption, the behavioral characteristics of a simple forced vibrating system can be characterized by the amplifying coefficients of its vibrating amplitudes: where, ω d and ω f denote the natural angular frequencies of the driven system and the driving force, respectively, ζ is called damping ratio indicating how tightly the driving force and the driven system are coupled together. Subsequently, replacing 2 2 d f / ω ω (square frequency ratio) by λ and substituting A(λ, ζ) expressed in Eq. 5 for x of Eq. 4, as a result, the logarithmic fundamental frequency can be expressed as: where, C is a constant coefficient. Typically, a speaker has an individual vocal range. Let f 0t and f 0b denote the top and bottom frequencies of the vocal range of a speaker and λ t and λ b denote two λ values that are one-to-one mapped to f 0t and f 0b . The relationship between f 0 within the vocal-range frequency interval and its corresponding λ is shown as follows: Since f 0t and f 0b are the top and bottom frequencies of the vocal range, λ t and λ b shall be determined regardless of ζ.
Practically, f 0 and λ can be determined through f 0 = T f0 (λ, ζ) and λ = T λ (f 0 , ζ), where they can be derived from Eq. 7 as followings: and T λ (f 0 , ζ) can be obtained by searching λ from 1 step-by-step in small increments (e.g., 0.0001), given λ b >λ t , until λ satisfies the following conditions: Allocation of pitch targets: By applying the structural model, the pitch targets on an F 0 contour is allocated in advance. Thereafter the parameter of the structural model will be approximated. For example, Fig. 3 shows a sparser specification of the F 0 contours shown in Fig. 1, the 'o' signs indicate the tonal peaks, the 'x' signs indicate the tonal valleys and the square sign indicates a neutral target to reset an into national phrase.
Let t i and f 0i denote the timing and F 0 value of the i th target, respectively. Table 1 lists the first eight target points (t i , f 0i ) and corresponding λ i = T λ (f 0i , ζ 0 ), given λ t = 1; λ b = 2; f 0b = 120Hz; f 0t = 420Hz and ζ 0 = 0.156. The i th local F 0 movement, i = 0, ..., 15, is defined as a scope extending from the i th target (t i , f 0i ) to the next. If f 0i ≤f 0i+1 , the local F 0 movement is therefore rising; otherwise, falling. The measured F 0 contours are subsequently approximated by connecting the rising and falling transitions through these target points.
For an F 0 falling movement, say i = 0, first compute λ(t) for t 0 ≤t≤t 1 by using following equation: with the following parameters (based on Table 1): λ p (= λ 0 ) = 1.28; ∆λ = ∆λ 0 /0.95 and ∆t = ∆t 0 /0.95, where ∆λ 0 = (∆λ 1 -λ 0 ) = 0.12; ∆t 0 = (t 1 -t 0 ) = 0.14. Second, synthesize contour by using f 0 (t) = T f0 (λ(t), ζ 0 ) of Eq. 5. The thick line between the 0 th and 1 st targets shown in Fig. 3 indicates the re-synthesized F 0 contour. For an F 0 rising movement, say i = 2, first compute λ(t) for t 2 ≤t≤t 3 by using Eq. 8 with the following parameters: λ p = 2-λ 2 =0.58; ∆λ = ∆λ 2 /0.95, where ∆λ = (2-λ 3 )-(2-λ 2 ) = 0.08and ∆t = ∆t 2 /0.95, where ∆t 2 = t 3 -t 2 = 0.15. Then, synthesize contour by using f 0 (t) = T f0 (2λ(t), ζ 0 ) of Eq. 5. It has been noted that A(λ, ζ) = A(2-λ, ζ) is applied to the computation. In Fig. 3, the thick line between the 2 nd and 3 rd targets indicates the resynthesized F 0 contour. A mathematical model: Subsequently, let F 0 (t) represent an F 0 contour as a function of time t in a vocal range [f 0b , f 0t ] (Ni and Hirose, 2006;Nicknam et al., 2009). Assume Λ(t) to indicate a sequence of virtual tone graphs in λ-time space to specify the underlying lexical tone structures. Additionally, assume a latent scale ζ(t) to characterize the intonation components. Thus, the F 0 contour on the logarithmic scale of fundamental frequency is expressed as a scale transformation from Λ(t) to F 0 (t), corresponding to the syllabic tones fitting themselves with sentence intonation in the vocal range: ,for t 0 Where: -13 jointly indicate a structural formulation of the control process of coupling the syllabic tones and sentence intonation together to form a final sentence melody. Equation 12 states that F 0 contour F 0 (t) is a transformation of a sequence of virtual tone graphs Λ(t) on a latent scale ζ(t). Λ(t) is expressed as a concatenation of n parametric bellshaped patterns lining up in series on the time axis with the following definition:  Figure 4 shows an example of re-synthesis of F 0 contour by using the structural model. Figure 4a shows the F 0 contour extracted from the natural speech, while Fig. 4b shows corresponding value of λ for three different fixed damping ratios ζ (0.156, 0.02 and 0.9). Figure 4c shows the re-synthesized F 0 contour with the three damping ratios in Fig. 4b, while Fig. 4d compares the F 0 contour extracted from the natural speech and the re-synthesized F 0 contour with limited samples (Geravanchizadeh and Rezaii, 2009).

An experimental design:
The flow chart in Fig. 5 shows the core process for our experiment. At first the speech corpus has been implemented. There is male and female speech in the corpus. Each of them has four speech styles including happy, sad, angry and reading styles. Each style consists of 5 sentences with 100 samples of utterances. Therefore our speech corpus contains 4,000 utterances. At the beginning, the F 0 values of an utterance have been calculated and then the pitch targets have been allocated by using local minimum/maximum criteria. In between any two adjacent pitch targets used as fixed points, an exponential function has been approximated to minimize the difference between the approximated function and the F 0 contour. The corresponding parameters from all of the functions along the utterance will be used as its representatives. Subsequently, the resynthesis of F 0 contour from the parameters has been conducted. Thereafter, the RMS error between the natural F 0 contour and the resynthesized F 0 contour has been executed. Finally, we analyzed the summarized data from the previous stages.

DISCUSSION
The experimental results in Fig. 6-10 show that the averaged RMS error of the angry speech is the highest level; meanwhile the averaged RMS error of the reading speech is the lowest level. The averaged RMS errors of the happy and sad speech are in the middle level. It can be obviously seen from all Figures that all 5 sentences have the corresponding results. When considering the differences between genders, we found that the averaged RMS error of female speech is above that of male speech. Last but not least, that the RMS error of the Fujisaki's model is mostly higher than that of the structural model for all speech styles. In other words, the structural model gives the better fit for modeling of the F0 contour of the expressive speech than that of the Fujisaki's model. The results of all sentences confirm this observation (Suhartono, 2011;Souleymane et al., 2009;Geravanchizadeh and Rezaii, 2009;Camminatiello and Lucademo, 2010;Nicknam et al., 2009).

CONCLUSION
This study proposes a comparison of two successful F0 models. First, the model is based on the Fujisaki's model. Second, the model is based on the structural model. The applied speech database consists of male and female speech and each one contains 4 different speech styles including angry style, sad style, enjoyable style and reading style. Five sentences are used for each speech style and each sentence includes 100 samples. From the experimental results, it has been seen that RMS error of each speech style is different from the others for both models. It also reveals that the RMS error of the Fujisaki's model is higher than that of the structural model for all speech styles. In other words, the structural model gives the better fit for modeling of the F0 contour of the expressive speech than that of the Fujisaki's model.