Modeling of Fundamental Frequency Contour of Thai Expressive Speech using Fujisaki's Model and Structural Model
DOI : 10.3844/jcssp.2011.1310.1317
Journal of Computer Science
Volume 7, Issue 8
Problem statement: In spontaneous speech communication, prosody is an important factor that must be taken into account, since the prosody effects on not only the naturalness but also the intelligibility of speech. Focusing on synthesis of Thai expressive speech, a number of systems has been developed for years. However, the expressive speech with various speaking styles has not been accomplished. To achieve the generation of expressive speech, we need to model the fundamental frequency (F0) contours accurately to preserve the speech prosody to preserve the quality of speech prosody. Approach: This study presents a comparison of two successful F0 models. One approach is based on the Fujisaki’s model which has been applied for many tonal and toneless languages. Another one is based on the structural model which has been conducted primarily for Mandarin Chinese. It is based on the assumption that the behavioral characteristics of vocal-fold elongation in vibration could be approximated by those of a simple forced vibrating system. Therefore this approach has been applied to model Thai expressive speech with best-fit function. Our speech database consists of male and female speech and each one contains 4 different speech styles including angry style, sad style, enjoyable style and reading style. Five sentences are used for each speech style and each sentence includes 100 samples. The speech sample in each group is analyzed for an F0 contour, subsequently a number of Fujisaki’s and structural modeling parameters are extracted for each contour. Thereafter, the parameters are used to synthesis the F0 contour and then the synthesized contour is compared with that of natural speech by calculating RMS error. Results: From the experimental analysis, it has been observed that RMS error of each speech style is different from the others for both models. It also reveals that the RMS error of the Fujisaki’s model is higher than that of the structural model for all speech styles. In other words, the structural model gives the better fit for modeling of the F0 contour of the expressive speech than that of the Fujisaki’s model. Conclusion: From the finding, it is a definite evidence that the structural model is more appropriate than that of the Fujisaki’s model for modeling four different speech styles including angry style, sad style and enjoyable style and reading style.
© 2011 Suphattharachai Chomphan. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.