Research Article Open Access

Modeling of Fundamental Frequency Contour of Thai Expressive Speech using Fujisaki's Model and Structural Model

Suphattharachai Chomphan1
  • 1 ,
Journal of Computer Science
Volume 7 No. 8, 2011, 1310-1317

DOI: https://doi.org/10.3844/jcssp.2011.1310.1317

Submitted On: 20 April 2011 Published On: 9 August 2011

How to Cite: Chomphan, S. (2011). Modeling of Fundamental Frequency Contour of Thai Expressive Speech using Fujisaki's Model and Structural Model. Journal of Computer Science, 7(8), 1310-1317. https://doi.org/10.3844/jcssp.2011.1310.1317

Abstract

Problem statement: In spontaneous speech communication, prosody is an important factor that must be taken into account, since the prosody effects on not only the naturalness but also the intelligibility of speech. Focusing on synthesis of Thai expressive speech, a number of systems has been developed for years. However, the expressive speech with various speaking styles has not been accomplished. To achieve the generation of expressive speech, we need to model the fundamental frequency (F0) contours accurately to preserve the speech prosody to preserve the quality of speech prosody. Approach: This study presents a comparison of two successful F0 models. One approach is based on the Fujisaki’s model which has been applied for many tonal and toneless languages. Another one is based on the structural model which has been conducted primarily for Mandarin Chinese. It is based on the assumption that the behavioral characteristics of vocal-fold elongation in vibration could be approximated by those of a simple forced vibrating system. Therefore this approach has been applied to model Thai expressive speech with best-fit function. Our speech database consists of male and female speech and each one contains 4 different speech styles including angry style, sad style, enjoyable style and reading style. Five sentences are used for each speech style and each sentence includes 100 samples. The speech sample in each group is analyzed for an F0 contour, subsequently a number of Fujisaki’s and structural modeling parameters are extracted for each contour. Thereafter, the parameters are used to synthesis the F0 contour and then the synthesized contour is compared with that of natural speech by calculating RMS error. Results: From the experimental analysis, it has been observed that RMS error of each speech style is different from the others for both models. It also reveals that the RMS error of the Fujisaki’s model is higher than that of the structural model for all speech styles. In other words, the structural model gives the better fit for modeling of the F0 contour of the expressive speech than that of the Fujisaki’s model. Conclusion: From the finding, it is a definite evidence that the structural model is more appropriate than that of the Fujisaki’s model for modeling four different speech styles including angry style, sad style and enjoyable style and reading style.

  • 1,003 Views
  • 1,530 Downloads
  • 0 Citations

Download

Keywords

  • Noised speech
  • speech analysis
  • fundamental frequency contour
  • speech enhancement
  • speech synthesis
  • speaker-independent
  • local-accent
  • model parameters
  • Fujisaki's model