Prosody Modification of Standard Arabic Speech Using Combining Synchronous Overlap and Add With Fixed-Synthesis Algorithm and Multi Level Discrete Wavelet Transform
Ykhlef Faycal, Bensebti Mesaoud and Bendaouia Lotfi
DOI : 10.3844/jcssp.2010.392.405
Journal of Computer Science
Volume 6, Issue 4
Problem statement: The objective of prosody modification is to change the amplitude, duration and pitch (F0) of speech segments without altering their spectral envelop. Applications are numerous, including, Text-To-Speech synthesis, transformation of voice characteristics and foreign language learning. Several approaches have been developed in the literature to achieve this goal. The main restrictions of these latter are in the modification range, the synthesized speech quality and naturalness of spoken language. The latest research studies provide evidence that the first Formant (F1) and F0 are dependent; suggesting that in order to preserve high quality and naturalness of the speech signal, any change to one of these parameters must be accompanied by a suitable modification of the other. Approach: This study introduced a prosody modification method using combining Synchronous Overlap and Add with Fixed-Synthesis (SOLAFS) algorithm and a multi level decomposition based on Discrete Wavelet Transform (DWT) to overcome the limitations cited above. It used Standard Arabic (SA) sounds. For a purpose of comparison, two techniques based on frame by frame processing were proposed. The first one consists in a pitch synchronous processing of the mth approximation level time segments used in SOLAFS algorithm. It was aimed to modify the prosody of the input speech without affecting the spectral envelop. The second one explores the correlation between F1 and F0 in the corresponding approximation level of SA sounded and modified duration and both F0 and F1 scales. It was based on a re-sampling method using FFT interpolation. The use of multi level analysis was aimed to provide independent control over the spectral envelope. In both techniques, the decomposition level depends on the chosen sampling Frequency (FS). F0 marking was based on multi level peaks comparison. Both techniques use an automatic speech classification algorithm based on modified version of the Johnson algorithm. Results: The performances of The performances of the proposed techniques were evaluated by listening tests using sentences in SA language sampled at an FS of 16 kHz. It was found that manipulation in the third approximation level of F0 in conjunction with the local F1 improved significantly the naturalness of the modified speech compared to the classical prosody modification. Conclusion: This improvement was most suitable for high F0 scales from the fact that speaker generally increases F1 as they increase their F0. Further, the technique can be used in the manipulation of the remained formant structure.
© 2010 Ykhlef Faycal, Bensebti Mesaoud and Bendaouia Lotfi. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.