Acoustic Model Adaptation for Indonesian Language Utterance Training System
Linda Indrayanti, Yoshifumi Chisaki and Tsuyoshi Usagawa
DOI : 10.3844/jcssp.2010.1334.1340
Journal of Computer Science
Volume 6, Issue 11
Problem statement: In order to build an utterance training system for Indonesian language, a speech recognition system designed for Indonesian is necessary. However, the system hardly works well due to the pronunciation variants of non-native utterances may lead to substitution/deletion error. This research investigated the pronunciation variant and proposes acoustic model adaptation to improve performance of the system. Approach: The proposed acoustic model adaptation worked in three steps: to analyze pronunciation variant with knowledge-based and data-derived methods; to align knowledge-based and data-derived results in order to list frequently mispronounced phones with their variants; to perform a state-clustering procedure with the list obtained from the second step. Further, three Speaker Adaptation (SA) techniques were used in combination with the acoustic model adaptation and they are compared each other. In order to evaluate and tune the adaptation techniques, perceptual-based evaluation by three human raters is performed to obtain the "true" recognition results. Results: The proposed method achieved an average gain in Hit + Rejection (the percentage of correctly accepted and correctly rejected utterances by the system as the human raters do) of 2.9 points and 2 points for native and non-native subjects, respectively, when compared with the system without adaptation. Average gains of 12.7 and 6.2 points for native and non-native students in Hit + Rejection were obtained by combining SA to the acoustic model adaptation. Conclusion/Recommendations: Performance evaluation of the adapted system demonstrated that the proposed acoustic model adaptation can improve Hit even though there is a slight increase of False Alarm (FA, the percentage of incorrectly accepted utterances by the system of which the human raters reject). The performance of the proposed acoustic model adaptation depends strongly on the effectiveness of state-clustering procedure to recover only in-vocabulary words. For future research, a confidence measure to discriminate between in-vocabulary and out-vocabulary words will be investigated.
© 2010 Linda Indrayanti, Yoshifumi Chisaki and Tsuyoshi Usagawa. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.