Statistical Parametric Evaluation on New Corpus Design for Malay Speech Articulation Disorder Early Diagnosis

: Speech-to-Text or always been known as speech recognition plays an important role nowadays especially in medical area specifically in speech impairment. In this study, a Malay language speech-to-Text system was been designed by using Hidden Markov Model (HMM) as a statistical engine with emphasizing the way of Malay speech corpus design specifically for Malay articulation speech disorder. This study also describes and tests the correct number of state to analyze the changes in the performance of current Malay speech recognition in term of recognition accuracy. Statistical parametric representation method was utilized in this study and the Malay corpus database was constructed to be balanced with all the phonetic placed and manner of articulation sample appeared in Malay speech articulation therapy. The results were achieved by conducting few experiments by collecting sample from 80 patient speakers (child and adult) and contain for almost 30,720 sample training data


Introduction
Speech is a type of communication method between people (Yong and Swee, 2014b), which people design the content of it to deliver their message. The process of speech sound production is called articulation, it is caused by the movement of lips, tongue, velum and jaws to shape the flow of air into sounds (Thomas and Carmack, 1990). Person who is having difficulties in articulate spoken language correctly are facing articulation disorder problem. According to (Ting et al., 2003), the cause of articulation disorder can be organic or functional. Example of organic causes including anatomical, sensory impairment or motor while example of functional causes have several etiologies (Van Riper and Erikson, 1996).
Early screening or diagnosis of articulation disorder and identify the type of the articulation disorder can be beneficial in the next stage which is treatment stage. Currently in Malaysia hospital and speech center, the method of diagnosis for speech disorder is still using the traditional method which is manual diagnosis (Mohd Nizam and Tan, 2012). Manual diagnosis required a lot of involvement of Speech Language Pathologist (SLP) for each session. In current situation, the ratio of speech language pathologist to speech disorder patient is 1:500 in Hospital Sultanah Aminah (HSA) while the ideal ratio number is 1:50 according to World Health Organization (WHO) (Mohd Nizam and Tan, 2012). According to SLP in speech center, Jamilah (2014), a SLP might needs to work in the condition of 1:4 (ratio of SLP to speech disorder patient) and this can reach 1:8 in hospital while the ideal case should be 1:1 ratio. This shows that more effort is needed in this area to decrease the distance between real situation and ideal situation.
Computerized system may comes into this part and assist SLP in early screening diagnosis process to make it more efficient and hope to cut down time consume and raise accuracy. However, the right combination of words need to be designed in the system to address language and speech disorder where it concern with disorders of human communication (Tan et al., 2007) that specifically focus on articulation disorder. This chapter has introduced the background and importance of the study followed by chapter 2, provides an introduction to the process of voice produced, current research conducted and covered some information about Malay Language. Chapter 3 will introduce the methodology that we used including Malay articulation corpus design and structure of Hidden Markov Model (HMM) with number of state adjustment and the chapters following are covered for speech recognizer architecture, HMM likelihood evaluation, result and conclusion.

Literature Review
Nowadays, Automatic Speech Recognition (ASR) has gained more and more attention in Malaysia medical field for its possible usage in diagnosis and treatment area especially for speech disorder case. A brief introduction for production of voice, method and Malay language will be presented in this section.

Production of Voice
In order to produce voice, complex planning and coordination of mouth and tongue movements is required (Dronkers, 1996a). It involves muscles along the vocal tract with specific and quick timing. Airflow formed in lungs and flowed through the vocal folds in larynx and vibrated in mouth cavity area including soft palate, hard palate, lips, jaw and tongue is needed in voice production. In detail, when we are planning to speak, our brain will send a signal to larynx muscle for closing the vocal cord (it is remain open for smooth breathing when we are not speaking). After the vocal cord is closed, the flow of air coming up from our lungs when we are speaking encounter the bloackage of vocal cord and vocal cord react open and close repeatedly, which become rapid vibration. This produces the sound waves in the air inside our vocal cord area and we known this as our basic tone of voice (Voice Production, 2009). This human anatomical structure shown in Fig. 1.
Manner of articulation is another important part in the process of speaking after the production of voice. When the moving air pass through vocal cord and reach oral cavity above tongue, the method of people modifying their speech organs and control the flow of air take charge in pronuncing the correct words. Manner of articulation refers to these methods of controlling speech organs such as stop, fricatives, plosives, affricate, nasal consonant and etc. For stop, it is created by trap the air flow in the mouth by different speech organs. For fricative, it is similar to hissing sound, due to a tight contriction made, so the air passing by formed turbulance and creating sound. For nasal consonant, the air has been directed to be released from nasal cavity through nose. Another type of articulation manner is taps and flaps, which is describing the act of make a rapid brush around alveolar ridge area (Hayes, 2011). Thus, human speech production requires complex planning and coordination of mouth and tongue movements (Dronkers, 1996b).

Statistical Modelling Technique
To access this problem where it's related to recognizing the smallest lexical unit in sound production, the suggested method is by using probabilistic pattern matching technique of Hidden Markov Model (HMM). The ability of statistically modeling the speech variability has been the reason of HMM widely been used in speech recognition field. It is flexible and can be utilized for many other applications especially in medical field that concern on early screening diagnosis of articulation speech disorder which are related to the stochastic modeling tasks. Stochastic task which in simple understanding to deal with the uncertainty and incompleteness of the input variable. The process of HMM will begin by the creating the stochastic models from known utterance. Then the comparison of the probability of that unknown utterance was generated by each model (Paul, 1990). However, at the hidden level, the most common speech recognition system will represent the phonetic information of the underlying speech signal. Some has achieved success by using this methodology but the approach somehow does not explicitly incorporate knowledge of certain aspect of human speech production (Aymen et al., 2011).
Another problem arises when developing speech recognition system is regarding the corpus design and pattern matching phase. The detail will be described in more specific in the next chapter of methodology.

Malay Language Overview
Malay language or Bahasa Melayu is the official language in Malaysia, Indonesia and Brunei (Yong and Swee, 2014a). There are about 500 million of Malay speakers existed in the world (Noraini and Kamaruzaman, 2008). With this amount of speakers spreaded in different places at least 3 different countries, it is having several dialect such as Kelantan Malay, Ulu Muar Malay, Langkawi Malay and others (Teoh, 1994). 'Standard Malay' or 'Bahasa Baku' refers to the formal way of Malay language including the language form and usage of it (Sariyan, 1988), is actually based on the Johor-Riau Malay dialect.
According to Noraini and Kamaruzaman (2008), Malay is a phonetic language which written in Roman characters. There are a total of 6 main vowels and 29 of consonants in Standard Malay. In the book of 'The Malay Sound System (1980) has mentioned that there are 9 vocal sound found from the 6 main vowels just now which are /i, e, a, ə, ɥ, o/ and /ɛ, ɜ, ɔ/ of allophones in addition.
In Malay System of Transliteration, Malay language is represent by the roman character with some notes on the pronunciation of malay words (Hugh and Frank, 1894). It is divided into 3 main parts which are vowels, diphthongs and consonants. For example word starting from "A", the Malays pronunce â (long) like the vowel sound a in "calm"; â (medium) is pronunced like the vowel sound "come" and â (short) is pronunced a little more shortly than the vowel sound in the English word "but". Thus, Malay language is quite similar with English pronunciation.  Noraini and Kamaruzaman (2008) said, this language is part of Austronesian lanaguage and it is agglutinative usually, words in Bahasa Melayu are formed by joining syllybles. Syllables are commonly using peak of sonority within an utterance in attempting such definitions (Ladefoged, 1975). A percise definition or structure of a syllybus presented by Laver (1994) is illustrated in Fig. 2.
In the figure shows a syllable formed from a central vowel or vowel-like consonant, which is nucleus. Sometimes, the nucleus will optionally pair up with one or more consonants which formed onset and coda.
The smallest unit of speech is the phoneme (Swee and Salleh, 2008), a family of counds that are close enough in perceptual qualities to be distinguished from other phonemes that is capable of convenying a distinction inmeaning, as the 'm' of 'mat' and the 'b' of 'bat' in English (Tischer, 2009). In Malay Language, there are about 24 pure phonemes and 6 borrowed phonemes. Malay Phonemes Features follows the standard of International Phonetic Association (IPA) (Asmah, 1983). Figure 3 shows the possible types of phoneme that can be combined together which presented by (Hina, 2012).
To be clear, the phonology of the speech is also related to diphthongs and vowel combinations. Diphthongs are types of vowels where two vowels sounds are connected in a continuous, gliding motion or sometimes refer as gliding vowels. A total of 3 diphthongs existed in Malay language which is [ai], [au] and [oi] (Raminah and Rahim, 1987). The gliding movement in Malay language itself contains seven combination of vowels such as 'ai ', 'au', 'ia', 'io', 'iu', 'ua' and 'ui' where it combine in two vowels. However, the vowel combinations of 'au' and 'ai' are different from diphthong 'au' and 'ai' in the way they pronounce.

Malay Articulation Disorder Corpus
In the process of computerizing the diagnosis process, computer plays a major role in it. However, there is a difficulty in applying computer system in this process which is the current technology of speech signal processing is unable to identify and differentiate the words which is pronounced similarly. For example, 'Start' and 'Stop' in English is difficult to be differentiated in current speech signal processing technology as they are having the same initial pronunciation of 'St'. As an addition, both consonant S and T are in the same group of alveolar in International Phonetic Alphabet (IPA) table (Nur Hana, 2007). Looking into the raw speech wave of these two words, it is having similar wave pattern in the initial part ('St-' sound) as shown in Fig. 4. As in Malay word case, similar to 'Start' and 'Stop' words, 'Lampu' (Light) and 'Lembu' (Cow) are the words which will create confusion in the speech signal processing system of computer due to its similar place and manner of articulation in IPA table, which is plosive bilabial.
(Consonant P and B) (Nur Hana, 2007). Thus, we take this into consideration in the process of simplifying the existed sample word corpus content to be more compatible to computer system. The sample words of the lexicon database were gathered from Hospital Sultanah Aminah by Mohd Nizam and Tan (2012). A total of 128 different Malay words are collected based on the categories of consonants mainly in alveolar and plosives type.
The next factor in simplification process is to make sure each chosen word will have maximum function to test the articulation process of a person, in other word, to takle particular articulation manner of a person. For instance, 'komputer'(computer), 'sembilan'(nine) and 'televisyen' (television), all these three words can test for 4 places. Take 'sembilan'(nine) as example, this word can be tested for 'sem-' (S consonant) as initial consonant articulation, '-bi-'(B consonant) and '-la-' (L consonant) for middle consonant articulation as well as '-an' (N consonant) for end consonant articulation. All the consonant testing are aiming from alveolar and plosives consonants.
The following Table 1-3 consist the target consonant corpus after simplification for this research. According to (Donald and Katherine, 1996), the most common misarticulated sound is consonant sounds comparing to vowel sounds in English language case. Thus, our corpus is focusing in consonant sound. Total words in here is 64 words.
All of the target words in this corpus are chosen with the consideration of the consonant position for instance initial, middle and end position of a word. In Fig. 5, it is clearly illustrated that consonant R can exist in initial position which is Rumah (House), middle position which is Jari (Finger) and end position which is Motor (Motor).
An example words in the corpus, sotong (Squid) can be divided into 3 places to be tested. For initial place, consonant S can be tested and consonant T in the middle position and consonant G in the end position as illustrated in Fig. 6.

Architecture of Speech Recognizer Engine
The core structure of this experiment is based on basics and the state of the art for ASR architecture which consists of front-end processing and back-end processing of the speech sample signaling. Figure 7 shows the ASR architecture in this research. The process of front-end processing is starting by capturing the sound wave of speech sample by using standard microphone at 16 kHz and 16-bit resolution format. Then the process of speech signal processing will be done by converting the speech signal of the waveform format into parametric representation by using FE by considering 12 Mel-Frequency Cepstral Coefficient (MFCC) setup (Davis and Mermelstein, 1980). MFCC had been selected because previous research shows that, MFCC has characteristics of the human auditory system and commonly used in the ASR (Axelsson and Björhäll, 2003).   The next phase reside on back-end processing. As been explain in previous chapter, HMM are dealing with stochastic modeling tasks. The general HMM will covers two stochastic processes that are the transition process between the states and feature vectors generating process by individual states (Zimmermann and Zimmermann Jr, 2002). The acoustic HMM training mechanism will be implemented to generate set of models to represent the observed acoustic vectors of the sample. In typical speech recognition system, The word-based pronounciation dictionary will be created and be used to describe each HMM acoustic model such as phoneme or syllable are mapped to form a word for both training and decoding purpose.
For this experiment, two sets of pronunciation dictionary were created which are training pronunciation dictionary and the decoding pronunciation dictionary. In order to develop accurate and robust model, few thousand of utterance sample must be involved in HMM acoustic training (Young et al., 2006). The training samples for this experiment consists of different Malay accent (chinese and Malay speaker), gender, normal and disorder patient are taken into account during the process of data collection. the word-based pronunciation dictionary been created for the use of acoustic training as shown in Fig. 8 with the total of 23 monophone been used.

HMM State Likelihood Evaluation
The experiment been done in this research is by applying the general approach of identifying FE by using HMM that provides statistical framework for modeling speech patterns which most widely use technique in speech recognizer (Rabiner, 1989). First, the sequence of feature vectors from FE process of MFCC been taken as a realization of concatenation elementary process describe by HMM. This HMM models will be observed through stochastic process that produces the time set of observations. HMM speech recognizer will identify unknown speech by estimating likelihood of each phoneme at the frames of the speech signal. Searching procedure will determine the highest likelihood of phoneme sequence that only been correspond to the words in vocabulary.
Isolated word recognition in this experiment will assume the spoken word of the speech utterance will be represent by a sequence of speech vectors observation of O, denoted in Equation 1 below: o T is the speech vector observed at time t. The isolated word recognition occurrence can be denoted as in Equation 2: where, w i is the i th of vocabulary word. For the probability is not compute directly, but will be compute using Bayes' Rule as: Fig. 8. Word-based training pronunciation dictionary Fig. 9. HMM recognition of isolated word P(w i ), is the set of prior probabilities where the most probable spoken word depends on the likelihood of P(O|w i ). For the HMM based speech recognition, the O will be correspond to each word that been generated by Markov Model as denoted in Equation 3. A Markov Model is a finite state machine which change state once every time unit and each time t that a state j is entered. Then the speech vector of o T is generated from the probability density denoted as bi(Ot). To generate the sequence of o 1 to o 6 the six model will move through the state sequence X = 1,2,2,3,4,4,5,6 which the entry and exit states are non-emitting.
The transition probabilities and output probabilities can be express in Equation 4 below: where, the joint probability that O is generated by the model M that moving through the state sequence of X. only observation O is known and X as the underlying state sequence is hidden. The likelihood can be express by considering the most likely state sequence denoted in following Equation 5 below: where, x(O) is constrained to be the model entry state and x(t+1) is constrained to be the model of exit state. The process can be display in Fig. 9. Based on Fig. 9; the training set are the sample that corresponding to particular model that can be determine automatically by HMM re-estimation procedure. This will provide enough number of representative samples of each word that can be collected. Then HMM will be constructed to implicitly models all O(t) of the sources variability inherent in real speech by training HMM model for each vocabulary word. The recognition phase of the unknown word is where the likelihood of each model generating that word is calculated to find the most likely model identifies the word.

Results
The experiment setup been divided into 2 phase that is training and testing. The sample been divided into 2 categories of children and adult. The acoustic training data been selected from best speaker and patient speaker for giving the rich input acquisition for the training database. The total of children involved was about 42 and adult was 38 people. Explained in previous chapter, the selected word for this new corpus design been simplified from 128 word into 64 word. Therefore, each target sample need to speak the word for 6 times to keep the consistency of the wave signal which will sum up training sampling into 80×64×6 where altogether is about 30,720 voice sample has been done for training set. The main concern is to test the unknown utterance accordingly into places and manner of articulation to test for the anomalies in the unknown sample. Table 4 shows the list of Malay phoneme according to place and manner of articulation which is a common guideline among the linguistic researcher (Celce-Murcia and McIntosh, 1979;Michailovsky, 1994).
Two main categories in Table 4 which are place of articulation and manner of articulation where place of articulation is the point of contact where an obstruction occurs in the vocal tract between an articulatory gesture and manner of articulation is the configuration and interaction of the speech organs when making a speech sound. From all the elements in the Table 4, we are focusing on two priority group that is alveolar (place) and plosives (manner) because these two groups have the highest possibilities to be articulated wrongly (Jamilah, 2014).
For the overall words in the corpus had been categorized into the categories of testing places as been mentioned in Table 5. The highest word to be tested is in the 3 places with the total percentage for almost 50% from the corpus. The less words can be tested is on the 4 places as the corpus design don't have much words on this categories which for about only 6.25% can be tested.
Evaluating the recognizer accuracy is consists of recognition results where it's been evaluated by string alignment. This is the process when the reference transcription is aligned with the recognizer's transcription by using dynamic programming (Stein et al., 2001). Then the differences are counted. There are three different types of error; substitutions, deletions and insertions. With the total number of phonemes in reference transcription and the number of this three type error, the two informative values of accuracy and correctness can be compute. The formula as shown below: The accuracy is where the recognizer has inserted excess phonemes and correctness is the proportion of recognized phonemes that actually correct.
Besides that, the intelligibility of the recognizer been also computed by using the formula of Word Error Rate (WER) denoted in Equation 8. The elements of S, D and I was gathered from the transcription of the speaker to be computed. C is the number of correct words: Based on previous research, the indicator shows that, the lowest WER means better speech recognition accuracy for the recognizer.
For the overall results which been showed in Table 6 had achieved almost 55% result of % Correct for sentence-level accuracy based on the total number of label files which are identical to the transcription files. The second line is the word accuracy based on the Dynamic Programming-based string alignment Procedure (DP) matches between the label files and the transcriptions. The results had achieved 75.89% of % correctness with total of accuracy had achieved for about 66.96%. The result of error type for D, S and I been showed above. The result been calculated by using Equation 6 and 7 respectively.
The example informative values of each type of error cases been illustrated as in Fig. 10 above. The example took case of 3 sample reference and recognition sample for the testing data. The illustration shows how the informative values been pointed to the string alignment to the number of deletions, substitutions and insertions before accuracy and correctness can be computed. The WER for baseline state had achieved the highest percentage of correctness which about 55% compare to other number of states. Somehow the WER for baseline is the best for each different state that is about 0.33 with total accuracy is the highest compared to other setting for almost 66.96% which shown in Table 7. Figure 11 shows the result been plot by number of states over % of accuracy. It's clearly indicated that the increasing number of state might not increase the percentage of accuracy. Several test been done from 80 different speaker with more than hundred sample utterance taken from the database.

Conclusion
In this study, the main concern is the proposed of HMM as statistical modeling technique for speech recognition system for diagnosisng patient that suffer from articulation disorder. Its emphasize on the design of Malay speech corpus that balanced with all the phonetic placed and manner of articulation sample appeared in Malay speech articulation therapy environment. The architecture of speech recognition engine had been also been describe with few discussion on HMM state likelihood evaluation. The 64 word corpus design been tested with few changes in the HMM setting and also the changes of the probability densities approximated by a mixture of different state setting. The output show, the baseline 5 state is the best setting which produces WER for about 0.33 and result accuracy achieved for about 75.89%. In short, phonetic balanced database could provide a good recognizer speech database with correct HMM setting.

Future Work
The process of preparing and designing the training corpus involve a lot of work. By joining the process of designing with expert such as speech language pathologist, it can shorten the process in the future. Few techniques of segmentation the training utterance and adjusting the FE setting might also the best way to improve the recognition accuracy especially for isolated phoneme recognition. It might be the future interest to improve this project.