© 2007 Science Publications Arabic Speech Pathology Therapy Computer Aided System

This article concerns a computer aided pathological speech therapy program, based on speech models such as the hidden Markov model and artificial intelligence networks, in order to help persons, suffering from language pathologies, follow a correction learning process, with different interactive feedbacks, aiming to evaluate the degree of evolution of the illness or the therapy. We dealt with the Arabic occlusive sigmatism as a prime approach, which is the inability to pronounce the[s] or [∫]. Results obtained are satisfying and the therapy program is prepared, for autonomous use by patients, for deep analysis and verifications.


INTRODUCTION
Computer aided therapy software is a new concept. in the field of medicine, specially in the speech process, It is an orientation to the introduction of different algorithms in the illness correction and default detection. This approach is essentially used in hospitals in pre-surgery [1,2] or in the default correction, by repeated learning sessions.
While treating a speech pathology, therapists try to detect the different defaults by some non-invasive methods, in order to preserve the real functioning conditions. The speech therapist intervenes in the treatment, using his level of knowledge and its acquired accuracy to detect the defaults, and correct them, before and/or after the use of the surgery.
Speech recognition techniques are, well, suited to the implementation, of a computer aided therapy system used, in the speech correction or language learning [3] . These techniques are based on the mixture of different approaches, such as the Gaussian mixture models, (GMM) [4] , artificial neural networks [5,6] , hidden Markov models [7] , or some hybrid techniques like ANN/HMM, [8] HMM/GMM [9] . In this work, we used a complementary technique to get a more robust recognition system, as an extension to our previous work based only on HMM/GMM [10] We will first focus on the speech pathology, related to our problem, then the following part concerns the material and methods section, the third section concerns the modeling parameters. The fourth part concerns the automatic word segmentation procedure followed by a discussion about the results and a conclusion

SPEECH PATHOLOGY
The speech pathology deals essentially with the detection of the default pronouncing areas, in order to measure the degree of the illness, for a future adequate treatment.
The different organs attained by this pathology are divided into two main parts: the first part deals with peripheral parts, like the hare nozzle, palatal division, velar insufficiency, lingual defaults etc…, which are consistent constraining mechanical movements in the production of the phonemes, or problems related to the incapability to articulate phonemes in the systematic and permanent ways. [11] . the second part of pathologies concerns the source which is related to the speech vocal cords, this direction, is not considered in this article. Generally, due to the defaults of the phonetic system, different pathologies occur, such as, the lisping, the hissing, the rhotacism, the stammering, the gammacism, etc…

MATERIAL & METHODS
This work is similar, in the problem definition, to the learning methods of the English language by the non-native English Taiwanese or Japanese people [3,12] . These studies are oriented toward the detection of the bad pronounced phonemes, this default is intrinsically native within this region of the world, due to the absence of theses phoneme in their languages.
Different programs are available in the speech pathology domain, but none treats the Arabic language neither the occlusive stigmatism [13] .
We tried to follow the same procedure; in order to end up with a similar approach that relates to the occlusive sigmatism, which concerns, the airflow from one side of the mouth. Due to a bad placement of the tongue, the similarity is regarded as learning a phoneme in a new language context, or learning how to pronounce a faulty phoneme in a pathology context.
The illness or pathology, we are dealing with, concerns the lingual interposition in the production of the Arabic phonemes such as [ As an example, taken from our own recorded database, due to the lack or absence of Arabic speech pathology databases, we have chosen, for this article, the sample word [ aχsija], where the occlusion pathology is intensively significant, this word has been pronounced by a patient representing the speech pathology; the transcribed heard sounds are shown in table 1. The second word is given as an example, in order to illustrate, the confusion that may happen in other word pronunciations.
These phonemes present a huge misspelling, due to their close place of articulation.
Let us remark the changes at the phonemic level, which gives a nearby word; that is the correction target. In both situation, the heard sounds, give the impression of the right word, but with a phoneme 'shift', due to the tongue that is shifted from the post-alveolar to the alveolar position, ending sometimes in the dental position, pronouncing a faulty phoneme. The therapy methodology, as shown in the flowchart of the figure 1, describes how are treated other similar therapies, as soon as the concerned correct and pathological databases are available. The choice of the parameters modeling the healthy or pathological speeches depends essentially on the desired recognition level.
In this work, we used two approaches, a global word segmentation approach based on a hidden Markov model (HMM), then a specific phonemic recognition level, in order to distinguish phonemes from their neighbors, based on a neural network rating process as shown in Figure.1.

MODELLING PARAMETERS
Choice of the cepstral coefficients: In order to make a robust recognition, we worked within a medium noisy environment, since the dedicated tool, will be used in a Change position of pathological phoneme hospital, or even at home, where the level of noise would be difficult to control or settle to its low level. The Mel Frequency Cepstral coefficients called MFCC are a good compromise, they can sustain to noise and have a good intrinsic de-correlation factor, added to the fact that they model the hearing system in a filtering manner like does the cochlea.
These coefficients have been used intensively, in the recognition of the English digits in a noisy environment [14] . In the medical survey for the detection of different sounds [15] . In the comparison of the parameters in Arabic language recognition [7] . In the detection of the pathological speech [16] . In acoustic modeling techniques for embedded systems [17] . In the speaker recognition [18] , and have shown good results in the identification of different complex phonetic features of the Arabic language [7] , and essentially in the recognition and verification of the Japanese students while learning English [3] .
The number of the MFCC coefficients is set to 13 then reduced to 12, considering that the first coefficient is the energy of the frame, which does not contribute to the discrimination process, in our situation. The other 12 coefficients represent the spectral envelope without high frequencies.
Choice of the modeling parameters: The Arabic language contains 34 phonemes; One Markov model may represent each phoneme [19] .
The Markov representation known as the Hidden Markov Model, (HMM), allows synthesizing the information within a corpus via learning. The different occurrences represented as some probabilistic transitions within a graph model; each graph could be a phoneme, a word, depending on the required level of recognition.
The use of a huge set of data is very important, in order to have a complete selectivity in terms of multi-Gaussian distribution. The total set of data called 'O' is approached by a Gaussian mixture model, for two main reasons, a compression side where the most important parameters would be the means and the variances of the Gaussians, added to that their degree of participation in the model. The second side concerns the natural distribution of data in the real world, which are best approximated by the Gaussian model.
The equation relating the data to the GMM is as follows: And: Where : O : represents the acoustic observations Θ : represents the GMM, with c1…c T being respectively the degree of participation of the submodels 1 .. T , with means an variances respectively {µ i T : being the number of sub-models chosen. We used initially the HMM/GMM approach for the automatic word segmentation, with different topologies represented by the number of states in the HMM model, and the number of Gaussians associated to each state, then an ANN network to distinguish between nearby phonemes.
The chosen topology of the HMM is a right to left word model, which stands for transition of the speech in one way, with 4,8 and 12 GMM associated to each state or phoneme.
From some previous studies [7,201] , a model with 3 right to left states is also adequate for each phoneme, ending with a HMM model 3 times longer.
The different steps of construction of the GMM/HMM model are based upon the use of the Baum-Welch algorithm, variant of the Expectation Maximization, EM algorithm [20] , which computes the expectation of an unknown random value against a known random variable. This method uses a recursive method, until a reasonable model adapts or fits well the data. This could be seen in the adjustment of the parameters of both the HMM and the GMM.
The global model is defined as follows: π : represents the initial probabilities of the HMM model. In order to model the data, it is required to find the best path through the transition matrix maximizing: The Viterbi approach is used, because instead of summing all the probability paths, only the maximal probability at each state is taken, this technique is based upon the fact that an optimal path is the sum of sub optimal paths. At each step of the EM algorithm, we check if the new model brings amelioration in the adjustment of the data, that is verifying if the data fits the model at the (n) th step better than the (n-1) th step. Initially we start by: The equation (7) represents the relative frequency to be at the state q i initially. The update of the parameters is done by completing the required data through the following process: That represents the variance of each Gaussian distribution within the state q i .

ANN Phoneme recognition/ distinction:
The multi layer perceptron (MLP) has been widely used in the area of pattern recognition, namely, document recognition, image processing and speech recognition [5,6] . It is the most successful neural network, known by its discrimination capability for pattern classification. It can approximate any function, such a model can easily associate the input shape to its class in a supervised manner.

Fig.2: Multilayer network
The ANN is based upon two processes, training and test, which are modeled by the following equations: The weight update is written as follows: Where the vector X(k) represents the direction of the minimum, within the gradient method, "l" the number of the layer and "η " represents the positive step constant value, of the learning process (training speed) required to minimise the quadratic output error designed by the difference between the desired value and the computed one. [21] AUTOMATIC WORD SEGMENTATION Best word pronunciation: Our automatic speech therapy process is divided into two parts, a word best pronunciation score is given via the EM algorithm then a phoneme deviation rate evaluation is generated by an ANN network.
Initially we concentrate on the word completely, in order to make an adaptation of the tongue in a global context; then we start tuning the recognition at a phonemic level to distinguish between neighbor phonemes We tested different HMM/GMM models to get a compromise between speed/precision, by using 12 MFCC coefficients, as well as their first Delta, and Delta-Delta dynamic components, with a mixture of 4, 8, 12 GM models, with a maximum of 35 iterations of the EM algorithm, as shown in figure 3.  The figure 4 shows that behind the HMM automatic segmentation there is mainly a phoneme correspondence, which is used to compute a second deviation score added to the word evaluation score, in our case the log-likelihood, as presented in figure 5. The best pronunciation as shown in figure 5 corresponds to the higher log-likelihood, in this case at trial 25.
In order to make the patient pronounce well, different trials are recorded, analyzed then feed backed at each time to the therapist, as well as to the patient, then a global view is presented visually, to help the correction.
This score might be given to the therapist in different forms, like a grading over 100.
Afters some hours of recording process, this might be days or even weeks, followed each time by the necessary visual and hearing feedbacks; either of the patient himself, in order to hear his pronunciation, or a good pronunciation with a tongue and lisp movement videos, and/or images; we remarked that the patient started to pronounce well the pathological word, as shown in figure 6. Both the worst and best pronunciations are illustrated, aiming to capture the visual change in the speech therapy process; this also could give an impression on the patient to manage the word or phoneme stress.
At this level, a deeper analysis is tackled, answering the question, which phoneme has been really pronounced and to which extent it is morphing toward the good one? And to which degree of likelihood, our system could be liable? to ensure correctness in limited trials. Phoneme level recognition: We noticed, that in the whole process of the segmentation, we dealt with [ a] instead of [ ], this is mainly due to the co-articulation that tends to decrease the effectiveness of the automatic segmentation, compared to the manual process. This later is based upon visual, hearing as well as human experience, while the automatic process is mainly based over distributions and probabilities; our aim is to make a small bridge between the two approaches. In order to get a good deviation rate score, we tested different configurations, the first configuration concerns 12 inputs, one hidden layer and 4 classes, while the second configuration is composed of 12 inputs, two hidden layers, and 4 output classes. Both networks used the same TIMIT database with 70% as training data and 30% as testing data, as shown in the table 2 and table 3. For one hidden layer network, we obtained the recognition rates shown in table 2:  From the above results, the chosen configuration is the network made of 2 hidden layers of 32 neurons each, and 4 outputs. Nevertheless, this is not a final rate or score, because we still do not know the real phoneme deviation, that is why a second process is implemented, based on 2 parallel networks that best detect the neighborhood limit of the phonemes, giving a kind of phonemic distance or deviation rate

RESULTS AND DISUSSIONS
We were concerned, initially, by the first and fourth phonemes of the word C1, the second phoneme [a] (vowel) is being absorbed by the [ ], while the third phoneme is of no concern in this pathology, we did a comparison with several occurrences or trials, in order to know the degree of correction and similitude that is being performed by the patient.
During the therapy process, different other words have been used in order to reinforce the phoneme stress; Correction occurred, mainly, when the patients hear themselves, even if the pronunciation is faulty, this helped a lot in the word segmentation. As a second point, the more the database contains different occurrences of the good pronunciation; the better is the segmentation, and the phoneme recognition. That is why; the phoneme database was partly taken from the TIMIT source.
At the end of the speech therapy process, we noticed that patients corrected themselves during the training

CONCLUSION
In this work, we worked on a new approach that deals with Arabic speech pathology computer aided therapy system, we designed an automatic application in this processing context; that helps in the word segmentation and phoneme recognition, of some well selected words, identifying the targeted illness, using a simple computer and a basic recording tool.
This therapy approach is based upon some visual and hearing feedbacks; this may help the patient as well the therapist, to find out the speech illness regions, and follow the evolution of the illness, by a history or log file concept, as well as with some comparative methods; the whole system is designed to be autonomous with less and less need to the continuous presence of the therapist.
We focused initially, as a new trend, on the replacement of the phoneme [ ], badly pronounced [ ] or [s].
The designed system allows an automatic recognition of different pathological phonemes, with a deviation rate score, that identifies the "how" a neighbor phoneme is pronounced instead of the original phoneme.
Let us remark also, that from our experiments, we deduced that beyond the computer process, the visual feedbacks have an intensive impact on the speech process as well as the psychological side of the experiment. In fact, patients were at each moment saying: 'why did not we try this before?', and this helped us a lot in the speech therapy.