Difficulties of Standard Arabic Phonemes Spoken by Non-Arab Primary School Children based on Formant Frequencies

Problem statement: The study of Malaysian Arabic phoneme is rarely found which make the references work difficult. Specific guideline on Malaysian subject is not found even though a lot of acoustic and phonetics research has been done on other languages such as English, French and Chinese. Approach: This study discussed about the correct and simplest way of Arabic phonemes pronunciation in Malay accent. The International Phonetic Alphabet of Arabic chart was considered as the reference of every recorded speech samples using Malaysian children for their sound localization (makhraj point) of every alphabet. The recorded sound was analysed to determine the origin of each alphabet data by measuring its formant frequencies. The consonants of Standard Arabic (SA) phonemes were studied and the appropriate place of articulation of every phoneme was measured through its formant. Results: Only seven out of 25 consonants of SA phonemes of the children’s samples did not give the appropriate formants value. The formants are /kof/, [ق], /zo/, [ظ], /kho/, [خ], /gheyn/, [غ], /ha/, [ح], /ain/, [ع] & /ha/, [ ] which consider as the difficult SA to utter among Malaysian children. Conclusion/Recommendations: The values obtained are used as the reference of the database for our recognition system.


INTRODUCTION
Spectrogram is well-known of its representation of visual image of a moving signal. Through a spectrogram, the signal waveform can be interpreted according to its colour distribution, time and frequency. The spectrum of the signal is shown in every small vertical slice of the spectrogram. The robustness and convenience of the spectrogram in speech application is approved through reading (Laprie and Berger, 1996;Hatazaki et al., 1989;Silverman and Lee, 1987;Hunt, 1987;Connolly and Edmonds, 1986;Zue and Lamel, 1986;Zue and Cole, 1979;Mporas et al., 2007).
From the spectrogram, Awais et al. (2006) has described about the detection of the phonemes boundaries and identify it as pauses, vowels or consonants. The result for segmenting continuous Arabic speech using FFT spectrogram is excellent. Every phoneme of 10 speakers is detected as consonants or vowels with overall accuracy of 95.39%.
Therefore, this study is to use the robustness of the spectrogram of each of the signal of SA phonemes to detect the formant frequencies and to study the waveform patterns. The formant frequency, F n of the phoneme is acquired according to Eq. 1 (Fant, 1960;Johnson, 2003;Rabiner and Juang, 1993;Mohd et al., 2009).
where, L is the length of the vocal tract which is measured from the place of articulation (makhraj point) of the sound production to the lips. Based on equation (1), since L is the length of the vocal tract, formant frequency, F will be lower as the constriction located into the mouth as n and c is fixed, where n is the formant; c is the speed of sound in warm and moist air (35000 cm sec −1 ). For example, phoneme /ain/, ‫]ع[‬ (in pharynx location) will possess greater formant frequency compared to /tsa/, ‫]ث[‬ (placed at dental place of articulation) as reported by Ali et al. (2001).
All speech sounds have its unique speech waveform patterns. Therefore, to obtain the spectrographic image of speech sound, speech data must be compressed and undergo signal acquisition process or digital signal processing method (Smith, 1997;Mitra, 2006).

By using Fourier Transform technique, Equation
(2) is visualized as an image with the convention that frequency is increasing from bottom to top as time is increases from left to right. While the pixel value at each point in the spectrogram is proportional to the magnitude of the spectrum at a certain frequency at some point in the time. Here: x(n) = The speech signal, and w(n) is a window of length L X m (e jω ) = A collection of Fourier Transform of windowed segments of x(n). Where n = The number of sample while m is the number of windowed frame of the sample An experiment done by Jackson (2001) on three English consonants for each voiced (/b, d, g/) and unvoiced (/p, t, k/) showed that formant F 4 of English plosives is less reliable. While the value of formant F 2 and F 3 can distinguish the consonants according to the place of articulation. The inner the place of articulation, the results is higher in formant F 2 . Iqbal et al. (2008) used formant frequencies to identify Arabic vowels and nasal formants /lam/, ‫,]ل[‬ /mim/, ‫]م[‬ and /nun/, ‫.]ن[‬ Subjects were Quranic expertise which gave 90% average accuracy.
The following parts discussed the SA phonemes based on its manner of production. Each manner of productions is having the same characteristics of the spectrogram based on the way it is pronounced. The following parts describe the methods used, results and conclude the findings.
Fricative phonemes: Fricative is the largest set of SA because it contains thirteen phonemes in the set. Fricative sound is produced as air escaped a narrow constriction in the mouth (Fant, 1960;Johnson, 2003;Rabiner and Juang, 1993;Hassan, 1984). The classifications of fricative are as follows and its phonetic as pronounced by Malay subjects: Plosive phonemes: It consists of nine phonemes in SA which is the second largest set of the phonemes after fricative. Plosive is the behaviour of sound produced by stopping the airflow in the mouth (Fant, 1960;Johnson, 2003;Rabiner and Juang, 1993;Hassan, 1984). Plosive is sometimes called stop or spirant. Those include bilabial, dental, velar, alveolar, interdental and uvular. The plosive classifications are as follows and their pronunciations are in Malay accent: Nasal phoneme: In SA, only two phonemes are categorized in this group. Nasal is produced with a lowered velum in the mouth, allowing air to escape freely through the nose (Fant, 1960;Johnson, 2003;Rabiner and Juang, 1993;Hassan, 1984). The classifications of nasal phonemes are bilabial: /mim/, ‫]م[‬ and dental: /nun/, ‫.]ن[‬ Approximant phonemes: Approximant is the manner of articulation which composed of two phonemes of SA. Both of these approximant are semivowel, which produced by bringing one articulator (alveolar) in the vocal tract close to another (Fant, 1960;Johnson, 2003;Rabiner and Juang, 1993;Hassan, 1984). The approximant phonemes are palatal: /ya/, ‫]ى[‬ and velar: /wao/, ‫.]و[‬ Trill phoneme: Only a phoneme is categorized as trill, which is produced by vibration of the tongue against some other part of the mouth (alveolar). The trill phoneme is an alveolar: /ro/, ‫.]ر[‬ Lateral phoneme: Lateral also composed of a phoneme in the group of SA. The phoneme is produced by raising the tip of the tongue against the roof of the mouth so that the airstream flows past one or both sides of the tongue (Johnson, 2003;Rabiner and Juang, 1993). The lateral phoneme is /lam/, ‫.]ل[‬ Figure 1 shows the distribution of the places of articulation in SA and Table 1 listed the SA phonemes with their manner and place of articulation (Awadalla et al., 2005) under consideration.

MATERIALS AND METHODS
The recording of the speech signal is done on a room at a tuition centre with quiet environment. All of the subjects are Malaysian children age of seven to eleven years old. The children are ten boys and fifteen girls which are randomly selected. Their background in Arabic phoneme is unknown. The data sound is collected and analysed in order to determine the accuracy level of Malaysian children pronunciation of Arabic phoneme. Other requirements of the project are as follows: • The experiment of data collection and data analysis is done separately. During the recording session, the children were asked to pronounce all of the 28 Arabic phonemes sequentially. The recording is started a few seconds before the children utter the first phoneme and continue until all of the phonemes pronounced. The recording is stopped a few seconds after the last phoneme is spoken for each individual. The speech sound collections of 25 children are gathered.
The first step in data analysis is to group every speech sound recorded according to its phoneme. By using Goldwave software, the recorded sound for each child is cut into one second length. Therefore, for each phoneme, a total of 25 samples are collected. After that, SFS software is used to analyse the data. The wideband spectrograms for each data are produced and the formant frequencies (F n ) are determined. The average data is calculated since the appropriate data for each phoneme is not single. All of the average measured formant frequencies are summarized in results and discussion. The average value is the average of 25 samples of each phoneme averaged using Eq. 3.
F n = the formant frequency where n = 1, 2, 3, 4 in which F 1 refer to the lowest formant and F 4 = the highest formant measured in the study. Figure 2 is the flow summary of method in this study and Fig. 3 is the algorithm to obtain the spectrogram.

RESULTS AND DISCUSSION
The pattern for all of the phonemes is observed in the spectrogram. The waveform for each features are compared within the group of manner of articulation and the formant frequencies are observed. The average value of each formant is summarized as in Table 2, 4, 6, 7, and 9 according to its place of articulation. The waveform of the fricative sound is shown in Fig. 4. As the fricative sound is produced by the jet air flowing through a constriction in the vocal tract, the starting of the formant is hissing due to air hitting the front teeth and caused turbulence for all frequency, before the formant can be seen clearly at time 450 ms as circled in the spectrogram.
From the literature review, F 1 and F 4 are not reliable. It is proven when the frequencies are increased and decreased randomly not according to the place of articulation. As summarized in Table 2, the value of F 2 is increasing from bilabial to post-alveolar phonemes. As the place of articulation move inner to the mouth, the value is increased and decreased inconsistently. While F 3 shows the increment of the value from dental phonemes to velar phoneme (/kho/, ‫)]خ[‬ but decreased and increased inconsistently for velar phoneme (/gheyn/, ‫)]غ[‬ and inner place of articulation inside the mouth. Coronal phoneme (/sod/, ‫)]ص[‬ which pronounced as the blade of the tongue is raised and lips are rounded and the place of articulation is considered to be longer than /shin/, ‫.]ش[‬ Therefore, the value of From the result, only seven fricative phonemes (front fricative consonant) gave the result appropriate to Jackson (2001). Other six fricative phonemes (back fricative consonant) did not obey the findings. The results show that F 2 and F 3 can distinguish the place of articulation for front fricative consonants as the formant increase from bilabial to post-alveolar while F 1 and F 4 is less reliable to be used in the same position as summarized in Table 3.
In Fig. 5, the waveform of the plosive consonant (/baa/, ‫)]ب[‬ is started with a burst before the waveform is easily seen on the spectrogram. This is because the plosive consonants are produced by stopping the air flow in vocal tract.
The average value of each phoneme is summarized as in Table 4. By neglecting the changes of F 1 and F 4 , the value of F 3 is increased from 3.9 kHz to 4.4 kHz at dental phonemes to velar phonemes.
The formants F 2 and F 3 for uvular phoneme; /kof/, ‫]ق[‬ are less than velar phoneme /kaf/, ‫]ك[‬ and /jim/, ‫.]ج[‬ It is shown that only one phoneme (/kof/, ‫)]ق[‬ did not obey the result as concluded in above. This shows that Malaysian children mostly have difficulty in articulating the plosive uvular phoneme correctly as discussed by Abdul-Kadir et al. (2010). Comparison between previous findings with current is summarized in Table 5. Fig. 6 shows the waveform of /mim/, ‫]م[‬ and its spectrogram. From the spectrogram, the starting region is less dark compared to middle region as for nasal pronunciation, the vocal tract is divided into nasal branch and oral branch. Therefore, more antiresonances are produced as the interference between these two branches. Antiresonances in the vocal tract will eliminate formants near it which appear to be weak when look at the spectrogram.
From Table 6, the values of F 1 and F 3 for /mim/, ‫]م[‬ are decreased from front to end of the phoneme. While F 2 and F 4 decreased from front to middle but increased to the end of the phoneme. Meanwhile, for phoneme /nun/, ‫,]ن[‬ the value of F 1 increasing from front to middle but decreased from middle to end, but the values of F 2 , F 3 and F 4 are decreased from front to middle and increased from middle to end of the phoneme. The difference between /mim/, ‫]م[‬ and /nun/, ‫]ن[‬ nasal consonants pronunciation is that the lip rounding when the phonemes pronounced.  The effect of the blocked oral tract in the centre and effect of side branches can be seen through the spectrogram in Fig. 7. From the spectrogram, the average value for /l/ F 1 is 0.4 kHz, F 2 is 2.3 kHz, F 3 is 3.8 kHz and F 4 is 4.9 kHz as summarized in Table 7. From front to middle (in order to produced /la/ sound), F 1 is increased from 0.4 kHz to 1.1 kHz, while F 2 , F 3 and F 4 decreased to 1.9 kHz, 3.5 kHz and 4.6 kHz respectively. To complete the /lam/, ‫]ل[‬ pronunciation, the formants changed from middle to end is reversed as from front to middle.
Both nasal and lateral formant F 2 is not in the range as result discussed by Iqbal (2008) as shown in Table  8. This might be the effect of the vocal tract length since the study used children age 7 to 11-year-old, while Iqbal used expertise in Quranic recitation age 15 to 30 years old.
Trill consonant is produced by vibrating the tongue against alveolar. From the waveform and spectrogram in Fig. 8, it can be observed that the formants are decreasing except at F 4 . This is due to the way to pronounce the phoneme which less vibration (friction) can be heard as the tip of the tongue curled back towards the mouth. From the spectrogram, F 4 is observed to rise drastically.
The average value for F 1 , F 2 , F 3 , and F 4 is 0.6, 1.6, 3.8, and 5.3 kHz respectively as summarized in Table  9.
Consonant /ro/, ‫]ر[‬ is observed through the spectrogram only. It is shown that the formant transition for F 4 is drastically increased compared to other formants in the same spectrogram as found by Ladefoged (2003) (refer Table 10).

CONCLUSION
In conclusion, the results are compared with previous research findings according to their manner of articulation. There are five types for all 25 consonants under consideration of this study. Most of manner of articulations show that F 2 and F 3 are reliable to know the exact place of articulation. The first and forth (F 1 and F 4 ) formants cannot be relied on.