Frontal Arabic Fricative Consonants Characteristics among Primary School Children: Spectral Density Function Approach

Problem statement: Arabic speech recognition system is un-developed u ntil today. By using people of non-Arab nationality, the set of fr ntal fricative of Arabic consonants are study as t he earlier step to develop the recognition system for Malaysian. As most of the developed Arabic speech recognition system for Malaysian are rarely found. Approach: The characteristic of each of the phonemes were study based on its spectral density f unction. The considerations were the peak and bandwidth of each consonant speech samples which ap pear in the spectrum. The total of each character was averaged to get the reliable reference value wh ich will be the baseline for the selected database later. Results: The results showed that the bandwidth consideratio n of the first peak gave the increment value as the consonants distributed from outer part (labiodentals-3.6 kHz) to inner part of the mouth (alveolar-3.8 kHz). While the peak appear ance is lowered as the group of the consonants moved further towards into the mouth. Conclusion/Recommendations: The values obtained were used as the reference of the database for our recog nition system. Only suitable consonants samples were chosen to be stored as database to improve the system accuracy.


INTRODUCTION
In Arabic speech, fricative is the major group which composed of 13 consonants as followed the International Phonetic Alphabet (IPA) system (http://en.wikipedia.org/wiki/Wikipedia:IPA_for_Arab) . Fricative, the act of producing sound by passing air through a narrow constriction that causes it flow turbulently with noisy sound, composed of eight places of articulation for thirteen consonants, but only seven consonants are taken into consideration in this study. Figure 1 shows the tabulated of places of articulation while Table 1 is the list of the consonants.
As seen in Fig. 1, dental, alveolar, postalveolar and palatal are located about the same range. Eventhough palatal used tongue as the medium to produce the correct sound, but the way of pronunciation is located in the middle of the oral tract which same as alveolar and postalveolar. Thus the correct pronunciation is just about the same which not easily heard as correct or not. Therefore, the speech signal are transform into the spectral in order to read the behavioral of each sample of different places of articulation especially the frontal consonants.
The common method to distinguish the speech samples are spectrogram reading which proved to be the powerful method to obtain the formants of each samples and had been used in long time ago in speech processing until now (Zue and Cole, 1979;Johannsen et al., 1983;Hassan, 1982;Zue and Lamel, 1986;Debyeche et al., 1998;Johnson, 2003;Awadalla et al., 2005;Awais et al., 2006;Steinberg and O'Shaughnessy, 2008;Iqbal et al., 2008;Abdul-Kadir et al., 2010, Mporas et al., 2007, Sumithra et al., 2011, Abdul-Kadir et al., 2011a.  Table 1: The Consonants  Cavity  ---------------------------------------------------------------------------------------------------------Front Back As an alternative, power spectral density of each phoneme is computed and the graphical illustration of the spectrum is study. The peak, bandwidth and trough are determined and the average of each is set as the reference. The amplitude of spectral peak decrease as the place of articulation moves further forward into the mouth while the bandwidth becomes wider. In addition, the tilt of the spectrum at the beginning also helps the identification of the phoneme. Furthermore, Li (2009) found that English speaking children generally produced alveolar fricatives more accurately than postalveolar, whereas the opposite was true for Japanesespeaking children. By using non-native children as the subject, this study study about the characteristics of frontal fricative consonants through power spectral density. Further observation is discussed in section 3. Shadle et al. (1995) also done the spectral analysis of [x] but using German speakers. The peaks are reported appeared at 2 and 3.8 kHz while trough is around 3 kHz. While Beautemps et al. (1995) had discover four peaks (600, 1211, 2180 and 3665 Hz) in the study of French speaker pronouncing [x]. Shadle and Mair (1996) had distinguished spectral parameters that are reliable to model a fricative corpus. Phillips (2000) found the spectral peaks of fricative /s/ are 1, 1.4, 1.8 and 26 kHz in his study of characteristics of plosive, fricative and aspiration components in speech production. Phillips (2001) had used the spectrum of plosive and obtains the spectral characteristics of the burst. The differentiation between voiced and unvoiced sound are discussed as stated that voiced sound composed of less aspiration. Jesus and Shadle (2005) observed that the back place of articulation possess a high amplitude peaks. As in his study, [s] which is situated in the front cavity had a broad peak at 8 kHz, [∫] in the centered had peak at 3.5 kHz while [χ] with longer front cavity is having a series of peaks at 1.3 and 2.4 kHz. The peaks are found to be distributed along the range of 1-1.6 kHz, 2.1-2.8 kHz and 3.2-4 kHz. The obvious trough always appears along 1.8-2.5 kHz.  (2008) had track the value of F 0 of fricatives of British English and European Portuguese. Less than 300 Hz is the average value acquired. Johnson (2009) had model the speech signal as he stated that each peak in the spectral form of each speech signal is the representation of the formant. Recording session is done in a room with normal environment. Each one of those subjects spoke each of the Arabic frontal fricative consonants in a record. With the specification obeyed, the total of all records of 175 phonemes signal are extracted and analyzed.

METERIALS AND METHODS
Processing step: The FFT method of windowing size of 8 is used as the processing method. The spectral of each of the samples are obtained in MatLAB environment. Figure  2 shows the overall workflows to get the power spectral density graph. While on the other hand, there are four steps to be concerned to get the reading of the spectral. First of all is to short-segmented the audio signal. Second, number of data points must be defined. Next is to define the length of window to use the FFT method in order to find the frequency spectrum of each segment. Last, power spectral density is plotted and analysed.

RESULTS AND DISCUSSION
The first part of this section discussed about voiced alveolar /s/ and voiced post-alveolar /ʃ/ while the second part discuss of other consonants.

Alveolar /s/ and Post-Alveolar /ʃ ʃ ʃ ʃ/:
After all of the data had been collected, the spectrum representation of each speech samples is computed using Matlab software with specific algorithm. The power spectral density for every consonant are obtained as in Fig. 3 which shows a spectrum of one of /s/, ‫.]س[‬ The pattern of all spectrums is study. The peak(s), tilt and bandwidth are gathered and the reliable average values are listed in Table 2. As the representation of the spectrum is in Fast Fourier Transform (FFT) window, thus the results are then changed to Hertz form. The overall window is 2 7 , n for sampling frequency, F S of 44.1 kHz. The Hertz value (f) is represented in Table 3 as computed using Eq. 1: In Fig. 3, the spectrum is rising from n = 0 until n = 3, the peak and drop drastically to n = 6. The bandwidth is measured from n = 1 to n = 6. No tilt since the peak only appeared once. Thus the peak appeared at 541.4 and bandwidth is 725.3 Hz.
As seen in Table 2 and 3, the spectral reading are then plotted as graph representation as in Fig. 4. Consonants /s/ and /ʃ/ which are located at alveolar and post-alveolar respectively composed of first peak at 541.4 and 725.3 Hz correspondingly.  While the bandwidth for both consonants are 1689.8 and 1849.6 Hz respectively. Bandwidth became wider as the place of articulation goes into the oral tract. For both consonants, the spectrum power is decreasing drastically after at nearly 2 kHz. According to theory, post-alveolar consonant which is situated inner into the mouth than alveolar should possess peak appearance at frequency lower for the same peak number (first peak with first peak) While According to Li (2009), English speaker generally pronounced alveolar fricative accurately than postalveolar while Japanese-speaking children pronounced post-alveolar better than alveolar. As can be seen in Fig. 4, the first peak of alveolar is less than postalveolar. This might happen as the Malaysian-speaking children have difficulty in produced voiced postalveolar sound as no fricative post-alveolar sound in the language. Both peak appeared at frequency lower than 1 kHz and the spectrum decrease drastically as in Fig.  3. The bandwidth is greater as the place of constriction moves from alveolar to post-alveolar.
Labiodental /f/, Dental /θ, ð/, Alveolar /z/ and Palatal /ð'/: This part (B) discussed about the remaining five frontal fricative consonants under study which are labiodental /f/, dental /θ, ð/, alveolar /z/ and palatal /ð'/. The spectrum of power spectral density is ilustrated as in Fig. 5 and 6 of consonants /f/ and /θ/ respectively. Both have more than a peak in each spectrum representation of each samples of same length duration. In Fig. 5, two peaks appeared at n = 6 and n = 10 which are 1722.7 and 3100.8 Hz respectively. While in Fig. 6, three peaks appeared at n=3, 6 and 11 which correspond to 689.1, 1722.7 and 3445.3 Hz and the slope at n=5 which is 1378.1 Hz. The spectrums of all samples are obtained and the results are averaged as summarized in Table 4. Table 5 is the interpretation of those samples in frequencies as computed using Eq. 1.
Most of these five consonants have two peaks with /θ/ and /z/ has three. The third peak always appear at f = 3937.5 and 3962.1 Hz for both consonants. As the place of constriction moves from labio-dental to palatal, the first peak is decrease as in Fig. 7.   Hz respectively as in Fig. 8. Bandwidth is increasing from labio-dental to alveolar but not increase as the place is at palatal. This may happen as proven by Li (2009) that children will have difficulty at producing the correct sound of consonant of non-native language. Comparison with previous findings is summarized as in Table 6.

CONCLUSION
The second and the third peaks always appeared greater than the first peak for each consonant of each place of articulation. The first peak always appeared with high amplitude peak of all peaks composed in each sample and drops drastically to produce slope, which is known as trough and increase again to produce the second peak and repeated until the last peak appeared with decrement in the amplitude as can be seen through the spectrum representation discuss previously. From the power spectral density representation, the characteristic of fricatives is easily to be known than spectrogram representation as the fricatives sound compose of hissing and aspiration more which make the formants not easily be seen and read from normal point of view. Furthermore, the spectral density function of each samples are readable of formants which represent by each peak.