Speech Segmentation Using Dynamic Windows and Thresholds for Arabic and English Languages

: Segmentation of audio data such as human speech (splitting each word in separate audio file – .WAV file) has been a major concern when working with multimedia such as recordings from radio or TV. The main focus of the segmentation of boundaries of spoken language has been on using energy and zero crossing thresholds for endpoint detection. Errors in endpoint detection are still a main cause of low accuracy of segmentation systems. The goal of this research is to develop an efficient algorithm in order to segment the speech of human in both languages of English and Arabic in different speaking speed with high accuracy. Simulation results show that the developed algorithm achieved high accuracy when segmenting human speech in English language up to 91.6% in average, while it is 89.0% of Arabic language.


Introduction
Speech is a primary means of communication between humans. In the information society, speech is used not only in its original form but also through many digital electronic devices (Zhang and Kuo, 2001) mobile phones grow rapidly, wired telephone systems become digital and Internet phones and audio are common in use. On the other hand, computers may also speak to humans by synthetic voice (Juan et al., 2015) and listen to us using speech recognition. To understand these processes for both human and machine, we have to study carefully the structures and functions of spoken language: how to produce and perceive it and how speech technology may help us to communicate (Hennig and Chellali, 2012) Speech segmentation is the process of splitting the speech into separately words and each word is saved in separated audio file for the upcoming processing as shown in Fig. 1. Speech segmentation becomes an interesting area of research. Whereas several algorithms work on audio files, mainly in English language, which achieve different accuracy of segmentation based on the properties of the spoken languages themselves.
In this research, we developed a fast and simple-toimplement segmentation algorithm that matches closely subjective expectations of the required target. The algorithm is based on two thresholds that are benefit from a dynamic window in order to split the words correctly. The algorithm is implemented using MatLab and is applied on Arabic and English Languages, the accuracy of algorithm is high. The remaining of the paper is described as follows: part II provides an overview about audio signal and sampling. Part III describes the related work. Part IV describes the proposed algorithm. Part V describes the experiments and the evaluation of results. Part VI shows the complexity of algorithm and part VII is the conclusion.

Related Work
Several researches and methods have been done to improve the performance of speech segmentation (Naoki et al., 2006) proposes an audio signal segmentation and classification method using fuzzy cmeans clustering. (Shi-sian and Hsin-min, 2003) proposes a sequential metric-based audio segmentation method that has the advantage of low computation cost of metric-based methods and the advantage of high accuracy of model-selection-based methods.

Word (n)
Other approaches are based on Hidden Markov Model (HMM) (Daniel and James, 2017) such as (Lefevre et al., 2002) that combines a K-Means classifier with Hidden Markov Models in order to analyze audio segment using several audio features based either on segment or frame. Another method base on HMM is (Biswajit et al., 2015) that aims at exploring Vowel Onset Point (VOP) and Vowel offset or End Point (VEP) for correcting the boundaries obtained using HMM alignment. HMM models the class information well, but it may not detect the exact boundary. Another method based on HMM and designed for Arabic language is (Abed et al., 2016) that proposes an automatic segmentation system of speech into phonemes for the Arabic language. This segmentation is based on two different techniques: Hidden Markov Models (HMM) and Artificial Neural Networks (ANN). Both systems were used to classify the speech signals, extracted from ALGerian Arabic Speech Database (ALGASD corpus), into five classes: Fricatives, plosives, nasals, liquids and vowels. Adriana et al. (2015) introduces a first attempt to perform phoneme-level segmentation of speech based on a perceptual representation -the Spectro Temporal Excitation Pattern (STEP) -and a dimensionality reduction technique -the t-distributed Stochastic Neighbor Embedding (t-SNE). The method searches for the true phonetic boundaries in the vicinity of those produced by an HMM-based segmentation. It looks for perceptually-salient spectral changes which occur at these phonetic transitions and exploits t-SNE's ability to capture both local and global structure of the data. Some methods benefit from sliding window technique such as (Shih-sian and Hsin-min, 2004), which presents a hybrid approach for audio segmentation, in which the metric-based segmentation with long sliding windows is applied first to segment an audio stream into shorter sub-segments and then the divide-and-conquer segmentation is applied to a fixedlength window that slides from the beginning to the end of each sub-segment to sequentially detect the remaining acoustic changes. (Bartosz et al., 2006) applies the Discrete Wavelet Transform (DWT) to analyze speech signals, the resulting power spectrum and its derivatives. This information allows locating the boundaries of phonemes. Ghazaal and Farshad (2011) investigates the problem of segmenting speech into sub_word units. a technique based on fuzzy smoothing is applied on short term energy function of speech wave. The smoothed energy contour is searched then in order to find local minima that imply to syllable units. Benati and Halima (2016) suggests the use of an acoustic-based algorithm for the segmentation which exploits acoustic particularities of the speech stream to detect word frontiers. Fréjus et al. (2015) presents an algorithm using fuzzy logic approach to perform the continuous speech segmentation task from non-linear speech analysis. The proposed algorithm is based on time domain features. These features are the short-term energy, zero crossing rate and the singularity exponents calculated in each point of signal. While (Brognaux and Thomas, 2016) focuses on a particular case of hidden Markov model (HMM)-based forced alignment in which the models are directly trained on the corpus to align. Kamper et al. (2017) introduces an approximation to a recent Bayesian model that still has a clear objective function but improves efficiency by using hard clustering and segmentation rather than full Bayesian inference. Kiss et al. (2013) introduces a language independent solution that based on the segmentation of continuous speech into 9 broad phonetic classes. The classification and segmentation was prepared using Hidden Markov Models.

Proposed Method
The algorithm aims to split the audio file that contains several words into several audio files that each one contains one word only.
The process of segmentation goes through several steps; starting by converting the audio signal into samples and then performing normalization process (Lucero and Koenig, 2000) on them to handle the irregularity of samples amplitude, then check for muteness periods, next check for the continuous words as the last step before saving the segmented samples into separate file as shown in Fig. 2. Figure 3 illustrates the flowchart that is explained in more details.
The first step after sampling is the normalization; Equation 1 converts samples into their normalization values: Where: F n = The amplitude of normalized sample F s = The amplitude of sample before normalization H = Normalization level F h = The highest sample's amplitude The algorithm then applies two thresholds and the dynamic window: The algorithm depends on two threshold values; the first one segments the slow speech, which has a considerable period of muteness, while the second one segments the speech that is relatively fast.
Where: Th1 = The value of threshold 1 f n = The amplitude of the normalized sample j = The total number of samples The percentage (0.55) is measured based on many tests; and after many experiments trying several percentages, the 55% achieves the best performance in both languages.
Next step is to determine the width of first window; the initial width of window is equal to the average length of spoken word (the width is varying based on the spoken language); and then starting from the beginning of samples and testing the amplitude at the end of window, the windows' width is extended until measuring muteness.
The first threshold is used to check the muteness occurrences; this is done by moving the window through the samples and then obtaining the maximum value and comparing it to the threshold. This technique is used to make sure that the highest sample value is selected each time. If that value is greater than the threshold, then it is a continuous word and the algorithm continues checking next values by moving the window to next position until a value less than the threshold is obtained, which means end of word(s) as described below: 1. Set the starting point of samples. 2. Compare (H) the highest value of samples within the window to (Th1) threshold1 If W is less than Th1 Delete the muteness Split the words Else Move the window and go to step 2. 3. Repeat until the end of samples Next step is to eliminate the muteness periods between words; which has an average time of 1000 samples (this number is obtained by testing many audio wave files in Arabic and English languages) and checking for that number of samples to eliminate these unusable segments as shown below:

If H is less than Th1
Add 1000 to W 2. Delete samples between W and W+1000 3. Move the window to W+1000 position 4. Go to step 1 The algorithm then analyzes the muteness between long segments in order to adjust the width of window; the shortest word in spoken Language (Arabic and English languages) is measured through the whole audio file by determining the number of reasonable amplitudes between two continuous muteness intervals as shown below: 1. Set M the end of muteness 2. Set N the start of next muteness 3. calculate the width of window = N-M A second threshold is used to segment the previous continuous words as shown in Fig. 3, the value of second threshold is the average amplitude of samples (Equation 3): The same idea is used by moving the dynamic window between the borders of continuous words -i.e., the start and end indexes-as shown in Fig. 4 and determining the next value which should be less than the second threshold, which means that it is a complete word and segmentation should be done here.
The previous steps are repeated to the end of samples in order to segment all the words as shown below:

Experiments and Evaluation
The proposed algorithm is applied using MatLab on 50 .wav audio files of English language and another 50 .wav audio files of Arabic language, the speed of speech changes from fast to slow in both languages. Table 1 shows a sample of tested files of English language and the accuracy of segmentation.
The accuracy of segmentation for English language is 91.6%.    Table 2 shows sample of tested files of Arabic language and the accuracy of segmentation is 89.0%.
The accuracy of segmentation of Arabic files are not better than English ones due to the nature of language itself; Arabic language has vowels, which take relatively longer time than its counterpart of English, the algorithm may consider them as muteness and so the algorithm can fail in determining the case of muteness.
These results and accuracy can vary depending on the speed of speech and the spoken language itself.

Algorithm Complexity
This section aims to measure the performance of algorithm using the Big O Notation []. Based on the analysis of algorithm, it belongs to the O(n2); the algorithm has some single loops and two nested loops and so the effective part is the nested loops and so its complexity is O(n2).

Conclusion
In this research, a new segmentation technique is introduced that depends on two threshold values that are calculated based on the samples themselves and two windows of width that are measured based on the muteness of audio samples.
In this technique, we aim to go through the peak values of samples by means of dynamic moving window that capture number of samples and calculate the maximum amplitude within the window and so we guarantee that the algorithm does not fail with samples less than threshold (part of word).
The first threshold determines the muteness when a suitable period of muteness is exist and then the algorithm splits words when that period is too small, the idea used is to go through another threshold higher than the first one, in addition to a window size less than the previous one to have the capability to capture the small period of time between two words.
The algorithm is tested on Arabic and English languages whereas the accuracy of English language is better than Arabic one due to the nature of language itself. Segmentation Using Dynamic Windows and Thresholds for Arabic and English Languages has not been published in whole or in part elsewhere.