An Improved Time Domain Pitch Detection Algorithm for Pathological Voice

Problem statement: The present study proposes a new pitch detection a lgorithm which could potentially be used to detect pitch for disordered or pathological voices. One of the parameters requi d for dysphonia diagnosis is pitch and this prompted the development of a new and reliable pitch detection algorithm capable of accurately detect pitch in dis ordered voices. Approach: The proposed method applies a technique where the frame size of the half wave r ectified autocorrelation is adjusted to a smaller f rame after two potential pitch candidates are identified within the preliminary frame. Results: The method is compared to PRAAT’s standard autocorrelation and th e result shows a significant improvement in detecting pitch for pathological voices. Conclusion: The proposed method is more reliable way to detect pitch, either in low or high pitched voice without adjusting the window size, fixing the pitch candida te search range and predefining threshold like most of the standard autocorrelation do.


INTRODUCTION
Vocal cords within the laryngeal structure vibrate due to air passing through them during voiced speech (Swee et al., 2010). During voiced phonation pitch is produced and the fundamental frequency, F 0 and its reciprocal known as pitch period, T 0 can be calculated (Amado and Filho, 2008;Kotnik et al., 2009;Manfredi et al., 2000). Vocal hyperfunction, vocal abuse and misuse, or unhealthy social habits such as smoking and alcohol consumption may over time, cause physical changes to the laryngeal structure and lead to voice changes such as loss of power, changes in pitch and reduction in voice range (Hadjitodorov and Mitev, 2002;Timmermans et al., 2002;Godino-Llorente et al., 2006;De Bodt et al., 2007).
Cycle-to-cycle pitch period perturbation (also known as jitter) is usually one of the parameters used to measure voice quality. In order to obtain an accurate pitch period for each cycle of voiced phonation, the Pitch Detection Algorithm (PDA) needs to be able to perform equally well in pathological voices (Manfredi et al., 2000;Jang et al., 2007;Schoentgen, 2003). The detection of pitch is difficult due to the following reasons: • The nonstationarity and quasiperiodicity of the speech signal as well as the interaction between the glottal excitation and the vocal tract (Ahmadi and Spanias, 1999;Chen and Wang, 2001;Rabiner et al., 1976) • False pitch estimates can also be caused by noise and signal distortion that occur in real environments and errors in voicing decision (Cai and Liu, 1997;Tabrikian et al., 2004;Chomphan, 2011) • For dysphonic voices, there are significant perturbation of amplitude and frequency in the voiced signal, presence of subharmonic and aperiodic components of high intensity and also influence of voiced signal formant structure (Mitev and Hadjitorov, 2003) Many Pitch Detection Algorithms (PDA) have been developed and yet the results are not adequately reliable in detecting pitch in pathological voices (Mitev and Hadjitorov, 2003). The corresponding ACF according to Eq. 1; (c) The corresponding ACF according to Eq. 2 This study aimed to propose a newly developed time domain PDA with improved reliability in detecting disordered pitch. The PDA was tested on the KayPENTAX Elemetrics database for the vowel /a/ from 50 normal voices and 100 pathological voices randomly selected. The results were compared with the datasheet provided by KayPENTAX Elemetrics for the accuracy test. The performance of the proposed PDA was also compared with the well-known and publicly available PRAAT toolkit (Kotnik et al., 2009).
There are several known types of time domain based PDA. The most prominent one is the Auto Correlation Function (ACF). The following shows the general equation of the ACF (Abdullah- Al-Mamun et al., 2009;De Cheveigne and Kawahara, 2002;Quatieri, 2002;Lahat et al., 1987;Momani, 2009) Where: R x = The autocorrelation value s[n] = The input speech signal at sample i = The first sample inside a frame n N = The frame size l = The lag or time displacement that ranges from zero to the number of sample per frame minus one The lag value that produces maximum peak will be chosen as the pitch period. According to De Cheveigne and Kawahara (2002), another type of autocorrelation equation is as the following:  Figure 1a is a frame of speech waveform of the vowel /a/. The equation includes the lag, l to be subtracted from n to produce ACF as shown in Fig. 1c while Eq. 1 produces ACF in Fig. 1b and 2. ACF produced by Eq. 2 degraded as the l value increases by time. Figure 1b-c show the ACF of the acoustic waveform which were normalized and half-wave rectified from Fig. 1a. It can be seen from Fig. 1b that there are two dominant ACF peaks and these are termed as pitch candidates. The first peak is at lag = 146 and the second peak lies at lag = 292. Usually, to choose the best pitch to be defined as the pitch period of the frame, a rule must be set whereby the range of choosing the best pitch should not be near to zero lag and should not exceed certain value of lag. This rule reflects the limit for human pitch range which is 60-500 Hz (Mitev and Hadjitorov, 2003). Most of the existing commercialized software such as PRAAT and Computerized Speech Laboratory by Kay Elemetrics require the users to specify their own fundamental frequency range of interest in order for the algorithm to work efficiently. Some literature also proposed the use of ACF threshold so that only peaks that exceed this predetermined threshold will be notified as pitch candidates (Mitev and Hadjitorov, 2003). But these rules lack flexibility. If the range is poorly specified, the algorithm will take the wrong lag as pitch period. If the range is not specified at all, the autocorrelation will not be able to accurately detect low pitched voice as reported by Samad et al. (2000). The threshold rule will also be inappropriate for Eq. 1-2 since some of the voices' ACF do not even exceed 0.5 or more.
Another method is called the Average Magnitude Difference Function (AMDF) (Manfredi et al., 2000). The general equation for AMDF, R y is as follows Eq. 3 (Quatieri, 2002;Chong and Shih-Chien, 1977): Unlike ACF which selects the maximum peak as the pitch candidate, AMDF tends to search the minimum peak as the pitch candidate. Manfredi et al. (2000) proposed the modified AMDF where the first valley found to be less than the threshold is set to be the pitch period of the frame. This approach also has its weakness similar to the ACF whereby some harmonics and noise effects can also produce AMDF values that falls below this threshold.
From these basic time domain PDA's, many researchers have modified these algorithms so that it will work more efficiently to obtain pitch. One of the interesting approaches was Merged Normalized Forward Backward Correlation (MNFBC) which basically used the same concept of autocorrelation but instead of using autocorrelation, it uses MNFBC which is to be noise robust (Kotnik et al., 2009). Plus the method of finding the exact pitch period was by implementing viterbi search to the MNFBC. The viterbi searches for three largest value of the MNFBC as the pitch candidates per voiced frame. But the viterbi search introduces high dependency on current frame's pitch value with the previous frame's pitch value and it will not be able to work efficiently with dysphonic voices since cycle-to-cycle pitch period can vary extremely from each other. False period estimation can also occur when the MNFBC value is larger at pitch candidates other than the true pitch period. Huang and Pan (2006) and Donato et al. (1999) proposed Hilbert-Huang Transform (HHT) for PDA which was developed to consider the non-linearity characteristics of speech signal. It was proven to produce better accuracy of pitch detected but the computational requirements are also increase (Kotnik et al., 2009). Jang et al. (2007) Experimented several PDA's to be implemented on pathological voices and the result showed that ACF was the most credible PDA to detect pathological voice. Mitev and Hadjitorov (2003) presented that with a little modification to ACF, it can be an accurate PDA to be applied to pathological voices. But the method still depends on a threshold which they used was 0.5. Some of the pathological voices have fewer ACF than 0.5 even at the pitch period. These findings indicate that ACF time domain based PDA can still be able to detect pitch in dysphonic voices with high accuracy.
From all of the information given above, this study is proposing ACF with modification and with less computational cost for pitch detection in dysphonic voices without using a predetermined threshold and can also automatically set the pitch searching range unlike most of the commercial software where the users themselves need to set the searching range.

MATERIALS AND METHODS
The proposed algorithm for PDA is based on time domain approach consists of the modified ACF. The procedures are as the following.
Step (1): Initialization: Let t = 1 be the initial point of the speech signal. The frame size used for the algorithm is two times maximum pitch period, MAX_PER. MAX_PER is the lowest pitch that human can produce which is 60 Hz of voiced speech signal so that at least within this frame size, two best ACF peaks can be chosen as pitch candidates. Step (2): Compute autocorrelation: The autocorrelation equation used is Eq. 2. The equation will produce ACF or R x with i is equals to t, l ranges from zero to 2*MAX_PER -1 and N is equals to 2*MAX_PER.

Step (3): Half-wave rectification and normalization:
The ACF is then normalized and half-wave rectified so that the values for consideration are normalized and positive. This technique was introduced by Kotnik et al. (2009) using the following procedures: • R x is calculated using Eq. 2. R o and R t are found using the following formulae Eq. 4 and 5: where similar to (2), s[n] is the input speech signal at sample n, i is the first sample inside a frame, N is the frame size and l is the lag. Lag l ranges from zero until N-1.
• Then the normalization of R x is done by using the following formula Eq. 6 and 7: Or: • Then R is half-wave rectified by setting all the negative values to zero Step (4): Mark all possible candidates: All the peaks of the ACF are then marked as possible pitch candidates, T i (i). Figure 3a shows one frame of ACF of the vowel /a/ and Fig. 3b is the marked peaks, T i . The algorithm has considered several conditions for the system to work efficiently after every T i are being recognized: • If there is no T i , or i = 0, then the pitch for that frame is set to 0 and the frame moves to the next frame as much as 2*MAX_PER. • If T i exists, go to step (5) Step (5): Find two best candidates: The algorithm will then find the best two candidates by sorting the R x at every T i from the largest value to the lowest value along with their T i as shown in Table 1. From the rearranged candidates, the best two candidates are found by firstly use the following Eq. 8 to find the difference between a pair of T i : where, j = 2, 3, 4,…, j total and the best two candidates are chosen based on the following condition Eq. 9: diff (2 * MAX _ PER) / 8 ≥   The first T i pair that achieves this condition will be kept as b 1 and b 2 for the next step. The value (2*MAX_PER)/8 was obtained experimentally as values lower or higher than this will degrade the performance of the proposed PDA which is to accurately detect the pitch. Figure 4 shows the two pitch candidates which have been marked.
Step (6): Create new frame: Once the two candidates are selected, the size of the new frame will be calculated as the following Eq. 10-12: Instead of searching the pitch within 1 until MAX_PER-1 range or within a predefined range as most of the ACF does in the literature, this study introduces the new searching range which will be from new_framei until new_framef. With the new frame introduced, the largest R x (T i ) value that lies within that range will be chosen and its corresponding T i is considered as the pitch period, T 0 . Figure 5 shows the new frame or the new region to search the T 0 and the T 0 is marked with blue line.
Step (7): Proceed to the next frame: Since the frame size used might be consisting of two or more pitches, the starting point of the new frame is found according to the following Eq. 13: Where: t new = The new frame's starting point, t prev = The previous frame's starting point and T 0 = The pitch period found from the previous frame This way, every pitch period or every pitch epoch can be located accurately as shown in Fig. 6.
The experiment was conducted to test the accuracy and the effectiveness of the proposed PDA on normal voices and pathological voices.
Where: p proposedPDA = The value of each parameter obtain by using the proposed PDA p reference = The value of each parameter given by the reference The results were also compared with the wellknown and publicly available PRAAT toolkit where the PRAAT autocorrelation (PRAAT_ac) was chosen because the proposed algorithm is a modified autocorrelation (Kotnik et al., 2009). Table 2 shows the errors of each parameter produced by using the proposed PDA while Table 3 presented the errors of each parameter produced when PRAAT_ac was used. The observation shows that the PRAAT_ac works well for normal voices as to compare with the proposed algorithm. However, the error differences between PRAAT_ac and the proposed PDA are only at a very small scale. Table 2-3 show the mean of the errors for each voice sample and each parameter by using two different PDA's. According to the results from Table 2-3, PRAAT_ac produces more error for the pathological voice than the proposed PDA.

RESULTS
To summarize the result obtained by using the proposed PDA and PRAAT_ac, every parameter was averaged to get the mean error for each voice sample. For the proposed PDA it has been found that for 49 normal voices, the mean error was less than 20% and one voice was classified to be having more than 20% error, 13 pathological voices had more than 20% average error while another 87 pathological voices had less than 20% mean error. These data are presented in Table 4.     Table 5, similar to the proposed PDA, there were 49 voices identified to be having less than 20% error and only one voice was put in the more than 20% error category. For pathological voice, there are 15 voices were having mean error of more than 20% and another 85 voices had less than 20% mean error.

DISCUSSION
Even though it was observed that PRAAT_ac works better for normal voices, Figure 11 until Fig. 14 presented that it works poorly for pathological voices while the error produced by using the proposed algorithm is smaller for pathological voices.  Figure 12 shows that three samples exceed 40% of error by using PRAAT_ac while the proposed algorithm had no error that exceeds 40% of error. Figure 13 also indicates that proposed algorithm produces less error by having three samples with more than 40% of error while the PRAAT_ac produces four samples with more than 40% of error. As can be seen in Fig. 14, the standard deviation error for PRAAT_ac exceeds 100% for two sample pathological voices. It can be seen in Fig. 14 that the error of the standard deviation of sample number 44 is over 1000% as well as sample number 78. As can be observed in Fig. 15a, the PRAAT_ac marked the pitch period correctly for the first half of the signal but marked the pitch period wrongly for the second half of the signal shown in Fig. 15b. This is maybe due to the autocorrelation used for the pitch detecting whereby the second pitch period has higher ACF value than the first or the true pitch period. This will happen if the search criterion for the autocorrelation only involves finding the maximum ACF within a predefined range.
But by implementing the proposed algorithm to the same voice sample as can be seen in Fig. 15c, the pitch period can be well determined along the signal thus producing a smaller error than PRAAT_ac.
In both methods, the voice samples with error of more than 20% are due to the strong subharmonics frequencies. The disordered voice with creaky or breathy characteristics will also influence the signal's waveform and since the autocorrelation is dependent upon the signal's amplitude and how correlate the periodic pattern is, the autocorrelation function produced will also be distorted.

CONCLUSION
The proposed method of determining pitch provides significant improvement to the standard autocorrelation which in this case is indicated by the autocorrelation by PRAAT for disordered voice. It allows a more reliable way to detect pitch, either in low or high pitched voice without adjusting the window size, fixing the pitch candidate search range and predefining threshold like most of the standard autocorrelation do.