Recognition of pathological voices by Human Factor Cepstral Coefficients (HFCC)


 Several tools have been introduced to achieve early detection of voice disorders. Among these tools are the human factor cepstral coefficients HFCC combined with prosodic parameters, the noise-harmonic ratio (NHR), the harmonic-noise ratio (HNR), analysis of trend fluctuations (DFA) and fundamental frequency (F0). These parameters are introduced and calculated in every frame. In this work, we used a variation of HFCC called equivalent rectangular bandwidth (ERB) to study the effects of HFCC on the classification of pathological voices. Using the HTK classifiers, the classification is carried out on two pathological databases, Massachusetts Eye and Ear Infirmary (MEEI) and Saarbruecken Voice Database (SVD). To assess the performance of the system, we used sensitivity and specificity.


Introduction
In biomedical applications of speech technology, the diagnosis of pathological voice is an important matter. The human voice may be affected by several diseases that appear in the vocal cords. Thus, the vocal treatment of the pathological voice presents some favors, such as its noninvasive and quantitative nature. These benefits allow the identification and observation of diseases of the vocal system and reduce the cost and time required for its treatment. The main objective in the classification of pathological voices is to predict whether the patient's voice is normal or pathological. Proper grading will allow automatic diagnosis and treatment of the disease. [1] For several years until now, the detection of vocal pathology can be evaluated in a subjective or objective way [2]. Indeed, the objective evaluation of acoustic signals is done through computer tools. This assessment identifies and quantifies the underlying vocal pathology that humans cannot hear [3]. Thanks to the technological revolution, the voice can be easily manipulated, so smart devices are used for recording and cloud technologies help with remote processing. In these works [4,5,6,7] the authors used signal processing techniques and machine learning algorithms to build a reliable system to distinguish precisely between healthy voices and pathological ones. In this same context, we have developed in this study an automatic recognition system for pathological voices. This system is composed of two basic modules which are the parametrization module which extracts the relevant parameters from pathological voices. This module is based on the Cepstral coefficient of the human factor (HFCC) proposed by Skowronski and Harris [8]. The second module is the classifier used to classify vocal pathologies. we used a hidden Markov model with a Gaussian mixture density (HMM-GM) [9], through The Hidden Markov Model Toolkit (HTK) (HTK 3.4.1) [10] Researchers in this field have frequently used objective assessment of vocal pathology using several databases. We note here the most used databases such as the database (MEEI) [11,3], Saarbruecken Voice Database (SVD) [4,7,12] and Arabic Voice Pathology Database (AVPD) [13,7]. The research carried out on these bases is generally based on the analysis of the phonation of the vowel /a/, for example in the works [14,15,16,17]. While in other works some researchers have combined vowels to do the analysis, for example [5,6,18].
In our work, we used two databases MEEI Database and Saarbruecken Voice Database for the classification of pathological voices. By comparing our work to the previous ones we did not analyze vowels but some sentences. In the first database, acoustic samples are recordings of up to 12 seconds of readings of the sentence "Rainbow Passage" by men and women. And in the other database, we used the recording of the sentence "Guten Morgen, wie geht es Ihnen?" ("Good Morning, how are you?"). Thus our study is based on the HFCC method combined with the prosodic parameters, the harmonic noise ratio (NHR), the harmonicnoise ratio (HNR), the relaxed analysis of fluctuations (DFA) and the fundamental frequency (F0) which are calculated for each image. There are various measures of the performance of a diagnostic test that include different indices such as sensitivity, specificity, accuracy, etc. [19] and the use of ROC curves (operating characteristic of the receiver). The probability that the test is positive corresponds to the sensitivity, given that the subject is sick. So it, measures the ability of a test to detect patients. The closer the sensitivity is to the unit, the fewer errors in the detection of sick subjects (false negatives). The probability that the test is negative corresponds to the specificity, considering that the subject is healthy. So it measures the ability of a test to detect healthy individuals. The closer the specificity is to unity, the less false positives there are. [20] The relation between the sensitivity and the specificity of a test is represented graphically by the ROC curve, calculated for all possible threshold values. The area under the ROC curve (AUC) is one of the most used overall measures of test performance. It varies between 0.5 in the case of a noninformative test to 1 in the case of perfect execution. [20] The aim of this work is to determine the capacity of these parameters to detect and classify voice pathologies. Another scenario has been used for the parameters alone with HFCC and hybrid. To validate the performance of the recognition system, we used the ROC curve and its under area (AUC).

Methods and Materials
Fundamental frequency F0, Human Factor Cepstral Coefficient (HFCC), the harmonic to noise ratio (HNR) and Detrended Fluctuation Analysis (DFA) are essentially the classical characteristics used for the classification of pathological voices. These classic features are inspired by the cues used in the field of voice recognition. This section provides an overview of the most common features involved in the pathological voice.

Fundamental Frequency
The fundamental frequency (F0) For a speech signal corresponds to the frequency of vibration of a speaker's vocal cords. This parameter is used in most studies, sometimes in conjunction with the Human Factor Cepstral Coefficient (HFCC). [21] 2.2. Human Factor Cepstral Coefficient (HFCC) There are many methods of extracting robust functionality available; one of the efficient methods of feature extraction is the Human Factor Cepstral Coefficient (HFCC). This is a new approach to extracting speech characteristics that have been proposed and described in detail in [8].
In 2004 Skowronski and Harris introduced the HFCC variant, which is the most recent implementation of the mel band. HFCC is based on a measurement of the width of the filters called "Equivalent Rectangular Bandwidth, or Equivalent Rectangular Bandwidth (ERB)" proposed by Moore and Glasberg in 1983. For each filter, the ERB value is defined as the width of an ideal bandpass filter of the same central frequency, the measurement of these ERBs illustrates the frequency resolution of the hearing system, it is given by the following formula:  The block diagram of the extraction of the HFCC characteristics is shown in Figure.2.

Fig. 2 HFCC Implementation.
First, the speech signal is pre-emphasized and weighted by the Hamming window. 25 ms with a frame offset of 10 ms. On apply thereafter, The DFT for each frame to obtain the spectrum X (j). Then, the X (j) obtained is used to calculate the amplitude spectrum | X (j) |. Subsequently, the result is filtered by applying a Human Factor Filter Bank. The outputs of the filter bank are compressed by the logarithm function. Finally, the Discrete Cosine Transform (DCT) is used to decorrelate the obtained outputs, yielding the HFCC Coefficients. [23] Where N is the number of filters in the filter-bank, M is the number of HFCC coefficients, and represents the logarithmic energy output of the kth filter (k=1, 2,… N). N and M are chosen as the following: N = 32 and M = 12 for the HFCC computations. And Hk(j), k=1, 2,.,N, represents the filter bank in the frequency space. The sampling frequency is 16000 Hz, Skowronski and Harris have proposed this implementation of the 32-filter HFCC filter bench, which covers the frequency range [115 8000] Hz. The design of the HFCC filter bank is described as follows. First, we choose the number of filters M as well as the minimum frequency and maximum ℎ ℎ of the entire filter bank, the central frequencies 1 and are calculated as follows: With i is the index of the center frequency 1 or M, the coefficients ̅ and ̅ are defined by: and c a a a a The constants a, b and c mentioned in (1) are expressed by the following values: 6.23 * 10−6, 93.39 * 10−3 and 28.52 respectively and they vary in both cases, for the first filter, the coefficients , , and ̂ are calculated as follows: The passage of   Finally, the maximum and minimum frequencies of each i filter are expressed by: The NHR measures the amount of noise in the voice signal and assesses vocal quality. When the signal-to-noise ratio is high, there will be good voice signal quality [24]. Thus the HNR is a measure examining the presence of noise during phonation. To calculate it, the signal is firstly downsampled to 16 kHz, and split into 25 ms length frames, with 10 ms shift. In each frame, a comb filter is applied to the signal to compute the energy in the harmonic components. [

DATABASES
The experimental study was developed on two pathological voices databases, the MEEI database and the Saarbruecken Voice Database (SVD). In the first database, acoustic samples are recordings of up to 12 seconds of readings of the sentence "Rainbow Passage" by men and women. And in the other database, we used the recording of the sentence "Guten Morgen, wie geht es Ihnen?" ("Good Morning, how are you?"). For the MEEI database, we chose a subset comprising 53 healthy voices (33 female voices and 20 male voices) and 96 pathological voices (47 female voices and 49 male voices). As well as for the second base Saarbruecken Voice Database (SVD), we selected a subset comprising 211 healthy voices (127 female voices and 84 male voices) and 154 pathological voices (95 female voices and 59 male voices). Table 1 summarizes the number of samples of pathological voices from each base.

MEEI Database
The database (MEEI) was registered at the Massachusetts Eyes and Ears Infirmary and marketed by Kay Elemetrics. It contains records of sustained vowel phonations [ah] (3 to 4 s long) and the first 12 seconds of rainbow passage spoken by normalophonic subjects and patients with psychogenic, neurological, organic and traumatic, the voice at different stages (from beginning to full development). the environment of recording speech samples is controlled at 25 kHz or 50 kHz and 16 bits resolution. [11].  Receiver Operating Characteristics (ROC) analysis is a useful method of measuring the ability of a voice recognition model to distinguish between people with illness and those without. Its use in speech processing was born as a method to synthesize the specificity and sensitivity of diagnostic tests across a range of possible cutting points. The area under the ROC curve can be interpreted as a probability of correct classification or prediction. [30] we discuss in this paper the use of the area under the ROC curve (AUC) as a measure of the performance of a classifier.

 The Area under the ROC curve
The area under the ROC curve (AUC) is one of the most popular summary indices that are associated with the ROC curve. It is an overall measure that indicates the performance of the diagnostic test. AUC's value is between 0 and 1. The overall diagnostic performance of the test is precise when the value of AUC is close to 1. [31] [32]

RESULTS AND DISCUSSIONS
The results of the experiments carried out for the detection and classification of pathologies are expressed in different terms. These terms are accuracy (the ratio of correctly detected samples to the total number of samples), sensitivity (proportion of pathological samples identified positively), specificity (proportion of normal samples identified negatively), and the area below the receiver operating characteristic curve (ROC), called the area under the curve. The functionality extracted from the two different databases must be checked in the detection and classification processes. Therefore, many of experiments have been carried out to verify their reliability and accuracy in both processes. To ensure accuracy, different detection and classification experiments were carried out individually for each combination of parameters (HFCC, F0, HNR, NHR and DFA) and for each value of ERB. Table 3 below illustrates the results of the pathological voice recognition rate for the MEEI database. The acoustic modeling of this database is refined, estimating the probability densities of four-Gaussian. The best recognition rates are obtained for HFCC-NHR with ERB=6 and HFCC-F0-NHR with ERB=4 (99.07%), HFCC-F0-NHR with ERB=5 and HFCC-HNR-DFA with ERB=6 (98.13%) respectively. We notice that if we increase the ERB, the recognition rate will be better. In this case, the best rate is obtained for ERB=4,5 and 6. Figure 3 below shows this.  Table 4 below illustrates the results of the pathological voice recognition rate for the MEEI database. The acoustic modeling of this database is refined, estimating the probability densities of four-Gaussian. The best recognition rates are obtained for HFCC-NHR with ERB=6 (87.40%) and HFCC-HNR-NHR with ERB=6 (86.03%) respectively. We notice that if we increase the ERB, the recognition rate will be better. In this case, the best rate is obtained for ERB=6. Figure 4 below shows this.  Table 3, the combination of the HFCC-NHR parameters gives the highest overall recognition rate for ERB = 6. This result is detailed in Table 5, which gives the recognition rate by pathology for female and male voices for each ERB value. We notice in this case that the best recognition was for the female voices. Tables 6 and 7 present the classification results for each type of pathology and for different ERB values for female and male voices. Table 6 gives different performance results of the recognition system for the type of female voices and for the combination of HFCC-NHR parameters. It can be seen that the accuracy varies between 83% and 100% for a variation in ERB between 1 to 6. Besides, it can be seen that the increase in the value of ERB improves the general performance of the recognition of pathologies.   Similarly, for male voices, the performance results of the recognition system for the combination of HFCC-NHR parameters are mentioned in table 7. It is noted that the system is precise; the precision varies between 84% and 100% for a variation ERB between 1 to 6. Besides, it is noted that the increase in the value of the ERB improves the general performance of the recognition of pathologies. (For ERB = 5 and ERB = 6 all diseases are predicted).  Figure 6 shows a combined measure of sensitivity, and specificity and we see that the pathological voice recognition system, in this case, is efficient. The area under the curve (AUC) varies between 0.99 and 1 for the different ERB values. It shows that the best performance could be obtained in the case of the equivalent rectangular bandwidth (ERB) equal respectively to 5 and 6. In, this case, the distinction between the different types of pathology is perfect.

Fig 6.
ROC for classification by HFCC-HNR for MEEI database for male voices

Combining HFCC-F0-NHR parameters
In this study for the combination of HFCC-F0-NHR parameters, we can see that the recognition system is more precise concerning type pathologies (ventricular, paralysis and hyperfunction) for male voices and for all ERB values.
On the other hand, we note that for all types of pathologies and at different ERB values, the recognition rate of female voices is better. Table 8 gives an overview of the different results.    Table 3, the best pathology recognition rate for the combination of HFCC-NHR parameters is obtained for female voices. Table 11 shows that laryngitis and spasmodic pathologies have a recognition rate of 100% for an ERB value, respectively equal to 3, 4 and 5. While for male type voices, the best recognition rate is obtained for healthy voices of 98.8%. We note that the recognition rate of pathologies is improved when the value of ERB increases   Table 12 illustrates the different performance values of the recognition system for the SVD database for female voices and for the combination of HFCC-NHR parameters. We note that the best recognition rates for the different pathologies are 100% (ERB = 3, 4 et 5), 96.1% (ERB = 6) and 90.9% (ERB = 6) corresponds respectively to pathologies of laryngitis, Spasmodic type, Normal and Hyperfunction. However, the accuracy is not sufficient to give a final conclusion on the effectiveness of the recognition system, and therefore these measures should be supplemented by the ROC curve as shown in Figure 9. The latter shows that the best performance is obtained with the values of the highest equivalent rectangular bandwidth (ERB) (AUC = 0.9704).

Fig 9.
ROC for classification by HFCC-NHR for SVD database for female voices Table 13 illustrates the different performance values of the recognition system for the SVD database for male's voices and for the combination of HFCC-NHR parameters. We find that the best recognition rates for the different pathologies are 98.8% (ERB = 5), 87.5% (ERB = 1,2,4), 77.8% (ERB = 3, 6) and 76.7% (ERB = 6) corresponds respectively to Normal, Spasmodic, laryngitis, polyp and Hyperfunction pathologies, However, the accuracy is not sufficient to give a final conclusion on the effectiveness of the recognition system, and therefore these measures should be supplemented by the Curve ROC as shown in Figure 10. This last shows that the best performance is obtained with the values of the equivalent rectangular bandwidth (ERB) equal respectively 5, 4 and 3 with the values of the area under the curve (AUC) equal respectively AUC = 0.9752, 0.9734 and 0.9671.

Conclusion
As part of this work, we improved the assessment of voice disorder using prosodic parameters using two different types of MEEI and SVD databases. The first database consists of a number of test samples of 105 records of the first 12 seconds of the Rainbow Passage spoken by men and women (58 female patients and 47 male patients) whom they represent 5 types of pathologies (ventricular, gastric, edema, paralysis and hyperfunction). The second database consists of 4 types of pathologies (hyperfunction, laryngitis, polyp and spasmodic); these samples are recordings of the phrase "Guten Morgen, wie geht es Ihnen?" ("Good morning, how are you?") by men and women. The second database consists of 4 types of pathologies (hyperfunction, laryngitis, polyp and spasmodic); these samples are recordings of the phrase "Guten Morgen, wie geht es Ihnen?" ("Good morning, how are you?") pronounced by male and female patients. There are 365 records in this database divided into 143 samples from male patients and 222 samples from female patients. Recognition rates varied from the database to database with the same combination of prosodic parameters. The best overall recognition rates we obtained were 99.07% and 87.40% for the samples taken in MEEI, SVD, respectively. The recognition rates obtained, as well as the sensitivities in this study, are essential to detect and classify vocal pathologies. For example, certain combinations of parameters have an excellent indication that they can contribute to the detection and classification of voice pathologies such as HFCC-NHR and HFCC-F0-NHR. For the MEEI database, the recognition of pathological voices is better for female type voices for these two combinations of parameters. The recognition system is more precise for the value of the equivalent rectangular bandwidth (ERB) equal to 6.FCC-F0-NHR. For the SVD database for specific pathologies, we conclude that the recognition of female type voices is better compared to those of male type such as (spasmodic, laryngitis and hyperfunction). On the other hand, for the male polyp pathology is well recognized compared to that of the female.