Speech Enhancement Using Minimum Mean-Square Error Amplitude Estimators Under Normal and Generalized Gamma Distribution

Problem statement: In this study, DFT-based speech enhancement via Minimum Mean-Square Error (MMSE) amplitude estimators was considered. Approach: Several variants of the basic approach (MMSE-STSA) have been proposed over the years to address certain shortcomings, chiefly the quality of the remnant noise and its trade-off with speech distortion. In this study, we presented a comparative study between the MMLSA and the estimators based on the Gamma model, followed by an implementation in Matlab of these algorithms and an objective evaluation using a corpus of speech. Results: We obtained the best values of various parameters used by different estimators. Conclusion: Objective evaluation confirm superiority in noise suppression and quality of the enhanced speech by the estimators derived under the generalized Gamma distribution than the estimators derived under the normal distribution, in stationary environments.


INTRODUCTION
The interest in the field of speech enhancement emerges from the increased usage of digital speech processing applications like mobile telephony, digital hearing aids and human-machine communication systems in our daily life. The trend to make these applications mobile increases the variety of potential sources for quality degradation. Speech enhancement methods can be used to increase the quality of these speech processing devices and make them more robust under noisy conditions. The large group of speech enhancement methods meant to improve certain quality aspects of these devices. In this study we will focus on single-microphone additive noise reduction and aim at methods that study in the Discrete Fourier Transform (DFT) domain.
The traditional hypothesis for speech enhancement in the DFT domain is that the distribution of the complex speech DFT coefficients is Gaussian (Ephraim and Malah, 1984;1985). Therefore, the spectral amplitude distribution is modeled by a Rayleigh distribution. Actually, super-Gaussian models of the DFT coefficients are used, because they lead to estimators with improved performance than those based on a Gaussian model. Martin (2005) derived complex-DFT estimators under Laplacian and Gamma speech assumptions. Lotter and Vary (2005) proposed a MAP amplitude estimator for a generalized Gamma amplitude distribution.
MMSE estimators of the complex DFT coefficients, assuming a two-sided generalized Gamma distribution, have been derived in . MMSE estimators for the amplitudes, assuming a onesided generalized Gamma distribution, are treated in (Andrianakis and White, 2006) and . For all these estimators, the decision-directed method is commonly used (Ephraim and Malah, 1984).
In this study, we present a comparative study between the MMLSA, which is the most efficient variant of the estimators based on the Gaussian model and the estimators based on the Gamma model. This study is followed by an implementation in Matlab of these algorithms and an objective evaluation using a corpus of speech.

MMSE spectral estimation: Modeling noise DFT magnitudes and assumptions:
Assume that we observe a noisy speech signal y(t) that is a sum of a speech and noise signal x(t) and d(t), which are uncorrelated. Their representation in the Short Time Fourier Transform (STFT) domain is given by: where, Y(k, 1) and D(k, 1) are the samples of the noisy speech, the clean speech and the noise signal's STFT correspondingly. The index k corresponds to the frequency bins and the index l to the time frames of the STFT. Since DFT coefficients from different time frames and frequency indices are assumed to be independent, the indices k and l will be omitted for simplicity. We can write X = Ae jΦ and Y = Ae jΦ , where random variables A and R represent the clean and noisy amplitude and Φ and Θ the corresponding phases values.
In this study we focus on MMSE estimation of the clean amplitude A. The MMSE estimate of A is the expectation of the clean amplitude conditional on the noisy amplitude r(E{A/r}). With Bayes formula we can express the MMSE estimate Â as: The estimation of the clean amplitude A requires some assumptions about the distribution of the speech and the noise. The speech has usually been assumed Gaussian, e.g., (Ephraim and Malah, 1984;1985), but in recent times estimators based on super-Gaussian speech assumptions such as Laplacian or Gamma distributions have been derived (Lotter and Vary, 2004). A similar development has been seen for the noise assumptions; most commonly the noise is assumed Gaussian, but estimators exist which suppose the noise to obey a super-Gaussian distribution (Lotter and Vary, 2004).
With the zero-mean Gaussian distribution assumption of the noise DFT coefficients, f R/A (r/a) can be written as (McAulay and Malpass, 1980): Where: I o = The 0th order modified Bessel function of the first kind Gaussian based short-time spectral amplitude estimator: In this case, the DFT coefficients of both the speech and the noise are assumed to be an independent Gaussian random variables. Moreover, the speech signal might not be present at all times and at all frequencies. We therefore consider a two following Hence, the probability density function can be given as: are the variance of the spectral component of speech and noise. Let C k be some function of the short-time spectral amplitude A k of the clean speech in the kth bin (e.g., 2 k k k A , log A , A ). The MMSE estimator k C of k C is given by (McAulay and Malpass, 1980): where, denote conditional expectations and conditional probabilities, respectively. Thus, Based on the results reported in (Ephraim and Malah, 1985;Malah et al., 1999), the Multiplicatively-Modified Log-Spectral Amplitude (MM-LSA) estimator (corresponding to C k = logA k ) outperformed the traditional MMSE-STSA estimator (Ephraim and Malah, 1984) with and without incorporating speech presence uncertainty indicated as MMSE-SPU and MMSE respectively (C k = A k ).
The MM-LSA estimator is (Malah et al., 1999): Under the Gaussian assumptions on the speech and noise, the gain function G LSA (k) is derived in (Ephraim and Malah, 1985) to be: Where: with η k is called the a priori SNR, γ k is the a posteriori SNR and q k is the a prior probability of speech absence in the k-th bin.
The gain modification G MM (k) is the soft-decision modification of the optimal estimator under the signal presence hypothesis and is given by (Ephraim and Malah, 1984;Malah et al., 1999): where the likelihood ratio Λ(k) is defined as: and, q k denotes the a priori probability of speech absence in the kth bin. By using 4 and 5, we get: Gamma based short-time spectral amplitude estimator.
In the Gamma based MMSE estimators of the speech DFT magnitudes; we assume that the speech DFT magnitudes are distributed according to a onesided generalized Gamma prior density of the form: where (.) Γ is the Gamma function and the random variable A represents the DFT magnitudes, with the constraints on the parameters β>0, γ>0, ν>0.
The Gamma based MMSE amplitude estimators for the cases γ = 1 and γ = 2 have been derived in (Andrianakis and White, 2006;Hendriks et al., 2006;Erkelens et al., 2007). We will use the case γ = 2, as the related estimator can be derived without any approximations and the maximum achievable performance for both cases is about the same.

The decision-directed estimator of the a priori SNR:
In order to evaluate the above gain functions, we must first estimate the noise power . This is often done during periods of speech absence as determined by a Voice Activity Detector (VAD), by using a noise-estimation algorithm like the minimum statistics approach (Martin, 1994;2001), or by using a real noise in comparative studies. The a posteriori SNR estimator γ k is the ratio of the squared input amplitude 2 k R and the estimated noise spectrum.
In (Ephraim and Malah, 1984;1985;Cape, 1994), a decision-directed approach for the a priori SNR estimation is proposed: where the smoothing factor 0≤α≤1, a value of α = 0.98 was used in the implementation and the lower limit η min recommended by (Cape, 1994), is the same to the use of the spectral floor in the basic spectral subtraction method (Berouti et al., 1979). A lower limit of at least-15 dB is recommended.

Implementation and performance evaluation:
For the experiment, the Noizeus database (Hu and Loizou, 2007) was used which consists of 30 IRS-filtered speech signals sampled at 8 kHz, contaminated by various additive noise sources. The frame size is 256 samples, with an overlap of 50%. The data window used was a Hanning window. The enhanced signal was combined using the overlap and add approach. The a priori probability of speech absence, q k , was set to q k = 0.3 in (7). The noise variance was estimated from 0.64 seconds of noise only, preceding speech activity. Matlab implementations available from, (Borrowes, 2003) have been used to evaluate the confluent hypergeometric functions.
To measure quality of the enhanced signal, we have used the segmental SNR, the Log-Likelihood Ratio measure (LLR) (Hansen and Pellom, 1998) and the Perceptual Evaluation of Speech Quality (PESQ) (Rix et al., 2001). All the measures show high correlation with subjective quality.
The LLR measure for each 20-ms speech frame is given by: Where: a φ and d a = The Linear Prediction (LP) coefficient vectors for the clean and processed speech frame respectively R φ = The autocorrelation of the clean speech frame LLR = A spectral distance measure which mainly models the mismatch between the formants of the original and enhanced signals The mean LLR value was obtained by averaging the individual frame LLR values across the sentence. The highest 5% of the LLR measures values were discarded, as suggested in (Hansen and Pellom, 1998), to exclude unrealistically high spectral distance values. The lower LLR measures for an enhanced speech, the better are its perceived quality.
Since the correlation of SNR with subjective quality is so poor. Instead, we choose the frame-based segmental SNR by averaging frame level SNR estimates and is defined by (Hansen and Pellom, 1998) where, M denotes the number of frames. The lower and upper thresholds are selected to be -10 dB and +35 dB, respectively. The perceptual evaluation of speech quality (Rix et al., 2001), predicts the subjective quality of speech signals with high correlation between subjective and objective results and expresses the quality in a score from 1.0 (worst) up to 4.5 (best).

RESULTS AND DISCUSSION
We evaluate the two estimators (MM-LSA and the Gamma based estimator). For a proper choice of ν, we evaluated the estimator for a wide range of values between 0.01 and 2.5. Figure 1 and 2 shows plots of SNR segmental and PESQ versus for γ = 2, at 0 and 5 dB SNR, in the case of white noise and babble noise, respectively.
We see the similarity between the PESQ plots and the SNR seg plots. Furthermore, the better performance is reached with lower ν-values and the Gamma based estimator scores very well for ν≈0.1. The quality of speech enhanced by the Gamma estimator was compared against the quality of speech produced by the other MMSE STSA estimators. Table 1 and 2 summarize the objective results for noisy speech, for enhanced speech with MM-LSA estimator and for enhanced speech with Gamma based estimator.
From the results in Table 1 and 2, it can be seen that the Gamma based estimator had higher preference scores compared to the MM-LSA estimator for all noise at 0 and 5dB SNR. Further, the enhanced speech from the Gamma estimator sounds less musical than that obtained from the other estimator. This was due to the fact that the Gamma priors fit better to measured speech DFT distributions than the Gaussian priors.

CONCLUSION
This study considered DFT based techniques for single channel speech enhancement. We show an increase in the quality of the enhanced speech with different noise types. Results, in terms of objective measures and listening test, indicated that the Gamma based estimator yielded better performance than the MM-LSA estimator based on a Gaussian model.
In the future, we plan to evaluate its possible application in preprocessing for new communication systems and hearing aid system.