Audio Environment Recognition using Zero Crossing Features and MPEG-7 Descriptors

Problem statement: This study investigated zero crossing features and selected MPEG-7 audio descriptors for environment sound recognition applications such as audio forensics. Approach: The study implemented several experiments focusing on the problems of environment recognition from audio particularly for forensic applications. Results: It was investigated the effect of the temporal zero crossing feature as well as selected MPEG-7 audio low level descriptors on environment sound recognition. The performance was evaluated against a varying number of training sounds and samples per training file. Conclusion/Recommendations: Experimental results showed that higher recognition accuracy is achieved by increasing the number of training files and by decreasing the number of samples per training file. This study presented an audio environment recognition using zero crossing features and MPEG-7 Descriptors.


INTRODUCTION
Digital forensics can be defined as the collection of scientific techniques for the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital evidence derived from digital sources for the purpose of facilitating or furthering the reconstruction of events, usually of a criminal nature (Delp et al., 2009). There are several areas of digital forensics: image forensics, audio forensics, video forensics and multimedia.
In this study, we focused on digital audio forensics. Digital audio forensics provides evidence from left-over audio files contained in audio/video media at the crime spot. This type of forensic can be categorized into four different classes according to its nature: • Speaker identification/verification/recognition to find the answer of who • Speech recognition/enhancement, to find the answer of what • Environment detection, to find the answer of where or situation and • Source authentication, to find the answer of how A significant amount of research can be found in the area of speech recognition or enhancement (Faghihi and Jangjoo, 2005), speaker recognition (Campbell et al., 2006) and authentication of audio (Begault et al., 2005). However, little research can be found in the area of environment recognition for digital audio forensics, where foreground human speech is present in environment recordings. There are many difficulties while dealing with recognition of environment from audio because, unlike speech or speaker recognition case, die rent environment sounds may have similar characteristics.
The study presents several experiments on environment recognition for digital audio forensics: restaurant, office room, fountain, cafeteria, mall, meeting room and corridor. Temporal Zero Crossing (ZC) feature and some selected MPEG-7 audio low level descriptors are used as features. The MPEG-7 descriptors we use are Audio Waveform (AWF), Audio Power (AP), Audio Spectrum Envelop (ASE), Audio Spectrum Centroid (ASC) and Audio Spectrum Spread (ASS). This selection is based on our ongoing research using the Fisher ratio. Two types of experiments on environment recognition are performed by varying (a) the number of training files and (b) the number of samples per training file.
The study is organized as follows. The next part gives a review of related past works; followed by a description of feature extraction with ZC and MPEG-7 audio descriptors, the classifier and the data used in the experiments, followed by the proposed approach to recognize environment sound. In this section, the experimental results and discussion are also given. Finally, conclusions and future direction are presented.
Literature review and current related work: Most of the previous works in environment detection used Mel Frequency Cepstral Coefficients (MFCC) as features, which are applied not only in environment detection but also in speech and speaker recognition applications (Zulkarnain and Nor, 2010) and Hidden Markov Model (HMM) based classification. While HMMs are perhaps the most widely used in different applications, the k-Nearest Neighbor classifier (k-NN) is also applied due to its simplicity (Duda et al., 2000). As noted previously, there is not much work done in the particular area targeting forensic applications, but we can mention some related works that impact on this area.
A comprehensive evaluation of a computer and human performance in audio-based context (environment) recognition is presented in (Eronen et al., 2006). In their study, Eronen et al. (2006) used several time-domain and spectral-domain features in addition to MFCC. Principal Component Analysis (PCA), Independent Component Analysis (ICA) and Linear Discriminated Analysis (LDA) were used to reduce dimensionality of the feature vector. Two types of classifiers were applied separately: k-NN (k = 1) and HMM with number of states and number of mixtures within each state varying from 1-4 (and 5), respectively. Nature and outdoors were recognized with highest accuracy (96-97%), while the library, a quiet place, had the lowest accuracy (35%).
The researcher Chu et al. (2008) introduced the Matching Pursuit (MP) technique (Mallat and Zhang, 1993) in environmental sounds recognition. MP provides a way to extract features that can describe sounds where other audio features such as MFCC fail. In their MP technique, they used Gabor function based time-frequency dictionaries. It was claimed that features with Gabor properties could provide a flexible representation of time and frequency localization of unstructured sounds in the background environment. They applied k-NN (k = 1) and GMM with 5 mixtures (Chu et al., 2006;. In (Chu et al., 2006), they also used Support Vector Machine (SVM) methods with 2° polynomial as classifier and reduced the dimension by applying forward feature selection and backward feature selection procedures.
Sixty-four dimensional MFCC, plus the spectral centroid were used as features in (Malkin and Waibel, 2005). They used forensic-application-like audio files, where both ambient, i.e., environmental sound and human speech were present. However, they selected only those segments that were quieter than the average power in an audio file for the experiments. They introduced linear auto encoding neural networks for classifying the environment. A hybrid autoencoder and GMM was used in their experiments and 80.05% average accuracy was obtained. Wang et al. (2006) used three MPEG-7 audio low level descriptors as features in their study on environmental sound classification. They proposed a hybrid SVM and k-NN classifier in their study. For SVM, they used three different types of kernel functions: linear kernel, polynomial kernel and radial basis kernel. The system with 3 MPEG-7 features achieved 85.1% accuracy averaged over 12 classes. Ntalampiras et al. (2008) used MFCC along with MPEG-7 features to classify urban environments. They exploited a full use of MPEG-7 low level descriptors, namely, audio waveform, audio power, audio spectrum centroid, audio spectrum spread, audio spectrum flatness, harmonic ration, upper limit of harmonicity and audio fundamental frequency.
To detect the used microphone and the background environments of audio recordings, the researcher Kraetzer et al. (2007) extracted 63 statistical features from audio signals. Seven of the features were time domain: empirical variance, covariance, entropy, LSB ratio, LSB flipping rate, mean of samples and median of samples. Besides these temporal features, they used 28 mel-cepstral features and 18 filtered mel-cepstral features. They applied k-NN and Naive Bayes classifiers to evaluate microphone and environmental classification. Their study reported that the highest 41.54% accuracy was obtained by Naïve Bayes classifiers with 10 fold cross validation, while 26.49% was the highest accuracy achieved by simple k-means clustering. They did not use HMM or GMM for classification.

Feature extraction:
Zero crossing: Zero-crossing is a commonly used term in electronics, mathematics and image processing. In mathematical terms, a "zero-crossing" is a point where the sign of a function changes (e.g., from positive to negative), represented by a crossing of the axis (zero value) in the graph of the function. Zero crossing features are good for extracting sound from environment if we increase number of training files and decrease the number of sample (Johnston and Gulrajani, 2002). Mean value is subtracted from each signal. Frame length is 512 samples with overlapping 256 samples.
Selected MPEG-7 audio descriptor: MPEG-7 Audio describes audio content using low-level characteristics, structure, models. The objective of MPEG-7 Audio is to provide fast and efficient searching, indexing, retrieval of information from audio les. The characteristics can be divided into scalar and vector types. Scalar types returns scalar values such as power or fundamental frequency, while vector types returns, for example, spectrum flatness calculated for each band in a frame. In the following we briefly describe each characteristic, or descriptor, used. Though (Ntalampiras et al., 2008) utilized a partial MPEG-7 feature with seven dimensions, we exploit the full advantage of MPEG-7 features in this study. MPEG-7 Audio low-level descriptors: second moment of the log-frequency power spectrum. It demonstrates how much the power spectrum is spread out over the spectrum. It is measured by the root mean square deviation of the spectrum from its centroid. This feature can help to differentiate between noise-like or tonal sound and speech Classifier: We used the k-Nearest Neighbor algorithm (k-NN) as classifier. k is the most important parameter in a text categorization system based on k-NN. In the classification process, the k documents nearest to the test document in the training set are determined first. Then, the predication can be made according to the category distribution among this k nearest neighbors. k-NN is one of the most popular algorithms for text categorization (Manning and Schutze, 1999). Many researchers have found that the k-NN algorithm achieves very good performance in their experiments on different data sets (Yang and Liu, 1999;Hirzallah, 2007;Baoli et al., 2002). The nearest neighbors are defined in terms of Euclidean distance. The Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler and is given by the Pythagorean formula: We recorded audio signals from seven different scenarios: restaurant, office room fountain cafeteria, mall, meeting room and corridor. The duration for each environment is half hour (30 min). Each environment file is separated into many files with fixed number of samples d, for example restaurant environment file 1 from sample 1 to sample d, file 2 from sample d+1 to double 2d, similarly file 10 from (9*d+1) to (10*d).
Sounds were recorded with an IC recorder (ICD-UX71F/UX81F/UX91F). Sampling rate was set to 22.05 kHz and quantization was 16 bit.

RESULTS AND DISCUSSION
The feature extraction and classification used in the experiments are: • ZC and selected MPEG-7 as features • k-NN as classifier Two types of experiments are performed, one with decreasing number of samples per file and the other one with increasing number of file in training. First, we decrease the number of samples with fixed number of training files to six. Second, the same consideration with the number of training files is fifteen.

Six file training:
In this case the first six files of each environment are used for training and the last five files for testing. The experiment was run with different number of samples: 1,000,000 and 500,000 for each file. The objective of this experiment is to see the affect of decreasing the number of samples. The results are presented in Fig. 1.
For the ZC feature, the average accuracy for all environments is 20% when the numbers of sample is 1,000,000. When we decrease numbers of sample to 500,000 the average accuracy for all environments is enhanced to 40%. The average accuracy is increased for all features except AP and AWF_min. The ASE feature has the highest average accuracy, followed by ZC. Fifteen file training: The number of training files is increased from six to fifteen files and the same experiment as described previously is repeated. The accuracy is enhanced when the number of file training is increased. The highest accuracy is achieved when the number of sample is 1,000,000. The results are given in Fig. 2. All features gave an increased overall average accuracy when we decreased the number of samples and increased the number of training files. Figure 3 and 4 give recognition accuracies (%) for fixed samples by varying the number of training files. From the previous diagrams, we can find that by increasing the number of training files, recognition accuracies are increased with all the feature types. However, for the number of samples, the reverse is true. If we decrease the number of samples, the accuracies increase.

CONCLUSION
In this study we investigated zero crossing features and selected MPEG-7 audio descriptors for environment sound recognition applications such as audio forensics. The experimental results showed significant improvement in accuracy using MPEG-7 Audio features and zero crossing when we increase the number of training files and decrease the number of samples. The future study is needed to study the effect of other types of features and classifier in environment recognition to achieve yet higher performance. This study provides an attempt to fill the knowledge gap in audio environment recognition using zero crossing features and MPEG-7 descriptors.