Classification of Breast Tumor in Mammogram Images Using Unsupervised Feature Learning

: In this study, we propose a learning-based approach using feature learning to minimize the manual effort required to extract features. Firstly, we extracted features from equally spaced sub-patches covering the input Region of Interest (ROI). The dimensionality of the extracted features is reduced using max-pooling. Furthermore, spherical k-means clustering coupled with max pooling (k-means-max pooling) is compared with well-known feature extraction method namely Bag-of-features. The resulting feature vector is fed to two different classifiers: K-Nearest Neighbor (K-NN) and Support Vector Machine (SVM). The performance of these classifiers is evaluated to use of Receiver Operating Characteristics (ROC). Our results show that k-means-max pooling, combined with K-NN, achieved good performance with an average classification accuracy of 98.19%, sensitivity of 97.09% and specificity of 99.35%.


Introduction
Breast cancer is disease that is diagnosed among the women in the world (Marrocco et al., 2010). Mammography is used as early screening tool for breast cancer. The studies in (Perlmutter et al., 1997;Rawashdeh et al., 2013;Rojas Domínguez and Nandi, 2008), reported that the radiologist may over look some abnormalities during mammogram image screening. Hence, Computer Aided (CAD) system helps the radiologist to double check the mammogram images. A lot of literature has been reviewed in the area of CAD systems for breast cancer as well as techniques for improving classification accuracy (Etehadtavakol et al., 2013;Korkmaz and Korkmaz, 2015;Pak et al., 2015;Ramirez-Villegas and Ramirez-Moreno, 2012;Wang et al., 2014). Feature representation is one of the techniques that were developed for improving the classification accuracy of CAD. Feature representation contains set of techniques that converts the input data into more meaningful representation so that machine learning algorithms can simply use it.
Traditional manually designed feature descriptors in (Yadav et al., 2015), such as gradient operators and filter banks, are not able to capture if complex variation related in frequency is found in medical images (Rose et al., 2010). This paves the way for designing efficient image descriptors. The process of learning features concentrates into main categories namely supervised and unsupervised learning. In the supervised learning algorithms, usually it performs the process of learning using classes or targets. However, unsupervised learning methods, the process of feature learning is carried out without using any classes, (Hinton et al., 2006) (Mairal et al., 2009), autoencoders (Boureau and Cun, 2008), .
To the best of our knowledge, limited work has been done on applying unsupervised feature learning to mammogram images. In this study, we adopted the unsupervised feature learning process based on unlabeled data (Coates and Ng, 2012). This approach can learn important and subtle features from the statistics of the image. By applying set of learned features coupled with labels to train the mammogram images, the features were extracted from patches of size 8×8 that represent the ROI of mammogram images. Moreover, the obtained features from the patches were pooled together to reduce the dimension of the features. Finally, classification is performed to classify between normal and abnormal using the features obtained from the unsupervised learning algorithm. The main contribution of this paper is the application of k-means-max pooling in mammogram images for automatic feature representation to enhance the classification of the mammogram images into normal and abnormal, as shown in Fig. 1.

Materials and Methods
The Digital Database for Screening Mammography (DDSM) is a publicly-available resource used by the image analysis community (Heath et al., 2000). In this experiment, we used a total of 400 images, which represented 200 normal conditions and 200 abnormal (benign and malignant) conditions. The cropping operation was applied to the images to cut off the unwanted portions. ROIs were cropped to the size of 128×128 as shown in Fig. 2. The mammogram images were cropped manually by selecting the ROI of mammogram images. All the unnecessary parts such as background which are out of the tumor area were completely eliminated.

Feature Learning Architecture
The ROIs obtained from mammogram images were transformed into patches. Each patch of size of 8x8 were collected and stored as feature vector. However, mammogram images taken during the breast cancer screening might have variations such as brightness and contrast. Hence, to eliminate this issue image normalization. The learning architecture proposed by (Coates and Ng, 2012) uses the following procedure for feature learning representation of an image patch: • Normalize for each patch, subtract out the mean of the intensities and divide by the standard deviation as follows: where, x are normalized parches and "mean" and "var" are the mean and variance of the elements of ( ) i x .

Whitening Transform
Principal Component Analysis (PCA) is used to reduce the dimensionality of the data (Kambhatla and Leen, 1997). There is a similar preprocessing step called ZCA whitening , which is required for some algorithms. If we are training on images, the raw input is redundant, since neighboring pixel values are highly correlated. The aim of whitening is to make the input less redundant: If VDV T = cov(x) is the eigenvalue decomposition of the covariance of the data points, x, then the whitening points are decomposed as V(D+∈ zca I) -1/2 V T x, where ∈ zca is constant. In this study, to normalize the data, we set ∈ zca as 0.01 for 8 by 8 pixel patches. Coates and Ng (2012), make a convincing case that K-means clustering is capable of learning dictionaries that can be easily used for classification. The K-means algorithm is particularly intriguing and it's very fast compared to standard K-means using Euclidean distances. Following the steps of a spherical K-means algorithm, which is much faster than using the conventional K-means (Zhong, 2005).

Max Pooling
Max pooling makes the feature learning into new reusable features that keeps significant information although removing redundant information. The typical properties of pooling are the robustness to cluster and compactness of representation (Boureau et al., 2010). In this study, the following pooling steps were applied.

Notations:
• For instance, if the unpooled data are a p × K matrix of 1-of-k codes taken at P locations, we extract a single P-dimensional column v of 0s and 1s, indicating the absence or presence of the feature at each location • The vector v is reduced by a pooling operation to a single scalar f(v) • Max pooling: ( ) average pooling : f m (v) = max i v i • Given two classes C 1 and C 2 , we examine the separation of conditional distributions: In this experimental design, we take the max of the cluster memberships over each 8x8 region. Max pooling reduces the dependency of the feature vectors on their exact placement in an image (each element of each 8x8 block gets treated about the same) and it also maintains a lot of the information that was in each of the feature vectors, especially when the feature vectors are expected to be sparse. Moreover, during the experiment design, we have taken different size of clusters (100 clusters and 150 clusters) to figure out the performance of the classifiers, as shown in section 3.

Bag-of-Features
Bag-of-features is s a vector of occurrence counts of a vocabulary of local image features. The basic steps of bag-of-words when applied to images are as follows (Cheng et al., 2010): • Building a codebook for local patches • Extracting the local patches from the ROIs of mammogram images • representing an ROIs using the statistics of its quantized local patches • Inference based on the statistics collected in step 3 The obtained set of ROI of mammogram images is represented and divided into testing and training sets. Then, visual vocabulary is built by clustering the patches from the training set followed by representing them as per image distributions. Moreover, each patch is represented as histogram of visual words drawn from vocabulary. In The experimental setup, 100 clusters and 150 clusters based on k-means as clustering method was applied to evaluate the classifier performance.

Classification
The learned features were stored as feature vector. The next step is to separate given classes into normal and abnormal. Two well know classifiers are adopted in this study as follows:

K-Nearest Neighbor
The KNN classifier usually applies either the Euclidean distance or the cosine similarity between the training tuples and the test tuples. In this study, the Euclidean distance is applied in implementing the KNN(k = 1) model for feature classification (Cunningham and Delany, 2007).

Support Vector Machine
A Support Vector Machine (SVM) is widely used in mammogram images classification due to its performance for the accuracy rate. This classifier achieves the classification rate by applying hyperplane. To see the classification performance of SVM for separating the two classes (normal Vs abnormal) and it is recommended the hyperplane has the largest distance to the nearest training data point of the two classes (Chang and Lin, 2011).

Performance Measures
The performance of the classifier is evaluated using a ten-fold cross validation method (Acharya et al., 2015). The dataset is divided into ten equal parts. Each part contains the features from a similar proportion of images from normal and abnormal classes. Nine parts are used to train the classifiers. The remaining parts of the image features are used as a testing set. This process is repeated ten times using a different set in each case. In each fold, we apply several classification measures in order to obtain a more reliable comparison. Normal and abnormal mammographic images, respectively correspond to negative and positive samples. In this study, True Positive (TP) and True Negative (TN), respectively represent the number of abnormal and normal tests, which are properly classified. Similarly, False Positive (FP) and False Negative (FN), respectively, represent the number of normal and abnormal tests, which are incorrectly classified. The mathematical computation of performance measures are as follows:

Receiver Operating Characteristics
The Receiver Operating Characteristics (ROC) curve provides a visual representation of the tradeoff between true positive and false positives. They are the percentage of correctly classified features with respect to the percentage of incorrectly classified negative features (Nascimento et al., 2013). As shown in Fig. 4 to 7, the point (0,0) along curve represents a classifier that by default classifies all features as being negative, where a point (0,1) represent a classifier that positively classifies all features.
The experimental results is carried out using windows 8 and MATLAB R2013a. Some of the tools that were selected were image processing and statistical tool box with application of Laptop of intel Core i5-4200U CPU @1.6GHZ. Table 1 to 4 presents the classification results using KNN and SVM classifiers for k-means-max pooling and Bag-of-features. We have obtained the highest performance for k-means-max pooling as compared to Bag-of-features method with an average accuracy, sensitivity and specificity of 98.19, 97.09 and 99.35% respectively (at 150 clusters). It is also obvious from the results that the KNN classifier performs better than SVM, with a high accuracy of 98.19%. To get the best optimum classifier with a single dataset, the classifier should not only have significant accuracy, but should also provide good sensitivity and specificity. Such a balance is required to make certain decisions that classify two classes (normal Vs. abnormal). Hence, this K-NN classifier was selected as the optimum classifier for this dataset mainly because it provided the best accuracy, sensitivity and specificity. Figure 3 visualizes the resulting 50 clusters centers from k-means-max pooling. We sorted the data based on how often data centers are assigned to each cluster. So, the top left has the most elements and the bottom right has the least. Figure 4 to 7 show the ROC curve for k-means-max pooling compared with Bag-of-features using KNN and SVM. It can be observed from these figures that the performance of k-means-max pooling is better in the KNN classifier at the 150 cluster. Furthermore, the ROC curve is more near to the y-axis for the KNN classifier than Bag-of-features, indicating a higher AUC value for KNN. Therefore, k-means-max pooling coupled with KNN has the highest discrimination capacity to separate the two classes (normal Vs. abnormal).

Discussion
Selecting the significant features from mammogram images is important for the mammogram image classification. Significant features increases the accuracy rate of the CAD system in breast cancer. The proposed methods to extract important features from mammogram images vary from one another and they are implemented either in supervised or unsupervised learning. In this study, we take the unsupervised feature learning method that will help in detecting breast tumors. We present the comparison of the results obtained using our technique and other techniques in the literature that also aim to diagnose mammogram breast tumors.
The summary of the studies reported by different authors is shown in Table 5. Rocha et al. (2014) classified 200 DDSM mammogram images into normal and malignant classes using Gleason and Menhinick diversity indexes and the SVM classifier. They have reported sensitivity and specificity of 90 and 83.33%, respectively. Lim and classified 343 DDSM mammogram images into benign and malignant using first-order gradient distribution, gray-level co-occurrence matrices and Generalized Dynamic Neural Networks (GDFNN). They achieved a true-positive and false positive fractions of 95.0 and 52.8%, respectively. The advantages of our proposed method are as follows: • We reported the highest sensitivity of 97.09% and a specificity of 99.35% in classifying the normal and abnormal classes compared to Bag-of-features • The proposed system is able to detect two classes (normal Vs abnormal) with 98.19% accuracy; this will reduce the workload of clinicians • Our proposed method is more reliable, as we used ten-fold stratified cross-validation and used 400 mammogram images in this study

Future Works
In the future, this study will focus outside the boundaries of the skin which is the area of cancer. By performing possible multiple selection between benign and malignant

Conclusion
An accurate and fast diagnosis of breast cancer can help clinicians in their diagnosis. Hence, in this study, we proposed a CAD system based on k-means-max pooling for early detection of breast cancer. We reported an accuracy of 98.19%, sensitivity of 97.09% and specificity of 99.35% using the K-NN classifier for k-means-max pooling. We have also showed that, the features of k-means-max pooling performed better than Bag-of-feature. Our proposed system is able to clearly diagnose all normal and abnormal cases correctly (97.07% sensitivity) and reduce clinician workloads.