Combining SURF and MSER along with Color Features for Image Retrieval System Based on Bag of Visual Words

: Content-Based Image Retrieval (CBIR) has received an extensive attention from researchers due to the rapid growing and widespread of image databases. Despite the massive research efforts consumed for CBIR, the completely satisfactory results have not yet been attained. In this article, we offer a new CBIR technique that relies on extracting Speeded Up Robust Features (SURF) and Maximally Stable Extremal Regions (MSER) feature descriptors as well as the color features; color correlograms and Improved Color Coherence Vector (ICCV). These features are joined and used to build a multidimensional feature vector. Bag-of-Visual-Words (BoVW) technique is utilized to quantize the extracted feature vector. Then, a multiclass Support Vector Machine (SVM) is implemented to classify the query images. The performance of the presented retrieval framework is analyzed and scrutinized by comparing it with three alternative approaches. The first one is based on extracting SURF descriptors while the second one is based on extracting SURF descriptors, color correlograms and ICCV. The third approach, on the other hand, is based on extracting MSER, color correlograms and ICCV. All implemented schemes are tested on two benchmark datasets; Corel-1000 and COIL-100 datasets. The empirical results show that our suggested approach has a superior discriminative classification and retrieval performance with respect to other approaches. The proposed method achieves average precisions of 88 and 93% for the Corel-1000 and COIL-100 datasets, respectively. Moreover, the proposed system has shown a substantial advance in the retrieval precision when compared with other existing systems


Introduction
An image retrieval framework is a computerized scheme designed to manage (browse, search and retrieve) digital images within large databases. Currently, the size of digital image collection increases rapidly due to the growth of the internet as well as the approachability of image capturing devices as digital cameras and image scanners. Thus, there is an urge to develop efficient and effective tools for searching, browsing and retrieving images by users from various areas, including medicine, remote sensing, publishing, architecture, crime prevention and so forth. To achieve this goal, research efforts have been directed to develop various general-purpose image retrieval schemes. Nowadays, practically all human life applications utilize images to obtain efficient services. A huge collection of these images is denoted as an image database. An image database is an organized structure of digital images where a big number of images are stored and queried.
Over the last few years, many researchers have been conducted on image retrieval. These investigations can be categorized into three comprehensive domains based on the kind of the utilized methodology; text-based approach (conventional annotation), context-based approach and content-based approach. In text-based methodology, retrieval procedure is accomplished by adding metadata such as captions, keywords or text to the images so that retrieval can be achieved over the annotation words. Images are manually annotated and subsequently retrieved in the same fashion as text documents using a database management system. Furthermore, traditional annotation has three disadvantages: Manual annotation requires significant level of human effort, the annotation is inaccurate due to the subjectivity of human perceptiveness, in addition to the Polysemy problem which means that the same word can refer to more than one object (Markkula and Sormunen, 2000;Zhang et al., 2012). These problems drew attention to image retrieval approaches based on the content.
Content-Based Image Retrieval (CBIR) approaches query the images with their real contents instead of their annotated metadata such as keywords, tags or text descriptions. Primary CBIR approaches automatically indexed and retrieved with low-level visual features such as texture, shape spatial information or color (Zhang and Lu, 2004;Yasmin et al., 2013;Danish et al., 2013). Color characteristics are the most intuitive and easily perceived low-level features. They play a vital role in human perception. Besides, Color features are considered to be stable, robust and invariant to scaling, translation and rotation regarding other visual features (Kodituwakku and Selvarajah, 2010;Afifi and Ashour, 2012;Elnemr et al., 2016). Unfortunately, employing low-level features for the situations of seeking images that accommodate the same object or scenery with various viewpoints has the main drawback, which is losing much detailed information about the images. Recently, the interesting point detectors and descriptors (Krig, 2014) are utilized in several CBIR schemes to master the former drawback.
An extensive diversity of feature detectors and descriptors has so far been presented in the literature, including the most well-known method Scale Invariant Feature Transform (SIFT) (Lowe, 2004), Speeded Up Robust Features (SURF) (Bay et al., 2008), as well as the affine invariant region detector Maximally Stable Extremal Regions (MSER) (Matas et al., 2002). While SIFT has proven to be very effective in computer vision applications due to its immunity to common image transformations (Panchal et al., 2013;Bauer et al., 2007), its computational requirement is significantly high. Therefore, SURF algorithm is desired since it performs more efficiently with a minimal but adequate number of fineness detected points (Panchal et al., 2013;Bauer et al., 2007). Furthermore, MSER is usually debated in the literature as an interest region detector. Thus, MSER detector and SURF descriptor are mixed and joined to act in a superior manner.
This work proposes an image retrieval system that employs a combination of SURF and MSER methods. SURF detector is able to detect features as corners and blobs, however, it can't detect keypoints about regions. It is also robust to noise and invariant to rotation and scale but it is not affine (Shaikh and Patankar, 2015). MSER, on the other hand, perceives regions that are characterized by an extremal attribute of their intensity function in the regions and on their external boundaries. MSER can distinguish features around the region of an object but it is not able to perceive corner and blob features. MSER is also invariant to rotation, scale along with affine transformation (Shaikh and Patankar, 2015). Therefore, using these two detectors together may be complementary and can conquer all the limitations and can yield to a better performance.
Moreover to ameliorate the proposed system performance, we merged color correlograms (Huang et al., 1997) and Improved Color Coherence Vector (ICCV) (Pass et al., 1996;Chen et al., 2007) since SURF and MSER work only on grey scale images.
Usually, for each image, there would be hundreds of detected interest points and regions. Besides, the length of the feature vector is large. This led to augment the computational complexity of the image matching. Hence, we implemented a popular technique, Bag-of-Visual-Words (BoVW), to give a compact representation of image features. BoVW approach is adapted from document retrieval to image retrieval; instead of utilizing actual words in document retrieval it employs image features as visual words to describe an image (Liu, 2013;Bosch et al., 2007). Finally, a multi-class Support Vector Machines (SVM) classifier is trained to discriminate between various image categories. This paper is structured as follows. Section 2 briefly reviews some CBIR systems. The proposed approach is portrayed in details in section 3. Section 4 debates the experimental results and implementations. Finally, future work and conclusion are deduced in section 5. Giveki et al. (2015b) offered two methods for implementing SIFT features in CBIR. These methods are based on applying k-means clustering on the extracted SIFT feature matrix and are aimed to minimize the SIFT feature matrix dimension.

Related Work
Authors in (Ashraf et al., 2015), implemented an image representation technique that is based on Bandelet transform. The Bandelet transform restores the geometric boundaries of the main objects detected in an image. Then, Gabor filter is used to evaluate the content of texture around the detected boundaries and back propagation neural networks are utilized to estimate its parameters to ensure maximum accuracy. This texture information is incorporated with color information in YCbCr domain to improve the feature vector. Finally, Artificial Neural Networks are used to derive the image semantics.
In (Velmurugan and Baboo, 2011) SURF is combined with the color moments feature to create a CBIR system. For each SURF key point, the first and second color moments are calculated. The retrieval is achieved using an indexing strategy and a matching policy. The KD-tree accompanied by the Best Bin First search procedure are implemented to index and match SURF and color features.
On the other hand, Bahri and Zouaki (2013) proposed an image retrieval method that also joins SURF and color moments. However, the proposed technique is based on constructing a bag of visual features model. The bag of features consists of visual words that are constructed from SURF and color moments. Chandrika (2014) developed a method of image retrieval using SURF and BoVW. The author presents an approach of building a visual dictionary for each class or group in the test dataset, rather than the overall dictionary offered by the standard BoVW. This method makes the proposed technique more discriminative with higher accuracy and precision, yet it is highly supervised as the number of groups must be a priori known before classification.
An experimental study of implementing wavelet transform as image feature descriptor from various color models on the performance of a CBIR system is presented in (Giveki et al., 2015a). The ultimate results indicate that Lab color model gives the best encouraging results. Consequently, the authors constructed a contentbased retrieval paradigm that applies Wavelet transform on Lab color model combined with color moments.
The work of (Sharma, 2013) suggested a CBIR system that is based on extracting the histogram from the image, the color moments from the HSV (Hue Saturation and Value) space and the SURF interesting points.
In the study of (Shrivastava and Tyagi, 2014), an image retrieval technique constructed on matching certain selected regions using region codes is presented. These codes are based on the target region location with reference to the focal region. For each region, the dominant color, as well as the local binary pattern features are obtained. The feature vectors extracted from regions that have codes similar to that of the query image regions are utilized for comparison. Karakasis et al. (2015) proposed an image retrieval structure that relies on utilizing image affine moment invariants as descriptors of salient image patches. BoVW concept is used for indexing and retrieval. Authors considered three setup designs in their experimental study. First, color affine moment invariants are computed. Second, the invariant moments are computed over all chromaticities of the original image, whereas in the third design a normalization method is performed. Jain et al. (2015) introduced a CBIR system that is based on five elements: Columnar Mean, Diagonal Mean, Histogram Analysis, Color Image Analysis via RGB Components and finally Euclidean Distance for retrieving similar images.
In (Bhargavi et al., 2013) Gabor wavelets and Color Coherence Vector (CCV) are applied to extract texture and color features. Class Attribute Interdependence Maximization (CAIM) algorithm is implemented to discriminate these features, convert these continuous features into discrete ones. Finally, Particle Swarm Optimization (PSO) algorithm is used to select the most significant features. Jasmine et al. (2015) submitted an image retrieval technique that integrates color histogram on HSV spaces and multi-resolution Local Maximum Edge Binary Patterns (LMEBP) joint histogram.

Materials and Methods
In this study, we submit an image retrieval strategy that is based on extracting SURF and MSER key points, color correlograms and ICCV. The proposed system is comprised of three stages: Feature extraction, BoVW creation and finally image classification.

Feature Extraction
Computing features consists of detecting SURF interest points and MSER interest regions, then calculating the corresponding feature descriptors. Furthermore, since SURF and MSER work only on grey scale images color correlograms and ICCV are utilized to extract color features.
SURF was first introduced in (Bay et al., 2008) as an innovative interest point detector and descriptor that is scale and rotation invariant, as well as its computation, is considerably very fast. SURF generates a set of interesting points for each image along with a set of 64dimensional descriptors for each interest point.
On the other hand, Matas et al. (2002) presented MSER as an affine invariant feature detector. MSER detects image regions that are covariant to image transformation, which are then used as interest regions for computing the descriptors. The descriptor is computed using SURF. Thus, there is a set of interesting region for each image. These regions have a set of key points, which are presented by 64-dimensional descriptors for each.
To extract the color features color correlograms (Huang et al., 1997) and ICCV (Chen et al., 2007) are implemented. Color correlograms feature represents the correlation of colors in an image as a function of their spatial distances, it captures not just the distribution of colors of pixels as color histogram, but also captures their spatial information in the images. The color correlograms size hinges on the number of quantized colors exploited for feature extraction. In this study, we consider the RGB color model and implemented 64 quantized colors with two distances. Hence, the size of the correlograms feature vector is 2×64.
ICCV divides the color histogram into two components: A coherent component that contains pixels that are spatially connected and a non-coherent component that comprises pixels that are detached.
Furthermore, it contains more spatial information than that of traditional color coherence vector, which improves its performance without much-added computing work (Chen et al., 2007). In this exertion, the ICCV feature vector is formed of 64 coherence pairs, each pair provides the number of coherent and noncoherent pixels of a specific color in the RGB space. Thus, the size of ICCV is 2×64.
The obtained feature vectors from the images in each training set of each class in the database are combined and portrayed as a multidimensional feature vector.

Bag-of-Visual-Words (BoVW)
BoVW is inspired directly from the bag-of-words methodology, which is trendy and extensively applied technique for text retrieval. In bag-of-words methodology, a document is characterized by a set of distinctive keywords.
A BoVW is a counting vector of the occurrence frequency of a vocabulary of local visual features (Liu, 2013;Bosch et al., 2007). To distil the BoVW characteristic from images, the extracted local descriptors are quantized into visual words to form the visual dictionary. Hence, each image is portrayed as a vector of words like one document. Then, the occurrences of each individual word in the dictionary of each image are obtained in order to build the BoVW (histogram of words).
The K-means clustering technique is utilized to cluster all the extracted features obtained from all training images to find a certain number of centroids. These centroids represent the set of generated visual words and their number depends on the number of clusters (i.e., K).

Classification
After obtaining the BoVW feature from images, it is inserted into the classifier stage for training or testing. In this study, we used a nonlinear multi-class SVM with the Radial Basis Functions (RBF) kernel for the classification stage. SVM are group of supervised learning techniques that may be used for classification and regression. In the classification stage, data are separated into training and testing sets and SVM are aimed to generate a model (based on the training data). This model presents the training samples as points in space so that the samples of different groups are separated by an obvious gap that is as broad as possible. Afterward, the incoming samples are mapped into its corresponding space and the group of each sample is predicted based on which part of the gap it falls in. SVM methodology has some preferences compared to others classifiers. It is robust, fast, accurate and efficient in dealing with enormous datasets. Besides, it can be used to solve multi-class classification problems with a huge number of classes and it requires small memory to store its model.

Datasets
The proposed image retrieval system is tested and evaluated on two image datasets. The first dataset is a subset of the Corel-1000 images (Wang et al., 2001). It consists of 10 classes, each of 100 images. The classes are extremely miscellaneous, which contain dinosaurs, cyber, horses, bonsai, textures, fitness, dishes, Easter egg, antiques and elephants. For each group, 70 images are utilized to train the system (building the visual dictionary) and 30 images are exploited to test the system (i.e., 700 and 300 images for training and testing, respectively).
The second one is COIL-100 object databases (Nene et al., 1996). COIL-100 is a widespread benchmark image database that includes 72 views of 100 objects obtained by revolving the intended object around the vertical axis. To examine this system, 50 query images from each group are selected for training while 22 sample images are selected for testing. Thus, there is a set of 5000 images are reserved for training while 2200 images are earmarked for testing.
The training and testing sets are randomly selected from both datasets. Samples of the investigated databases are displayed in Fig. 1 and 2.

Results Evaluation
The feature database is represented by a multidimensional vector with size equal to: where, N S is the number of training/testing samples, N SURF is the number of SURF descriptors, N MSER is the number of MSER descriptors, N corr is the number of color correlograms descriptor (2 descriptors) and N ICCV is the number of ICCV descriptors (2 descriptors). The training feature database is clustered into k clusters exploiting K-means algorithm. The K-means clustering is the most widespread technique utilized to build the visual dictionary owing to its simplicity and speed of convergence. The obtained clusters centers are the visual words and the set of visual words stands for the word vocabulary. Each extracted descriptor from a query image is assigned to the closest cluster centroid. Then, the occurrences of each visual word are obtained in order to create the BoVW histogram. Therefore, each image can be reckoned as a long and sparse vector of words of length K. Consequently, we can imitate textretrieval systems, applying fast search on this vector space. A multi-class SVM-RBF is trained using the training BoVW histograms then the test BoVW histograms are fed to the SVM to be classified. To analyze the performance of our proposed retrieval method, which relies on combining SURF, MSER, color correlograms and ICCV descriptors with BoVW, we compared it with other three approaches. In the first approach, we used the SURF descriptors only while in the second approach we considered the SURF, color correlograms and ICCV descriptors. The third approach is based on utilizing MSER, color correlograms and ICCV descriptors. Furthermore, to evaluate the proposed system, we investigate the effect of the size of the vocabulary K on the performance of the retrieval system. We individually let K = 100, K = 200, K = 300, K = 400 and K = 500 in the comparison experiments.
The precision, recall and accuracy ratios are used to assess the efficacy of the proposed technique and they are presented by the following equations: (4) Figure 3 shows the recall, precision and accuracy of the experimentations done on the Corel-1000 dataset. The results indicate that K = 400 have the best recall, precision and accuracy for all studied approaches. It can be clearly noticed that color features significantly enhance the performance of the retrieval system. Furthermore, Fig. 3 demonstrates that our proposed scheme performs better than other considered schemes in terms of accuracy (0.97), precision (0.88) and recall (0.84).
Moreover, Fig. 4 represents the experiments conducted on the COIL-100 dataset. We can realize from the results that the optimum vocabulary size K differs as the extracted descriptors change. For the SURF descriptors, K = 500 gives the best accuracy (0.98), precision (0.81) and recall (0.5), while K = 400 denotes the best accuracy (0.99), precision (0.9) and recall (0.66) in the case of SURF, color correlograms and ICCV descriptors. When using MSER, color correlograms and ICCV the best accuracy (0.99) and recall (0.67) are achieved at K = 300, but precision = 0.85 is slightly less than that at K = 400. Considering the SURF, MSER, color correlograms and ICCV descriptors, recall is almost saturated at 0.7 for K = 100, 200 and 300, then it changed insignificantly at K = 400 to 0.69. Precision increases to 0.93 at K = 400 then increases slightly to 0.94 at K = 500, while the accuracy is almost 0.99 at all values of K. Thus, it can be concluded that our proposed approach outperforms the other deemed methods and it has the almost best performance at K = 400. It is also worth to be noted that color features significantly enhance the overall system performance.
Furthermore, the investigated approaches are evaluated using Precision-Recall Curve (PRC). Figure 5 and 6 display the retrieval performance for K = 400, in terms of precision and recall, on Corel-1000 and COIL-100 datasets, respectively. The chart comparisons indicate that our proposed technique achieves superior performance compared to that of the assessed schemes.

Comparison of Computation Time
For each implemented retrieval method, the total computation time to extract the multidimensional feature vectors for images in COIL-100 (2200 images) and Corel-1000 (300 images) datasets at K = 400 is recorded in Table 1. Also, the average computation time taken for constructing the feature vector for each image is calculated from this total time and noted in Table 2. As shown in Table 1, the extraction of SURF descriptors is considerably less than that of other methods. While the proposed method takes slightly longer time than other techniques. Furthermore, from Table 2, we can notice that the average time to construct the multidimensional feature vector of each image from COIL-100 dataset is significantly less than that from Corel-1000 dataset. This is because Corel-1000 images have complex background and plenty of details compared to COIL-100 images.
On the other hand, Table 3 and compare the total retrieval time and the average retrieval time for the different studied methods at K = 400 for COIL-100 and Corel-1000 datasets. Clearly, all methods take almost the same retrieval time.  (Kavya and Shashirekha, 2015) 10 random objects were considered. 86 (Kavitha and Sudhamani, 2013) 10 random objects were only considered. 83 (Velmurugan and Baboo, 2011) 15 random objects were only considered. 88 (Bahri and Zouaki, 2013) 15 random objects were only considered. 78 Although image retrieval approach based on SURF descriptors consumes the least processing time, it has the smallest precision and recall. The proposed retrieval approach achieves the best precision and recall at considerably a reasonable time; 0.76 and 3.51 sec for each image of COIL-100 and Corel-1000 datasets, respectively.

Comparison with Existing Systems
To inspect the performance of the proposed system, we compared it with some existing CBIR systems. The existing systems we selected for comparison use subset of COIL-100 dataset to evaluate their systems. The result reported in this study is compared against the performance of Velmurugan and Baboo (2011), Bahri and Zouaki (2013), Kavitha and Sudhamani (2013) as well as Kavya and Shashirekha (2015). Table  5 displays the average precision of retrieved images for the stated existing systems and the proposed work. The results illustrate that although our proposed system utilized the whole dataset, it outperforms significantly other existing systems.

Conclusion
The prime contribution of this work is to build an efficient and effective CBIR system that tends to be feasible for large datasets. Therefore, we have proposed a new CBIR system that is based on extracting SURF and MSER feature descriptors combined with the color features; color correlograms and ICCV. These features are utilized to build a BoVW model, which in turn is fed to a multiclass SVM that performs the classification step. The effectiveness of the submitted retrieval procedure has been investigated by comparing its performance with three implemented different approaches. In the first approach, SURF is implemented individually for the retrieval process. While in the second and third approaches, color correlograms and ICCV are combined with SURF and MSER descriptors, respectively.
Furthermore, a set of experiments has performed to choose the optimum vocabulary size that achieves the best retrieval performance.
All considered retrieval procedures are examined on Corel-1000 and COIL-100 datasets. The results obtained from these experiments indicate that our proposed methodology is effective and significantly outperforms the other studied methods at significantly a reasonable time. Moreover, it shows a superior capability of retrieving images efficiently, more than the existing CBIR systems.
A further extension of this work can be to improve the system performance by utilizing high-level features and using a more powerful clustering algorithm instead of K-means which is computationally expensive. Furthermore, high-performance computing techniques can be implemented to enhance the computational performance and thus save the processing time.

Author's Contributions
The author prepared the study, elaborated the methodology, performed the analysis and wrote the manuscript.

Ethics
This article is original and contains unpublished material. The corresponding author confirms that no ethical issues involved.