Feature Fusion in Improving Object Class Recognition

: Problem statement: Extraction of features in object class recognition researches previously gives attention to local features as discriminative features. This is because local features have invariant properties that are robust to viewpoints, translation and rotation. However this feature still has a limitation to represent high-level representation of objects. The problem will occur if the object is too small and do not have strong local features. Approach: This study proposes the combination of different features with local features for improving performance of object class recognition. The objective of this study is to address the problem of building object class representation based on these different features. The different features are sourced from boundary-based shape features. The dataset used consists of segmented objects with unrestricted poses and sizes from publicly image database. Both types of features are combined using feature fusion approach by concatenating those features in a new single feature vector. This new feature vector is trained by Support Vector Machine (SVM) to predict of unknown object class. Result/Conclusion: Experimental result show the inclusion of more than one type of features yields improvements of object class recognition compared to using single feature. [


INTRODUCTION
An image can be easily understood by human but everyone has different views in describing an image. In Fig. 1, human may describe the image based on the scenery or surrounding, such as "city" or "outdoor" image. This image may also be recognized based on objects contained in the image such as "car" and "trees". Those objects are identified based on their features such as shapes and colors. In computer vision research, there are various features introduced by researchers to support an object recognition system that is able to capture similar concepts as understood by humans.
Features of objects in an image can be extracted through their shapes, colors and sizes. In addition, the objects can be seen in different range of views, for example front view, side view or rear view. To relate these visual features into a higher level of conceptual representation that is closer to human understanding, it is sufficient to identify the category or class of object, known as Object Class Recognition. To achieve this, the semantic gap between the simplicity of visual features and the richness of user semantics needs to be reduced (Hare et al., 2006). At this juncture, much efforts in related research attempt to map an object within an image to a suitable class, which is also referred to as a "concept".
Earlier researchers introduce local features to identify objects with different variability in terms of poses and sizes. Local features refer to features that are extracted based on interest points detected on the object generated by region detector. Graz02 http://www.emt.tugraz.at/~pinz/data/GRAZ_02/ Interest points capture information from its neighbors and invariant under scales, translation and rotation. What makes local features appealing is their ability to represent the variability of object classes of different scales, orientation, sizes or poses. Opelt et al. (2006a) use many local features such as Scale Invariant Transform Features (SIFT), subsampled gray values, basic intensity moments and moment invariants as input to the boosting classifier in recognizing object classes. The author concludes that the classification performance of combinatory local features yield higher accuracy as compared to solely using the SIFT feature. However, many object classes such as "cups", "horses" or "cows" are better described by shape features as compared to local features. For example, "cup" objects have limited local features, for example fixed color or shade. This will make it difficult to discriminate among the classes and in turn result in poor recognition results.
Furthermore, local features focus on the local information of objects without considering other properties such as shape. This causes a problem for the computer to recognize objects that have limited or plain local features (Mansur and Kuno, 2007). Shape features are often used as a replacement of, or complement to local features in several works, such as in (Opelt et al., 2006b;Yu et al., 2007;Shotton et al., 2008). Due to the richness of information, shape is an important part of the semantic content of images and it should be the main feature in recognizing object classes (Yang et al., 2008). Several researchers concentrate on local shape information such as shape context and area. Other shape features are based on contour fragments (Shotton et al., 2008), which represent the partial shape of objects. However, the contour fragment cannot guarantee the actual shape of the object. In addition, it may be affected by high resolution noise and small details may disappear in low resolution noise.
Minority applications concentrate on full contour of the object's shape such as in face recognition (Su et al., 2003) and medical image retrieval (Arun and Menon, 2009;Jeong and Radke, 2007;Schaefer et al., 2009), where such objects have roughly restricted shape poses. Nevertheless, natural images can consist of objects with different poses. For example, Fig. 2 shows images of car class in different poses such as rear, front and side poses. Although, they have different poses, these images can be categorized into similar class. However, the representation of shape for an object changes once the poses of the object changes. To overcome this limitation, several papers take the advantageous from local features in combining with shape features to contribute to the improvement of object class recognition (Mansur and Kuno, 2007;Opelt et al., 2006b;Zhang et al., 2005). This study proposes the boundary-based shape features that describe the entire contour of object class to be combined with local features. Comparing with the color and the texture, the shape is described after the objects in the image have been segmented. Moreover, the shape features are capable to represent the entire object, hence can be interpreted by human vision. Good recognition accuracy requires an effective shape features that are as similar as possible to the interpretation of human perceptual (Yang et al., 2008). The advantages of these features are that they can be robustly extracted from the image. They are insensitive to surface features such as texture, color and also invariant to lighting conditions. Furthermore, the shape of objects may be easily encoded.
However, the object's shape extracted by boundary-based features may lead a problem of ambiguity in recognition process. This is because for a natural image, single pose of the object is insufficient in identifying the actual objects. Hence, this study will consider numerous poses of objects in resolving the ambiguity problem. Then, the local features is used to discriminate the object that cannot be distinguished using shape features since it is ability to resolve the problem of detecting objects in various poses, scales and rotations.
To predict the class of unlabelled objects based on visual features, feature fusion approach is used. Feature fusion is a method of combining multiple features in a new single feature vector (Oliveira and Nunes, 2008;Sun et al., 2009;Ali, 2007). This is a simple approach where the features are mapped into one feature space. A Support Vector Machine (SVM) classifier is used to train this new feature vector due to its ability to generalize and to support high-dimensional and nonlinear data for classification.

MATERIALS AND METHODS
Feature fusion is a straightforward method that will form the input to the classifier. At present, researchers are facing difficulties in determining the combination of methods that could produce optimal results (Kludas et al., 2008;Dimitrovski et al., 2011). In our study, the features are sourced from two different types of features, which are boundary-based shape features and local features. The former type of feature is based on the outline of segmented objects while the latter are based on the interior information of objects. The motivation in this study is to demonstrate that combination of different features is able to produce better recognition performance as compared to using single type of feature. The feature fusion framework proposed by our study is illustrated in Fig. 3. Each feature is computed separately and is represented as feature vector. All feature vectors of boundary-based shape features and local features are then concatenated into a single vector. The vector can be defined as the following Eq. 1: where, O i is a object class, FD, EFD, MI represents the boundary-based shape features and SIFT represents the local features. OC i corresponds to a new feature vector that resulted from concatenation of all feature vectors, FD, EFD, MI and SIFT.
Feature extraction: Feature extraction is divided into two types of features; boundary-based shape features and local features. The boundary-based shape features used in this study are Fourier Descriptors (FD), elliptical FD (EFD) and Moment Invariants (MI) are extracted from a segmented dataset. For local features, Scale Invariant Feature Transform (SIFT) is adapted to cooperate with the shape features.

Boundary-based shape features:
This boundary-based shape features are based on silhouettes of the segmented objects. The primary factors that are taken into consideration include invariance under translation, rotation, reflection and scaling (Gonzalez et al., 2004). These features are employed due to its accurate in shape representation and can be easily normalized (Zhang and Lu, 2004). Previously, FD and EFD features have not been used in the object class recognition research. FD and EFD are widely used in medical image processing (Arun and Menon, 2009;Jeong and Radke, 2007;Schaefer et al., 2009;Reig-Bolano et al., 2010).
FD: FD values are produced by the Fourier transformation of a given image that represents the shape of the object in frequency domain (Gonzalez et al., 2004). Based on frequency analysis, the Fourier coefficients can be used to describe shape of an object. These shape descriptors are normalized in order to make them independent from translation, rotation and scale. Higher frequency descriptors will generate detailed shape of an object, whereas lower frequency descriptors will create rough shape from the original object. In shape description, the Fourier transform theory may be applied in many different ways. In this study, the boundary (outline) of the object is treated as a layer in a complex plane (Zahn and Roskies, 1972), with row and column co-ordinates of each point on the boundary, B(k) = [x(k), y(k)], k = 0,1,….,K-1 can be expressed as a complex number as denoted by Eq. 2: where, j is the sqrt(-1). The boundary point starts at an arbitrary point, (x o , y o ) and are traced around the boundary in counterclockwise direction at a constant speed. The result is a sequence of coordinates that are represented by complex numbers. Figure 4 shows an example of object boundary. Dealing with discrete images, the Discrete Fourier Transform (DFT) is applied. The DFT of b(k) is defined as Eq. 3: where u = 0, 1, 2,…,K-1. The complex coefficient DFT(u) are called Fourier Descriptors of the boundary that gives the shape of an object. The inverse of Fourier transform of these coefficients will restore b(k) where k = 0, 1, 2,…,K-1 as shown in Eq. 4: The inverse Fourier Descriptors is computed by specifying number the descriptors in order to obtain a closed spatial curve.
EFD: Similar to FD, EFD is applied to the closed contour of object based on the boundary information. The closed contour is defined with differential chain code, represented as a point coordinate of closed contour. Figure 5 illustrates the example contour of a binary image with its chain code generated from this image.
Based on Fig. 5, the length (dt i ) of element (v i ) of the chain code is given by Eq. 5: Therefore, for the whole number of element in a contour, the length is Eq. 6: The following equations present the projection of each v i , on X and Y-axis, respectively Eq. 7: For all elements in the chain, p, the projection on X and Y-axis will be Eq. 8: EFD is calculated from the sum of elliptical harmonics. In identifying the closed contour points, K and N harmonics are considered. Kuhl and Giardina (1982) use four Fourier coefficients, a n , b n , c n and d n in each harmonic. Equation 9 presents these four coefficients. These harmonics and their corresponding coefficients are used to produce coordinates that define ellipses that fit within the object's outline to represent the object's shape.
Moment Invariants (MI): MI are shape features that have been succesfully used in pattern recognition research such as in aircraft recognition (Sarfraz, 2006), object class recognition (Yuan and Hui, 2008), face recognition (Nabatchian et al., 2008) and handwriting recognition (Ramteke and Mehrotra, 2008). This features can be extracted from the boundary and interior region of an object. In this research, MI values are extracted from the segmented objects based on boundary points based on Hu (1962) who propose seven expressions to be calculated from normalized central moments that are invariant to object scales, translations and rotations. Hence, MI features used in this research is able to represent different geometrical features in input objects. MI may also be applied for disjoint shapes that cannot be supported by FD (Chen, 2003).

Local features:
In object class recognition, each object will have a unique representation. However, this is difficult because the same object may be interpreted using many poses. One of the disadvantages concerning the chosen shape feature is the silhouette information, which may be insufficient and ambiguous. Similar silhouettes are often corresponds to different objects from different viewpoints. To overcome this, the study also considers using local features, SIFT to be combined with boundary-based shape features. SIFT is the best local features to recognize various objects in different views and scales, including blurry images as well as images with changes in lighting and translation (Mikolajczyk and Schmid, 2005;Lowe, 1999). SIFT feature extraction employs the bag of keypoints approach (Csurka et al., 2004) that is based on vector quantization of the SIFT features extracted from the object. The difference-of-Gaussian is applied to identify the interest points of an object. The dimension of object's local features is based on the number of interest points generated by the region detector, which is usually 128-dimensional extracted from multiple interest points of the object patches. Once a multi-dimensional feature set has been extracted from an object, a clustering algorithm is performed to generate the visual vocabulary. In order to construct a bag of keypoints as the feature vector, the number of patches assigned to each cluster is calculated and the learning algorithm is applied to train this feature vector. The category of test data can be determined based on the model designed.

RESULTS
The goal of this study is to investigate whether a fusion of different types of features in a single feature vector improves the performance in recognizing object classes. Comparison of the proposed work is carried out against the recognition results from a single feature. For the purpose of this experiment, the Graz02 dataset is used because the objects included in the dataset are more realistic and are not limited by changes of poses, size, lighting, translation or illumination.
Empirically, 40 descriptors of FD and 28 EFDs are used in this study. This number accurately describes the shape of objects. Figure 7 presents the error rates using SVM classifier for each object class for different number of FDs. The error rate improves slightly between 30 and 40 number of FDs. However, the error rate was increased to 0.02% in recognition error using more than 40 FDs. Therefore in this study, we state that 40 FDs present optimal descriptors for each object class in terms of accuracy.
Following previous research (Opelt et al., 2006a), the SIFT features are clustered using K-Means algorithm with K = 100. The new feature vector as mentioned in Eq. 1, OC represents a single feature space with total 175-dimension and is then trained using the SVM binary classifier in order to model each class. For recognition, all features are extracted from testing data and the trained model is used to predict the final object class. The Radial Basis Function (RBF) kernel is applied with gamma,  and cost, C parameters acquired using 10-cross validation approach. The size of training data and testing data is adopted from Opelt et al. (2006a) (Opelt et al., 2006a)   All experiments in this study are evaluated using Receiver-Operating-Characteristic curve (ROC) for presenting results in recognition as shown in Table 1 This evaluation method can be a good measurement for recognition performance since it takes into account the difference between errors on positive and negative examples (Rakotomamonjy, 2004). The combination of shape-based features with local features (FF) yields improvement of classifications performance as opposed to using single feature only. From this result, Feature Fusion (FF) exceeds the ROC rates in all classes. Combination of boundary-based shape features and SIFT features has further increased the performance for bike class by 8.1% and car class by 0.1%, whereas the performance of the people class about 1.7%. On contrary, boundary-based shape features alone averaged up to 85% of objects are correctly classified. However, for local feature, the performances on bike and car classifiers were not promising, which is only 56.4% and 40.1% correct classification, respectively. In Fig. 8, we present the error rates using SVM classifier for each object class for different features to give a clearer picture about the recognition performance of single feature with FF (FD+EFD+MI+SIFT). From this figure, the error rate was improved slightly using FD+EFD+MI+SIFT whereas, the error rate has increased significantly in recognition error when using SIFT solely. Table 2 shows the result and comparison to the state-of-the art approaches in Opelt et al. (2006a) and Hegazy and Denzler (2008) which combined the variety of local features solely. We observe that combining global and local features improve the classification state-of-art features more than 15% for all object classes.

DISCUSSION
In recognizing objects, shape and local features play an important role in producing a successful recognition system to reduce the classification error. However, from our observation, the object classes in Graz02 (Opelt et al., 2006a) dataset do not have much information on their local features, hence it is not able to increase the recognition performance even though SIFT local features are robust to scales, viewpoints and illumination. Also, we also observed that shape features have high influence on the final decision in feature fusion approach even when local features of objects are limited.

CONCLUSION
This study proposed a combination of boundarybased shape features and local features using feature fusion technique with a binary SVM classifier. The experimental results on this challenging dataset show that the performance of feature fusion improved the classification accuracy as compared to using single feature. However, some drawbacks noted in this approach include the high-dimensional feature vector and contradictory information when there are too many different features combined. From this method, it is hard to identify which features are exactly relevant or impactful to the resulting accuracy since all features are represented in one feature space. To overcome this weakness and to reduce the computational time, further research will explore the use of decision fusion methods in aggregating different types of features.