Unsupervised Cosegmentation Model based on Saliency Detection and Optimized Hue Saturation Value Features of Superpixels

Corresponding Author: Rasha Shoitan Department of Computer and Systems, Electronic Research Institute, Cairo, Egypt Email: rasha.shoitan@eri.sci.eg Abstract: Segmenting out the foreground from a single image remains a challenge in computer vision. Image co-segmentation has been used recently to alleviate the single image segmentation by exploiting the information of the common object to be segmented from a group of images having the same object. This research proposes an unsupervised co-segmentation technique based on saliency detection and optimized features of the histogram of Hue, Saturation and Value (HSV) of the superpixels. The proposed method is formulated as the conventional Markov Random Field (MRF) segmentation model with an added co segmentation constraint. First, an initial segmentation is extracted based on a saliency technique. Afterward, a Particle Swarm algorithm (PSO) is utilized to select, iteratively, some superpixels on the inner and outer boundary of the initial segmentation to be a foreground or a background according to the optimized energy function. PSO is guided by the HSV dominant colors of the image class and the superpixels around the foreground to decide if a certain superpixel is related to the foreground or the background. The proposed method is evaluated by two datasets: iCoseg and MSRC, along with comparisons to the results of using ten conventional methods based on the Intersection over Union (IoU) metric. The experimental results demonstrate that the proposed method can segment the object successfully and accurately more than the traditional cosegmentation methods, even with a cluttered background.


Introduction
In an era where digital cameras are increasing rapidly, many images need to be effectively utilized by expert and artificial intelligent systems. Extracting the objects from these images is beneficial in understating and analyzing their contents. Indeed, segmenting the objects from an image is crucial for many multimedia and computer vision applications, such as skeletonization, image classification, action recognition, scene analysis and image retrieval (Merdassi et al., 2020). However, it is hard to automatically segment the foreground object from a single image because of the lack of information for this object. Recently, researchers attempted to cope with this lack of information and increase the segmentation accuracy of the objects by exploiting the details of common objects that exist in a set of similar images. The process of segmenting out the common objects through simultaneously processing the collection of similar images is known as co-segmentation. In the last years, different cosegmentation methods are suggested segmenting the common object. Some of these methods use the traditional segmentation models and add a certain foreground consistency constraint to segment the common object in a pair of images Hochbaum and Singh, 2009;Mukherjee and Dyer, 2011). Then, the idea is broadened to segment the common object in a group of images (Mukherjee and Peng, 2011;Meng et al., 2016). However, these methods still suffer from the lack of segmentation accuracy. In this study, an unsupervised co-segmentation method based on the traditional MRF segmentation model and PSO is proposed to improve the segmentation accuracy of the segmented objects in a group of images. The main purpose of the proposed method is to use the PSO for the first time to solve the proposed MRF cosegmentation model. Also, a new foreground consistency term based on the histogram of HSV color space is proposed to improve the accuracy of the segmentation results. The proposed algorithm utilizes a saliency detection technique (Borji et al., 2019) to detect the initial foreground for each image. Afterward, the superpixels around the contour of the initial segmented foreground are determined. The PSO is responsible for achieving the best co-segmentation by flipping the state of the superpixels around the contour to a foreground or a background then evaluating the energy function. The energy function starts with the initial foreground to achieve a consistent segmentation through two fields; (1) within the image, (2) within the class (intra-class), which is the co-segmentation term. The proposed cosegmentation term in the energy function compares the dominant colors of the initial foregrounds from all the images in the same class and the dominant colors of the selected superpixel. Based on this comparison, a weight is added or subtracted from the energy function. According to the energy function value, PSO decides the best state of a superpixel to be a foreground or a background. The four main contributions in the proposed method are: 1) Exploiting the fast convergence and the good global search capability advantages of the PSO to optimize the energy function 2) Using HSV color domain instead of RGB to determine the dominant colors leads to better separation of the color information from the intensity (3) Facilitating the comparison between the dominant colors by converting them to a binary matrix 3) Searching for the superpixels on the foreground contour instead of searching in the whole image makes it easier to reach the best solution

Literature Review
Object co-segmentation was first introduced by Rother et al. (2006) to demonstrate that segmenting the common objects from a pair of images improves the segmentation accuracy of the object than using one single image. From that time, co-segmentation has drawn much attention and different methods are introduced (Merdassi et al., 2020). Some of these methods describe the co-segmentation problem as an energy function of the conventional segmentation model, adding the foreground consistency constraint term. Therefore, the co-segmentation problem is solved by minimizing this energy function that enforces the foreground consistency (Meng et al., 2016). Rother et al. (2006) formulate the energy minimization function as an MRF-based segmentation with the L1-norm of the histograms of the common foreground objects to force their consistency. This co-segmentation problem is solved by the Trust-Region Graph Cuts optimization method (TRGC). Mukherjee and Dyer (2011) used L2 norm instead of L1 for the foreground consistency term to relax the energy minimization problem to linear programming and solve it using Pseudo Boolean optimization. Unfortunately, the existence of the histogram difference term in both types of research increases the complexity of the optimization problem. Therefore, Hochbaum and Singh (2009) simplify the energy minimization function by rewarding the foreground consistency term and optimizing it using the max-flow algorithm.
The previously mentioned methods are applied to pair images and have restricted applications. Thus, the cosegmentation methods have been generalized to be applied to a group of images to extract common objects for practical applications. Batra et al. (2010; have scaled the co-segmentation for multiple images by introducing an interactive co-segmentation method. This method depends on an action from the user to add scribbles on an image to discriminate between foreground and background then the Gaussian mixture model and the Grabcut are used to co-segment these images. Collins et al. (2012) propose an interactive co-segmentation framework based on a random walker model to add constraints information for foreground regions. However, the co-segmentation results are affected by the position and the size of these scribbles. Also, Lee et al. (2015) deduce that a random walker is insufficient to segment the images accurately because it depends on the single random walker. Therefore, they propose a graph-based system to analyze Multiple Random Walkers (MRW) motions and relations to improve co-segmentation performance. Mukherjee and Peng (2011), the author has broadened the MRF co-segmentation methods to multiple images by modifying the histogram term and considering the foreground scale variations. Meng et al. (2016) address the problem of Co-segmentation of Multiple Groups of images (CMG) and formulate it as an energy minimization function consisting of three terms: The single image segmentation term, the single group term and the multiple group term, then the Expectation-Maximization algorithm (EM) is utilized to solve this optimization problem. On the other hand, Joulin et al. (2010) propose a co-segmentation method based on discriminative clustering to assign the labels jointly for the common foregrounds in a group of images. Chang et al. (2011) use co-saliency to detect the foreground locations and utilize this initial foreground as prior knowledge in the energy function. This energy function is solved using the graph cut technique.
Saliency has been widely used in different ways to solve the co-segmentation problem. Jerripothula et al. (2014) utilize the Geometric Mean Saliency (GMS) technique to form a global salience map by fusing the aligned saliency maps of a group of similar images, then segmenting out the single image using this global saliency map. Because the process is repeated for each image, the GMS method suffers from excessive calculations for large scale datasets. Furthermore, Jerripothula et al. (2014) propose an image co-segmentation method based on Saliency Co-Fusion (SCF) to increase the robustness of the co-segmentation system. The authors apply different saliency detection techniques on each image to improve the joint object and get multiple saliency maps. These saliency maps are weighted and fused at the superpixel level and the resultant saliency map is used to implement single segmentation on each image. Afterward, Jerripothula et al. (2015) solve the GMS technique issue of using the Group Saliency Propagation (GSP) method for performing the cosegmentation. The basic idea of this method is to cluster similar images into groups and choose a key image that represents each group. This method reduces the transferred information between images and decreases the calculations (Lu et al., 2019).
Moreover, Meng et al. (2012) propose a cosegmentation method using the salient information and the Shortest Path Algorithm (SPA) to segment similar objects with different colors. First, the local regions are extracted from each image then a digraph is built by the similar local regions and the saliency map. The problem is constructed as the shortest path problem and solved by a dynamic programming method. Other authors suggest the idea of joint processing between various visual tasks to support each other with useful information such as Dai et al. (2013) propose an energy function that combines the co-sketch and the cosegmentation for aligning the similar objects in a group of images to improve the segmentation results. However, Jerripothula et al. (2017) propose a framework that couples the co-skeletonization and the co-segmentation to use the interdependencies between them to support each other.
Faktor and Irani (2013) propose a co-segmentation method by composition. The algorithm measures the overlapping degree between the co-occurring regions and the initial segments, then constructs a co-segment map reflecting the score of overlapping for each pixel. Finally, the Grabcut is applied to this score map to obtain the segmentation output. Recently, Convolutional Neural Networks (CNNs) have been used in the co-segmentation. Kamranian et al. (2018) propose CNN-based Feature Visualization (CNN-FV) for detecting the foreground and use this information in the energy minimization function to improve the co-segmentation results. Gong et al. (2020) propose a co-segmentation technique that calculates the visual correlation between images based on the co-attention computation block and Siamese network. Li and Liu (2021) propose a new IS-Triplet loss and merge it with conventional image segmentation loss for the co-segmentation application.

The Proposed Co-segmentation Method
This section presents a detailed description of the proposed co-segmentation method. As mentioned in the introduction section, the proposed technique formulated the problem as an MRF based segmentation model with a further term for co-segmentation, where the optimization is achieved using PSO. In the proposed method, a saliency object detection technique is used to extract an initial foreground in the images and then PSO selects superpixels around the contour of its initial foreground. Finally, the PSO optimizes the energy function and decides if each superpixel is related to the foreground or the background according to the cosegmentation information obtained from the images group. The detailed information of each process is discussed in the following subsections:

Salient Detection Method
Salient object detection is used in the proposed method to recognize and segment out the most visually attractive objects or regions from the background in an image. The advantage of using the salient object detection method in co-segmentation is that it concentrates only on distinctive and interesting objects.
Therefore, it is used in the proposed approach to extract an initial foreground to be used as prior information in the energy minimization function. A saliency detection method based on the deep learning method mentioned in (Borji et al., 2019) is used in the proposed technique because of its ability to locate the most salient regions accurately.

Superpixels on the Contour
Superpixels are used in the proposed technique instead of pixel level for many advantages. First, superpixels contain additional information than pixels since each superpixel includes the pixels that have the same visual properties, so this superpixel has a perceptual meaning. Second, the superpixels arrange the pixels in a compact form, which is beneficial for large computational tasks. Third, superpixels abide by the image boundaries when clustering the pixels, which are essential in image segmentation (Achanta et al., 2012;Lézoray et al., 2017;Stutz et al., 2018). In this research, the Simple Linear Iterative Clustering (SLIC) algorithm (Achanta et al., 2012;Stutz et al., 2018) is used as a preprocessing stage to over segment the images into many superpixels. The proposed technique exploits the adherence to the image boundaries of the superpixels and selects only the superpixels around the initially segmented foreground. This step directs the PSO search for the correct places and reduces the time to search in the whole image, as shown in Fig. 1. Each particle in the PSO is represented as a vector; its length equal to the number of the selected superpixels (n) around the initial foreground contour, as shown in Eq. 1. Each particle is initialized by the status of the selected superpixel (foreground 'FG' or background 'BG'): In each iteration, each particle selects one of the superpixels to flip its status (foreground/background) and check the effect on the energy minimization function. Afterward, PSO decides whether to flip one of the selected superpixels based on the calculated fitness function value. Then the superpixels on the boundary are updated to be used in the next iteration.

Dominant Color
The cosegmentation term in the energy minimization function is considered the most crucial part of any cosegmentation technique, as this term affects the accuracy of the segmentation and the complexity of the energy minimization function. Most previous studies use the RGB color domain and some features in the cosegmentation term to segment the common object Hochbaum and Singh, 2009;Mukherjee and Dyer, 2011). However, RGB color space is not recommended for color analysis because RGB cannot differentiate between color and luminance (intensity). Therefore, in the proposed algorithm, HSV color space discriminates between intensity and color and considers the color characteristic (Shaik et al., 2015;Garcia-Lamont et al., 2018). The HSV color space consists of three components which are the Hue (H), the Saturation (S) and the Value (V). The Hue term represents the pure color (yellow, orange, blue, etc.). Saturation term measures the degree of mixing the Hue with white color. However, the Value term measures the intensity or the brightness of the color, as shown in Fig. 2 (Bourne, 2010).
For determining the dominant colors in the proposed technique, all the images from the same class are segmented using the salient object detection technique. The segmented foregrounds are then converted to HSV color space. Subsequently, the 2-D histogram is drawn between the Hue and the Saturation terms for each segmented foreground. The vertical axis of the histogram represents the Hue, while the horizontal axis represents the Saturation. The histogram presents the color frequency distribution for each image foreground on the Hue and Saturation terms. As shown in Fig. 3, a group of images from the same class is used to get the salient regions and then a histogram is created using the Hue and Saturation values of these regions. It can be noticed that there are a lot of yellow points in the histograms of the images which reflect that these locations have non-zero elements (colors). Also, it can be observed that most of the yellow points are concentrated on the top right corner of the histogram, which means that the color is red (Fig. 2). The histograms generated from all segmented foregrounds are summed and a new one is created. This summed histogram represents all the colors of the images in this class. The dominant color is determined from the summed histogram by arranging its values (colors) in a descending order, where the first color is the most frequent color in the summed histogram (the highest number of pixels). According to the shape of the histogram (wide or narrow), a percentage from the arranged colors is considered dominant. The histogram shape is determined based on a ratio of non-zero elements in the histogram. This ratio is calculated as: If the ratio is small, the histogram is narrow and a high percentage of the colors is taken as dominant colors. However, if the ratio is high, the histogram is wide and a low percentage from the colors is taken as dominant colors. Thus, dominant colors are determined to help PSO choose the state of a given super pixel (foreground or background) by comparing the super pixel colors and the dominant colors of the image class. In the proposed method, this comparison is facilitated by converting the histogram matrix that contains the dominant colors to a binary matrix according to the following equation: The 1's in the binary matrix reflects the dominant colors that are represented using yellow points, as shown in Fig. 3.

Energy Minimization Function
In the proposed method, the common objects are segmented out using the conventional MRF-based segmentation model with a constraint that reflects the foreground consistency. This energy function is formulated as: where, EMRF is the segmentation term for a single image that assures the smoothness and differentiates between foreground and background in an image, while the ECoseg is the cosegmentation term that enforces the consistency between the foregrounds in a group of images of the same class (intra-class) (Popova et al., 2018).
The MRF term is responsible for assigning a binary label for each pixel to be a foreground or background according to the following equation: Eu is the unary potential term encoding the probability that a pixel is relevant to foreground or background while Ep is the pairwise potential or smoothness term that reward assigning the same label to adjacent pixels with the same color feature.
In the proposed method, the cosegmentation term Ecoseg is constructed according to a comparison between the superpixels selected by the PSO and the image group based on the dominant colors. Relative to their degree of similarity, a weight is added or subtracted from the energy function to help PSO to decide if the superpixel is a foreground or background. This similarity is measured by a Similarity Ratio (SR) which is defined as the intersection between the binary matrix of the class and the binary matrix of the selected superpixel as in the following equation: where th1 and th2 are determined empirically to decide if the SR value is considered high or low. In the first two cases, SR has a high value, so if PSO suggests flipping the superpixel to foreground, then rewards this suggestion; otherwise, penalties it. However, in the second pair of cases, SR has a small value, so if PSO suggests flipping the superpixel to the background, then rewards this suggestion otherwise penalties it. In the last case, the SR value is between th1 and th2, which means that the superpixel's colors are neutral (neither dominant nor distinct), so the weight is zero. Thus, the final energy minimization equation is: A block diagram of the proposed algorithm is presented in Fig. 4 and the proposed algorithm steps are given below: Step 1: Determine the saliency region (initial foreground) of the image to be segmented.
Step 2: Determine the saliency region of all the classes.
Step 3: SLIC is applied to the image class to generate superpixels for each image.
Step 4: Convert the segmented foregrounds from RGB to HSV colour space and then generate the histogram for each converted foreground.
Step 5: Sum the histograms of all the foregrounds in one histogram then determine the dominant colour as mentioned in the dominant colour subsection.
Step 6: Determine the superpixels on the foreground contour for the image to be segmented.
Step 7: For each iteration, the selected superpixels around the contour, the foreground and the dominant colour of the image class are used as an input for the PSO.
Step 7.1: Calculate the HSV histogram of the selected superpixel.
Step 7.2: Determine the dominant colours of the superpixel as mentioned in the dominant colour subsection.
Step 7.3: Calculate the SR between the dominant colours of the image class in step 5 and the dominant colour of the selected superpixel in step 7.2 using Eq. 6 Step 7.4: Determine the value of the weight which will be added to the energy minimization function using the SR value as in equation 7 Step 7.5: PSO evaluates the energy function, then if a global best is reached then PSO flips the superpixel status.
Step 7.6: Update the foreground and determine the superpixels around the updated foreground Step 8: If the termination criterion is not met go to step 7

Experimental Results and Analysis
The performance of the proposed approach is evaluated by conducting different experiments on two commonly benchmark cosegmentation datasets which are iCoseg (Batra et al., 2010) and MSRC (Shotton et al., 2006). The iCoseg dataset has 643 images. These images are divided into 38 classes. However, the MSRC dataset is composed of 233 images of 8 object classes. In these datasets, the images in each class have the same object with different scales, viewpoints and illumination.
The performance of the proposed method is evaluated and compared with different previously proposed techniques (Meng et al., 2012;Dai et al., 2013;Faktor and Irani, 2013;Jerripothula et al., 2014;Jerripothula et al., 2015;Lee et al., 2015;Kamranian et al., 2018) using IoU metric. This metric is calculated as the intersection area of the segmented foreground and the ground truth to their union. The results of this comparison on iCoseg and MSRC in terms of IoU are tabulated in Table 1 and Table 2, respectively. In both tables, the average of the IoU is calculated for each class because each dataset has a massive number of images. It will be noticed that some results of methods (Meng et al., 2016;Kamranian et al., 2018) are missed; this is because the results are gotten from their original paper. However, the results of the other methods are obtained from (Lu et al., 2019).
It can be perceived from the results presented in Table  1 that the proposed method outperforms the three saliency methods which are GMS, SCF and GSP. Although the SCF has applied different saliency techniques for each image then fuse between them to improve the performance of the cosegmentation technique, the proposed method succeeds in improving the results without the need for extra processes. Also, it can be noticed from Table 1 that the proposed method gives the best performance for different challenging classes over the ten conventional techniques, such as 'Red sox players', 'Stonehenge 1', 'Liverpool FC players', 'Ferrari', 'Taj Mahal', 'Elephants', 'Airshows-planes', 'Airshows-Huntsville', 'Gymantics-2', 'Skating-3', 'Soccer players' and 'Track and field'. On the other hand, the proposed method results for the rest of the classes are still competitive. Overall, the proposed technique achieves the highest success occurrences, which are 14 classes out of 38 classes.
Meanwhile, the best-performing conventional method, CMG, achieves 7 out of 38 classes. Also, the average IoU is calculated for all the classes for each technique and the results are presented in the last row of Table 1. It can be found out that the proposed method accomplishes the highest average of IoU which is 0.7536, over the other conventional techniques on the iCoseg dataset.
For MSRC dataset results in Table 2, it can be observed that none of the techniques performed well for all the classes and the results are a little bit lower than iCoseg. The first reason behind this is that the MSRC is considered more complicated than iCoseg where the same class can have different individuals with varied colors and shapes of the same category (as the animal class has cows, cheeps, horses, etc.). The second reason is that the MSRC ground truth is not accurate, as shown in the second and sixth rows of Fig. 6. Despite the problems in using MSRC dataset, the proposed method attains the highest success occurrences, which are 4 classes out of 8 classes compared to the other conventional methods. Also, the average IoU for the proposed method is still competitive.   Visual results of the proposed method on different image classes of iCoseg and MSRC are shown in Fig. 5 and Fig. 6, respectively. Figure 5 shows the proposed segmentation result of four images classes from iCoseg dataset, which are 'Panda', 'Elephant', 'Hot balloons' and 'Cheetah'. It can be seen from the figure that the proposed method segments out the common object accurately even with a cluttered background. On the other hand, Fig. 6 shows the results of the proposed method of four image classes which are 'Cow', 'Car', 'Animal' and 'Plane', from MSRC dataset. It can be seen from the figure, that the binary foreground of the proposed method is sharper and more accurate than the ground truth, which leads to the common object being localized and segmented accurately.

Conclusion
In this study, an unsupervised co-segmentation technique is proposed to extract the common objects from a group of images. The proposed method exploits the MRF based segmentation method and adds to it a co-segmentation term. PSO is used in the proposed co-segmentation technique, for the first time as far as the authors know, to optimize the energy function method. The saliency technique is used to create an initial segmentation for each image in the class. On the other hand, the HSV color system is used to determine the dominant colors for the segmented images. The superpixels around the initial foreground to be segmented are selected and assigned to the PSO particles. Selecting the superpixel around the foreground contour supports PSO to concentrate on the region of interest and reduce the computation time. The dominant colors of the image class and the dominant colors of the selected superpixels are compared. Their similarity ratio is added as a weight to the energy function for assisting the PSO to efficiently discriminate if a superpixel is related to the background or the foreground. The experiments have been conducted on two benchmark image co-segmentation datasets and have achieved high accuracy compared to other research in this field. The results prove that using the dominant colors for foreground consistency term and optimizing the function using PSO improves the co-segmentation results even with an interfered background.