Video Retrieval Using Histogram and Sift Combined with Graph-Based Image Segmentation

- Content-Based Video Retrieval (CBVR) is still an open hard problem because of the semantic gap between low-level features and high-level features, largeness of database, keyframe's content, choosing feature, etc. In this paper we introduce a new approach for this problem based on Scale-Invariant Feature Transform (SIFT) feature, a new metric and an object retrieval method. Our algorithm is built on a Content-Based Image Retrieval (CBIR) method in which the keyframe database includes keyframes detected from video database by using our shot detection method.Experiments show that the approach of our algorithmhas fairly high accuracy.


Introduction
Finding and retrieving relevant videos from video collections is a natural important problem.It is more and more necessary when videos are generated at increasing rate nowadays.Motivated by this demand, a lot of video retrieval researches have been made to find more effective methods which can be applied in real applications such as video-on-demand systems, digital libraries, etc. Nowadays most of current digital systems support retrieval using low-level features, such as color, texture and motion [1] (example: Google's search engine, Yahoo's search engine…).But, generally these features don't reflect users' demands clearly because they only express little content of videos, while the users often care about high-level semantics or concepts.It's a reason why many content-based video retrieval methods have been developed.
Considered as a conceptual extension of CBIR into the video domain [2], CBVR problem can be traced back to early 1980s with the introduction of CBIR.
Although being a young field, there are many different approaches in CBVR proposed, such asusing visual information methods, retrieval based on textual information presented in the video, relevance feedback algorithms… [3] A framework of these methods often includes breaking videos into shots, keyframes and retrieve suitable keyframes for input data based on some chosen features extracted from these shots orkeyframes [4].There are many different approaches which focus on various properties of frames and videos (such as visual effects, motion, sound, etc.) used to solve eachsub-problem.
A common first step for most content-based retrieval techniques is shot segmentation.Even if there are some approaches do not use histogram, histogram difference is still the most widely used method [3].

Information Sciences Letters
An International Journal @ 2012 NSP Natural Sciences Publishing Cor.
Many shot detection techniques use it as a feature, such as a feature optimal choice method based on rough-fuzzy set of Bing Han et al [5], hidden Markov model method of Boreczky and Lynn [6], sliding window method of Li and Lee [7], and some other directly bases on histogram, such as the method of Colin et al [8] andour method, which is presented in section 3.
Keyframe feature extraction is always one of main works in video retrieval problem, especially when video retrieval techniques are mostly extended directly or indirectly from image retrieval techniques nowadays.Although this approach does not use the spatial-temporal relationship among video frames effectively, this extension also gains some success [3].In our model, SIFT feature is chosen due to its ability of being almost unchanging under variations of recording frames (light intensity, rate and geometric transformations).Moreover, SIFT detection algorithm runs fast and SIFT matching algorithm has high precision and recall.
For a large video database, clustering is always chosen to abbreviate and organize the content of videos.
In most case, it is used to create a useful indexing scheme for video retrieval by grouping similar shots.
There are mainly two types of clustering: partition clustering where similar data is arranged into separate clusters (example: shot clustering techniques of Cao et al [9], K-means, ISODATA, etc) and hierarchical clustering which generates a hierarchical classification tree and considers groups as nodes of the tree [3].That means hierarchical clustering methods tell us relationship (in tree structure) of different groups at different levels.Therefore, in our scheme, we choose a hierarchical clustering method for clustering process.Moreover, we apply a new metric to "increase the difference" between feature vectors (in compare to Euclidean metric).
The object of this work is to retrieve from video database frames which are similar in terms of vision with an input image or object.We describe this process as follow: In section 2, we present the framework of our algorithm.We provide a shot detection method in section 3. Then the next section describes a process of clustering keyframes and builds an index file.Section 5 mentions three techniques: graph-based segmentation, finding representative vector of each object by using SIFT feature and clustering these vectors.Our new metric is also described in this section.We present results of our experiment in section 6.And section 7mentions some conclusions and extensions.

Video retrieval framework
We change video database to feature vectors to compare with feature vectors extracted from a query image.So the goal here is to extract SIFT feature [10].In this paper we create a video retrieval system by combining some available techniques such as shot detection [11], graph-based segmentation [12], SIFT detection algorithm [10]…Model of our system is shown in Figure 1: General model of video retrieval system.

Pre-processing:
-Segmenting each video in the database into shots.-Extracting keyframes from shots.Then we cluster them to get a database of representative keyframesand create an index file to link between them and corresponding videos.
-Segmenting and extracting SIFT features from representative keyframes.Calculating feature vector for each object.
-Reducing database one more time by clustering objects.Each group of objects is represented by a feature vector.

Pre-processing:
-Segmenting each video in the database into shots.-Extracting keyframes from shots.Then we cluster them to get a database of representative keyframesand create an index file to link between them and corresponding videos.
-Segmenting and extracting SIFT features from representative keyframes.Calculating feature vector for each object.
-Reducing database one more time by clustering objects.Each group of objects is represented by a feature vector.

Retrieval:
Querying image is proceeded simultaneously according to two stages.At stage 1, we segment the imageinto objects and calculate SIFT feature vectors of these objects.At state 2, matching state, representative objects which is the most similar to input objectsare chosen and keyframes containing them are shows as results.Our system consists of retrieving based on entire input image or on an object in an image.We use a new metric to match feature vectors of objects in query image with feature vectors in database to determine results.

Shot detection
As we mention above, the popular first step in CBVR schemes is segmenting video into shots.A shot is a group of consecutive frames from the start to the end of recording in a camera which is used to describe a context of a video such as a continuous action, an event, etc. [3].In our paper, we use a novel method representativekeyframe by one vector.We start with representative keyframes and output groups of the feature vectors.
Although using an image for input, users often focus on one particular object in the image such as actor, item, animal,etc rather than the whole.To satisfy this demand, we segment every keyframe into regions (objects).We use Pedro F. Felzenszwalb and Daniel P. Huttenlocher's graph-based image segmentation method [12].After an image is segmented by this algorithm, there is always evidence for a boundary between every pair of objects in image.Besides the algorithm satisfies two global properties, runs in time nearly linear in the number of edges of graph, a representation of the segmented image, and preserves detail in low-variability image regions while ignoring detail in high-variability regions [12].

Feature vectors clustering
In the SIFT framework [10], interest points on objects in an image are called keypoints, and there is a descriptor vector corresponding to each keypoint.And this approach often generates large numbers of descriptor vectors from an image, so to use it we must solve a problem: matching process is slow.In paper [14], authors propose an idea to overcome this difficulty.They replace N descriptor vectors corresponding to N keypoints on an object with mean of the vectors.By using this method each object is represented by one mean descriptor vector.
After completing the above processes we get a large collection of feature vectors.In order to retrieval processing run more quickly, we cluster these vectors.We also use complete-link algorithm [13] for this work.A representative vector of one cluster is mean of all vectors in that cluster.

A new metric
To applying the clustering algorithm and the matching process, we created a new metric on based on SIFT descriptor vectors' characteristic.Some SIFT descriptor vector's components are always large and some other components are always small.For example, for one descriptor vector, 9 th component, 17 th   x-axis contains 1,… 128 and y-axis is value of each component of the sum vector.
If we choose 9 th component as a landmark andset its value to 3.25 (in order to ), then value of other components in the above example is approximated alternately as follow Table 1: approximated value of 128 components (1 st component is 1, 2 nd component is 0.75, 3 rd component is 0.75, so on) After some experiments we find out that for two descriptor vectors x, y, if is small then is often small and if is large then is often large, too.So, we define a new metric for every .
In comparing with Euclidean metric, this metric "increases distance" between two descriptor vectors x, y by increasing large components and decreasing small component.Therefore, we can easily choose clustering threshold and get a better result of this process.

Experimental Results
To evaluate the performance of our system, we performed experiments on a medium video database (200G)of elevencategories which represent distinct contents rather than a scene.Since many keyframes are blurred (due to the effect of films, fast movement of objects…) or just contain a part of an real object (an actor, an animal…), the results are influenced a lot.For query keyframes fromdatabase, the results are high accurate (more than 90% in our experiments).For query images not in database and their content are "different a little" from the content of keyframes in database, the query result precision is about 30%.We test for 100 images of 10 different categories of interest.The following are our detailed experiments:  In a movie, the movement of main objects (people, vehicle, etc.) and the variation of background create different shots, although many shot contains same main objects.Therefore, clustering a main object at different shots (if this object does not change much) into a cluster is an important request to reduce the largeness of keyframe database.Because of the ability of the segmentation process to separate main objects from their correlative background with acceptable accuracy and the ability of being invariable under the changing of geometry transforming and rate, the scheme of keyframe segmentation, calculating SIFT feature and object retrieving can recognize similar main objects from different shots with good accuracy (see figure 2, 4).Or we can say that the schemeis a good choice to solve the above request.Moreover, since SIFT feature is unchanged under the varying of light intensity;itrejects the lighting effects used in movie in clustering process (see the first cluster in figure 2).In summary, our algorithm works fairly well on retrievalling query images with some geometry, light variations from some keyframes.But that is different with other variations such as feeling variations, changing of background, etc.
In this paper, we developed a video retrieval system combining between histogram; SIFT algorithm, graph-based segmentation method and complete-link algorithm which has advantage ofsimplicity and efficiency in searching distinct objects rather than a scene.Users can use an input image or an object of that image to retrieve.Moreover, the system can be applied easily to the specific data domains,for instance, video shot retrieval for face sets [10], events… However, our system has two main disadvantages: long query time, surpluses in detectinggradual shot transitions.So, our future work is to overcome those disadvantages to have a better video retrieval system.

Figure 2 :
Figure 2: Two images (a) and (c) are segmented into objects (images (b) and (d)) with acceptable accuracy.
component, 41 st component and 49 th component are almost more large than 0.1 and sometimes more larger than 0.2, but 4 th component, 6 th component, 7 th component 8 th component are almost smaller than 0.5.

Figure 3 :
Figure 3: sum of representative descriptor vectors of all objects in 2000 randomrepresentativekeyframes.x-axiscontains 1,… 128 and y-axis is value of each component of the sum vector.

Figure 4 :
Figure 4: (a) a query image, (b) a corresponding result (a representative keyframe) from a movie "Tom and Jerry" in the database (a) (b)

Table 2 :
Experiment result.The columns show the accuracy and average query time of the three methods on three rows.