Sparse Representation Tracking with Auxiliary Adaptive Appearance Models

: We propose an effective tracking algorithm based on sparse representation and auxiliary adaptive appearance modeling. Based on a sparse representation, l 1 minimization can follow targets in challenging situations. Unfortunately, tracking approaches based on l 1 minimization are likely to be inefficient because they measure using dense coefficient distributions. The number of target candidates can be very large when the state space is densely sampled. Each minimization takes long time to find the solution. Traditionally, we must calculate the coefficients for each tracking candidates, which is computationally expensive. In this work, we found that l 1 minimization can be limited to a few regions with the reasonable probability based on adaptive appearance modeling and background probability estimation. Therefore, the computational cost is greatly reduced. We have also found that appearance information is useful for the region selection. We form the basis of appearance modeling using colors and shapes. The results of the experiment show that the proposed tracker has good performance.


Introduction
Object tracking has many important applications such as surveillance, human-computer interface and robotics. A good tracker should possess properties such as high efficiency and accuracy. However, in real tracking tasks, many issues are hurdles for developing good trackers. For instance, viewpoint variations and illumination changes tend to make a good model invalid. Additionally, occlusions and unusual motions are also challenging problems for tracking.
Target representation is essential for developing a good tracker. Target appearance usually is not constant in an image sequence. The appearance changes gradually or suddenly as results of viewpoint or illumination variations. Since viewpoints and illuminations are in continuous spaces, the target appearance can have infinite possibilities. Despite from the large number of appearance variations, it has been found that these variations lie in a low dimensional space Tenenbaum et al. (2000;Gu et al., 2008;Mei et al., 2011). This space is well approximated by a set of templates. This observation is useful in tracking based on l1 minimization. l1 minimization pursues sparse representations in a large space Donoho (2006). Matching probability in l1 tracking is partly measured by the density of the coefficient distribution. A target candidate that can be sparsely described has a higher probability to be correctly recognized than those with dense non-zero small coefficients. A tracker based on l1 minimization can deal with appearance variations, occlusions. Despite from the robustness of l1 based trackers, they are not efficient due to the coefficient spanning calculation in a large space.
Based on appearance modeling approaches, object tracking algorithms can be classified into two categories: generative and discriminative trackers. Generative trackers model targets using representations that can reconstruct the targets. Such trackers do not consider the background information of the targets which can be helpful in distinguishing a target from its background. Discriminative trackers are effective in distinguishing targets and their background using the differences between foreground and background. However, discriminative trackers are short of the reconstruction ability of generative trackers. In this work, we aim at combining the merits of discriminative and generative trackers in an integrated way.
Tracking based on l1 minimization belongs to generative trackers. We aim at developing a l1 tracker supported by discriminative approaches. The mean-shift algorithm Comaniciu et al. (2003) is a representative of discriminative trackers. Collins et al. (2005) select good color space combinations to improve discriminative ability. Wang and Yagi (2008) consider multiple shape and color features in an integrated framework. We found that the adaptive appearance modeling Wang and Yagi (2008) is helpful for including discriminative power in our tracker. We characterize target and its background using adaptive appearance modeling. Then, foreground likelihood images are computed based the appearance modeling. The searching space of the l1 minimization can be greatly constrained with the support from adaptive appearance modeling. We develop a robust and efficient tracker with adaptive appearance modeling and l1 minimization.
Particle filtering, a general Monte Carlo method, has been widely used in non-linear sequential state estimations Isard and Blake (1998;Toyama and Blake, 2001;Zhou et al., 2004;Hastie et al., 2003;Ross et al., 2007;Wang and Yagi, 2009;Das et al., 2012;Kim and Jeon, 2012). Particle filtering estimates the state of a sequential system that evolves in time series. Generally, the system measurement we get is polluted by noise. In order to correctly estimate the state of the system with noisy measurements, particle filtering uses many samples to represent state distributions. The computational cost of a tracker depends on the number of the samples. We constrain our particle filtering using adaptive appearance modeling Wang and Yagi (2008). This paper is organized as follows. Section 2 introduces our particle filtering framework. Section 3 describes target modeling using a set of target templates and trivial templates. Trivial templates are used in describing occlusions in template matching. Adaptive appearance modeling is also introduced in section 3. Experimental results on real image sequences are demonstrated in Section 4. Section 5 summarizes this work.

Particle Filtering
Particle filtering estimates the state space based on the current and historical information of observations The state of our tracking system evolves according to x t = f t (x t-1 , u t-1 ), where x t is a vector representing the state of the system at time t; x t-1 is a vector representing the state of the system at time t-1, u t-1 is a noise vector, f t is a nonlinear time-dependent function. We get observation y t from the system y t = g(x t , e t ), where e t is a non-linear transform function characterizing the measurement (Doucet et al., 2001).
We estimate the distribution p(x t , |y 0:t ) using particle filtering. Here, we consider all the observations from y 0 to y t , where y 0t are the historical observations. Particle filtering consists of two steps: predicting and updating the states. p(x t |y 0:t ) is computed at time t-1: p(x t-1 , |y 0:t-1 ) is updated with the new observation y t based on Bayes' rule: Particle filtering represents the probability distribution, p(x t |y 0:t ), as a weighted particle set π . The samples in the weighted particle set are calculated using sequential importance re-sampling (Arulampalam et al., 2002). J samples are drawn from the proposal distribution the samples are updated and propagated by the algorithm recursively: Given the distribution p(x t-1 |y 0:t-1 ), particle filtering calculates the filtered distribution p(x t |y 0:t ). Embedding the weighted particle set into the above equation, we obtain the approximation of p(x t |y 0:t ):

Target Modeling
We represent the target using a generative method. Correct candidates can be found by l1 minimization. We also characterize target and its background using adaptive appearance modeling (Wang and Yagi (2008)), which is a discriminative modeling approach. The auxiliary adaptive appearance modeling better allows for the acceleration of the l1 minimization in particle filtering by shrinking the state space of the target.

Generative Target Modeling
We have a few modeling templates denoted by m i ∈ℜ d , where d denotes the size of a template. They are used to create a matrix M = [m 1 , m 2 ,..., m n ], where M∈ℜ d×n .
Considering all the modeling templates, We search for a good target candidate y * from a set of candidates y.
The reconstruction in y = Mα does not consider any noise or occlusion. Unfortunately, observation noise and corruptions are unavoidable in real tracking tasks, which make ideal reconstruction impossible. We have to introduce a noise vector v and trivial template set β for dealing this problem. In each trivial template, only one non-zero element is permitted. Combining many such trivial templates, we can describe noise and occlusion in the reconstruction here, the template set β represents corruptions. A coefficient vector α and a trivial To reconstruct y, we can calculate the coefficients in α. The values in α should be constrained as nonnegative values because the templates bearing similarity to the tracking target are positively related to the target. The coefficient vector α can be constrained in the reconstruction. However, it is not easy to enforce nonnegativity on the trivial coefficient vector v. To solve this problem, trivial coefficient vector v is extended by including the negative trivial templates. Here, target templates M and trivial templates set β are combined to form a matrix 1 M, , There are few coefficients with large values for reasonable reconstructions. In the optimization process, we have to keep the number of the large coefficients to be few. This is more difficult than only considering the reconstruction error. To retain a few non-zero coefficients and keep others zero, the problem is transformed into a minimization with a penalized residual sum of absolute values of ε Hastie et al. (2003).
We search for good candidate by minimizing an objective with l1 penalty Tibshirani (1996;Koh et al., 2007a;2007b):  6 is not meaningful. λ also influences the runtime of the minimization process. In this work, we set λ according to the suggestions given in (Koh et al., 2007b). For each minimization, we set λ to 0.1λ max the number λ max gives us an upper bound on the useful range of the regularization parameter λ, which can computed as described in Koh et al. (2007b). Efficient solution to l1 minimization has been proposed in Koh et al. (2007a;2007b). Despite from these development, it is still computationally expensive for our tracking tasks. We will introduce adaptive appearance modeling to constrain the cost in section 3.4.

Generative l1 minimization
We can find target candidates by directly comparing the pixel difference between a target template and its candidates in images. Unfortunately, such kind of matching is very sensitive to noise. In this work, we search for sparse coefficients for appearance recovery. The residual errors are defined as the difference between the templates and target candidates.
The minimization in Equation 6 can be performed using Forward Stage-wise Linear Regression (FSLR) method Hastie et al. (2003). However, this approach is time consuming. Many alternatives have been proposed to speed up the minimization process. We solve the l1 minimization problem using the truncated Newton interior-point method Tenenbaum et al. (2000). Logarithmic barriers are defined for bound constraints in this method. In the iterative solving process, search directions are computed using preconditioned conjugate gradients Tenenbaum et al. (2000).

Matching Probability in Generative Reconstruction
We compute matching probability of a target candidate based on the reconstruction results: where, α is calculated in the l1 minimization, t is a negative coefficient restricting the exponential function; Z is a normalization factor.

Adaptive Appearance Modeling in a Discriminative Way
We represent target candidates using a set of particles. We compute the reconstruction errors for every target candidates if no other information is taken into consideration. Since the number of the particle set can be large, the computational cost related to candidate matching using l1 minimization is rather high. We would like to alleviate this problem by constraining particle filtering based on foreground likelihood images produced by appearance model back-projection. Wang and Yagi (2008) proposed a tracking algorithm by selecting reliable features from color and shapetexture cues. The selection is performed according to feature descriptive ability. The target model is updated based on the similarity between the first and most recent models. Thanks to the multi-feature selection mechanism, the target can be characterized effectively. We adopt the feature selection approach in Wang and Yagi (2008) to incorporate appearance guidance in our particle filtering framework.
We compute weighted histogram for multiple cues. Then, log-likelihood ratio images are generated based on these histograms, which is described in Equation 8. We measure the descriptive ability of a certain cue according to its variance ratio. The discriminative features are selected for foreground likelihood computation. Foreground and background histograms are denoted by ξ F and ξ B , respectively. Foreground likelihood ratios can be generated by back-projecting the histograms onto the input image Wang and Yagi (2008): where, σ η is a very small number.
Log likelihood images are helpful for reducing the searching space of particle filtering. We show a foreground likelihood image in Fig. 1. The distribution of target candidates is closely related the likelihood ratios in a certain region. The samples outside the foreground likelihood distributions are not helpful in particle filtering. We do not need to consider samples in such regions.
We calculate the sum of the likelihood ratios for each candidate region Candidate matching can be performed by measuring l2 norm of the difference between the templates and candidates. l2 norm computation is also more efficient than l1 minimization. The computation cost of particle filtering can be reduced by applying l2 norm computation. l1 minimization is bounded by l2 norm computation Mei et al. (2011): Although l2 norm is helpful for restricting l1 minimization cost, it is still less efficient than the foreground likelihood checking strategy. Computational cost analysis will be given in Section 4.2.  (Collins et al. (2005)). (b) Foreground likelihood image calculated using the method in Wang and Yagi (2008)

Experimental Results
We implemented the proposed tracking method and compare it with a few good trackers. We compare the proposed tracker with a sparse minimization based tracker, the structural local sparse appearance modelbased tracker Jia et al. (2012) (Struct). This tracker bears certain similarity with ours in the aspect of the sparse representation. We also compare with the visual tracking decomposition tracker Kwon and Lee (2010) (VDT). This tracker is good at dealing with occlusions. Other baseline trackers used in the experiments include the Variance Ratio Mean-Shift tracker (VRMS) Collins et al. (2005), which is an extension of the mean-shift algorithm Comaniciu et al. (2003); the Incremental Visual Tracking algorithm (IVT), which compute an incremental Principle Component Analysis for target modeling; and the fragment-based tracking method Adam et al. (2006) (Frag), which utilizes a voting technique in the localization process.
We have tested the trackers on many image sequences. We evaluate the trackers on challenging sequences. We can find typical difficulties for visual tracking in these sequences: Cluttered background, heavy occlusions, shape deformations and background motion. We also include 1 sequence which seems to be simple. Actually, complicated trackers can fail on simple sequences, which is out of our expectation. We have tested the trackers for many sequences. Here we show 6 typical sequences to demonstrate the performance of our algorithm.
In the first sequence, a bird is tracked in the images with shape and appearance variations. The bird in this sequence changes its pose in a large degree. In addition, it is occluded by other objects in some frames. The Struct tracker give very good performance in this sequence. Our results are a little better than the Struct tracker anyway in this sequence. The Frag tracker's performance is also reasonable although it is a relatively simple tracker.
We track a vehicle in the second sequence (Fig. 3a) the vehicle has a distinctive appearance. The difficulty of this sequence is that the scale of the vehicle always changes. It is small at the beginning of the tracking and becomes large after a few frames and then becomes small again. The Struct tracker become unstable from frame 420 because of the viewpoint variations. It fails at frame 907 due to scale variations. The VRMS tracker does not give good estimation on the scale. The IVT tracker is better in this aspect. However, both of VRMS and IVT lost tracking after a few frames. Although the IVT tracker found the vehicle again by chance in frame 600, it lost tracking again soon. The proposed tracker follows the position and scale of the vehicle well.
A face is tracked in the third sequence (Fig. 3b). The face is occluded by a toy during the tracking. Since the VRMS tracker is not good in dealing with occlusion, the positions of the face provided by the VRMS deviate from the true positions. The IVT tracker is better in handling occlusions. Both the IVT and our trackers give good estimations under heavy occlusions. The VDT tracker successfully deals with the heavy occlusions in many frames. It fails at the end of the sequence due to the fast motion of the face.
We track another face in the fourth sequence. The face moves forward and backward during the tracking. The l1, Struct, VTD, IVT trackers can handle the pose and scale variations before frame 400. The MIL tracker fails due to another face's distraction. The VRMS tracker cannot follow the face after a few frames. Although the IVT tracker follows the face in some frames, its results are not as precise as those given by the proposed tracker.
We track a basketball player in the fifth sequence. The player has large pose variations. In addition, there are sudden illumination changes in the sequence. The VDT tracker and our tracker give good performance in this sequence.
We track a woman in the sixth sequence. The woman is occluded by several cars in the sequence. The VDT and Struct trackers failed during the tracking due to the heavy occlusions.

Qualitative Comparison
We measure the tracking accuracy of the trackers. We compute the error based on the L2 distance between the centers of tracking result bounding boxes and the centers of ground truth bounding boxes. The evaluation results are shown in Table 1. The proposed tracker gives low errors in a few sequences. However, it is not always the most accurate tracker. For example, the tracking results of GirlFace sequence are worse than those of the Struct tracker. In other sequences, our tracker gives the best tracking accuracy. The VRMS usually gives the worst results compared with other trackers. However, it is much more efficient than other trackers.

Computational Complexity Analysis
We test the likelihood ratios for each target candidates. The computational cost of our approach only depends on the size of the candidate regions d and the number of samples J in particle filtering. The cost is linear to both of the factors, which is O(dJ). This computation is extremely fast in practice.
The minimization using l2 norm is simpler than using l1 minimization. However, it is more expensive than our foreground likelihood test. Although the complexity of l2 norm computation is also O(dJ), it takes more flops in practice. The l2 computation is more involved than our foreground likelihood ratio test.
The l1 minimization problem can be solved by a truncated Newton Interior-point method Koh et al. (2007a;2007b). In the iterative computation, the direction searching is performed by using a preconditioned conjugate gradients method. The computational cost of this method depends on the iteration number required by the truncated Newton interior-point method. It takes a few hundred iterations to reach the solution. The complexity of the preconditioned conjugate gradients method is O(d 2 +dJ), which is much more expensive than our foreground likelihood test.
The number of target templates usually is less than the dimensionality of the templates. Therefore, the size of the templates has a large effect on the computational cost of l1 minimization. In Fig. 2, we compare the computational cost of foreground likelihood checking, l2 norm computation and l1 minimization. We use 600 samples in our particle filtering process. The computational costs of different template sizes are shown for all the three approaches. Since the difference between the methods is rather large, we demonstrate the results in log domain. Obviously, all computational costs increase with larger template size. Foreground likelihood checking is the most efficient method compared with the other two methods and the l2 norm computation takes less time than the l1 minimization.
We also perform quantitative efficiency comparison of the trackers. Tracking time is indicated by frames per second. We run all the trackers on a computer with Windows 8 system, i5 CPU and 4G RAM. We mark the tracker with the best tracking efficiency with bold fonts and mark the second best tracker using italic fonts. It is clear that the VRMS tracker is far more efficient than all the other trackers. The efficiency of this appearance based mean-shift tracker indicates the importance of overall appearance modeling. Our tracker is the second best in all the sequences. The computational cost of our tracker is much more expensive than that of the VRMS. However, it provides much better performance than the VRMS. It is more efficient than other trackers (except the VRMS).

Conclusion
We estimate target likelihood image based on adaptive appearance modeling. We represent target motion using less particles with the support from target likelihood image. The particle filtering in this framework is computationally easier compared with matching all the target candidates using expensive minimization process. The proposed method can deal with target appearance and scale variations. It is also good at handling occlusions.