Efficient Pose Tracking based on Line Segment Matching

: Pose tracking is a crucial issue for many applications such as robotic tasks and facility operations. Vision-based approaches with non-contact properties are appropriate choices for these tasks. However, vision-based approaches are not sufficiently robust and fast. In this work, we propose a vision-based pose tracking to deal with these problems. We estimate poses using Lie group and Lie algebra representation theory. Such operation is performed in a linearized space, therefore it is convenient for pose estimation. To provide reliable visual information for our pose estimation, we detect line segments. Our detection of line segment depends on semi-global image information. We describe all line segments and match those detected in consecutive frames. Our line segment detector and matching descriptor are good at discarding ambiguous line segments and finding real ones in noisy situations. The integration of group theory and line segment detection and matching plays an important role for developing a robust vision-based pose tracking system. Our system proves to be efficient and robust.


Introduction
Vision-based pose estimation finds numerous applications, including robotic tasks, facility operations and automatic measurement. Real-time vision-based pose estimation is becoming practical in recent years. Despite of the development, real-time precise pose estimation is still a challenging task because the information in a video stream is huge. In addition, viewpoint variations, image noises and illumination changes are difficult to handle in many scenarios. We aim at developing a robust vision-based 6D pose estimation based on the projection of an object in a image sequence. There are two key ideas in our system development. The first one is pose estimation using Lie group and Lie algebra representation theory. The second one is detection and matching of line segment features. The integration of the two techniques is important to develop a robust vision-based pose tracking system.
Given an initialization for pose tracking, the transformation of continuous poses can be represented using 3D rotation and translation. The representation in a homogeneous coordinate setting is a 4×4 matrix. We estimate the transformation to update the relative pose between a camera mounted on a system an object in the camera's view. Given two sets of points P = {p i }, one set is gotten by a Euclidean transformation and another is detection results in an image, we can estimate the motion. A good representation for estimating transformations needs to be composed, inverted and differentiated. Unfortunately, direct matrix operations do not meet the above demand. It is especially difficult to differentiate a transformation matrix. To deal with this problem, we represent object motion using Lie group and Lie algebra theory.
Group theory focuses on the algebraic structures that have certain properties on a set. A Lie group is a nonempty subset where a smooth manifold and a topological group are defined. Lie groups are differentiable manifolds (Sattinger and Weaver, 1998). The group operations of a Lie group are smooth. The inverse mapping of a Lie group is also smooth. A Special orthogonal Lie group SO(3) describes rotations in a 3D space. A Special Euclidean group SE(3) represents 3D rigid transformations that include linear transformation on homogeneous 6-vectors (Taylor and Kriegman, 1994). Lie algebra defined on the tangential space of a Lie group characterizes the local properties of the elements in the group. Lie algebra consists of a vector space over some field and a binary operation named as the Lie bracket operation. The Lie algebra so(3) of SO(3) can be described by a 3D vector that can be transformed into a skew-symmetric matrices. The Lie algebra se(3) associated with the Lie group SE(3) is able to represent motions. It is also possible to describe coordinate frame transformations using the ad-joint representation of a Lie group. We can find local representation and operation of a Lie group to estimate parameters in a linearized way.
To provide reliable information for pose estimation, feature correspondences between consecutive frames are crucial for the pose estimation results (Davison, 2003;Drummond and Cipolla, 2002;Kim et al., 2016). Point matching has been widely used for tracking and pose estimation because it is easy to detect and find point feature correspondences in consecutive frames using an effective feature description. However, point features are not always consistent due to viewpoint and illumination variations. Although it is possible to find scale, rotation, or affine invariant features (Lowe, 2004;Mikolajczyk et al., 2005), such features are computationally expensive to be detected. Therefore, Invariant features are not good choices for practical applications such as tracking and pose estimation.
Edge features are abundant in images. The Canny edge detector can find edges in a heuristic way. The number of edges is adjusted by the thresholds. Edge features are searched in 1D when the direction of the searching can be calculated (Rosten and Drummond, 2005). Edge features are not always reliable especially in image regions with low contrast. In addition, some edge features are formulated due to local noises. Such edges bring outliers to pose estimation systems.
Lie group theory has been applied in a tracking system based on edge detection (Drummond and Cipolla, 2002). The edge detection is performed in a local region. The detection results might have drifts caused by the noises in local neighborhoods. The drift of edges can lead to tracking errors, which should be avoided. In contrast, we detect line segments based on semi-global information. These line segments are used to find correspondences with the projection of the 3D model. This paper is organized as follows. After the related work in section 2, we introduces line segment detection, description and matching in section 3. Camera projection and Lie group and Lie algebra formulation are given in section 4. Experimental results on image sequences are demonstrated in section 5. Section 6 concludes this work.

Related Work
Our goal is to develop fast and robust pose tracking systems. Pose tracking has been handled based on different sensors. Ultra sound and laser sensors are useful in many applications (Talib et al., 2007;Koolwal et al., 2010). Although ultrasound sensors are cheap and convenient, they do not provide high accuracy. Laser sensors can be precise. Unfortunately, it is relatively difficult to measure an object at many points simultaneously using laser sensors. Vision-based pose tracking does not need expensive hardware. Moreover, vision-based approaches are flexible to add other functions such as recognition. They have better flexibility than other sensor-based approaches. Visionbased pose tracking has been dealt with using deterministic approaches (Wunsch and Hirzinger, 1997) and probabilistic approaches (Isard and Blake, 1998;Wang and Yagi, 2008). Using either approaches, motion has to be linearized to solve the problem. In this work, we handle this by using Lie group and Lie algebra. This bears certain similarities to a few works such as (Taylor and Kriegman, 1994;Drummond and Cipolla, 2002;Rosten and Drummond, 2005;Flint et al., 2011). However, our method is different from these works in three important aspects: First, we consider line segments as reliable input for pose tracking; second, we propose a line segment descriptor for matching; third, we consider the problem as a minimization using a robust fitting function. Line segments have several advantages compared with low-level features. Our line segment descriptor can discard ambiguous correspondences in consecutive frames, which is important for the input of the minimization process. The robust fitting function is useful for getting a stable solution to the minimization problem. Different from pose estimation and tracking based on point, edge and line features, we compute 3D relative pose by using line segments. In our system, line segments detection runs in a semi-global way. Given a local seed pixel, we find possible a line segment based on gradient orientation consistency. We also propose a line segment feature descriptor to match line segments. We discard ambiguous line segments based on matching results.

Line Segment Detection, Description and Matching
Line feature detection can be cast as a two-step issue: Detecting edges and connecting edges. Line features can be detected by connecting edge features with similar gradient orientation. This approach heavily depends on the edge detector. Edges in weak contrast regions are difficult to be detected. Although using a low threshold in Canny edge detector can increase the detection probability, it is at the cost of high false-positives. In addition, the computational cost is high using the two steps including the edge detection and connection.
Hough transform detects line features in a global way based on a transformation from feature space to parameter space. A Hough transform based method performs detection by splitting the input frame into a set of voting elements (Hough, 1962). The Hough vote for each line hypothesis is simply obtained from the edges by observing the distribution of line parameters. Hough Transform has been extended for multiple object instances detection (Barinova et al., 2012). Hough transform calculates all the votes in an image. Unnecessary information is considered for line detection. Hough transform cannot provide robust performance in image regions with low contrast because the votes from such regions cannot compete with the votes from other regions with high contrast.
Line segments contain more information than lines since they provide the ends of the lines. Line segments tend to be more reliable than lines in this application. Line segment detection can be done by proposing line hypotheses using corners found in an image (Rosten and Drummond, 2005). The features are detected using a corner detector. Then, many hypotheses are proposed based on the corners' position. The hypotheses are tested and those passing the test are considered as useful lines for pose estimation. This approach has to test a large number of hypotheses because line hypotheses are quadratic with respect to the number of the corners detected. In addition, many lines passed the test are overlapped. It is also difficult to eliminate the ambiguous lines.

Line Segment Detection
Line segments can be formulated by occlusion boundaries of foreground object and their background (Burns et al., 1986). It is also possible that contrast in an object leads to a line segment. Line segments are sparse compared with the number of the pixels in an image. The direction of a line segment is closely related with the pixel gradient orientations in its neighborhood. They are approximately orthogonal to each other (Gioi et al., 2010). The pixels following this rule are named as aligned pixels. They are important for line segment detection. Despite of the sparse property of line segments, there are a lot of line segment hypotheses when we consider a local region in an image. Most of the hypotheses will be discarded if a large support region is checked. A rectangular region along a line segment is defined as support region for the line segment support region. The pixels following the gradient consistency rule are used for line segment detection.
We calculate the difference between two pixels: where, I(u,v) is the intensity value at pixel (u,v). Then, the difference is used for calculating the gradients: Since ∆I(u,v) appears in both (2) and (3). ∆I(u,v) is calculated once and used twice to reduce the computational cost. The magnitude of gradient is: We calculate the derivatives respectively: The gradient orientation of a pixel (u,v) can be calculated by: Gradient orientation is compared with the direction of a line segment hypotheses. We do not calculate the difference between two orientations directly because it is computationally expensive. The angle difference calculated directly can be ambiguous because of the periodicity property. Instead, we keep the two values α u (u,v) and α v (u,v) for each pixel.
Orientation difference of two pixels I(u 1 ,v 1 ) and I(u 2 ,v 2 ) is calculated by measuring the distance between The calculation shown in Fig. 1 is: In the input image, the pixels on the edges have high gradient magnitudes than the pixels in the smooth regions. A few pixels with consistent gradient orientations are sampled as the seed set of a line segment. It is known that the gradient orientations of a line segment are consistent. We check the gradient orientation consistency between the selected pixel and its neighborhood. Once a pixel is sampled as a seed, it will grow from one pixel to many pixels that form a region. The pixels in the region have similar gradient orientations. The pixels in rectangle region are calculated to check the fitness of the region. Orientation difference calculated in (8) is computationally inexpensive. Moreover, it reflects the orientation difference in a better way. For example, the difference calculation between two angles is not straightforward when one angle's value is larger than 0 and another one is less than 0.
We sample several pixels in a region and calculate the orientations of these pixels. Then, we compute median of the orientations in the pixel set. The pixel that has the shortest distance to the median is considered as the seed for region growing. Pixels are checked along the direction given by the median. The pixels' gradient orientations within the range of orientations are added into the region. The orientation range is set to 8 π corresponding to a distance threshold 0.3902 using (8). The direction of the growing rectangular region Rec can be updated by accumulating i th pixel's gradient orientations: The weighted coordinate mean of the rectangular region is calculated by: and: ( ) The direction of the rectangular region is estimated by calculating the eigenvector of the association matrix defined by: where, the elements of the matrix d uu , d uv , d vv are calculated by: The eigenvector associated with the smallest eigenvalue of the matrix can be approximately calculated using an iterative algorithm within a few iterations (Gastal and Oliveira, 2012). The computational complexity of this eigenvector computation method is linear to the matrix dimension. In contrast, the computational complexities of the traditional methods are quadratic to the dimensionality of the matrix. In our implementation, we can estimate eigenvectors using 1 or 2 iterations. In most cases, only 1 iteration is sufficient for accurate estimation. Figure 2 shows the detection results of one image.

Line Segment Description and Matching
We can find the line segment correspondence for a projection of a 3D contour model. The searching is performed based on the line segments nearby the contour projection.

Fig. 2: Line segment detection results using our detector
However, this approach is not able to guarantee correct matching because multiple line segments are available in the neighborhood. Incorrect line segment can be selected for the pose estimation, which leads to failure of pose tracking. To deal with this problem, we match line segments detected in consecutive image frames. The line segments in the previous image frame are detected and described using a compact representation. They are matched with the line segments detected in the current image frame.
A good line segment descriptor should be compact and efficient for the comparison. We propose a descriptor by comparing the intensities of pixel pairs sampled in the rectangular region of a line segment. The intensity comparison of a pair of pixels is saved as 1 binary digit. The digit is set to 1 when the first pixels sampled is brighter than the second one. It is set to zero, vice versa. The matching is performed by calculating the Hamming distance between two line segment binary description vectors. The matching process greatly reduces the danger of incorrect matching. The binary description has been used in other matching approaches such as Sum of Squared Difference (SSD) are more expensive for line segment matching.
In addition, matching using SSD is sensitive to the small differences of pixel intensity variations. The matching results tend to be not very robust to noises and viewpoint changes. We estimate the relative position by minimizing an error term that is define as the sum of the distances between the projected features of the 3D model and their corresponding image features projected by the object.

Pose Tracking
We calibrate a camera using the method proposed by Zhang (2000). The pose is initialized by using the effective method (DeMenthon and Davis, 1995;Desolneux et al., 2000). We detect line segments in an input image. The correspondences are found by using our line segment matching method in Section III-B. We linearize the motion using Lie algebra. The motion is calculated by solving a minimization problem. The flowchart of our system is shown in Fig. 3.

3D to 2D Projection
3D objects have projections in images. A camera model with intrinsic and extrinsic parameters describes the mapping from 3D to 2D. The intrinsic parameters of a camera is characterized in a 3×3 matrix A: where, f u and f v are the scale parameters of the camera; u 0 and v 0 are the coordinate deviations from the principal point in the image plane. The extrinsic parameters are described by a 3×4 matrix consisting of a rotation matrix R and a translation vector T: The coordinates in the image are calculated by: Motions in consecutive frames are updated by right multiplying of the projection matrix E t-1 in the last frame by a Euclidean transformation: M t is a 4×4 matrix composed of a rotation R M and a translation t M of the motion: We estimate the motion matrix based on the input of an input image. Then, the pose is calculated using the known transformation.

Minimization for Motion Tracking
The pose of the camera is calculated using the correspondences of the features. To make the estimation more precise, we perform M-estimation to minimize the fitting errors. The projection of the 3D model in an image p = [u v] T is calculated by (18) where, ψ(q-p) is a metric for measuring the distances. The simplest definition is measuring the Euclidean distance between the two points: The matching of a single point pair is insufficient for the transformation estimation. In contrary, many correspondences are applied in the transformation estimation. The transformation is calculated by solving a least square problem: To solve this problem, the derivative of the transformation matrix is necessary for the minimization. The derivative of the matrix should be calculated in a 6D space because the transformation is described by 3 rotation parameters and 3 translation parameters. Unfortunately, the transformation matrix is not closed for matrix addition. The result of two transformation matrix addition is not meaningful.

Motion Estimation using Lie Algebra
A Lie groups SO(3) represented in a R 3×4 matrix has 3 degrees of freedom corresponding to a rotation. A Lie group SE(3) described by a R 4×4 matrix has 6 degrees of freedom: R∈SO(3) and R∈R 3 . The Lie algebra se (3) The mapping R 3 →se(3) is: where, φ denotes the translation vector and ω denotes the rotation vector. ω × is the anti-symmetric matrix transformed from ω. alg is the linear combination of the generators. Linear combination of the generators is specified by the coefficients: The motion matrix is approximated by the generators linearly: The partial derivatives corresponding to the motion parameters can be obtained by: The derivatives with respect to the image coordinates are calculated by:

Motion Optimization
The distances measured between the projection of the 3D model the line segments are projected to the 6D subspace spanned by the transformation vectors. The solution to the 6D subspace transformation is obtained by minimizing the robust error term.
The standard least squares algorithm considers the residual values as the only evidence for the optimization. A few large residual values with quadratic penalty will make the parameter estimation far from the real solution. Unfortunately, such large residual values are due to wrong positions or correspondences. These are usually outliers that cannot be fitted into the transformation model. This problem can be addressed by adopting a robust error function. Instead of using the simple quadratic penalty term, we consider the residuals according to their values. A smaller penalty will be given to the residuals with large values. We use Huber cost function. The function in (22) where, the value b gives the range of δ for the approximation. The function is a hybrid between the L1 and least squares cost function.

Experimental Results
We implemented the proposed system and perform experiments. The camera is moved in different paths. In some of the experiments, the camera is mounted on a robot arm. The end effector on which a camera is mounted has a coordinate setting which is different from the camera coordinate system. The relationship of the two coordinate settings is calibrated by using the projection of calibration board in the images. A point p c in the end effector coordinate is transformed to a point p c in the camera coordinate by 4 transformation matrices: p c = E c Q bc Q eb m e where E c is the camera projection matrix composed of internal parameter matrix and external parameter matrix; Q bc is the matrix describing the transformation from the camera coordinate system to the robot arm base coordinate system; Q eb is the matrix describing the transformation from the robot arm base to its end effectors.
We use a calibration board to calibration the system. Several images of the calibration board are captured by the camera. The calibration board has same movement with the end effector which it is mounted on. The camera internal parameters are calibrated using the method proposed by Zhang (2000). We detect the points of the corners in the images. The correspondences between the image points and the points in the 3D space are utilized for calibrating the two transformations T ce and T eb . The transformation matrix can be converted into transformation vectors, which makes the representation compact. The transformation vector consists of 3 translation elements and 3 rotation elements.
The two transformations (T ce and T eb ) are solved by using the interior point algorithm. There are noises in the images which lead to drift of the transformations. We calculate the distances between the image points and the fitted positions. The correspondences with large errors are discarded. Then, we run the optimization algorithm again to get the precise transformation parameters. We obtain the ground truth using a laser tracker measuring several points on the object. The pose is calculated by fitting these points. The results of the experiments are demonstrated in Fig. 4. The tracking accuracy is given in Table 1. The measured accuracy includes the translation and rotation errors. The translation errors are measured in millimeter and the rotation errors are measured in degrees. The accuracy given in Table 1 is the average error in a sequence.  Fig. 4: Motion estimation results. The object in blue indicates the estimated pose and the object in red indicates the ground truth. The estimation error has been enlarged to make the difference can be seen

Conclusion
We presented a practical framework for 6D pose tracking. Our system finds applications in augmented reality and robotic applications. The line segment detector considers semi-global information for line segment parameter estimation. It is efficient to detect sufficient number of lines for matching. The detector is good at discarding ambiguous line segments while finds real ones in noisy situations. Line segments sets foundation for matching and pose tracking. We consider 3D frame transformations using Lie groups and the corresponding Lie algebras. The motion parameter estimation is performed in a linear space, thanks to the good properties of Lie algebras.