Multi-View RGB-D Video Analysis and Fusion for 360 Degrees Unified Motion Reconstruction

: We present a new method for capturing human motion over 360 degrees by the fusion of multi-view RGB-D video data from Kinect sensors. Our method is able to reconstruct the unified human motion from fused RGB-D and skeletal data over 360 degrees and create a unified skeletal animation. We make use of all three streams: RGB, depth and skeleton, along with the joint tracking confidence state from Microsoft Kinect SDK to find the correctly oriented skeletons and merge them together to get a uniform measurement of human motion resulting in a unified skeletal animation. We quantitatively validate the goodness of the unified motion using two evaluation techniques. Our method is easy to implement and provides a simple solution of measuring and reconstructing a 360 degree plausible unified human motion that would not be possible to capture with a single Kinect due to tracking failures, self-occlusions, limited field of view and subject orientation.


Introduction
The field of marker-less motion capture and 3D or free-viewpoint video has received a lot of interest in the past decade. It has a number of applications in the areas such as natural user interface design, motion analysis, video surveillance, virtual reality etc. Traditionally, multi-view RGB camera systems have been used to capture motion, shape and appearance of a real-world actor. Carranza et al. (2003) presented one of the pioneer works in this area by employing eight synchronized RGB video cameras to capture a real-world actor. Using the eight video streams they developed a template-based marker-less motion capture scheme to correctly estimate the motion of the actor. This work was later extend by Theobalt et al. (2007), who measured the surface reflectance properties of the actor in addition to its motion. Afterward, de Aguiar et al. (2008) presented another template-based deformation framework to capture high quality motion of the real-world actor. In contrast, Vlasic et al. (2008) used the skeletal data to deform a template mesh to capture the high quality motion. Ahmed et al. (2008) used a shape matching approach over dynamic visual hulls to capture the track a single mesh over the complete sequence. So far, the previously explained methods relied on the RGB data.
Depth cameras, especially consumer-grade depth cameras were made popular by the introduction of Kinect by Microsoft (2010). The major benefit of Kinect is its low cost that allows it to be used a very cheap RGB-D sensor to acquire both the color and depth data at 30 frames per second (Ahmed and Khaifa, 2016). If only the depth data is desired then the Time-of-Flight (TOF) cameras can also be employed (Kim et al., 2008). Unlike Kinect, a TOF camera does not provide a unified solution to acquire both depth and RGB data, which is one of the major strengths of Kinect. In addition, using the Microsoft's Kinect SDK, one can also acquire real-time pose estimation or skeletal data of a real-world actor.
Pose estimation from a single camera has been a hallmark feature of Kinect and a number of solutions have been proposed for human pose estimation using a single Kinect (Girshick et al., 2011;Ye et al., 2011;Baak et al., 2011). The real-time skeletal data from Kinect is employed in a number of applications ranging from controlling a robot using the skeletal data or a controller free gaming experience by means of body poses (Lun and Zhao, 2015). The Kinect SDK can provide the skeletal data of multiple actors in a standing or sitting position.
A number of methods have been proposed that only use the depth data for the real-time pose estimation using machine learning or non-linear optimization (Chen et al., 2013). On the other hand, one can use the Kinect SDK directly to get the real-time pose data. Thus Kinect SDK provides a simpler solution of pose retrieval compared to a number of other methods that are comparatively very difficult to implement (Wei et al., 2012;Ye et al., 2011;Baak et al., 2011;Yasin et al., 2015;Shotton et al., 2011;Dantone et al., 2013). Due to the complexity of these methods they are not as widely adapted as Kinect's pose estimation. In practice, Kinect's SDK has been widely adopted for the real-time pose estimation and has been employed in a number of applications in a number of areas (Lun and Zhao, 2015).
Kinect has been developed to be used as a standalone camera in a living room, where the person is always facing the camera. Therefore, the Kinect SDK only captures the correct pose of the person as long as it is facing the camera with the frontal orientation (Obdrzalek et al., 2012). If the person is not facing the camera or in case the body parts of the person are occluded due to self-occlusions then the incorrect orientation or the missing depth information result in the incorrect pose estimation. Additionally, due to the field of view limitations of a Kinect combined with the orientation of the person, it is not possible to capture the motion of the person from all sides. Thus a 360 capture of the motion of the person is not possible using a single Kinect.
In order to resolve these shortcomings of pose estimation from a single Kinect, a number of methods have been proposed that employ more than one Kinect for the pose estimation. Viewing a scene from multiple Kinects provide a number of benefits, specifically if a body part is occluded in one camera view will be visible in some other camera. Additionally, if the placement of the cameras is around the person, then the person will be oriented towards at least one of the camera that can correctly estimate its pose. On the other hand using multiple Kinects results in the loss of depth data due to the interference between different depth sensors. As shown by Ahmed (2012), this interference does not result in the loss of quality for a 360 degree 3D animation reconstruction, because the missing information from one depth sensor is filled in by the other sensors. In their work (Ahmed, 2012), employed six synchronized Kinects to reconstruct a 360 degree 3D animation. In contrast, Berger et al. (2011) employed four Kinects for unsynchronized marker-less motion capture. Ye et al. (2013) employed three hand-held Kinects for marker-less performance capture. Caputo et al. (2012) employed multiple Kinects for hand gesture recognition. All of these methods did not use the realtime pose data from Kinect. Rather, all pose estimation methods used an optimization process by means of silhouette-based minimization or template deformation to find the correct pose. Even though these methods work fine in practice, using Kinect SDK for the pose estimation has a number of benefits. In the first place, the pose data is available at 30 frames per second, making it suitable for a number of real-time applications. There is no additional post processing required before using the skeletal data. In addition, the reliability of the skeletal data is good enough to be employed in a number of applications as long as the person is facing the camera and the person's pose does not result in the selfocclusion of body parts (Obdrzalek et al., 2012).
If multiple Kinects are used to acquire the real-time pose data, it is not straightforward to fuse these poses together for 360 degree unified motion reconstruction. As Kinect only estimates the correct pose if the person is facing the camera, a completely incorrect pose with inverted joints is estimated for the back-facing camera. To fuse the pose data from Kinects, it is important to first identify the Kinects toward which the person is oriented. In addition, even for those Kinects with the correct person orientation, the joint data should be selected in such a way that the self-occluded joints should be discarded and only be used from the pose data that is estimated from the non-occluded joints. Finally, even if the joints are no occluded, a joint which is oriented more towards a Kinect should be preferred because in general it is better tracked compared to a joint that is not oriented towards Kinect (Obdrzalek et al., 2012).
In this study, we propose a new method of fusing the skeleton data from multiple Kinects over 360 degrees. Our method can automatically detect the correct orientation of the actor with respect to each camera and can fuse the joint data based on our novel confidence score to create a unified skeletal representation at each frame. Our method uses the Microsoft Kinect SDK for acquisition and its implementation is relatively simple. The result of our method is a unified human motion measurement in the form of a skeletal animation over 360 degrees that is free from the artifacts due to occlusions or tracking failures. Our work does not estimate the pose from the depth data, rather it presents a very simple and effective method to combine the data acquired from multiple low-cost sensors for a reliable 360 degrees motion capture. An algorithmic flowchart of our method can be seen in Fig. 1 and the algorithmic details can be seen in Fig. 2.
In the following sections, we will present each of the algorithmic step in detail, starting from the discussion of data acquisition, followed by the presentation of the unified skeletal animation reconstruction algorithm. Afterward, the results are presented and validated followed by the conclusions.

Data Acquisition
Our acquisition system is comprised of four Kinects placed at 90 degrees with respect to each other. Our system is not confined to a fixed camera setup, but can work effectively for a hand-held acquisition, if required. We use a software-based synchronization similar to Ye et al. (2013) for the multi-view acquisition. We use the Kinect SDK to acquire RGB, depth and skeleton data. RGB-D streams from Kinect are low resolution (640x480) at 30 frames per second. For each frame, Kinect tracks a skeleton comprising of 20 joints. One frame from our acquisition system showing, RGB, depth and the skeleton data can be seen in Fig. 3a and 3b.
One of the benefits of using the Kinect SDK is that it circumvents the need of any manual intrinsic camera calibration. The SDK provides the mapping between RGB, depth and skeleton data. It also maps the depth and skeleton data to a unified three-space coordinate system. Thus for every depth value the corresponding RGB value is available. Additionally, for every joint position we know its depth value and the mapping to the RGB data. For our work, we only need the mapping between depth and the skeleton data. The depth to world coordinate mapping allows us to resample the depth data in a 3D point cloud. Thus, for each frame we obtain four 3D point clouds along with their corresponding estimated skeleton data in their local coordinate systems. In addition, the Kinect SDK also provides a tracking state for the skeleton and each joint. For the skeleton the tracking states are: Not Tracked (did not track anything), Position Only (did not track any joint, only one skeleton position) and Tracked (did track joints). For the joints the tracking states are: Not Tracked (joint data is not available), Inferred (joint data is calculated from other tracked joints), Tracked (joint data is tracked and available).
The joint tracking states are an important part of the confidence score assigned to each joint for our method, as discussed in the next section. Even though our experiments use a static camera setup, our method can also work without a fixed extrinsic parameterization between the cameras for the whole sequence, in case the cameras are not static. We show that the extrinsic parameters can also be calculated dynamically using the skeleton data as explained in the next section.

Unified Skeletal Animation Reconstruction
The fusion of skeleton data from multiple Kinects poses a number of challenges. First, the skeleton data from Kinect is not usable if the actor is not facing the camera. The Kinect uses the depth data under the assumption that the actor is facing the camera and returns the incorrect pose if the actor is not facing the camera, as seen in Fig. 3b(right). In the first step, for every frame we need to identify which cameras can be used for reconstructing the unified skeleton.As the depth data, or the skeleton and joins tracking states are not helpful in finding the correct orientation of the human actor, we use one of the standard face detection methods (Viola and Jones, 2001) over the RGB data to determine the front-facing actors. We use two profiles, one for the frontal face and one for the profile face to find out which cameras can be used for the fusion (Fig. 3a). Face detection is a standard feature provided in nearly all camera systems, ranging from mobile phones to high end DSLRs. It is prone to failure if the actor's face is occluded. Sometimes it can also detect false positives. We used simple sanity checks to circumvent these issues, to be discussed in the Results and Validations section. Additionally, we could also use the face detection API provided with the Kinect SDK, which works robustly in practice, but since it is real-time, we found that it adversely affected the performance of our acquisition system. In principle, as the Kinect already provides the head position in the depth image coordinates, the extrinsic camera parameters can be used to localize the head position in the RGB space. Using the head position, some other image processing algorithm can also be used to detect the front-facing camera.
Once the cameras to be used are identified, we start the fusion process by assigning a confidence score to each of the skeleton joint for each camera. Assuming we are using C cameras and there are T frames in the sequence, the confidence score S for a joint c t j , where j = 1,…,20, c = 1,...,C and t = 1,..., T , is defined by:  is the occlusion score for c t j , it is 0 if the joint is occluded or 1 otherwise. We find out if the joint is occluded or not by back projecting its depth value to the depth image and comparing the z value of the threespace joint position from Kinect and the depth image. We cannot completely discard a joint if it is occluded because in some cases Kinect can still track the pose even if a joint is occluded for a small number of frames. which if a joint is moving, compares its displacement d t at t with the displacement d t−1 at t−1. If the joint is not moving, or if there is very little movement, then it is set to 1. If d t ⇐ σ * d t−1 then it is set to 1, if d t > σ * d t−1 and d t ⇐ ρ * d t−1 then it is 0.5, otherwise 0. We found this term to be very important because it penalizes sudden jerky motion of the joints in case of a tracking failure. Skeleton tracking from Kinect can also fail not because of the occlusions but also due to the limitations of the underlying pose estimation algorithm. By introducing this temporal smoothness term, we try to compensate for these failures. It is to be noted that ( ) c t j D cannot compensate for the jerkiness if it is present for a joint in a particular frame, in all the cameras. The jerkiness is a shortcoming of the underlying pose estimation algorithm, whereas this term favors the best available joint with the least jerky motion. In this regard, this term compensates for this particular shortcoming of the underlying skeleton estimation algorithm. The parameters σ and ρ are found through experiment, as discussed in the Results and Validation section. For our method, we chose σ = 1.2 and ρ = 2.0 for two slow sequences, while for the faster motion their value was 1.05 and 1.7 respectively. , is the bone length score. Similar to Yueng et al. (2013), we initialize all the bone-lengths manually for the first frame and classify them as the ideal lengths. As each joint is associated with one or more bones, let L(j) be the sum of all the bones lengths associated with each joint. Using the sum of ideal bone lengths associated with each joint L ideal (j), the normalized term ( ) c t j B is calculated as follows: is 1.0 and any deviation from the ideal bone length results in a smaller score. We also use the bone length variation to quantitatively validate our method, as discussed in the Results and Validations section.
For all the cameras oriented towards the face of the actor, we use the 3-space mapping of the joints with the confidence score greater than 2 and find the least squares solution to determine the transformation that maps one camera to the other. As explained earlier, given the static camera setup, this is not a required step. It is only performed to demonstrate that our method can also work for moving cameras. This dynamic extrinsic calibration is done at every frame and if more than two cameras are used, they are mapped to one reference camera. The results of extrinsic calibration can be seen in Fig. 4a and 4b. In practice, we always found 12 or more joints with the confidence value greater than 2. Thus, the linear system was never underdetermined. The confidence score in Equation 1 is one of the ways to perform the dynamic extrinsic calibration. Using the skeletal data is a novel approach in this regard, but one can use also use traditional image processing based methods, similar to Ahmed (2012), to achieve the similar results. To reconstruct the unified skeleton, dynamic extrinsic calibration by means of Equation 1, is the first step. We modify Equation 1 with an additional orientation term to select the best possible joints for the uniform skeletal reconstruction.
Using the extrinsic calibration, we first map 3D point clouds and skeletons to the global world coordinate system. In the next step, we use the unified point cloud (Fig. 4a) to estimate the normal ( ) c t n j of each c t j . The normal orientation is estimated using SVD-based plane fitting on the neighboring 3D points of c t j in the unified point cloud. If we do not use the unified point clouds and the normal for c t j is only estimated through its corresponding camera point cloud, then the normal orientation will be biased towards that particular camera.
Before merging the skeleton data, we modify our confidence measure (Equation 1) and introduce a new is the dot product of ( ) is oriented towards c and it decreases as the actor rotates away from the camera. This term increases the confidence score for the joints of the front-facing camera, which is desired, as Kinect best estimates the skeleton if the actor is facing the camera. Finally, we reconstruct the unified skeleton at t by selecting each of the 20 joints from the camera c that has the highest confidence score ( )

Results, Validations and Discussions
We recorded three sequences of 200 frames each. First sequence shows a fast boxing motion, the second sequence is a normal walking motion, while the third sequence is the fast rotation motion of whole body. Our method was able to track all sequences successfully and the selected joints from multiple cameras capture the motion accurately. Our confidence measure ensures that joints with the wrong pose are replaced by the joints from other cameras that estimate the correct pose, as can be seen in Fig. 5. More results from two of the sequences can be seen in Fig. 3c, 4c. It can be observed in the results that our method can merge the skeleton data from multiple cameras to reconstruct the unified skeletal animation. Please note that the boxing sequence is shown with only three cameras because the actor never turned around to face the fourth camera. It can be seen in the figures that because of the faster motion, the boxing sequence has a number of tracking failures, even in the front-facing camera, but our method was able to reconstruct the correct motion by merging data from the other cameras. The walking sequence shows a complete 360 degree reconstructed unified motion.
In addition to the qualitative visual evaluation, we also perform multiple quantitative validations. In general there is no ground truth data available for us to compare the goodness of our method. In addition, this work is not a direct pose estimation from the depth data (Chen et al., 2013), rather it uses the estimated pose from each camera and combines them together. Therefore, we do not need to quantify the quality of the individual pose, but need to estimate if the unified skeleton is better than the individual poses from each camera. We use two methods that compare the unified skeleton with individual skeletons using the bone-length variation estimation and 3D point cloud overlap to quantify the goodness of the unified skeleton:

Bone-length Variation Estimation
For the first quantitative analysis, we implement the bone-length variation estimation system that is presented and employed by Yueng et al. (2013). Similar to their method, we initialize all the bone-lengths manually for the first frame and classify them as the ideal lengths. Ideally the bone-lengths of the reconstructed skeleton at each frame should be as close as possible to the ideal lengths. Following Yeung et al. (2013), we compared bone-lengths at each frame for the unified skeleton and the front-facing cameras at each frame. For all the sequences we found the unified skeleton to be closest to the ideal lengths compared to individual Kinects.

Fig. 5: Merging of three cameras (black, red and green) is
shown on the right. As can be seen the algorithm correctly selects the joints from the cameras that depict the most accurate motion In this study, due to size constraints, we are only showing results of the boxing sequence, because it is the most challenging sequence with the very fast motion, with a number of tracking failures for all the cameras. Similar to Yeung et al. (2013), we show the statistics of bone-length variation for a number of bones for the boxing sequence in Fig. 6. Table 1 shows the absolute difference of the average bone length and the ideal bonelength for individual Kinects and the unified skeleton for the boxing sequence. As can be seen in Fig. 6 and Table  1, over the course of the sequence, the bone lengths of the unified skeleton are always closest to the ideal length, when compared to individual Kinects.

Bounding-box and Skeleton Overlap Estimation
The bounding-box-based error measure calculates the overlap of the skeleton and the underlying 3D point cloud. For each bone in the individual skeleton from Kinects and the unified skeleton, a bounding box B i is defined at the first frame, where i = 1,...,19 is the bone index. The size of each bounding box B i is initialized manually and remains consistent throughout the sequence. The orientation of each bounding box is automatically determined from the orientation of the bone. The bounding boxes are tracked over the whole sequence using the skeletal animation. An example bounding box of a bone at an arbitrary frame can be seen in Fig. 7.
For each bounding box, the number of overlapping 3D points are calculated for each skeleton. A normalized error measure ξ t is calculated for a time frame t as follows: is the count of all the points in the complete 3D point cloud. As shown in Fig. 7(c and d), the bounding box from the unified skeleton completely overlaps the correct region of the merged 3D point clouds, resulting in the higher value of ξ t . In this particular frame, the unified skeleton has on average 7.73% better quality, compared to the individual cameras. For comparison, we also estimate the goodness criteria for each individual front facing camera ξ t . For all three sequences, we found that on average the goodness of the unified skeleton ξ t was better than the average goodness of individual front facing cameras ξ t by a factor of 7% to 10%.

Discussion
We used both evaluation methods to estimate the two parameters σ and ρ in the temporal smoothness term ( ) c t j D that was used in Equation 1. For each sequence, we reconstructed the unified skeletons with varying parameters and compared each bone-length with the corresponding ideal length and the goodness of the skeleton by calculating ξ t . We observed that for the sequences with the slower motion the value was higher, whereas for the faster motion it was lower, because the sudden jerky motion is heavily penalized in case of the faster motion.
In terms of computing speed our method runs at a moderate speed and can estimate 12 frames of uniform skeletons per second. Ignoring the I/O overhead and if the extrinsic calibration is pre-established, it runs in realtime at 30 frames per second. We tested the method on a 2.4 Ghz Quad Core i5 system with 4 GB of memory.
Our method can easily be parallelized on a cluster as each frame is processed individually.
Our method is subject to a couple of limitations. We employ face detection to find out actor's orientation with respect to the camera. Face detection works well in more than 90% of the frames but it can fail if the face is occluded, for example, in the boxing sequence. We solve this issue in a pre-processing step by analyzing the sequence and if a couple of frames are missing the face, then we look at the frames before and after the missing frames under the assumption that the frames were skipped due to occlusion. Additionally, we also use the normal of the root joint from the previous frame to determine if the actor is still oriented towards the camera. For example, in case face detection has failed, but in the previous frame the actor was facing the camera, then it is unlikely that the actor was rotated by 90 degrees in a single frame. Similarly, face detection can also detect false positives, for example, some parts in the surroundings can be incorrectly classified as faces. Again, we make use of the full sequence to determine the correct size and most likely position of the face. Incorrect face rectangles with very small or large areas are immediately discarded.
One can also see some flickering in the reconstructing sequences, where one joint switches between two cameras quickly. This is due to very similar confidence score, which can vary according the normal orientation if both cameras see the joint clearly. The depth data from Kinect is very noisy and we do not compensate for this noise, thus normal orientation can differ slightly in each frame. Additionally, the general flickering in joint positions is not from our algorithm rather it is the raw skeleton data from Kinect, which is not smooth over time. In future, we want to explore smoothing the skeleton data by reconstructing the joint position from all available cameras by means of a weighted average, or incorporate a probabilistic model in the confidence measure.
Despite the limitations, we show that our method is able to reconstruct the human motion over 360 degrees by fusing multiple RGB-D sensors and reconstruct a unified skeletal animation in a plausible way that would not be possible with a single Kinect.

Conclusion
We presented a method to reconstruct human motion over 360 degrees by using data from multiple RGB-D Kinect sensors and reconstruct a unified skeletal animation. Our method can merge the skeleton data directly from Kinects by assigning a confidence score to each joint based on its tracking state, occlusion, displacement, bone length and orientation. The confidence score is then used to select 20 best joints from the cameras towards which the actor's face is oriented. This orientation is found by means of face detection. Our method can reconstruct a unified 360 degree skeletal animation from multiple Kinects that would not be possible from a single Kinect due to occlusions and tracking failures. We also quantified the goodness of the reconstructed unified skeleton using the bone-length variation calculation and bounding-box overlap ratio methods. In future, we would like to extend the unified skeletal animation reconstruction algorithm by incorporating a probabilistic model in the confidence measure. In addition, we would also like to work on new methods to quantify the goodness of the reconstructed unified skeleton.