Hand Gesture Recognition for Human-Computer Interaction

: Problem statement: With the development of ubiquitous computing, current user interaction approaches with keyboard, mouse and pen are not sufficient. Due to the limitation of these devices the useable command set is also limited. Direct use of hands can be used as an input device for providing natural interaction. Approach: In this study, Gaussian Mixture Model (GMM) was used to extract hand from the video sequence. Extreme points were extracted from the segmented hand using star skeletonization and recognition was performed by distance signature. Results: The proposed method was tested on the dataset captured in the closed environment with the assumption that the user should be in the Field Of View (FOV). This study was performed for 5 different datasets in varying lighting conditions. Conclusion: This study specifically proposed a real time vision system for hand gesture based computer interaction to control an event like navigation of slides in Power Point Presentation.


INTRODUCTION
Human gestures have long been an important way of communication, adding emphasis to voice messages or even being a complete message by itself. Such human gestures could be used to improve human machine interface. These may be used to control a wide variety of devices remotely. Vision-based framework can be developed to allow the users to interact with computers through human gestures. This study focuses in understanding such human gesture recognition, typically hand gesture. Hand gesture recognition generally involves various stages like video acquisition, background subtraction, feature extraction and gesture recognition. The rationale in background subtraction is detecting the moving objects from the difference between the current frame and a reference frame, often called the background image or background model. Wren et al. (1997) have proposed to model the background independently at each pixel location. The model is based on ideally fitting a Gaussian probability density function (pdf) on the last few pixel's values. Lo and Velastin (2001) proposed to use the median value of the last 'n' frames as the background model. Cucchiara et al. (2003) argued that such a median value provides an adequate background model even if the subsequent frames are sub sampled with respect to the original frame rate by a factor of 10. The main disadvantage of a median-based approach is that its computation requires a buffer with the recent pixel values. Stauffer and Grimson (1999) proposed Gaussian Mixture Model (GMM) in which scene background is modeled by classifying the pixels as object or background by computing posterior probabilities. The advantage of using GMM is that it provides multiple background model to cope with multi background objects. Then the features are extracted from the foreground objects. Skin color based features can be extracted from the foreground objects as in (Jones and Rehg, 1999), but it lacks the robustness to varying illumination conditions and it requires an exhaustive training phase. Extreme points of the foreground object, typically hand, can be used to best describe the gesture. Skeletonization is used to extract the extreme points as it provides a mechanism for controlling scale sensitivity. In Sánchez-Nielsen et al. (2004), gestures are recognized by Hausdorff distance measure but it is too sensitive to the shape of the hand gesture. The proposed method employs Gaussian Mixture Model to segment the hand region. Star skeletonization is used to extract the extreme points of the hand region. Gestures are recognized based on the distance signature.

MATERIALS AND METHODS
This study proposes a method to automatically recognize the hand gestures which could be used to control any event like power point presentation. The proposed method has three stages viz. Gaussian Mixture Model to detect the hand, Star skeletonization for feature extraction and Distance signature for hand gesture recognition. The overall block diagram of the work is given in Fig. 1.
where, 'T' denotes the transpose operation. Here the mean, µ m and covariance C m parameters are encapsulated into a parameter vector, as θ m = (µ m, C m ). The parameters θ m and α m are concatenated as Θ = (α 1 , α 2 ,…α m θ 1 , θ 2 ,… θm). Using Θ, Eq. 1 can be rewritten as Eq. 4: If the component from which 'x' originated is known, then it is feasible to determine the parameters Θ and vice versa. Since the parameters are unknown it is difficult to estimate. The EM algorithm is incorporated to overcome this difficulty through the concept of missing data.
The EM algorithm: Expectation Maximization (EM) algorithm (Dempster et al., 1977) is a widely used class of iterative algorithms for Maximum Likelihood (ML) or Maximum Posteriori (MAP) estimation in problems with missing data. Given a set of samples X = (x 1 , x 2 ,....,x k ), the complete data set Z = (X, Y) consists of the sample set X and a set Y of variables indicating from which component of the mixtures the sample came. The estimation of parameters of the Gaussian mixtures with the EM algorithm is discussed in (Zhang et al., 2003).
The EM algorithm consists of an E-step and M step. Suppose that Θ (t) denotes the estimation of Θ obtained after the t th iteration of the algorithm. Then at the (t+1) th iteration, the E-step computes the expected complete data log-likelihood function given by Eq. 5: where, P(m/x k ; Θ (t) ) is a posterior probability and is computed as in Eq. 6: And the M-step finds the (t+1) th estimation Θ (t+1) of Θ by maximizing through Eq. 7-9: The parameters are maximized and their optimal values are obtained once the convergence is achieved. The pixel 'x' is fitted to the corresponding component by optimal weight, mean and covariance. The extracted foreground object from the Gaussian Mixture Model is applied to star skeletonization for feature extraction.
Star skeletonization: Star skeleton, a simple but robust technique extracts the feature points from the foreground object. The features consist of the several vectors which are the distances from the extremities of human contour to its centroid. The basis of the star skeleton is to connect the extremities of human contour with its centroid. To find the extremities, distance from boundary point to the centroid is calculated through boundary tracking in a clockwise or counter-clockwise order. In distance function, the extremities are located at local maxima. The distance function is smoothed by a low pass filter for noise reduction. Consequently, the final extremities are detected by finding local maxima in smoothed distance function.

Boundary extraction:
The first pre-processing step is morphological dilation followed by erosion to clean up anomalies in the targets. This removes any small holes in the object and smoothes any interlacing anomalies. This closing operation is performed on the binarized image, i.e., detected hand (A) is dilated followed by erosion using the structuring element B given in Eq. 10: Where: (x c ,y c ) = The average boundary pixel position 'N b ' = The number of boundary pixels (x i , y i ) = The i th point lies on the boundary To find the extrema points in the object, the distances between the centroid and boundary points are calculated using Eq. 13: The distance of boundary to centroid, 'd i ' gives the information of the extremal points in the objects. From the distance plot the extrema points are considered as skeleton points.

Skeleton extraction:
The distance between the boundary points and centroid is calculated and plotted. The distance plot of an object contour has noises. Therefore these noises are removed by smoothing in frequency domain. Fourier transform is performed on the measured distance as given in Eq. 14 and is smoothed by a low pass filter: where, 'L' is the size of distance vector d(x). The low pass filter in frequency domain is represented as in Eq. 15: Local maxima of d smooth are taken as extrema points and the Star skeleton is constructed by connecting them to the object centroid (x c ,y c ). Local maxima are detected by finding zero-crossings of the difference function mentioned in Eq. 18:

RESULTS AND DISCUSSION
The dataset for the proposed study is acquired using a web cam and simulated using Matlab 7.0. The open and close fists are used to represent the navigation to next slide and previous slide respectively. These gestures shown in Fig. 2 and 3 are used as a vocabulary for human computer interaction. Gaussian Mixture Model is applied on the input video to extract the foreground. The input frame is shown in Fig. 4a and 5a. This algorithm is trained to segment the object which exhibits drastic movements. Fig. 4b and 5b shows the segmented hand image in the input video which depicts the gesture to move next slide and previous slide respectively. The extracted moving object is given to the star skeletonization algorithm. Morphological operations are applied to extract the contour of the segmented hand region as shown in Fig. 4c and 5c. The plot of distance between centroid and the boundary of the object is shown in Fig. 4d and 5d. The smoothed distance plot is shown in Fig. 4e and   By connecting these extrema points with the centroid, the skeleton of the object is obtained as shown in Fig.  4f and 5f. The difference between the global maxima and minima of the distance signature is used to recognize the gestures. The proposed algorithm has been tested on various dataset and depicted in Fig. 6-9.
As an alternative effort for comparison, gesture recognition was implemented by extracting Multi-scale Fourier Shape Descriptors, (Direkoglu and Nixon, 2008) at various scales like σ1 = 15, σ2 = 11, σ3 = 8, σ4 = 5, σ5 = 3, σ6 = 1, on segmented hand images as in Fig 10. But this approach needs storage of pre-defined hand gesture templates leading to escalation in memory requirement.

CONCLUSION
A hand gesture based recognition algorithm is proposed to control the PowerPoint application. In the proposed method, foreground is extracted through Gaussian Mixture Model. The extracted object is applied to Star Skeletonization process to detect the extreme points. The experimentation is tested on various dataset which justifies that the proposed solution outperforms the existing methods by being robust to scale variance and does not require any predefined templates for recognition.