Static Hand Gestures: Fingertips Detection Based on Segmented Images

: Fingertips detection is important for recognition of static gesture in sign language. This paper presents a new method that is based on YCbCr colour space and skeletonization for fingertips detection towards gesture segmentation and recognition. The method begins with the conversion of an image in RGB colour space into YCbCr colour space. Then, the chrominance Cb and Cr are extracted from the YCbCr colour space. A thresholding technique, which is based on a pre-defined range for Cb and Cr components representing the skin colour value is used to extract the hand from the background and to achieve the binary image. A morphological processing is performed on the obtained binary image to remove noise and unwanted image pixels. The candidate fingertips position is then calculated based on skeletonization algorithm and tracing process. The centroid is subsequently found, after which the Euclidean distances between the pixels’ coordinates that belong to the candidate fingertips and the centroid are calculated towards validating their representation of the actual fingertips. Based on the proposed method, the fingertips of twenty-six American Sign Language (ASL) alphabets hand sign samples were detected successfully and the gesture was correctly recognized with an accuracy of 96.3% during the conducted


Introduction
For years, sign language has been among several communication options utilized by hard-of-hearing or def people. It is considered as a language employing signs created through the movement of the hands in combination with body postures and facial expressions (Nguyen et al., 2013). Gesture language identification, with its high applicability, likewise belongs to the areas that are explored to assist deaf people's integration into the community. Researchers utilize specialized equipment (i.e., gloves or recognition techniques that are based on image processing with the use of computers and cameras) (Rautaray and Agrawal, 2015) towards detecting gestures.
Due to its potential use for human-computer interaction, hand gesture recognition has become a very active topic for research. A range of non-verbal communication is considered part of human hand gestures; for instance, simple ones, such as pointing, to more complex gestures, such as sign language. Because hand appearance is variable, accurately detecting hands in video or still images continues to be a challenging problem. In terms of detecting fingertips towards recognizing gestures, a majority of the methods of detection are based on two steps. Hand segmentation is always involved in the first step, which is followed by fingertips detection (Bhuyan et al., 2012). Through hand segmentation, the areas of interest, which is intended towards finding the fingertips, becomes smaller; thus, the process of detection becomes faster. Most methods of hand segmentation, however, cannot get good results because of certain conditions, such as cluttered background, fast hand motion and poor lighting condition. The results of fingertip detection are usually invalidated by poor hand segmentation method performance (Raheja et al., 2012). Two trends emerge from the study of the problem of hand gesture recognition. These trends correspond to two types of data: Static and dynamic gestures.
The hand shape and pose information are commonly utilized towards describing static gestures while hand motion is used towards representing dynamic gesture types, Hasan and Abdul-Kareem (2014). However, a close relationship has been observed between these trends; this is because many dynamic gesture identification approaches are developed according to solutions of static gesture recognition. Thus, there is consideration of static gesture identification as the research basis when it comes to general gesture recognition (Shen et al., 2012). Most hand segmentation research is based on the detection of the skin colour. In this approach, the main idea rests on the assumption of skin colour's difference from other objects' colours and that its distribution leads to the formation of a cluster in a few specific colour spaces.
Research activities pertaining to skin segmentation have mainly progressed in these two directions: Statistical colour models and generic colour models (Bhuyan et al., 2015;Cooper et al., 2012). A fixed colour range is defined by the so-called generic colour model towards separating skin from non-skin pixels. In general, the fixed colour range is accomplished empirically utilizing a few collected training instances. Using generic colour models by itself cannot handle human skin variations and illumination. (Wenkai and Lee, 2012;Khan and Ravi, 2013) used a classic YCbCr colour space for hand segmentation. Chrominance Cb and Cr were extracted from YCbCr colour space; a binary hand segmentation result was obtained through taking the chrominance CbCr values as the elliptic equation model's parameters.
This paper is organized as follows. In the section, 2 states of the art of fingertips detection are presented. In section 3 Proposed Colour spaces used for skin detection is introduced. In section, 4 The Proposed Methodology is described. The experimental results are introduced in section 5. The paper is concluded and future work is presented in section 6.

State of the Art
These days, the hand gesture is counted among the non-verbal methods of communication which can be employed in Human-Computer Interaction (HCI). Moreover, this method has been extended for use in Human-Robot Interaction (HRI). The hand images can be easily obtained through the use of a camera attached to the computer or the robot. A few processing steps, however, should be done in order to recognize the meaning of the hand gesture. Before the hand gesture recognition process, some hand features, such as its centre, the finger directions and the fingertip positions, should be extracted. Ibraheem et al. (2013) put forward a proposal for a method towards hand segmentation based on skin colour with the use of the YCbCr colour space. In order to detect the skin, YCbCr was utilized for the extraction of the luminance component of the colour space. However, the method is unreliable because of the difficulty to have a model for skin colour that can represent people of different ethnicity and can be applicable under various lighting conditions. Vieriu et al. (2013) proposed a simple system that is developed by utilizing artificial neural network and skin colour segmentation. However, the extracted features resulting from these methods do not have sufficient hand shape information, which is a key attribute that has a bearing on the recognition accuracy. Thus, the level of generalization for these methods is not really high. Dung and Mizukawa (2009) suggested finding the position of the fingertips according to a set of new feature pixels. This method emphasizes on finding the position of the fingertips in an image frame which includes both face and hands. The extraction of these feature pixels, which are called distance-based feature pixels, is made through a simple technique that is based on the connected component labelling image's distance transformation in order for the binary skin colour detection image to be found quickly and easily. Thus, the process of hand gesture recognition and hand tracking will be performed with more ease and accuracy. Furthermore, no complex calculations are required for all these above-mentioned processing steps; thus, hand gesture recognition or hand tracking system is allowed to achieve real-time performance.
Joshi and Vig (2015) gave a presentation about a computer vision system for the recognition of six American Sign Language (ASL) hand gestures through the use of a combined hand geometry and colour information. The overall accuracy rate that was given for this system is 89%.

Colour Spaces used for Skin Detection
Skin detection is defined as the process of looking for skin-coloured pixels and regions in a video or an image. Typically, this process is utilized as a pre-processing step towards finding regions in an image which potentially have human hands and faces. The pixels are usually transformed into appropriate colour space to detect skin in an image. Then, a skin classifier is used to label whether the pixel can be classified as skin or nonskin (Shakti, 2013).

RGB Colour Space
RGB colour space is a type of mixed colour space. It describes colour space through primary colours namely red, green and blue. In RGB, the chrominance and the luminance components are mixed, which is the reason for RGB's non-selection as a preferred colour space used in analysis and segmentation algorithm (Al Tairi and Saripan, 2014).

YCbCr Colour Space
The YCbCr colour space is typically utilized for skin colour segmentation. Luminance is reflected by the component Y. The Cb component is reflective of the difference between a reference value and the blue component (Qiu-yu et al., 2015). The Cr component, on the other hand, reflects the difference between a reference value and the red component. In the conventional algorithm that detects skin, no consideration is given to the luminance component Y since the chrominance components Cb and Cr is nearly independent of that colour space's luminance component (Ayala-Ramirez et al., 2011). In a few works, therefore, only the Cb and Cr components were used towards classifying each pixel as skin or non-skin. In this research study, the YCbCr colour space is chosen instead of RGB because of the following reasons: • The YCbCr possesses the same structure theory with a person's vision (Kaur and Kranthi, 2012) • The transformation between YCbCr and RGB colour space is linear. The computation process is, therefore, simple. It is characteristically betterclustered than the other colour mode • The colour model is extensively utilized in television as well as in other vision devices • Under various conditions of illumination, YCbCr has a tiny overlap between the skin and non-skin colour • RGB colour space has a very high sensitivity towards intensity differences The formulas for the conversion from RGB to YCbCr and vice versa are illustrated in "Equation 1 to 3". Figure  1 shows YCbCr colour space is represented:

Proposed Methodology
The four key stages of this research study's proposed methodology are presented below. These consist of Image Acquisition, Segmentation, Morphological Operation and Fingertips Detection. Figure 2 illustrates the flowchart of our approach.

Image Acquisition
The first step involved the acquisition of the hand gestures of the American Sign Language alphabets. These images, which may include varying illumination and cluttered background conditions, were then captured utilizing a camera. Normal static hand gesture images were also photographed, aside from the static ASL. A Logitech USB 2.0 webcam was used with the resolution of 320×240 pixels. In general, the subsequent stages will be less complicated if, one, the image is shot using a simple background and, two, the contrast between the background and the hand is set high. The ASL images were also gathered from a few available online data sets 'http://www.lifeprint.com/asl101/pageslayout/handshapes.htm'. Only some images of symbols, i.e., B, C, D, F, I, K, R, V, W and Claw in the American Sign Language (ASL), are utilized, as illustrated in Fig. 3.

Segmentation
In this proposed method, the images underwent conversion from RGB to YCbCr colour space. When the background image is not complex, simple thresholding was used for hand segmentation. But, when the background is complex and people from various ethnicities were involved in the hand gestures, then the free-form skin colour model was used for hand segmentation, as described in (Dawod et al., 2010a). The thresholding process' condition is also shown in "Equation 4", which produces a binary image where the white pixels (value of 1) represent the hand and the black pixels (value of 0) represent the background: The algorithm for the thresholding is stated below: for i = 1:m for j = 1:n if(Cb(i,j) > 77 && Cr(i,j) < 127 && Cr(i,j) > 133 &&Cr(i,j) < 180) ImageWithHand (i, j) = 1; else ImageWithHand (i, j) = 0; end end end  In the algorithm stated above, Cb and Cr are the matrices that represent the Cb and Cr pixel values in accordance with the YCbCr colour space. ImageWithHand stands for the matrix representing the consequent binary image. Based on the good results that have been provided experimentally, the values of the threshold have been determined in this matrix. Figure 4 illustrates the result of the segmented hand.

Morphological Operation
Image pre-processing is the term employed for image operations at the lowest level of abstraction that intends to enhance the image data. This is done through the removal or minimization of undesired noises or through the enhancement of some image features, which are important for additional processing and analysis. Image pre-processing does not increase the image information's content.
The hand segmented binary image undergoes morphological operations, namely 'Opening' followed by 'Dilation'. In the Opening operation, the small connected components are eliminated; the narrow isthmuses are broken and the contours are smoothened. The Dilation operation employs a disk-shaped structuring element towards shaping the contour and increasing the skin region's size, which will be helpful in the performance of the region labelling operation. Dilation and Erosion are applied with 3×3 structuring elements in the case of this research study as shown in "Equation 5 and 6" This is towards the removal of small background objects and the separation of the hand from the background, as illustrated in Fig. 5.
Let N ∈x 2 , which is the 2D space of the (a, b) and also let M ∈x 2 be the SE which controls the structure of the morphological operations, then for any binary image I where N ⊆ I, dilation and erosion can be defined as in (5) and (6) respectively (Hasan and Mishra, 2012).
Dilation Operator Given N and M sets in x 2 , the dilation of N by M, is defined by: Erosion Operator of N by M is defined as. Given N and M sets in x 2 , the erosion of N by structuring element M, is defined by: The erosion of N by structuring element M is the set of all points x, such that M, translated by x, is contained in N.
Even after the performance of morphological operations, a few background objects which share a similar colour with the skin still remains. A blob detection is done to solve this problem. A unique label is assigned to each blob towards separating it from other blobs. The same labels are also assigned to all the pixels that exist within a blob of 1s that is spatially connected. After selecting the blob that has the highest number of pixels as the blob to represent the hand, all other blobs are removed.

Fingertips Detection
In the skeletonization process, which is used to identify fingers on the hand image, the hand blob was 'thinned' until it produced a size of the one-pixel skeleton. Additional details on the algorithm can be seen in (Dawod et al., 2010b;Omkar and Monisha, 2011).
The fingertips skeleton image may contain a few noise pixels, which can be taken out using regional connectivity analysis. Figure 6c illustrates the fingertips skeleton result. Figure 7 shows that for fingertip detection, the fingertip position must obviously be one of the finger skeleton's endpoints.
The following steps are performed so as to find out whether the endpoints represented the wrist or the fingertips. First, we looked for the endpoint's convex hull. We then search for the distance between the neighbouring endpoints. The endpoint that has the biggest distance will be tagged as the wrist while the rest of the endpoints will be tagged as fingertips.

Experimental Results
This experiment is conducted using a PC with Intel Core i7, 2450M, 8G, 1.97GHz processor. MATLAB software is utilized with the Windows 8.1 operating system. The dataset used for the experiment is composed of one single-handed gesture images with different illumination conditions and backgrounds. The image resolution used is 320×240.
We consider the segmented hand region, its centroid or centre given by (X C , Y C ) is the centroid of the hand image is calculated. In this research work, the silhouette is calculated through the use of the following formula: where, x i represents the X-coordinate of each boundary pixel of the image shown in Equation 7: where, y i represents the Y-coordinate of each boundary pixel of the image shown in "Equation 8".
x i and y i are the x and y coordinates of the i-th pixel in the hand region, M represents the total number of pixels in the hand region Lastly, we compute the ∆ x and ∆ y centroid of the hand image. Figure 8 shows the detected fingertips specifically for hand image with a complex background. The results show that the fingertips for all images that have a complex background are successfully detected when all the fingers are open. Figure 9 shows the detected fingertips for the ASL alphabet datasets. This research's method was able to detect the fingertips of all the images. However, our method has this limitation: At present, it could not detect fingertips that are closed or those that are bent into the palm. This could be the possible direction of our future work. Figure 9 also shows the estimated centroid along with the detected fingertips. Figure 10 shows the binary image of hands together with the histogram of detected fingertips. The number of bars represents the number of fingers detected in each image. Once the fingers' end points are obtained, a calculation of the distances between them is made. The particular sign is recognized through the use of a rulebased system which involves the number of fingertips, the number of branches and the distance between the neighbouring fingertips, after which, the limits for each sign are specified. The proposed method's efficiency is then calculated for three different backgrounds, as shown in Table 1. The processing time is recorded afterwards. Each of these calculations contains all the American Sign Language's signs for alphabets from A to Z. A correct result (CN) is obtained whenever a sign is correctly recognized while an Incorrect Result (IN) is obtained whenever the sign is unidentifiable or recognized as incorrect. The computation of efficiency uses the following formula. Alphabets J and Z are considered as dynamic gestures; thus, the researchers can only capture the final position. This approach's efficiency is illustrated in "Equation 9". The different background for ASL is shown in Fig. 11: The confusion matrix for binary classification has four categories, where T P stands for true positive, F N stands for false negative, F P stands for true positive and T N stands for true negative counts. One plot the False Positive Rate (FPR) on the x-axis and the True Positive Rate (TPR) on the y-axis. Therefore, one plot Recall on the x-axis and Precision on the y-axis, The Recall indicates the True Positive Rate, whereas the Precision indicates the Positive Predictive. The recall is the same as TPR. Accuracy (ACC) is calculated as shown in "Equation 10 to 12":

Conclusion and Future Work
This research paper presented a new method that detects the fingertips positions based on skeletonization as well as a few morphological operations. This proposed method is quite effective in recognizing static hand gestures. This work can be further applied for the purpose of sign language recognition as well as for the HCI domain.
Towards getting highly accurate hand detection, this work employed the free-form skin colour model, as described in (Dawod et al., 2010a), along with Morphological filtering. Based on the conducted experiments, this proposed method succeeded in detecting all the open fingertips from images that have different complex backgrounds, as well as those images from the ASL alphabet dataset which performs various gestures.
As a future work, the method that we used will undergo improvement towards recognizing dynamic hand gestures and closed fingertips. Moreover, a possible extension of this work can be applied to the performance of static gestures recognition. Acknowledgement I wanted to take this opportunity to thank Dr. Md. Jan and Dr. Junaidi for their constant support and guide me for doing very good research work also, they provided to me a chance to be the corresponding author. Indeed, it is a big responsible to me, therefore we are a very effective team for the research.