An Appearance based Method for Eye Gaze Tracking

: Problem statement: Gaze estimation systems compute the direction of eye gaze based on observed eye movements. The need for gaze-contingent applications is the basis of the current research work. The gaze pointing systems is a substitute for the existing input devices. Approach: The gaze tracking methods are either feature based or appearance based. In this study, an appearance based approach for gaze tracking is proposed based on Run Length Coding (RLC). The experiment was conducted considering transitional changes and the class-intervals in iris pixels. The image acquisition begins from the center of the screen in anticlockwise direction. The center of the screen was the pivot point. Results: Using RLC, the recognition rate of 95% was achieved. The image analysis in different directions determines the gaze point. The directions was determined with respect to the pivot point. Conclusion: The proposed system provides a robust, less computational gaze tracking method using web camera.


INTRODUCTION
Eyes provide reliable and prominent features for communication using gaze enabled interfaces. The gaze tracking system captures intention of a person on the screen. The gaze point determines the direction of where an individual is looking at as shown in Fig. 1. Eye movements are categorized into fixations and saccades. A fixation occurs when focus of eye is on a particular object. The movements of our eyes from one fixation to another are known as saccades. As the human eye scans over the scene or image, the focus shifts about 25 times per second, to take in the disparate parts in its field of vision. The movements and information combine to form a cohesive vision of the scene. Analysis of the fixations and saccades are important for visual behavior. The major challenges are due to illumination, variability in position, faster saccades and eye blinks (Hansen and Ji, 2010). Existing eye gaze tracking systems are confined to controlled environments. The usability of the system under natural environments needs improvement (Zhu and Ji, 2004). Some methods require strict calibration procedure prior to gaze tracking. The accuracy of gaze tracking system depends on size of eye's visual field, range of eyeball rotation, diameter of the fovea and radius of the eyeball.
The existing gaze tracking techniques are broadly classified into intrusive and non-intrusive. The intrusive techniques require attachments around the eye to determine the gaze. These include search coils, electrooculography, contact lens and head mounted devices. Non-intrusive techniques use video cameras under infrared or natural light sources. The non-intrusive or video based techniques are classified into Appearance based and Feature based techniques. Appearance based techniques use the image contents as to map directly to the screen coordinates (Hansen and Ji, 2010). These methods require several significant calibration points to infer the gaze direction from the images. The analysis of the images at calibration points is important for gaze estimation. Explicit camera calibration is eliminated. In a morphable model developed by Rikert and Jones (1998), the texture for a set of prototype images is mapped to the reference image based on shape transformation. Neural network is used for training the prototypes with parameters for shape and texture of the eye region. Betke and Kawai (1999) determine gaze direction by gray scale unit images. During calibration, gray scale units are created in an elliptic pattern to form model images. Learning process use self-organizing map. The comparison of the pupil positions in model image with various regions in the trial image using correlation coefficient determines the gaze direction. The method by Tan et al. (2002) use appearance manifold for gaze estimation. A set of sample eye images with varying parameters and pose represent a continuous set of points called appearance manifold in the high dimensional space. For a test image, the set of closest manifold points is determined by interpolation based on least squares criteria. In the method by Hansen et al. (2002), the shape and pixel intensity information of eye corners and pupil position is obtained by active appearance model. The mapping function is based on Gaussian process interpolation method considering mean value. Yoo et al. (2002) determined the gaze point on the screen based on the glint positions. The cross ratios in the image are mapped to the monitor screen to obtain the coordinates of the gaze point. The appearance models are used for tracking smaller eye movements compared to the size of the object. In feature based methods, the gaze estimation requires as prerequisite, a set of features. The pupil corneal reflection or the pupil-glint vector is the most common feature used in feature based techniques (Morimoto and Mimica, 2005;Zhu et al., 2006;Baluja and Pomerleau, 1993). The local gaze features include pupil and limbus position, iris center, eye corner, inner eye boundary and sclera region. The global gaze features are face skin color interpupil distance, ratio between average intensity, shapes, sizes of both the pupil and orientation of pupil ellipse with respect to face pose (Zhu and Ji, 2004;Khosravi and Safabakhsha, 2008). The gaze mapping functions determine the screen coordinates. The mapping is an analytical function of either linear or second order polynomial. In the method by Kim and Ramakrishna (1999), the displacement of the iris center is based on linear approximation. The method by Zhu and Yang use eye corner-iris center vector as input for gaze angle calculation (Zhu and Yang, 2002). Interpolation is used to determine the gaze direction. Some nonlinear mapping functions use neural network, support vector machines and radial basis functions. The gaze detection system by Kiat and Ranganath (2004) use two Radial Basis Function Neural Network (RBFNN) to determine x and y coordinates of the gaze point on the screen. The pupil and glint parameters are used to train the RBFNN. In the method by Zhu and Ji (2004), six input parameters, the pupil glint displacement along x and y direction, ratio of major to minor axes of the ellipse that fits the pupil, pupil ellipse orientation and glint image coordinates are used for training the generalized regression neural networks. An extension of the work was developed using Support Vector Regression (SVR) to determine the gaze coordinates (Zhu et al., 2006). In this study, an appearance based model is presented for eye gaze tracking. The features are the pupil and iris pixels in the captured eye images. The gaze direction is estimated based on RLC. The sequence of transitional changes and class intervals remain unique for each gaze direction.

Prologue:
Region growing: The entire image region is represented by S. Region based segmentation partitions S into n subregions, S 1 , S 2 ,…S n , such that (a) The regions cover the whole image, for any adjacent region S i and S j . Region growing groups pixels or subregions into larger regions based on predefined criteria. The pixel aggregation starts with a set of seed points Abas, (2010). The seed mark each of the objects to be segmented. Regions are iteratively grown by appending to each seed points those unallocated neighboring pixels that have similar properties. The region growing is shown in Fig. 2.
The distance between a pixel's intensity value and the region's mean, dist is used as a measure of similarity. The pixel with the smallest difference measured this way is allocated to the respective region.
This process continues until all pixels are allocated to a region. It stops when the intensity distance between region mean and new pixel become larger than a certain threshold value. This value is the region's Maximum Intensity Distance (MID). The algorithm for region growing is given in Algorithm 1.
Algorithm 1: Region Growing Let S be the image region with intensity values I.
The seed point, s p and MID values are initialized. The q neighbor locations are determined and the neighboring pixel n p is added to list L. For instance if q = 4, 4 neighboring pixels are added to L. The mean value of the region, m(S) is computed.
The distance between pixel's intensity value and region's mean is computed, dist=I(S)-m(S). For each pixel, n p in L if n p ∉ R and dist<MID n p is added to L n p is assigned as region pixel r p , r p ∉ R The new mean value of the region is calculated end end The segmentation results are dependent on the choice of seeds. Seed point selection is based on some user criterion like pixels in a certain gray-level range, texture, color and shape. The initial region begins as the exact location of these seeds. The MID determine the condition for region growing. The region growing for MID values ranging from 0.01-0.06 with seed point is shown in Fig. 3. It is observed that the optimum value of MID for region growing needs to be selected on trial and error basis. Even a small change in the value does not provide the complete region.
Run length coding: RLC provides compact representation of a binary image. The sequence of repeated intensity values is represented as a single value and count. The representation is useful for images which contains runs of data. The long sequence of same value is replaced by a two values. The intensity values v 1 , v 2 , v 3 ,..., v n are mapped to pairs (a 1 , r 1 ), (a 2 , r 2 ),...,(a n , r n ) where a i represent image intensity and r i represent runs of pixels (Gonzalez and Woods, 2005). The algorithm for RLC is given in Algorithm 2.
Algorithm 2: Run Length Coding Let seq be the sequence of intensity values of a binary image. k=1; For the sequence of binary image seq = {1 1 1 1 0 0 0 0 0 0 1 1 0 0 0} the following values are obtained. a={1 0 1 0}, d={4, 10, 12, 15} and r={4, 6, 2, 3}, where the intensity values are represented by a and cumulative count is represented by d. The vector r represents run length values with respect to a . The RLC algorithm applied on iris images returns the number of 1s in each row. The number of 1s corresponds to the iris pixels, i p in the segmented image.

Related work:
The gaze estimation algorithm based on pupil-corneal reflection and second order polynomial calibration function was proposed by Morimoto and Mimica (2005). An average error was achieved for the entire screen. Yamazoe et al. (2008) developed singlecamera-based gaze estimation algorithm. The method consists of facial feature detection, eye model estimation and gaze estimation. An average estimation error is obtained. The method proposed by Lee and Park (2009) used head-mounted display environment. The method used virtual eyeball model by analyzing 3D structure of the eyeball. A head based approach developed by Kaminski and Knaan (2009) determines 3D face orientation from the two glints and bottom point of nose. The gaze detection involves estimating the center of cornea. The method by Ohno et al. (2002) use pupil and centroid of the Purkinje image as input to the gaze detection based on eyeball model. The model determines two parameters, center of cornea curvature and center of pupil in the camera coordinate system. The method by Wang et al. (2003) estimates gaze using iris contours. Eye gaze is determined as the line joining the eyeball center and iris center in the eye model. A calibration free method by Shih and Liu (2004) estimates gaze direction directly by the orientation of the Line of Sight. In the study proposed by Park (2007), the pupil center and six boundary points of pupil are used for gaze estimation. The gaze vector is obtained by the average of six gaze vectors, where each vector is computed by the cross product of the pupil boundary points.
The video input object vobj is created to aid communication between the system and image acquisition device using (1a) where adn is the adapter and dvid indicates device identification. The video data for vobj is previewed using (1b) for positioning the eye. h contains the information of the image such as video resolution and the number of bands. The video resolution is 640×480. The capturing of the image is event driven. The trigger configuration and the number of frames per trigger is initialized using (1c). The directions screen depicted with degrees on the monitor is displayed in fullscreen mode with a resolution of 1280×800 and a small preview window of size 150×150 on the left bottom side of the screen using (1d). This is shown in Fig. 4.

Algorithm 3:
Capture of eye and screen images Let N be the number of eye images to be captured. Repeat for i=1 to N The execution of vobj is initiated. init(vobj) The image acquisition from vobj is activated. An event is triggered for the capture of eye images. act(vobj) The eye images as seen in the preview are captured and buffered.
buffer eye(i)=cap_ eye(vobj) The screen dimensions, width (wd) and height (hgt) are possessed to determine the rectangular coordinates.
[wd,hgt]=scr size(FC); rec [0 0 wd hgt] The image is created by reading the pixels from the screen with respect to coordinates given by rec. The image data is buffered to the input stream.

Fig. 5: Preprocessing stages
The center of screen is denoted by pivot P. It is the initial focus point and all other directions are determined with respect to this center point. The changes in the shape of the iris are observed. The system does not require exclusive calibration. With horizontal head movements parallel to screen, the position of the iris with respect to sclera of eye do not change remarkably as shown in the Fig. 5. The primary position of eye is defined anatomically by head and eye planes. Photographic and video analyses show that the primary position of the eyes is a natural constant position in alert normal humans (Jompel and Shi, 1992). If axis of horizontal rotation of head and screen plane are parallel, the algorithm is independent of initial head position.

Preprocessing:
The eye images are captured using web camera. The RGB eye image is shown in Fig. 5a. The captured 640×480 sized eye images Img(x,y) are converted to grayscale as shown in Fig. 5b. Filtering is performed using a suitable mask to extract the exact position of the eye using (2a) as shown in Fig. 5c. The mask is of different sizes such as 6×6, 24×6 and 42×6 of 1s. Each subject has a unique mask. The image is binarized and the maximum value of x-coordinate is determined using (2b). The rectangular coordinates for the exact eye position are given by (2c). The image is cropped using (2d). The filter response determines the bounding box for the eye region as shown in Fig The value s p = (m x , m y ) represents the seedpoint to grow the iris region. Region growing is performed on the eye image as shown in Fig. 5e. Growing terminates when dist value exceeds MID using (4a) and (4b). In the proposed work, the optimum value of MID is 0.05: The reflections formed in the eye due to illumination are eliminated and smoothing is performed to define the contour of the iris region. The boundary of the iris is determined and iris region is extracted as shown in Fig. 5f. The MID determine the condition for region growing. The segmented iris is normalized to the size 25×25 as shown in Fig. 5i(g).

RLC based on transitional changes:
The result of preprocessing is the iris region. The image of the iris region is resized to 50×100. The binary image is scanned row wise and each row is given as input to RLC algorithm. The run length algorithm returns the count of the number of 1s and 0s row wise. The sequences are '1 0', '0 1 0', '0', '1' and '0 1'. Table 1 and 2 show similar coding values for direction pointing to P, D 1 of a subject. Table 3-4 show similar runlength values for direction 90°, D 5 of a subject. The coding values for D 1 and D 5 are different for a subject.
The transitional changes are considered in this range. The change over from 1 to 0 is a negative transition represented by '-1'. Similarly, the change over from 0 to 1 is a positive transition indicated by '+1'. The intensity values and transitional changes for direction D 1 is listed in Table 6. The transitional changes for a D 5 direction are shown in Table 7. It is evident that it is different from D1. Similar analysis is made for other directions. The count of positive transitions is considered for different gaze directions. The proposed system considers gazing pivot irrespective of initial head position. In this experiment it is evident that, all the other gaze directions can be identified with respect to pivot using difference in positive transition threshold values.   5  3  24  0  12  5  2  25  0  13  6  2  26  0  16  7  2  27  0  16  7  2  28  0  19  11  2  29  0  19  13  5  30  2  21  12  5  31  3  22  14  7  32  5  23  10  7  33  7  24  16  2  34  10  24  19  7  35  12  27  19  10  36  16  28  19  10  37  19  31  22  13  38  23  32  23  16  39  25  32  26  23  40  29  35  28  26 The significant information for shape analysis is given by the number of 1s. The shape of the iris depends on the bent in the eyelid and variations in the iris pixels for each direction. The variations are indicated in the binarised image in terms of growth or shrinkage of iris pixels i p . The shape of the iris is different for different directions. The number of 1s from the RLC algorithm is different for each direction in each row. The eye images in different directions are shown in Fig. 8.
The count of i p is considered. The class-interval is defined to determine the variations in the i p . The iris pixels are assigned to specific class intervals. The class intervals considered are, 1-5, 6-10, 11-15, 16-20 and 21-25. The intervals are denoted by the grades A, B, C, D and E respectively. The sequence of grades forms a unique pattern for each direction. For instance, the sequence {BEEEEEEEEEDDDDDDDDDCCCCBB} represents direction of eye gaze pointing to D 1 . The eye samples pointing to the same direction acquire similar Sequence Of Grades (SOG). The grades for direction D 1 and D 5 are given in Table 8 and 9. The grades are shown for two samples of a subject.
The sequence signifies the shape of the iris. In direction D 1 , the iris pixels are more concentrated towards the initial few rows and decreases for the other rows. This gives a bulged appearance of the iris. D 1 denotes center of the screen. As the gaze moves to the direction D 5 which denotes 90°, the iris shape becomes elongated and appears lean. The iris pixels are uniformly distributed for most of the rows in the eye image. Similar analysis is done for other directions. The SOG is unique for each direction. There is variation in the grades for iris pixels in the boundary of the class intervals leading to different SOG for the same direction. This is highlighted in the Table 8-9. In order to attain equal weights for i p , the summation of iris pixels is considered segment wise in horizontal and vertical directions. The horizontal segment, hs is the summation of iris pixels every 5 rows using (5). The vertical segment, vs corresponds to summation of iris pixels every 5 columns using (6) Where: c = 1, 5, 10, 15, 20 j = 1, 2, 3, 4, 5

MATERIALS AND METHODS
The Smart Infocomm digital webcam is used in the experiments for image acquisition. The USB color webcam captures 30 frames per second with a resolution of 640×480. The data format is RGB24. The focal distance is 3cm with 62° view angle.
The class-intervals are defined as 1-25, 26-50, 51-75, 76-100 and 101-125. The grades are assigned as A, B, C, D and E for horizontal direction and F, G, H, I and J for vertical direction. Each horizontal and vertical segment is assigned a grade.

RESULTS AND DISCUSSION
The SOG form similar patterns for same direction. Table 10-11 show SOG for directions D 1 and D 5 for two samples of a subject. Similar analysis has been made for all other directions. The segment-wise SOG values determine the gaze direction. The correct recognition rate of 95% was achieved.