Elderly People Fall Detection System Using Skeleton Tracking and Recognition

Corresponding Author: P.W.C. Prasad School of Computing and Mathematics, Charles Sturt University, Sydney, Australia Email: cwithana@studygroup.com Abstract: The fall detection systems use a variety of technologies like sensors, wearable devices, color camera, thermal camera etc. With the use of Microsoft Kinect camera for non-gaming purposes, depth images have started being utilized for fall detection. Various authors have made an attempt at using Kinect for fall detection in combination with a variety of techniques like ellipse analysis, bounding box analysis etc. However, most of these attempts fail to differentiate between human subjects and other inanimate objects and fail to identify the person who has fallen while assuming that there is only one person who needs to be monitored. This paper proposes a new system that is based on the depth images captured by Microsoft Kinect, skeleton tracking and bounding box analysis. The key novelty of this system is that it wraps the moving object into the bounding box and determines the change of size of the moving object by analysis the motion over the time to distinguish the human moving object and non-human moving object. The system stores the joint measurements of the known people in a database and compares the joint measurements of the detected person with the values in the database to identify the person. The proposed solution provides a significantly higher accuracy rate as compared to the current best solution and especially when the person carrying an object, sweeping the floor, dropping an object and picking an object from the floor.


Introduction
The rapid development of Information and Communication Technology (ICT) has had a number of implications on the modern life. One of its major advancements has been in elderly care. Elderly people, being weak and fragile, are prone to falls. It is bringing disruptive changes in the way the society functions. As the society changes, so does its structure and elderly people have to increasingly live alone at home. This situation necessitates the development of a supportive environment with the help of modern technologies to help them live an independent life. One of the main concerns with elderly independent living is safety as elderly people are weak and fragile, therefore, they are subject to injure themselves due to falls (Vaidehi et al., 2001).
Falls constitute a major threat to elderly safety as one out of every three people, aged 65 or above in the US, experience a fall every year (Chua et al., 2015. Studies have shown that falls are the leading cause of death for elderly people, 79 years or more of age (Chaudhuri et al., 2014). A fall can lead to a serious injury and may affect or completely impair the mobility of the elderly (Deandrea et al., 2010). In addition to this, if a fall is not detected on time, no help can be provided to the elderly and this may further aggravate the situation. Fall detection systems try to address this problem by using the wide range of technologies available today. Some of them use accelerometers integrated into wearable devices, others use acoustic sensors or sensors integrated into floors and carpets. Computer vision has also been used in fall detection by employing color, thermal and infrared cameras. The previously existing solutions suffer from three main problems that include the accuracy; i.e., detecting only real fall incidents and differentiating fall-like events from actual falls, no information about the person who has fallen, as there can be more than one people in the house and no differentiation between humans, moving objects and pets as they all display motion in one form or the other.
Another issue with computer vision is identity protection (Willems et al., 2009). This solution attempts to solve the above limitations by using depth images, which help in identity protection as they do not reveal any facial characteristics and are not affected by the lighting conditions or surrounding temperature. The depth camera works independent of light and illumination. The pixel value of the image represents the distance of the object (that is the human body) and the camera and it has nothing to do with the facial expression of the human being, i.e., the elderly person. This feature can help us to solve the problem of human privacy and identity protection as privacy is one of the biggest concern of vision based technology.
In the proposed scheme, the joint measurements of the known people are stored in a database and are used to identify a person based on the measured data using depth information, tracking skeleton, bounding box analysis and the size of change of the bounding box over time. The wrapping the person into the bounding box and determining the change of size of the bounding box provides more accurate results by distinguishing human moving object and the non-human moving object. Thus, the proposed scheme outperforms the state-of-the-art method with a significantly higher accuracy especially in the cases of carrying an object, sweeping the floor, dropping an object and picking an object from the floor, etc.

Related Work
There is a large amount of literature available today for elderly fall detection. Various techniques are used to detect falls and they can be broadly classified as follows:

None-Vision Fall Detection Techniques
These systems generally use technologies like wearable devices, acoustic sensors or ambient sensors. A fall detection system designed and integrated into a wristwatch, which uses two sensors to measure acceleration on x, y and z-axis (Degen et al., 2005). Designing wearable devices for fall detection is a cumbersome task, as the device has to be designed in way that it occupies minimal space and provides least discomfort to the elderly. Moreover, the device has to have a low power requirement and so that it does not have to be charged frequently. The complexity of the algorithm also remains a major issue. An attempt made to tap the sound energy made during fall and subsequently detect it. Based on the energy and the height of the signal, the signal is evaluated by feature extraction and classified using a K-Nearest Neighbor (k-NN) classifier trained to identify falls from non-falls.
However, such a system required extensive training of the classifier as a variety of sound signals prevail around the house and there may be more than one person or pets in the house (Popescu et al., 2008). Yin and Bruckner (2011) proposed a system, which monitors for activity using motion detectors and builds a daily activity model of the elderly person. Any unusual activity is sensed, as it would not match the daily activity model. Any deviations from the routine would be detected but its implementation for fall detection remains to be studied and implemented.

Vision-Based Fall Detection Techniques
Non-vision based techniques need careful consideration in their design to integrate them into the lives of the elderly and generally have complex algorithms for fall detection (Diraco et al., 2010). Another important issue is that if the elderly people forget to wear the device or forget to charge it, the purpose of the system is defeated (Li et al., 2009). On the other hand, vision systems do not suffer from such problems as the person is not required to wear any device and thus, these systems tend to be less intrusive.
A vision system can be implemented that detects falls based on the shape changes of the human silhouette. A person is represented using 3 points and ratio of the distances between three points is calculated. A fall is detected by monitoring the changes in this ratio, as the ratio remains 1 while performing most of the routine activities. But, there are cases in which the changes in the orientation of the upper and lower body are not prominent and they go unnoticed (Chua et al., 2015). Nasution and Emmanuel (2007) used a combination of k-NN classifier and evidence accumulation techniques to achieve a better output. After obtaining the projection histograms of the foreground, posture classification is done using a combination of these two techniques, which involves storing temporal information about the last standing posture. The problematic area is bending postures especially towards the camera. To get around this, they proposed a bounding angle test (Mastorakis and Makris, 2014). The system works well but bending and lying direction are mistaken and shadows pose a problem. However, with the integration of evidence accumulation, the response time suffers and is delayed for an average of 8 frames. Foroughi et al. (2008) proposed a method to improve upon the feature extraction to improve upon the false alarm rate. They used three features-approximated ellipses, projection histogram and temporal changes in head position to construct the feature vector. Sixsmith et al. proposed a system that uses infrared thermal imaging to monitor the elderly and then tracked them by wrapping them in an ellipse and then used a neural network to classify falls. Nevertheless, such a system is dependent on the ambient temperature and there is not always a clear temperature difference between the subject and the background (Wang et al., 2014;Lee and Lee, 2013). Lee et al. (2013) worked on a fall detection system usinga Microsoft Kinect camera, which provided three data streams-infrared, color and depth. The first function of this system is checking if the user's center of mass is within a certain threshold from the ground and the second function checks the velocity of the center of mass. Then, a posture recognition algorithm is applied which classifies an event as fall if the ankles are below the hip center and one of the legs is either folded or one knee is higher as compared to the other (Chen and Ming Liu, 2007;Zerrouki et al., 2016). Using depth images for fall detection solves a major issue with computer vision i.e., identity protection (Chua et al., 2015;Feng et al., 2014). Depth images not only help in identity protection but also are independent of the lighting conditions and ambient temperature unlike color and thermal images.

Current Best Solution (State-of-Art Method)
Mastorakis and Makris (2014) proposed a fall detection system, which uses the Microsoft Kinect camera but analyses the depth images captured in a different way. It uses a 3D bounding box for analyzing the motion of the elderly and detects a fall (Fig. 1). The system is successfully able to detect a wide range of falls-forward, backward and sideways and is able to differentiate non-fall or fall-like events from real falls. This solution uses the depth images provided by Kinect. Using depth images helps in protecting the identity of the elderly, as they do not reveal any facial characteristics. In order to extract the moving parts from the image, background subtraction and motion tracking is applied. Then, a bounded box is used to analyze the motion of the extracted part. This bounding box is expressed in terms of world coordinates. Expressing the bounding box in terms of world coordinates makes it independent of the floor coordinates of the scene. The algorithm takes the dimensions of the bounding box as an input and computes the first derivatives of the change in dimensions of the bounding box. In the event of a fall, the height of the box decreases and the width-depth dimension increases. The derivatives are compared to threshold values to differentiate fall-like events from actual falls. In case, a fall is detected, the subject if further monitored as a fall ends in a state of inactivity (Fig. 2).
This approach is relatively robust and simple to implement. However, it suffers from some drawbacks as well. The method used for background subtraction and motion tracking does not differentiate between humans and moving objects. Thus, it extracts any moving part out of the image, whether it is a human or an object, which a person is moving from one place to another. And if by any chance, the moving object falls, the system raises a false alarm. Also, the threshold values are estimated by performing a random search on the dataset (Fig. 1). The black boxes are the process steps of the current solution while the blue boxes indicate the good features to this solution. The red boxes show the limitations of this work.   Fluckiger proposed the use of skeletons for recognizing humans and differentiating them from objects. The system detects skeletal joints from the depth images and takes joint measurements and compares them with the values stored in the database. This helps in recognizing the various people in the house and more accurate information can be given about whom has fallen (Fluckiger, 2012).

Proposed Solution
Remote monitoring capabilities, such as fall risk evaluation, fall detection and detection of small changes from predefined baselines in health situations and motor functional abilities of the elderly, will address the challenges associated with self-dependent living (Fauzia et al., 2016). The proposed system is based on 3D bounding box analysis and skeletal tracking and recognition as illustrated in Fig. 3. It is based on the input provided by Microsoft Kinect camera and bounding box analysis. Kinect provides three data streams-color, depth and infrared. One of the most important development tools for Kinect is OpenNI. It is open-source and developed by Primesense (Tao et al., 2005). With the help of OpenNI, it is possible to play with the depth aspect of the images and this can be further utilized in various ways such as human tracking and gesture recognition. Human skeletal joints are located from depth images using OpenNI's skeletal tracker.
The system continuously monitors for skeletons. When a skeleton is detected, the system tries to recognize the skeleton by taking joint measurements and comparing them to the values of stored in the database: Where: s1 = measurement of detected person s2 = Measurement of known person delta = Total difference epsilon = Maximum allowable difference In the captured skeleton we can get a number of joint information. Among the joint information 9 distances are considered here. They are left forearm, right forearm, left upper arm, right upper arms, left shin, right shin, left thigh, right thigh and shoulder width. The Kinect measurements differed 5% to 13% from the actual length if the distances between a person and Kinect are between 1 to 2.2 m. The average length for the 9 segments for the given test subjects was 327 cm. The proposed value of the epsilon is 15 cm which is the 5% of the average length (i.e., 15/327 = 5%). Obviously, the value of epsilon depends on the human body. We take 5% from the average length of the given observed people. This threshold may be varied based on the observed people, however, we find that it works in our experiment. The articles (Ghoreishi and Allaire, 2017;Xie et al., 2018;Imani and Braga-Neto, 2017; have mentioned a number of effective ways to estimate parameters and state in dynamic environment. Although they are effective, we did not consider this in the proposed method, as the hardcoded threshold used in the proposed method provides a better result compared to the state-of-the-art method. However, in the future, we will consider them to determine the optimum threshold in the proposed scheme.
To handle the situation when a different person comes to the scene and the previous person leaves, we need to consider constant authentication. However, at this stage we do not consider it to reduce the complexity. The skeleton-based authentication can identify the different person and works accordingly. The scenario, i = 9 is the number of joint measurements in a person whose sum of the differences can be calculated and compared with the epsilon value. The Epsilon is the maximum allowable difference that could be stored in the database of a known person for comparison. For every person epsilon could be varied depending on the body structure of the person. Typically, the system uses a 15cm threshold for considering the skeleton being as a pass. That means if the sum of the segments of the detected person and the sum of the segments of the known person in the database have a difference less than 15 cm, the system considers the authentication to be successful. If a person is authenticated in one frame, he is authenticated for all consecutive frames. If a person is authenticated for one frame, then he is authenticated for all consecutive frames. This can help save a huge amount of time and hence the computation time is reduced and performance will be increased.
A 3D bounding box then wraps the skeleton. Using a 3D bounding box for analysis helps to better analyze the shape changes in the three dimensions and give more accurate results in terms of the range of falls detected (Fauzia et al., 2016). It is created using OpenNI's DepthMetaData process. This bounding box is expressed in terms of world coordinates. Expressing the bounding box in terms of world coordinates makes it independent of the floor coordinates of the scene. The height, width and depth are estimated by calculating the difference between the maximum and minimum points along the three axes-X, Y and Z. The system continuously monitors the bounding box. During a fall, the height of the bounding box decreases and the composite width-depth dimension increases. Also, the system computes the change in velocity of the dimensions over a sequence of N frames. This velocity is then compared with the threshold. During a fall, the velocity exceeds the threshold velocity but during the normal routine activities like bending, sitting etc. it remains below the threshold value. In case, a fall is detected, the system monitors for another N sequential frames to detect inactivity as a fall is accompanied by a period of inactivity. Monitoring only for skeletons helps eradicate moving objects irrespective of their size and does not place a limit on the size of the bounding box. So, no matter what posture the elderly form, they will be monitored. In addition to this, skeletal data can be used to recognize the various people in the house and differentiate them from one another.

Implementation of the Proposed System
The tests were carried out using Matlab R2016b on 11 sample image streams. The samples were collected from a standard Google dataset. The images have a resolution of 320×240. OpenNI's skeletal tracker was used to extract the skeleton from the depth stream. Since the proposed system focuses on improving human detection, the system was tested with moving objects in the scene as well. The system was tested on a variety of routine activities like sitting on sofa, walking around the house, bending and squatting to check its robustness in terms of the normal day-to-day activities that take place in a house. Complex activities like carrying an object from one place to another, picking an object from the floor, sweeping the floor etc. were also tested upon to see how the proposed system performs when another moving object comes into the scene. The system was subjected to a number of fall sequences as well such as falling towards the camera, falling sideways and falling from a chair. The results of the test are illustrated below in Table 1. Step 1: SET threshold T vH , T vWD; threshold T iH; counter a, b = 0 ; counter falling = N frames ; counter inactivity = M frames ; boolean activityDetection = false ; boolean inactivityDetection = false Step 2

Experimental Validation
The test dataset basically consists of three parts: Simple routine activities, complex routine activities and falls. The proposed system accurately detects humans while they go around the house doing their routine activities like bending, squatting etc. Only the human skeleton part is monitored and extracted. The system does fairly well as these activities involve significant changes to the human shape. It also shows good results when complex activities like sweeping the floor or moving/lifting an object are involved. It only takes into account the human subject as it monitors for skeletons only. However, the previously existing system tracks both the subject and the object in the scene as it relies on background subtraction and motion tracking and fails to differentiate humans from inanimate objects. Finally, the proposed system performs well when the human is subject to a fall. A variety of falls were tested and all were detected. The system was able to distinguish between real falls and fall-like incidents like picking an object from the floor. We have selected challenging scenarios to see the performance of the proposed scheme and the best solution. In our future research, we like to add more challenging scene and see the performance of the proposed scheme.
As can be seen from Fig. 4, 5 and Table 2, the proposed solution (FDSTR) provides a significantly higher accuracy rate as compared to the current best solution (FDK). We consider all events in that case and make average of that. For the simple activities like "walking around the house", "sitting on sofa" both the solutions perform fairly well. FDK based experiments give good results, as these are simple activities and do not involve complex or confusing motion. Introduction of skeleton tracking does not pose any overhead on the proposed system and the processing time remains the same. For complex human postures like "bending" and "squatting", both the systems again perform fairly well by detecting human subjects only. This is due the fact that there is only one moving candidate extracted from the background by the FDK based system. The processing time of FDSTR system for "bending" increases by 0.1 ms whereas for "squatting", it stays the same. Now, we introduce complex routine activities like "sweeping the floor" and "carrying an object". FDK based experiments fail to differentiate between the two moving candidates and detect both of them. This is due to fact that FDK relies on background subtraction and motion tracking and the inanimate objects changes its position relative to the background frame. FDSTR based experiments prove to be highly accurate and detect only the human subject and mark an accuracy of 100%. Although, the processing time increases by 0.1 ms but this increase is insignificant compared to the improvement in accuracy from 50% to 100%.    Picking an object from the floor Next, we subject the proposed system to complex fall-like activities like "picking an object", "dropping an object". Results of FDK based experiments demonstrate that it is again not able to differentiate between the extracted parts of interest. It monitors both the subject and the object being handled and gives a false alarm on dropping the object as a fall is detected by the system irrespective of the fact that the human has not fallen. FDSTR based experiments give 100% accurate results as the system only monitors the human subject. It does not take into account the object being handled. The processing time of FDSTR system for "picking" stays the same but for "dropping", it increases by a small factor. The accuracy increases from 50% to 100% as we move from FDK based system to FDSTR in this case. At the end, both the current and proposed systems were subjected to fall sequences: "Falling sideways", "falling from the chair" and "falling towards the camera". FDK based experiments and FDSTR based experiments show equivalent results and are 100% accurate in detecting all the three types of falls. The proposed solution also has a feature of recognizing people whose joint measurements have been previously stored. The proposed system takes the joints measurements of a person as a sample and then the images of the same person doing a variety of different activities are fed to the system. FDSTR based experiments are able to successfully authenticate the person based on his joint measurements. The subject performs a range of activities like "bending", "squatting", "sweeping", "falling", "siting" etc. and the system is able to authenticate the person in the activities.

Conclusion
In this study, a new model for elderly fall detection called fall detection with skeleton tracking and recognition is proposed. It involves monitoring for human skeletons, recognizing them in order to provide the identity of the person who has fallen and them tracking the person with a bounding box in order to detect a fall. This solution solves the limitation of tracking moving objects in the scene and giving false alarms as it only monitors for skeletons. The proposed system gives highly accurate results when complex activities are performed and moving objects are brought into the scene. It provides a method to tap the features of the inexpensive Kinect camera and hence can be used on a wide-scale. The model does not increase the processing time by a significant factor and helps solve the problem of identity protection of the elderly. However, the algorithm can only differentiate between people who have at least 7-13% difference in their skeletons. So people with closely matching skeletons can fool the system. Another drawback can be pets in the scene. It would be required that the system should not monitor non-human skeletons.