Gender-Based Customer Counting System Using Computer Vision for Retail Stores

Corresponding Author: Intan Sari Areni Department of Electrical Engineering, Hasanuddin University, Gowa, Indonesia Email: intan@unhas.ac.id Abstract: The development of modern retail business is gradually getting faster, increasing the level of competition among retailers. The retailers are changing their business strategies to acquire new customers, maintain customer loyalty and improve customer service. One way to indicate the good market performance of a retail shop is to know the number and details of visitors based on gender. The number of visitors in retail shops can be observed by installing CCTV camera. In the existing shop system, CCTV used just for monitoring activities in the retail shop. Therefore, the data from the CCTV camera can be used to calculate the number of visitors based on gender automatically by utilizing computer vision. This study designed a system to count the number of female visitors through video data. The data acquisition process involved 89 people, 41 women and 48 men. The data are stored in .AVI video file format with a resolution of 7201280 pixels. This system can be divided into three main stages, which are face detection using the Viola-Jones method, feature extraction using Gabor Filter 2D and classification using Support Vector Machine (SVM) method. The result of the study showed that the system can count the number of female and non-female visitors with an accuracy rate of 96.52%. The system performance will be improved by using another feature extraction and classification methods.


Introduction
The development of modern technology has an impact on life which is becoming more practical, fast and economical. Along with the development of these technologies, the existence of modern retail business has become more prominent because of the shift in spending patterns where people tend to favor modern retail such as minimarket or supermarket. The community becomes fonder of shopping in the modern retail market where people can get practicality and convenience in shopping (Adji and Subagio, 2013).
The development of the modern retail business that continues to increase has intensified the level of competition among the retailers. The competition forces the retailers to change their business strategies to win the match and increase market share by maintaining and attract the customer.
Market performance can be known by observing the customer's perspective, such as market share in various locations, customer comfort, brand image and the number of customers who come every day. By knowing these aspects, the company will be able to maximize the customer's perspective (Zaroni, 2015).
Counting the number of people who visit a store or a shopping center gives essential information about the potential of retail performs. This information can be used to find out the number of visitors, the busiest days and the peaks hours. Furthermore, the information can also determine how the shopper count patterns change throughout the year, what effect promotions, advertising, competition, weather, school holidays, road-works, etc., have on visitor counts and measure the effect of relocating merchandise and concessions. Moreover, it gives the managers ability to make better business decisions by identifying and analyzing trends through the given systematic statistics information, optimize sales performance, improve sales efficiency and assign the right number of employees to give the best customer service (Mavenstat, 2019).

440
One of the major aspects that can influence a customer's decision is gender. Men and women approach shopping with different motives, perspectives and considerations, where women have a propensity to have higher hedonic shopping orientation than men (Wahyuddin et al., 2017). In contrast to men who shop out of necessity, women go on shopping to purchase both essential and discretionary goods, to relax and to socialize. In addition, female are more emotional and easily attracted by advertisements compared to male (Imam, 2013). Therefore, gender identification is one of marketing techniques that can be used to influence buying behavior of consumers, especially women.
Gender identification is based on the differences between female and male physical appearances. The most significant differences are often seen in the facial features. Women's faces who are considered anatomically attractive tend to be heart-shaped, with rounded corners from the hairline extending to the chin area. Men tend to have square-shaped faces with more complicated hairlines and slightly square jawlines. Men also tend to have a longer face at the bottom to accommodate the long upper lip and long chin. Moreover, the facial profiles also influence the development of feminine and masculine characteristics. Women's profiles tend to be flatter, while men's foreheads tend to tilt back and their faces tend to bulge forward. Women's lips and the tip of the nose tend to be pointing up, whereas men have more distance between the base of the nose and the top of the lips. Women's eyebrows are more curved and sit higher above the eyes, while men's eyebrows sit slightly above the brow bone and straight without arches (Brucelf et al., 2019).
In its development, cameras have gradually became more involved in business activities, including finding out the number of customer visits each day. Data recorded by the camera can be used to estimate the number of customers based on gender. However, the manual calculation requires a lot of time and effort. Therefore, retail companies need to build the system for counting and detecting customer from camera by utilizing computer vision, including calculating the number of customers based on gender.
Moreover, a research also proved that recent technologies from the field of computer vision can overcome the data acquisition bottleneck and allow for data-based innovations that helped traditional retailers to improve their customers' shopping experience and consequently to strengthen their market position.
Computer vision can also be used to generate movement tracks over time for individual customers, from the time the customer enter the supermarket until the customer left. The acquired data allowed for several data-based applications that can achieve similar goals as their counterparts in online retail (Hernandez et al., 2019). An example of research that implemented computer vision in retail activities is a system that can detect certain face among crowd. The system worked using face recognition in searching person from a database. The input is video recorded using a CCTV camera that is mounted on a wall with height and slope angle of 2.5 meters and 60 degrees. The result showed that threshold value between 0.5 and 0.7 is the best value for classifying face and the system achieved the accuracy of 88.88% . Another study proposed an approach to record and analyze customer behavior data based on image and infrared sensors. Both the movement and the activity data were used for customer behavior analysis. This study revealed different customer behaviors, not only for different setups but also for different times of a day, days of the week, or seasons. Information gained from these results can be used as basis for planning dynamic product placements or seasonal offers. It also helped to identify rarely visited areas or products and therefore allowed retail store managers to analyze and optimize their shopping environment (Kröckel and Bodendorf, 2012).
Studies on face detection and recognition using computer vision approaches also have been carried out so far. Kumrawat and Chawla (2017) proposed a human face detection and recognition system using a combination of skin grabbing, Gabor filter and PCA. The system is tested with different background and lighting condition and yielded up to 100% accuracy. Dey et al. (2013) built a system that can automatically detect human face and gender from input images. This research has been experimented on a database containing more than 4000 facial images in order to evaluate the performance of the proposed system. The system achieved more than 78% average accuracy. Wulansari et al. (2017) proposed a gender identification system based on facial images using the Artificial Neural Network (ANN). Facial features were segmented into the eye area, right eye area, left eye area and mouth area. These segments were converted into vectors and used as input for ANN using the Backpropagation method. Identification results from 60 trained data yielded 100% accuracy, while the identification results from 60 data that have never been trained before produced 82% accuracy. Furthermore, most recent research by Liu et al. (2018) proposed a smart unstaffed retail shop scheme that utilized Artificial Intelligence (AI) and the Internet of Things (IoT). Based on the data set of 11,000 images in different scenarios that containing 10 different types of Stock Keeping Unit (SKU), an end-to-end classification model trained by the MASK-RCNN method was developed for SKU counting and recognition and the proposed solution in this study was able to achieve 97.7% accuracy for counting and 98.7% accuracy for recognition on the test dataset.

441
Despite the wide variety of face recognition algorithm that has been extensively studied, Support Vector Machine (SVM) is particularly popular for its high performance and suitability for a wide range of problem. Many previous researches have yielded high system accuracy using this method. For example, Rustam and Ruvita (2018) presented a face recognition system for gender classification using SVM and tested it with different kernels. The result showed that SVM method with RBF kernel and Polynomial kernel achieved the same maximum accuracy of 100 percent. Moreover, Leo and Suchitra (2018) combined 3D Principal Component Analysis (PCA) with SVM and achieved 96.29% recognition rate when tested on a rich facial expressions database. Another paper described a research on face recognition using a simple feature vector and SVM classifier. The paper also conducted experiments to compare polynomial and Radial Basis Function (RBF) kernels of SVM, which showed RBF kernel gave better recognition result than polynomial kernel.
This research focuses to build a system to automatically count female visitor in a retail shop using CCTV camera.

Proposed Methods
In this research, data retrieval is carried out by simulating the real situation in a retail company. The CCTV camera is placed in front of the entrance door with a distance of 3.0 m from the turnstile entrance and placed on a metal pole with a height of 2.10 m above the floor with 54.5 degree camera angle. This predetermined angle is meant to allow the camera to capture objects. The illustration of the data acquisition process can be seen in Fig. 1.
Data retrieval is completed from the front corner of the entrance so that the object approaches the camera. Generally, system design comprises of several stages, as shown in Fig. 2. In this study, the system design is divided into two main parts, namely the training process and the testing process. The training process aims to train the system so that it can recognize the data that is trained with a predetermined classification group. The testing process aims to determine system performance. The data used in the training process and the testing process uses different data. Each process will begin with the process of inputting data, preprocessing and the proposed method that is combining feature extraction methods with Gabor Filter and SVM as a classification method. The counting process will be greatly influenced by the previous process that has been passed, where the results of the classification will determine the number of counting.

Input Video
The initial step to build the system is to prepare input data in a video file extension. The input videos are obtained from the data acquisition process and taken using a Vivotek IB369 IP camera and ASUS K401L Core i5 laptop with 4GB RAM and 1TB Hard Drive. The system is built using the MATLAB R2015a 64 bit programming application and a video converter software. The data are stored in .AVI video file format with a resolution of 7201280 pixels and duration of 78 seconds with a frame rate of 30 fps. Then, the input video will be extracted to get the RGB frames. Figure 3 shows an example of a frame taken from the input video.

Face Detection
RGB frames are then used in the face detection process by using the Viola-Jones method. The Viola-Jones method uses a box-shaped sliding window that will scan the image to see its pixel values. The Haar-like features will be detected in the box. The shapes of Haar features can be seen in Fig. 4 (Viola and Jones, 2001). Figure 4 shows the types of rectangular Haar-like features. The types of Haar-like features depending on the number of adjacent rectangles. The first type is the edge feature, which is consisted of two squares and represented by the first and the second images in Fig. 4. The third and the fourth comprise of three squares and represent line feature and the last represents the diagonal feature. The value of the feature can be calculated by subtracting the pixels of the black area by pixels in the white area. These following equations can be used to get feature values according to the number of squares:  Two squares:  Four Squares: where, B and W are the pixels value of the black area and white area, respectively. After the sliding window ends, the size of the sliding window will be reduced (resized) and it will re-scan the image. The process will continue until the size of the sliding window cannot be reduced, hence many features will be obtained (Indrabayu and Areni, 2019).
The face area or the foreground can be differentiated from the background by selecting a threshold as the cut point. Pixel values which greater than the threshold value are called object points (labeled as 1), while the others are called background points (labeled as 0). Finding the optimal threshold value is crucial to minimize the possibility of the system detecting face-like features on objects rather than faces. In other words, the threshold image g(x,y) is defined as follows (Prasetyo, 2011): Where: g(x,y) = Pixel point of an image T = Threshold value Threshold (T) value optimal is obtained by conducting trial-error experiment in this study, i.e., 12. An output example of the face detection system is shown in Fig. 5.
The output of the Viola-Jones method is a bounding box indicating the location of the face and face area. The next process is to crop and cut the face area based on the size of the detected bounding box. To get additional features around the face such as headscarves and hairstyles, the bounding box size is slightly enlarged. Figure 6 shows the bounding box section to be enlarged. Equation

Preprocessing
The image obtained from the face detection process varies in size. Therefore, in the preprocessing stage, image resizing is carried out to equalize face size. All face data is resized to a predetermined size, which is 101111 pixels. This value is taken based on the average image size from the face detection stage. The cropping process is performed using the nearest neighbor method.
Illustration of the nearest neighbor method can be seen in Fig. 8. The method works by copying the initial pixel value to the neighbor pixel value.
The next process is image conversion from RGB to grayscale. This process is done by calculating the average value of the red, green and blue channels. Then the average value is used as the grayscale value in each pixel.
Calculation of grayscale value of each pixel can be done using Equation (8) The example of the grayscale images is shown in Fig. 9.

Feature Extraction
Feature extraction aims to get valuable information from a digital image. The features can be local facial features (nose, eyes, mouth, etc.) or global facial features (all parts of the face). In this study, global features of the face are used. The feature extraction process is carried out using the Gabor Filter 2-D method. Gabor Filter 2-D is a linear filter used to detect edges.
The Gabor Filter was developed to simulate the human visual ability in observing the texture of objects (Daugman, 1985). Gabor filter 2-D has four parameters, which are lambda (λ), theta (θ), psi (ψ) and gamma (γ). The Gabor Filter 2-D value can be calculated using Equation (11) Where: x, y = Matrix or pixel value of the coordinate λ = Wavelength of the sinusoidal factor (pixel) ψ = Phase offset (rad) σ = Standard deviation of Gaussian Envelope (pixel) γ = Spatial aspect ratio Examples of feature extraction result can be seen in Fig. 10.

Classification
The classification process aims to analyse facial features based on data from feature extraction. The classification method used in this study is Support Vector Machine (SVM). SVM performs classification by finding the optimal hyperplane that maximizes the margin between the two classes with the help of support vectors. It is strictly based on the concept of decision planes (or hyperplanes) that define decision boundaries for the classification. The hyperplanes are boundaries that divide the data points into different classes. Data points falling on either side of the hyperplane can be attributed to different classes. Support vectors are data points that are located closer to the hyperplane. They affect the position and orientation of the hyperplane. Illustration of SVM algorithm can be seen in Fig. 11. The first stage of SVM is to divide the data into two groups, namely training data and testing data. In this study, training data consist of 8 objects, where each object has 35 frames obtained from the face detection result. The training data is classified into two classes. Class 1 includes 4 positive data (female) and class 2 includes 4 negative data (not female). The illustration of training data can be seen in Fig. 12.   Fig. 10: Result of the feature extraction process

447
SVM implements a scoring function to determine classification groups according to the class given. SVM is a classification method using binary methods (two classes). If the input score for a certain class is positive, the input will be classified as the corresponding class. Otherwise, if the score is negative, the input will be classified as the opposite class instead.
By using linear kernel and giving -1 label for the first class (female) and 1 for the second class (not female), the predicted score value for all test data in the female class can be calculated using Equation 5 (Bhavsar and Panchal, 2012): Where: f(x) = Predicted score value х = Feature value s = SVM scale kernel β = Weight b = Bias

Counting
After the classification process is carried out, the next stage is to count female and non-female visitors. The counting process is performed by calculating the number of frames showing each person, then observing which class has the highest number of frames for each person. If the number of frames predicted in class 1 is greater than or equal to the number of frames predicted in class 2, the counter for female visitors will increase by 1. Otherwise, if the number of frames predicted in class 2 predictions is higher than the number of frames predicted in class 1, the counter for non-female visitors will increase by 1.

Results and Discussion
In this study, testing data consist of 6 videos with a total duration of 527 seconds and 15916 frames. Data acquisition process involves 89 people, 41 women and 48 men, as the test objects. To calculate system accuracy, this study uses the confusion matrix rule. Basically, the confusion matrix contains information on the comparison between classification's results carried out by the system with actual classification results (Prasetyo, 2011). Example of a confusion matrix can be seen in Table 1.
To calculate the accuracy of female visitor counting system the following equation is used: Three of Gabor filter 2-D parameters were predetermined θ = 0°, ψ = 1 and γ = 0.5. The next stage after obtaining optimal lambda value is classification stage using SVM method. The result of this step can be seen in Fig. 14. In Fig. 14, it can be seen that the score of SVM feature vectors have been divided into two classes. Female class has a maximum score of 1.4705 and a minimum score of 0.6264. Non-female class has a maximum score of 1.9787 and a minimum score of 0.3589. The female data is marked with a red circle shape is in class -1 (Y-axis) and non-female data is marked with a blue circle is in class 1 (X-axis).
The experiment showed the best system performance was obtained using λ = 8. The result can be seen in Table 2. Table 2 shows that there is one positive object that counted as a female in Video A, while in Video C there are two positive objects that counted as female. The combination of Viola-Jones, Gabor Filter 2-D and SVM method for calculating the number of female visitors has an average accuracy of 96.52%.
The obtained result is affected by the value of Gabor Filter λ parameter. The higher the value of λ, the brighter and blurrier the result of texture segmentation becomes. If the value of λ is too low, the condition of the image will be increasingly difficult to recognize because of the low contrast, making the extraction results indistinguishable. Also, the Gabor Filter 2-D method depends on the brightness of the frames. Significant differences in brightness levels can cause system misclassification. An example of system error can be seen in Fig. 15.   True Positive (TP), is the number of positive objects correctly classified as female  False Positive (FP), is the number of negative objects falsely classified as female  False Negative (FN), is the number of positive objects falsely classified as non-female  True Negative (TN), is the number of negative objects correctly classified as non-female

Conclusion and Future Work
The output of the gender-based customer counting system is a video composed of face frames marked with yellow bounding boxes. Each frame will be equipped with a counter of female and non-female visitors which will appear in the result panel. This value will change according to the number of visitors' faces detected by the system. This study used the Viola-Jones method, Gabor Filter 2-D and SVM. Viola-Jones method was used for face detection, Gabor Filter 2-D was used to extract global facial features and the SVM method was used for classification of female and non-female customers. The results show that the system can count the number of female and non-female visitors with an accuracy of 96.52% from 6 testing video data with a total duration of 527 seconds and total frame of 15916 frames with 89 people. For the future work, the feature extraction and classification methods will be studying to obtain the higher accuracy.