Face Verification for Person Re-Identification from Surveillance Camera and Drone-based Videos

: Person re-identification in surveillance camera videos is attracting widespread interest due to its increasing number of applications. It is being applied in the field of security, healthcare, product manufacturing, product sales and more. Though there are a variety of methods to do person re-identification, face verification-based methods are very much effective. In this study, a deep learning framework to perform face verification in videos is proposed. Face verification deep learning model development includes different stages like face recognition, cropping, alignment, augmentation, image enhancement and face image selection for model training. The authors have put forward innovative methods to be adopted in various stages of this sequence to improve the performance of the models. The focus of this study is on these image preprocessing stages of the process, rather than the deep learning part, which makes the approach unique. The overall model is improvised by increasing the efficiency of each of these stages by adopting methods like face recognition and cropping based on face landmarks, effective training image selection using face landmark symmetry, various image augmentation techniques including perspective transformation and image enhancement methods like contrast stretching and histogram equalization. An average two percent increase is obtained in the accuracy of the face verification models by applying these methods.


Introduction
Human beings can identify a person in a photo or video, by observing it from a viewable distance and angle. This identification is based on the overall appearance of the person, facial features, pose, voice and complexion, or even based on an accessory the person carries. When this person re-identification task is automated, one or a combination of the above methods can be applied. Person re-identification through face recognition and verification (Koide et al., 2017) is one of the easiest ways to achieve the best results for this task.
Person identification methods based on face recognition can be divided into two categories, feature based and metric-based methods. In the first approach, effective descriptors are used for representing each person. In the second approach, an effective distance calculation method is used to minimize the distance between the images of the same person and increase the distance between the images of the different persons. (Wang et al., 2018a) This study focuses on one of the methods under the first category, person re-identification, based on face verification from videos using deep learning methods. Convolutional Neural Network (CNN) based deep learning models are feature-based classification methods, (Jogin et al., 2018) even though the features are not listed out explicitly.
The biggest challenges in face recognition tasks from videos are the limitations due to the poor quality of images extracted from videos, ineffective object localization methods (Koc, 2021) and inefficient image preprocessing methods (Liao et al., 2012).
The person re-identification using face verification involves three stages (Mathew et al., 2019): 1. Data collection: Collect a set of videos or images of a person or persons to be identified 2. Face Image Extraction: Process the image frames in the video and do face detection using face detection algorithms and extract the faces of the person or persons to be identified 3. Model Development: Develop a deep learning model for face verification using the above face images after doing required image preprocessing on the images which include image alignment, augmentation, selection, image enhancement, etc.
The face images are collected from running video sequences. The quality and pose of the face images extracted have a big impact on the performance of the face verification model . Though many studies exist in fine-tuning the deep model parameters,   (Wang et al., 2018b) (Parkhi et al., 2015) we have focused on improving the pre-processing stages to improve the performance of the model. The studies in this direction are less compared to the model fine-tuning approach (Liao et al., 2012) (Koc, 2021)  ). An effective face verification algorithm in which importance is given to the face image preparation stages has been proposed in this study. This algorithm systematically collects suitable face images, preprocesses them properly and develops models for face verification. Innovative methods in the face image preprocessing stages are the main contributions of the paper. With the help of this algorithm, it is possible to develop face verification models and identify persons from videos collected from surveillance camera videos more accurately.
The remaining part of this study is organized as follows. The Literature Review describes the related work categorizing them into different sections based on the different stages of the person re-identification framework. The Materials and Methods section describes the details of the datasets used and the proposed person reidentification framework, including the algorithm. It also includes the model development stage and the results of the face verification or re-identification stage. Results related to the different stages of the algorithm like face detection, face alignment, selection, augmentation, symmetric face selection and perspective transformation is included next. A comparative study with other existing methods is carried out in the discussion section. Finally, the Conclusion and Future Scope are included.

Literature Review
Many studies have been carried out on automated person re-identification and activity detection to develop intelligent systems that reduce human effort in, home and public environments (Thakur and Han, 2021). The different methods used for person re-identif/ication tasks include Sequence Information (Hu et al., 2018). Saliency Learning, (Zhao et al., 2017), Pose-aware method (Cho and Yoon, 2018), Multi-shot ranking (Karanam et al., 2019), Correlation-based method (Hsu et al., 2018), PCA and Eigon Faces (Meedeniya and Ratnaweera, 2007), etc. A multimodal method of face and body matching is used in (Koo et al., 2018).
In this study, the focus is on the face detection-based person re-identification method. The other challenges in implementing person re-identification using face detection from videos are occlusion (Zhuo et al., 2018), poor quality of the video, (Hitesh et al., 2017), partial faces (Liao et al., 2012), etc. The nature of the input dataset is always a factor that affects the performance of any model (Mathew et al., 2018).
In , the authors have tried to improve the performance of the deep convolutional neural networks for face recognition model using angular loss function. (Wang et al., 2018b) have used a large margin cosine loss to improve the performance of the model. A face verification model development process involves a sequence of steps and there exists a possibility of improving the model by incorporating improvements in any of these stages. A sequence of steps is applied in the proposed algorithm to make the input face images sharp and perfect before model training. The existing works in these stages are listed below.

Face Detection
Face detection is the identification of the location of one or more faces that exist in an image frame. In most cases, it returns the bounding rectangles in which the face image exists (Ranjan et al., 2019). Face detection algorithms have received much attention in the last decade and a variety of algorithms are generally accepted for the purpose. Viola and Jones (2001), Multitasking Convolutional Neural Networks (Zhang et al., 2016) and FaceNet (Schroff et al., 2015) are the most popular ones which provide considerable accuracy. A landmark-based algorithm is proposed in this study to improve face detection accuracy.

Face alignment and Augmentation
When face images are captured from videos and taken as input for model development, the factors that bring down the model accuracy are the lack of alignment of face images and the presence of partial faces (Liao et al., 2012). So, it is required to align all the face images captured to the same orientation and filter out partial faces. Some of the methods used for this purpose are Face alignment using regressing local binary features (Ren et al., 2016), Adaptive Pose Alignment for Pose-Invariant Face Recognition (Liao et al., 2012) and Viewpoint-Consistent 3D Face Alignment (Tulyakov et al., 2017).
Another important factor that affects the orientation of images is the position of the camera. They are fixed mostly on the walls of a room or traffic posts that are above the view level. Studies about image correction to nullify the effect of position and angle of the camera have been conducted earlier. Mean shift clustering and Laplace linear regression were suggested for automatic radial distortion (Tang et al., 2019). When the camera is positioned above view level, perspective transformation is a useful option to cancel the distortion on object images introduced by the position of the camera. (Ansari and Shim, 2019).
Face alignment is an important image preparation stage that results in better performance of models. Joint face alignment is one such method where optimization was performed iteratively and sequentially (Zhang et al., 2020). Existing face alignment methods are explored and a new method with better performance is proposed in the study. The most aligned faces are selected based on this method for the model developed.

Enhancement of Face Images
Fuzzy-based illumination normalization of face images is applied in (Nasution et al., 2014) as a face enhancement step. Collaborative Random Faces Guided Encoders are used in (Shao et al., 2017).
Two face enhancement methods, Histogram Equalization (H. E.) and Contrast Stretching (C.S.) are proposed as effective methods for face enhancement in this study.

Compute Face Embedding
When the input face images used for model development are extracted from videos in different contexts and completely different attire and accessories, the direct deep learning method for face verification fails considerably. A method to overcome this scenario is to extract the face features first using the face embedding method and then apply deep learning for face verification. (Schroff et al., 2015).

Datasets
Five different datasets were used in the study which is listed below.

a. YouTube Faces Dataset
Originally collected from YouTube by its contributors, the purpose of this dataset is to do face recognition from videos. 3425 videos of 1595 people are included in the dataset. The videos are split into frames and stored in different folders. The smallest video includes 48 frames and the longest video includes 6070 frames with an average frame count of 181.3 for each video (Wolf et al., 2011). As the dataset is too large, a subset of this dataset was used for testing the algorithm. Removed the subjects having only very few images, especially to obtain a balanced training dataset.

b. Children Data Set
Using an 8MP phone camera, videos of five children were captured to do face detection and verification. The set of videos include 15 videos of five teenage children with 3 videos of each child. Some videos which include more than one of these children were also collected for verification and marking the person in the video.
Smartphone cameras are the next growing source of video and images. This way, the dataset is intended to represent real-world situations and hence make the model suitable for a wide range of applications from multi-type sources.

c. Choke Point Dataset
It is a dataset sponsored by NICTA (National Information and Communications Technology Australia) designed for carrying out person re-identification tasks (Wong et al., 2011). From this collection, the dataset used in the study includes cropped face images from videos of 30 people with an average of 50 images per person.

d. Film Star Dataset
It is a dataset of two popular film actors collected from the Kaggle website and it was enhanced by adding more images from the internet. It is an image dataset that includes around 1000 images of each actor. Since they are very popular film actors who have acted in several roles in different films, the images include a variety of appearances of the same actor. This dataset is being used for the development of a binary classification model for verifying the face of these two actors. The dataset includes two folders in which the two sets of images are stored.

e. Drone Face Dataset
Though many datasets are available for face identification (Yang et al., 2016), the dataset most suitable for performance analysis of the deep learning models based on the position of the surveillance camera is drone face (Hsu and Chen, 2017).
In the drone face dataset, it is possible to study the face recognition performances from drone videos. The part of the dataset used in the experiment includes:  1364 face images of 11 persons including 7 males and four female candidates. The face images are between 23  31 and 384  384 resolution. All are frontal face images.  For each person, four sets of images are included. These four sets of images include images captured from cameras positioned at different heights 1.5, 3, 4 and 5 m from the person's head level. For each camera height, there are 31 images captured at 2 to 17 meters away from the person at 0.5 m difference.
 All images were taken in daylight.

f. Newsreader Dataset
It is a video dataset of ten television newsreaders collected from popular news channels. Different clippings of 10 newsreaders were collected. They were clipped into frames and stored in 10 different folders each containing minimum of 1000 images.
In this study, the main objective is to study the performance of the proposed preprocessing methods. In addition to the benchmark datasets, authors have created some additional datasets as listed above. While preparing the additional datasets, it was ensured that enough input videos are included to avoid class imbalance problems, after certain stages of preprocessing like landmark-based face detection and face symmetry-based image selection.

Proposed Person Re-Identification Framework
The proposed person re-identification framework comprises of different stages. (1) Face detection and extraction.
(3) Face image quality enhancement. 4) Face embedding computation, if required based on the input dataset. (5) Model development using extracted images and (6) Face verification using the developed model. Figure 1 shows an overview of the person identification framework.
The different stages in the proposed person identification framework are explained below in detail.

Face Detection and Extraction
A video is a collection of image frames. To extract face images from the video, image frames are captured at regular intervals and each of these image frames is analyzed to detect face images using a face detection algorithm. The performance of different face detection algorithms was compared by developing models using face images extracted with each of these algorithms. Model, based on the Viola and Jones (2001) and model based on Multitasking Convolutional Neural Network (MTCNN) (Zhang et al., 2016) algorithm have produced comparatively better performance. The MTCNN based model was found to be the best out of these two algorithms (Mathew et al., 2019). In this study, the Viola-Jones algorithm and MTCNN algorithms were explored for their suitability and performance for face detection from videos. A landmark-based algorithm that gives better performance in the context of person re-identification from videos is also proposed.

Face Image Selection
In the proposed person re-identification framework, face extraction and filtering are done in different stages. One of the algorithms, Viola-Jones is used in stage1 as shown in Fig. 2 for extracting face images and they are stored in different folders each corresponding to a different person in the dataset.
While analyzing raw video for face detection, it is found that many non-face images were detected as face images due to the deficiency of the face detection algorithm. After completing stage 1 in Fig. 2, many non face images were thus captured as face images into the folder. One effective mechanism to filter out such non face images is to apply a sequence of face detection algorithms to filter out the non-face images completely, as in Fig. 2. Irrespective of the algorithm being applied, it is most important to apply more than one algorithm in a sequence to filter out all non-face images from the dataset.
The quality and completeness of the face images used for model development are always a factor that affects the performance of the model used for face verification. The landmark-based face detection algorithm used in stage 3 of Fig. 2 is based on landmark point detection. In this algorithm, the face images are extracted only if the face images contain all the face landmarks in the face. A face with landmark points marked is shown in Fig. 3a.
If all face landmark points are not achievable in the landmark detection stage, drop the face from the folder or regenerate the face by flipping side by side based on the one-half available, if face images are less.
Different methods to improve the quality of input face images undergoing classification are proposed here. Remove the face images having poor resolution. The resolution of input face images is a factor that affects the performance of the face verification model. If the resolution is good it can be easily detected. When the face images with low resolution are removed the number of samples will be reduced. When the input video collection is large this can be ignored as there will be enough face images.
If the face is small, scale it with interpolation. If the input video collection is large, this can be ignored. There will be enough images even if not regenerated with scaling.
Another issue to be handled in the image selection stage is the number of samples. If there exist enough input videos with the person to be identified in it, it is easy to select the most suitable face images to make the model. But in some cases, there will be only very few videos of the person to be identified. In that case, utilize all the available face images without filtering out many faces even if they are not perfect.
All datasets will have male and female entities. In the model architecture, one more level can be added with a male-female detection first and then identify the person. This will reduce the classification problem size to half and better accuracy can be achieved.

Face Alignment and Symmetric Face Selection
A challenging problem related to Person identification from surveillance videos is the misaligned, distorted faces and faces twisted in angle in the detected face images. The faces extracted from videos are not always aligned. It can be side faces or rotated faces also. When these face images are used for model development, the accuracy of the models will be affected . Due to this, a set of face alignment methods are used to avoid these problems. Even though good quality face detection systems can detect slightly misaligned faces the performance of the face classifier is found to reduce when misaligned face inputs are also considered during model development. Face orientation correction and Frontal face detection need to be applied to avoid this issue.

a. Rotation
If the face is misaligned rotate it. Extract face using a face detection algorithm and find out all face landmarks. To align the face, use the face landmarks.  Face-landmarks are a set of points marked on the faces. Obtain the angle of rotation of the face from two face landmark points on a straight path. The two points from the corner of two eyes are taken here. Now calculate the angle of rotation of the face image as below: Where: (x1, y1) = The left corner of the left eye (x2, y2) = The right corner of the right eye Rotate the face through this angle. A face that has been rotated in this way is included in Fig. 3. Face image before rotation and after rotation is depicted in Fig. 3a and 3b respectively.

b. Symmetric Image Selection
Faces lacking symmetry in landmark points may be partial images. Therefore, remove it before model development. A mathematical method to check the symmetry of faces has been adopted to select the most suitable images to be included in the model. The mean square error from the axis can be calculated based on the position of the landmark points for finding the more symmetric images. Face images of different symmetry error values are described in Fig. 4a to 4e.
Face images before and after applying landmark points based symmetrical image selection is shown in Fig. 5.

c. Image Distortion Correction using Perspective Transformation
In a practical scenario, the surveillance camera is fixed at an elevated position irrespective of whether they are indoor or outdoor. It is fixed on the sidewalls of the room, a traffic post, or even in a drone. , Where x corresponds to the vertical axis passing through the middle of the face and xi that of landmark points. d. In the available face image set, obtain n such faces with the minimum value for the d value.
In such cases, since the camera lens and the object are not always in parallel planes, image distortion occurs in the captured images. The important distortions that can occur are radial and tangential distortion. Camera calibration methods are used to obtain rectified images from distorted images (Beauchemin and Bajcsy, 2001).
The amount and type of distortion depends on the position and orientation of the camera against the object. In the case of face images, there can be a large variation between normal and distorted images as in Fig. 6. While applying face recognition models, this distortion introduced by the camera position will reduce the effectiveness of the face verification models.
The way the values of the projected coordinates change when images are captured is indicated in the Fig. 7.
The projections are obtained using the Perspective projection equations: Where: (X, Y, Z) are co-ordinates of the scene point (x, y, d) are coordinates of the image point To obtain the corrected image, after capturing the images without camera calibration, perspective transformation can be applied (Ansari and Shim, 2019).
The perspective transformation matrix is calculated using 4 sets of points in source and destination images similar to the process of camera calibration using chessboard images (Beauchemin and Bajcsy, 2001). Then perspective transformation is applied to the distorted face images with this perspective transformation matrix to rectify this distortion.
After extracting face images from the video. perspective transformation is applied to reduce the distortion that occurred to the face image. When the perspective transformation was applied to the first level distorted images in the drone face dataset an accuracy improvement of 2% was achieved in the face verification model. Figure 8 describes some sample face images on which perspective transformation was applied.

Face Image Augmentation
To introduce variety in the input face images, suitable augmentation methods were applied during the model construction stage.
The following methods were tested using the programming framework before making the Convolutional Neural Network (CNN) model for face verification purposes:  Horizontal shifting inside the image  Vertical shifting inside the image  Horizontal flip  Vertical flip  Varying brightness  Rotating the images from different angles

Face Image Quality Enhancement
Changes in lighting in the scenes may change the intensity of images captured in videos (Starovoitov et al., 2003). The videos and images captured from various sources will be of different quality and the quality of images may, in turn, affect the performance of the model being developed. Two methods are proposed to make the quality of images uniform and improve model performance.

a. Histogram Equalization (H.E.)
Histogram equalization is an important method for image enhancement. (Honnutagi and Maranur, 2018) (Akila, 2017) After applying histogram equalization, an image with uniform distribution of pixel values is obtained (Jeon and Kim, 2016). Image processing like Histogram equalization is found to increase the performance of the neural network model. (Bertalmío, 2019).
For an image, histogram equalization is done using the following procedure.  Contrast stretching is applied to improve the contrast in an image by increasing the range of intensity values of the pixels. (Ruikar et al., 2018). It is applied to many image analysis algorithms as a preprocessing step. (Cao and Li, 2020). Increasing image contrast this way can improve the quality of the image and it is an important image enhancement method in various applications (Erwin and Ningsih, 2020).
Contrast stretching is obtained by the following formula: Where: fmin, fmax = The minimum and maximum pixel intensity of the pixels in the input image = 256 = The number of grey levels of the image This method increases the image contrast and thereby improves the model performance. Better performance is obtained from models developed after applying any of these two image enhancements.

Model Development for Face Verification
The face images were prepared by adopting the various steps in the proposed algorithm for each of the datasets. The images were prepared into five different folders, based on the image processing steps applied. They are existing MTCNN algorithm-based face image cropping, Face landmark-based face image cropping, Rotation and Face landmark-based Symmetry, Histogram Equalization and Contrast Stretching. Five models each were prepared for each dataset to evaluate the performance of the stages in the algorithm by considering the above five categories of images in its training set. Use a convolutional neural network (Wang et al., 2018a) to develop a deep learning model for face image classification. Results of the performance comparison of these models are included and discussed in the Results and Discussion section.
The general structure of the Convolutional Neural Network used in all the models is shown in Fig 9. The effect of applying perspective transformation for face alignment was tested with the Drone Face dataset which is suitable for studying the performance of face alignment when the camera is fixed above view level. The models developed with images before and after applying perspective transformation were compared for performance.

Face Verification
During face verification, all the pre-processing stages applied during image preparation for model development have to be carried out first. It includes the sequence of face detection and cropping stages, face rotation, histogram equalization, contrast stretching, etc. After these stages, prediction is carried out. The confusion matrix and accuracy were computed by providing a test set of face images. The results are included in Table 1.

Compute Face Embedding
While experimenting with person re-identification from videos, it is found that CNN based image classifier does not perform well when the input face images are extracted from videos captured at a very different context other than the one being predicted or if the person being verified is wearing different face accessories or having the different facial appearance, etc. The face embedding calculation method described in the Face Net paper (Schroff et al., 2015), gives a better performance in this context. In this method, face embedding computation is performed first and after which a deep learning model is prepared from these computed results.
While doing face verification using face embedding, the process can be enhanced by incorporating a clustering layer to classify the face images of a scene first. Computed face embedding values can be used for this clustering. The number of clusters is identified using the elbow method (Marutho et al., 2018). This clustering procedure will help to fix the number of persons in a scene first and then the task of person verification is simplified. Figure 10a and 10b contains the accuracy and loss curves based performance comparison for the model developed using existing MTCNN based face image cropping against the model developed using Face landmark points-based face images for the YouTube Faces dataset. The proposed face verification pipeline recommends more than one face image detection and cropping algorithm during its different stages. But, while comparing the performance of proposed Landmark-based cropping against the MTCNN algorithm in the pipeline, the accuracy curve is found to be more stable and resulted in better accuracy as shown in Fig. 10.

Results
Similarly, the performance comparison of the accuracy of the models for the five datasets and five different image enhancement methods is summarized in Table 1.

Model Performance for different Datasets and Image Enhancements
The performance improvement on the application of perspective transformation was tested using the drone face dataset and an accuracy of 99.5% was obtained after filtering out very low-resolution images from the dataset. Earlier work has reported 89.9% accuracy in the droneface dataset (Bustos et al., 2018).
For a collection of 5000 images of 100X100 resolution, the face-landmark-based symmetry calculation, histogram equalization and contrast stretching took around 15000ms, 22000ms and 900ms respectively in an i7 processor with 32GB, including the time required for storing the images created. A comparison of image preprocessing time for various methods used in this study with the multimodal method in (Koo et al., 2018) is included in the Table 2.   (Koo et al., 2018) 98 Face landmark-based symmetry calculation 3 Contrast stretching 1 Histogram equalization 5 (a) (b) Fig. 10: Performance comparison of models developed using faces extracted with MTCNN face detection method and proposed landmark-based face image cropping method

Discussion
As per these results, a minimum one percent accuracy improvement is obtained when proposed landmark based face image detection and cropping is adopted compared to the Normal Multi-Tasking CNN-based algorithm.
It is also clear from Table 1  The face image enhancement methods histogram equalization and contrast stretching also improved the accuracy by around three percent compared to the normal images as listed in Table 1.
The graphs in (Fig. 11a, 11b, 11c and 12a, 12b, 12c, 12d) show the comparison of accuracy curves of different datasets for five different image enhancement methods. To avoid cluttering of the images the comparison is done by splitting the graphs into two sets. Comparison of normal, landmarkbased and symmetrical images is included in Fig. 11. Comparison of models developed using Normal images with Histogram equalization and Contrast stretching of images is included in Fig. 12.
Consistent performance improvement for the proposed methods in different datasets is clear in the graphs plotted. The accuracy curves for different datasets are not similar as the number of images and features of the images in the different datasets vary widely and the number of epochs for which the model training was carried out to get maximum performance also is different.
In addition to these results, the effect of applying perspective transformation on the drone face dataset was approximately two percent improvement compared to the normal dataset.
To further verify how the proposed algorithm performs in comparison with the existing algorithms a comparison study is performed. Out of the six datasets used, three are benchmark datasets for face verification tasks and the performance comparison of our algorithm with methods used in other studies where they have used the same datasets are shown in Table 3 and 4.   (Wong et al., 2011) S+V Model (Mokhayeri and Granger, 2019) 89.20 CCM-CNN (Parchami et al., 2017) 98   Accuracy perc Parkhi et al. (2015) 97.40 Wang et al. (2018a) 97.60 Deng et al. (2019) 98.02 Wang et al. (2018a;Hu et al., 2018) 98.12 Face landmark-based symmetry 97.00 Contrast stretching 99.00 Histogram equalization 99.90 One limitation of the algorithm is that while applying face landmark-based face image extraction and landmark symmetry-based selection of face images, a large number of images are required as input. Otherwise, face images that do not match the required criteria will be filtered out by the algorithm and dataset size will get reduced that may, in turn, reduce the performance of the algorithm.

Conclusion and Future Scope
The algorithm for person re-identification using face verification has been developed and tested using five different datasets. The performance of the face verification model is enhanced when the proposed sequence of face detection, face selection, alignment, augmentation and enhancement methods was used. The landmark-based symmetry calculation algorithm is highly efficient in the face selection process to select the most suitable face images for model development. The accuracy improvement obtained by applying landmark points symmetry-based face image selection was two percentage. It is observed that applying histogram equalization and contrast stretching of the face images, also improved the performance of the face verification model by around three percent. The important method used in the face-alignment stage is the application of perspective transformation. The improvement obtained by applying perspective transformation was also two percentage. The different possibilities to improve the face orientation, alignment and quality were tried rigorously and results were presented.
The proposed algorithm with its novel steps like face symmetry-based face image selection, preprocessing using perspective transformation, contrast stretching and histogram equalization helps the scientific community in implementing accurate face verification models. These concepts can also be extended to other areas like emotion recognition and activity detection where face image-based model construction is required. It is also planned to explore more preprocessing methods and the selection of better models and their parameters in future work.