Feature Discretization for Individuality Representation in Twins Handwritten Identification

: Problem statement: The study on twins is an important form of study in the forensic and biometrics field as twins share similar genetic traits. Handwriting is one of the common types of forensic evidence. Differentiating the similarities of writing of a pair of twins is critical in establishing the reliability of handwriting identification. Writing style can be used as biometric features in authenticating individual uniqueness where these unique features can be used to identify the writer, including between a pair of twins. Existing works in Writer Identification concentrate on feature extraction and the classification task in order to identify authorship. The high similarity in a pair of twins’ handwriting may degrade classification performance. There should be some standards to represent these unique features before entering into the classification task which is with the use of discretization technique. Approach: We proposed a new framework for writer identification in terms of identifying twins’ handwriting and showed the effect of discretization process on handwriting samples of a pair of twins in order to obtain individual identification. Results: An experiment has been done at the Sulaimania University in Iraq with fourteen pairs of identical twins where each twin provides 4 samples of handwriting for the purpose of data collecting. These samples were implemented in this research making a comparison between the new proposed framework and classic framework. Conclusion: Our experimental results showed that with new framework identification of handwriting of a pair of twins can be improved through the discretization process.


INTRODUCTION
Pattern recognition serves as a vital part of various engineering and scientific fields such as computer vision, biology and artificial intelligence. Handwriting analysis is an essential branch of the pattern recognition field as it has an important purpose in the courtroom and forensic document investigations (Sargur et al., 2008;Tan et al., 2010;Srihari et al., 2007), Signature identification (Li and Tan, 2009) as well as Iris recognition (Chowhan and Shinde, 2008). A person's handwriting is usually affected by many personal elements such as self training history plus physiological and psychological state and nature which makes distinguishing the handwriting of a pair of twins a form of study with utmost importance.
The Twins Handwriting Identification is a quite popular area of research in pattern recognition and computer vision fields as it, in some situations, provide the only means of discovering the real writer of a written text out of a group of people (Plamondon and Lorette, 1989;Srihari, 2010).
Previous studies done on biometric identification of twins such as the discriminability between the fingerprints of a pair of twins (Jain et al., 2002), DNA analysis (Rubucki et al., 2001), computational discriminability analysis on the fingerprints of twins (Liu and Sargur, 2009), show of coefficient values in individual sets as a form of unique code for a person's face (Rycchilk et al., 2009), prove that there are physiological traits in nature which do not change over the years. Handwriting, however, is more associated with a person's attitude and behavioral factors rather than the aspects of psychological traits which provide researchers the real motive behind the studies on handwriting (Sargur et al., 2008).
Among the obstruction mentioned, distinguishing twins' handwriting is considered one of the first of the obstacles. As the uniqueness of an individual's criteria in their handwriting has been noticed ages ago, many techniques have been established over the years, depending on human's knowledge and proficiency to sort and compare people's handwriting (Sargur et al., 2008).
The ability of distinguishing a pair twins' handwriting is considered an efficient mean of personality identification via each twin's own handwriting. The identification of twins' handwriting, however, is found to be much more complex compared to that of non-twins as the resemblance of the characteristics in the writing manners of twins causes huge similarities in the features of a pair of twins' handwriting forming a purpose for the identification operation. The phase is divided into two stages; the individual features analysis and the identification and capture of similar features. The two functions are then computerized and executed accordingly via the classical method of pattern recognition in order to get results both rapidly and accurately.
Pattern recognition applications are usually executed through feature extraction and classification or learning scheme (Li and Tan, 2009;Liu et al., 2003;. The most critical and highly prioritized process in pattern recognition is capturing and selecting the desired main features. This is especially important in the case of the identification of Twins' Handwriting. The two main problems in Writer Identification (WI), Thus: first is to find the means of acquiring the main features out of different handwriting styles or very closely similar handwriting styles as a way of discovering the real writer (Xu et al., 2008;Bensefia et al., 2005;Schlapbach and Bunke, 2004;Yu et al., 2004;Srihari et al., 2002;Shen et al., 2002) and how to obtain the meaningful features when comparing a pair of twins' handwriting. The second problem is categorizing the features selected from the different handwriting styles and the twins' handwriting styles into the proper classes where the features belong to.
Previous studies have developed new approaches or techniques for better feature extraction and to proof the concept of individuality in handwriting. However, from the literature, it is found that most of the studies focused on how to extract the individual features between a pair of twins and not on illustrating the individual characteristics of handwriting between the twins for systematic representation.
The focal point of this study is to suggest new framework and implement the invariant discretization process of features in order to represent the individual features of writers between a pair of twins' handwriting and significantly illustrate the related features in a systematic way to provide easier classification and obtain better identification result.

Individuality of handwriting:
Handwriting has long been considered one of the means of presenting a person's individualistic nature and the writer's individuality rests on the hypothesis that each individual has consistent handwriting (Srihari et al., 2007;. The relation of the characters, words and the shape or style of writing is very similar between a pair of twin. However, there are still unique features for each twin. These unique features can be generalized as the individual's handwriting even though there can be high similarity in a pair of twins' handwriting. Figure 1 shows the example of the handwriting samples of two pairs of identical twins and the similarity between them. Individuality representation: Good features acting as input to a classifier are important in order to obtain good performance in the process of identification. Extracted features usually perform the classification task directly in order to identify a writer. These features do not portray the individual features of a writer between a pair of twins because the handwriting of twin has very closely similar features which lead to small variance in the handwriting between a pair of twins. Another process is needed to in order to improve the authorship invarianceness. This study will adopt the Invariant Discretization technique based on the previous work done in (Muda et al., 2008) to be implemented on the twins' handwriting.
This process will help in increasing the variance between the features in the handwriting of a pair of twins. An overview of a new framework which is needed as an additional procedure prior to the classification task in order to improve the performance of the identification process of twin's handwriting.
The traditional framework is shown in Fig. 2 while the new framework is shown in Fig. 3.

Feature extraction:
Macro-feature captures the global characteristic of writer's individual writing habit and style. They are extracted from the entire document (Sargur et al., 2008;Srihari, 2010), is exploited in this work on Twins handwriting. Totally there are thirteen macro feature including the initial eleven features reported in (Srihari et al., 2002;Srihari, 2010). The initial eleven features are: entropy of grey values, binarization threshold, number of black pixels, number of interior contours, number of exterior contours, contour slope components consisting of (number of horizontal, number of positive, number of vertical and number of negative), average height and average slant. Eight features use in our experiment it is (entropy of grey values, binarization threshold, number of black pixels, number of interior contours, number of exterior contours, average height , average slant and average stroke width). We have chosen Macro-Features because features that capture the global characteristics of the writer's individual writing habit and style can be regarded to be macro-features (Srihari et al., 2007). More details on the procedure of the macro-feature algorithm can be found in (Sargur et al., 2008;Srihari et al., 2001;.

Discretization:
Training instances are usually the focus for classification problem. The set of training instances are usually categorized into classes with certain distinct features describing them. Through the process of discretization, discrete partition with certain number of intervals is formed when the continuous features are transformed. The range of each interval is represented by the boundaries, both lower and upper.
However, there are many ways for continuous features to be represented. This leads to the need for certain important points where firstly, the number of the intervals for the discrete partition needs to be decided. The intervals are usually selected randomly by the users. The second phase is where the boundaries of the intervals need to be determined. There are many several known methods for discretization such as Equal Information Gain (EIG), Maximum Entropy (ME) and Equal Interval Width (EIW). The recently proposed Invariants Discretization has managed to successfully provide higher rates of identification (Muda et al., 2008). Invariants Discretization is a supervised method where the process starts with a search for the appropriate intervals representing the information about the writer. Each interval has a set of boundaries, upper and lower. The number of the intervals for each image should also be the same as the number of feature vectors. Computation is done for each writer where according to each writer, their individual uniqueness can be preserved and the classification task can be easier done. The illustration for the feature vectors grouped with the interval will be similar to the process for the interval. The discretization process provides several benefits including non-linear representation (Agre and Peev, 2002), through the set of intervals, easier interpretation by human can be done (Liu et al., 2002) and with the reduced amount of data, the process of computation can be done faster and higher accuracy can be achieved (Pongaksorn et al., 2009;Hwang and Li, 2002). Through the study done in (Muda et al., 2008), it was proven that the use of discretized data can provide better classification compared to the use of Undiscretized data. The results of the study showed a significant increase in the accuracy of the identification when the discretization is implemented on the proposed integrated Moment Invariant.
Issues regarding supervised and unsupervised discretization were discussed by Agre and Peev (2002). Two supervised method for discretization, entropy-based discretization and MVDM-based discretization, were enhanced and managed to successfully increase the accuracy of the classification process. Another study by Mehta et al. (2004) proposed a correlating preserving discretization which is a form of unsupervised method. The proposed algorithm was used on multivariate dataset where its efficiency proved sufficient with the prediction of missing values.

Invariant discretization process:
The importance of Invariant Discretization in this work is to achieve more accurate classification of the writings of a pair of twins. With the given information about the classes of each image of the handwritings, the discretization algorithm can be applied and appropriate set of cuts representing the writer's data can be found. The minimum to maximum range of the data is divided with the size of the interval which then gives each interval or cut its lower and upper approximation. The number of the feature vectors for each image defines the number of intervals or cuts. With this, the number of applied invariant vectors in a moment invariant function can be kept to its original amount. A single representation value is defined to represent each interval or cut ensuring that their corresponding feature vectors will be similarly represented. The algorithm for the discretization process is as shown below. Based on the writers' classes, the intervals or cuts and the values representing each interval are calculated. The Writer Identification domain set this concept where each twin in every pair has his or her own writing style and individuality which ensures the preservation of the individual's unique characteristics. The features can be illustrated much clearer and the characteristics of the features can be maintained through the process of discretization. Therefore, in order to match the process to the concept, the calculation of the intervals and representation values is done for each writer's class.

Algorithm of discretization
Feature invariant vector are transformed into discretize feature vector as shown in the examples in Fig. 4 and 7.   Figure 4 shows the data for the handwriting of a pair of twins before the discretization process. The eight columns represent the extracted feature vectors while the column at the end shows the class label for the writer. A row of eight invariant vectors represent a single image for one writer. Figure 4 shows the continuation of the discretization process done on the data in Fig. 5-6. The figure shows an example of how the data of pair a, b of twins 1 is discretized. The discretized feature data is shown in Fig. 7.
The data in Fig. 7 is the discretized feature data which shows that data representation illustrates the characteristics of each twin.   This correctly represents the concept of individuality for every pair of twins as stated in the WI domain. The analysis of the identification performance is then done through the identification process with the discretized feature data.

RESULTS
The new framework with twin's handwriting identification successfully done. The identification of the handwriting of a pair of twins can be improved through the discretization process Fig. 8. An experiment is conducted to compare the accuracy of the identification for both discretized and undiscretized data.
The percentage of the accuracy is shown in Table  1. The extraction of the feature is done with the use of macro-features algorithm (Sargur et al., 2008;Srihari et al., 2001;. Techniques provided by the Rosetta toolkit including Holte 1R classification, Genetic algorithm and Exhausive algorithm are used on the discretized and undiscretized data for the purpose of identification (Ohrn and Komorowski, 1997). The extracted features are the undiscretized data while the extracted features which have gone through the discretization process make up the discretized data. The experiment is done with the use of the samples of handwriting from fourteen pairs of twins where each twin provides four samples. Table 1 summarizes the experimental results after running with 78 training data and 33 testing data consisting of 70% training data and 30% testing data, 67 training data and 44 testing data with 60% of which are training data and the 40% are testing data and 56 training data and 55 testing data with 50% of which are training data and 50% are testing data. The below-obtained result shows that the overall identification rate (Average Accuracy (%)) with discretized data is very good.
The overall identification accuracy from each training and testing datasets achieved is calculated through confusion matrix in (Ohrn and Komorowski, 1997).

DISCUSSION
In this study, we have introduced a new framework for twin's handwriting identification which is successfully done with discretization algorithm. The experiment for macro-features algorithm techniques for extracting the features and all tested classifiers shows that the use of discretized data produced more accurate results. The process of discretization can systematically represent the features of the data. This helps to illustrate the individuality of each twins' handwriting in discretized data more clearly. This study focuses on proving that the identification process using discretized data increases the performance of indentifying the real writer between a pair of twins' compared to the use of undiscretized data.

CONCLUSION
In this study, a new framework for the purpose of identifying twins' handwriting is proposed and we have show the effects of the discretization process on the handwriting samples of a pair of twins. An experiment has been successfully conducted with the use of the proposed framework. The individual features in the twins' handwriting can be systematically represented with the use of the invariants discretization algorithm. The extracted discrete features are used in the discretization process to granularly mine the authorship of the writer. The authorship identification can be done easier with the reduced amount of similarity errors. The results reveal that with the use of the invariant discretzation technique, the accuracy of the twins' handwriting identification is significantly improved with the overall classification get good accuracy compared to undiscretized data.