AUTHORSHIP CATEGORIZATION IN EMAIL INVESTIGATIONS USING FISHER’S LINEAR DISCRIMINANT METHOD WITH RADIAL BASIS FUNCTION

Email plays a vital role in faster communication. Lots of mails are sent to common public with falsified information that appears to be a realistic. It is mandatory to trace the origin of the email and the authors/systems responsible for generating such emails. Representative signatures of email are to be generated using lexical and syntactic based methods. The signature of each email has huge dimensions and is called a vector/pattern. In order to make it convenient for subsequent processing, the huge dimension of the signature is converted into 2-dimensional pattern using Fisher’s Linear Discriminant Function (FLD). The 2-dimensional patterns of the signatures of emails under consideration are used as training data for the Radial Basis Function (RBF) network which can learn non-linear data. The classification of email is very well achieved due to transformation by FLD and training by RBF. The proposed method helps in building signature database for accurate categorization in email forensics. The proposed combination of algorithms helps in clustering the different emails generated by an author or by a system.


INTRODUCTION
Previous authorship studies (Zheng et al., 2006;Stamatatos, 2009) contain lexical, syntactic (David, 1992;Grieve, 2007;Luyckx and Daelemans, 2008), structural and content-specific features. Lexical features are used to learn about the preferred use of isolated characters and words of an individual. Word-based features including word length distribution, words per sentence and vocabulary richness were very effective in earlier authorship studies. Syntactic features, called style markers, consist of all purpose functional words such as 'though', 'where', 'your' and punctuations like '!' and ':'.
The objective of this study is to create signatures for each email using lexical, syntactic methods. The signature represents uniqueness for each email and hence grouping of emails of an author is enhanced. The information in the email is based upon the thoughts an author. If it were handwritten, then it is still more easier to identify the author. However, the same behavior is reflected in the email created by the same author except the non availability of the handwritten signature in the email. This property really helps in identifying the author using the unique signature of the email.

Materials
The Table 1 describes the sequence of operations of the proposed system in this study for email authorship categorization. The proposed system is the combination of FLD and RBF algorithms: Step 1: Emails have been used from enron database.
Step 2: Tokenize the information of the enron emails.
Create a dictionary of information. The template contains functional words like preposition,

JCS
conjunctions, interjections, pronouns, verbs, adverbs, adjectives. This template has been used for filtering out irrelevant information that will not be used for authorship analysis.
Step 3: Signature for each email is created by extracting features based on lexical characters, lexical words and syntactic properties. The total number of features for each email signature is 322. The details of the features (Farkhund et al., 2008;2010) are as follows: Lexical analysis based on characters: Find the number of words and the number of occurrences (frequencies) an email and all the emails of authors. Create a matrix with rows equivalent to total number of unique words extracted from all emails of all authors. The number of columns is equivalent to number authors. Fill up the columns with frequencies of words corresponding to respective authors. Each column is treated as a signature which is further transformed into 2-dimensional pattern. A labeling is done for each pattern: Step 4: The emails of each author is taken as a separate class. In this study, emails of 100 authors are grouped into 100 classes. Fishers linear discriminant method is used to create two projection vectors ϕ 1 and ϕ 2 . These projection vectors transform 322 dimensional signature into 2 dimensional pattern. Fifty emails for each author has been considered and hence a total of 5000 (50 emails*100 authors) signatures are obtained.
Step 5: Radial basis function with 75 centers (any other value) is used to learn 20% of emails of each author (Total of 10 emails×100 authors = 1000 signatures) to get final weights. Many neural networks are available, however, we preferred RBF as it learns non linear data effectively.

JCS
Step 6: Testing the proposed system is done by using 80% of 50 emails per author (Total of 40 emails×100 authors = 4000 signatures) are used.
Step 2 to step 4 are adopted to obtain two dimensional signatures of the testing emails. Each signature is processed with the final weights obtained in step 5. The output of the RBF is used for categorization of the authorship of an email.

Linear Discriminant
Linear Discriminant Analysis (LDA) (Sambasiva et al., 2009) and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier. This linear classification can be fine tuned by applying radial basis function on it. The mapping of the original vector 'X' onto a new vector 'Y' on a plane is done by a matrix transformation, which is given by Equation (1 and 2): where, X is the signatures and: ϕ 1 is a projection vector (also called a discriminant vector) and ϕ 2 is another projection vector.
The 2-dimensional pattern from the original 322dimensional vector is denoted by 'y i ' . The vector 'y i ' is given by: The vector set 'yi' , is obtained by projecting the original signatures 'X' of the 5000 signature patterns onto the space spanned by ϕ1 and ϕ2 by using Equation (3).

Radial Basis Function
The radial basis function is a supervised neural network which uses distance measure between the input pattern and the centers of the RBF nodes (Pandian and Sadiq, 2011). The summation of the distance is passed over an exponential activation function. This forms the outputs of the hidden nodes in the RBF network. A bias value is appended to the outputs of nodes in the hidden layer. The outputs of the hidden layer is processed with the labeled values (targets) assigned to obtain the final weights which will be used for testing.

RESULTS
The plots in Fig. 1-13 define the characteristics of the emails of 100 authors based on the information mentioned in step 3. The email can be categorized to an author by averaging the signatures of the emails as shown in Fig. 14. The brown color plot shows the difference among the successive authors. The average difference is 0.3511 that indicates that the author can be categorized.  Figure 15 presents the intersections of ϕ 1 and ϕ 2 projection vectors. In Fig. 16, signatures of 100 authors are projected using ϕ 1 and ϕ 2 vectors into 2-dimension.

DISCUSSION
From this plot, very few authors signatures overlap and the remaining authors signatures are visible distinctly. In order to overcome the overlapping, RBF is used for correct categorization.RBF network is trained with projected signature patterns along with labeling. A final weight matrix is obtained which is further used to test the untrained emails. The outputs of RBF are categorized to a trained authors database else, the email is categorized to some other author outside the database.

CONCLUSION
This study presents the email authorship categorization using Fisher's linear discriminant method combined with Radial basis function network. FLD transforms 322 dimensional signature pattern into 2dimensional pattern. As there is overlapping of few authors (Fig. 16), RBF has been used. Advantages of the proposed system is as follows: • The size of the 322-dimensional signature pattern is reduced to 2-dimension • The training of RBF is faster with less computational complexity • The size of the RBF topology is reduced from 322 to 2 in the input layer • Since, the activation function used in RBF is nonlinear, the overlapping problem is solved