Fuzzy ARTMAP Approach for Arabic Writer Identification using Novel Features Fusion

: Arabic writer identification and associated tasks are still fresh due to huge variety of Arabic writer's styles. Current research presents a fusion of statistical features, extracted from fragments of Arabic handwriting samples to identify the writer using fuzzy ARTMAP classifier. Fuzzy ARTMP is supervised neural model, especially suited to classification problems. It is faster to train and need less number of training epochs to "learn" from input data for generalization. The extracted features are fed to Fuzzy ARTMP for training and testing. Fuzzy ARTMAP is employed for the first time along with a novel fusion of statistical features for Arabic writer identification. The entire IFN/ENIT database is used in experiments such that 75% handwritten Arabic words from 411 writers are employed in training and 25% for testing the system at random. Several combinations of extracted features are tested using fuzzy ARTMAP classifier and finally one combination exhibited promising accuracy of 94.724% for Arabic writer identification on IFN/ENIT benchmark database.


Introduction
Handwriting analysis could be characterized as one of the significant characteristics of human beings. Writer's identification is a process of identifying the author's from an unknown script (Saba et al., 2014a;2014b;2012;Mahmood et al., 2017;Nodehi et al., 2014). Automatic writer identification systems could be used in courts of law for documents forensic analysis, in banks to trace check writers, in a medical prescription to identify doctors and so on (Zendehdel and Paim, 2014;Rasool and Khan, 2015). It is an important tool for verifying legal documents to assist specialists to verify the writers of an unknown script Saba et al., 2011a;Muhsin et al., 2014;Saba et al., 2010a;2010b). However, manually it is time-consuming and errors prone due to huge database size. Hence, there is a need of writer identification system to help out specialist at high degree accuracy. It might reduce the frenzy analyst efforts when comparing unknown samples for tens of thousands of documents (Elsayed et al., 2015;Loe and Liu, 2015;Aljalidi et al., 2016). Arabic writer identification accuracy is still away from maturity despite decades research (Mundher et al., 2014;. It is mainly due to the reason that writing style is unique and no two writers could write in the same style unintentionally (Arafat and Saba, 2016). Arabic handwriting is considered one of the large alphabet language and therefore poses serious issues during Arabic writer identification process (Saba et al., 2011b;Saba et al., 2016). Initially, writer identification systems for Latin scripts were directly applied for Arabic writer identification brought low accuracy (Soleimanizadeh et al., 2015;Younus et al., 2015). Hence, researchers concluded that Arabic writer identification accuracy is far from maturity and continuous efforts are in demand from the research community. This research presents a system for Arabic writer identification that exploresfusion of different statistical features for fuzzy ARTMAP training and testing for Arabic writer identification. The main objective is to find the similarity between samples available in the training set and test set that will show both handwritten scripts are written by the same person or different.
The further paper is organized into four main sections, section 2 explores the current state of the art, section 3 presents proposed approach in-depth, section 4 exhibits results and the research is concluded in section 5.

Research Background
Literature is replete with Latin and Chinese writer identification techniques that report high accuracy . However except afew work could be seen for Arabic writer identification that still needs further investigation to raise accuracy (Ahmad et al., 2014;Harouni et al., 2014;Neamah et al., 2014). Actually, all reported writer identification approaches could be categorized into text dependent and text-independent approaches. Text-dependent techniques employ same samples in training and testing phases while in text independent handwritten samples are different in test phase than trained (Waheed et al., 2016). The second category is closer to real life issues of forensic documents analysis. In either category, features extraction and selection play a critical role (Mughal et al., 2017a;Muhammad et al., 2017b;c). Sometimes, a single feature helps to identify writer Bashardoost et al., 2017). Extracted features could be structure, statistical features or their fusion. Structure features are normally extracted from directional, angular distributions, average height, the width of words etc. The statistical features include texture measurements, grapheme distributions, grey-level statistics and crosscorrelation distributions. These features are normally extracted from handwriting fragments and are effectively employed for classification (Hussain et al., 2017;Saba et al., 2014c;. Initially, authors in (Srihari et al., 2002) worked out on 1500 scripts to identify writer with the assumption that handwriting is an individual act based on writer could be identified; they reported an accuracy of 98% for Latin script. Latter different researchers continued efforts of extracting statistical/structure features/possible fusion to identify writers. Siddiqi and Vincent (2007) employed IFN/ENT benchmark database for Arabic writer identification using 350 writers. Authors extracted texture features and claimed an accuracy of 88%. Abdi et al. (2009) employed contour-based features for offline text independent Arabic writer identification. They reported 90% accuracy for 82 Arabic writer's evaluation taken from IFN/ENIT benchmark database. Siddiqi and Vincent (2010) also employed a set of texture features for Arabic writer identification and reported an accuracy of 82% using 130 writers' samples from IFN/ENIT benchmark database. In the same line of research conducted in (Abdi and Khemakhem, 2015;Hannad et al., 2015), extracted texture features from Arabic script fragments taken from IFN/ENIT benchmark database are employed for Arabic writer identification with an accuracy of 90% on 411 writers and 87% on 130 writers test case.
Based on the reviewed literature, it could be seen that current Arabic writer identification research explores features, their combination but ignore classifiers role, that is also one of the vital components in the whole process. Additionally, accuracy is not up to the mark, although tested on around hundred writers. The accuracy for a large number of Arabic scripts is still a question mark. Accordingly, the current research aims to explore efficient features, their possible combinations with Fuzzy ARTMAP classifier to identify Arabic writer that is being employed for the first time.

Database
In this study, IFN/ENIT benchmark database has been used which provides 2200 forms filled with 26000 Arabic handwritten words from 411 various Arabic writers, scanned at 300 dpi available in binary format. It is one of the most commonly used Arabic script databases in various researchers for training and testing of Arabic script classification systems (Pechwitz et al., 2002). Each line of the script is also cropped from each of the forms and a label corresponding to the script author is assigned to the cropped image. These lines of script are provided to the system without any noise element. Images are then transformed into binary form by removing their colour map information. No further thresholding is required; however, images need to be inverted before used for the proposed algorithm as white areas are considered existing objects in binary images. A few sample images from IFN/EFNIT database are presented in Fig. 1.

Proposed Approach
The proposed approach is exhibited in Fig. 2. Main steps involved in Arabic writer identification are described in each section below.

Pre-Processing
It is evident from the literature that preprocessing of images is a mandatory step in all imges processing applications to smoothline succeeding processing. Accordingly a series of preprocessing steps are applied on images such as noise removal by using estabished techniques (Rehman et al., 2009;Muhammad et al., 2017a;Fern et al., 2017;Iftikhar et al., 2017;Mehmood et al., 2018;Saba, 2017).

Segmentation
Further segmentation of the images is desired priror to features extraction (Rahim et al., 2017a;Iqbal et al., 2018;2017). In this study, since only author identification is required, actual meanings of the text and knowledge of characters involved remain of lesser importance. Each line of text is cropped into smaller segments where care has been taken not to cut through any alphabet so that integrity of author's handwriting is maintained. Once segmentation is completed, unusable fragments of text are removed and the segmented images are resized to 70 * 70 pixels in order to reduce variations between similar texts. This provides an additional benefit as images are required to be in square form by certain features. Following steps are followed to segment the image: • Input image of an Arabic handwritten word • Divide into smaller images based on the connected component analysis (without cutting through an alphabet or joined letters) • Remove images containing smaller pieces of text • Resize the images to 70 * 70 pixels The segmentation process could be seen in Fig. 3-5. Figure 3 shows the first line of text from author labelled as 'ae07' (IFN/ENIT database), while Fig. 4 and 5 further demonstrate segmentation outputs.
Actually, the image is unusable for feature extraction step in this form and therefore, the image needs to be inverted and segmented into smaller chunks without breaking a character or joined characters in two exhibited in Fig. 4. Additionally, the images are normalized/resized into the size of 70* 70pixel values as shown in Fig. 5.

Feature Extraction
Literature has been reviewed as well as a considerable number of features are tested to come out most appropriate features for the task on hand (Mughal et al., 2017b;Meethongjan et al., 2013;Lung et al., 2014;Fahad et al., 2018). The features that did not produce acceptable results were discarded. Moreover, combinations of certain features were also considered. As discussed earlier extracted features are categorized into statistical and structure features. Literature is evident that low accuracy rates are reported while employing structure features for writer identification, additionally, these are high processor demanding due to complex segmentation and features extraction procedures. Whereas, statistical features are extracted from entire script images or from their fragments exhibits better accuracy for handwriting recognition, writer identification and writer verification (Fadhil et al., 2016;Al-Turkistani and Saba, 2015).
In this line of action, Bertolini et al. (2013) exhibited high writer identification rates by using a fusion of two statistical features; Local Binary Pattern (LBP) and Local Phase Quantization (LPQ). Hannad et al. (2016) introduced one more statistical feature Local Phase Quantization (LTP) in addition to LBP and LPQ. They reported an accuracy rate of 87% for 130 writers data taken from IFN/ENIT database.
In this research, further statistical features are proposed and their fusion is evaluated for Arabic writer identification for a larger set of Arabic writers taken from IFN/ENIT database. The fusion of features employed is shown in Table 1. Computer Aided Designs often require a variety of features to make reliable decisions. As the higher number of features, as well as the high complexity of certain features, may lead to slow response time, measures need to be taken in order to reduce the computational complexity of the system. In this study, we have used Principal Component Analysis (PCA) which uses a linear combination of the original features to produce new synthetic features so that the dimensionality of the feature set can be reduced. Extracted features are normalized and fed to the Pre-Processing:

Background and noise removal
Segmentation:

Script image segmentation into graphemses
Feature extraction, selections and normalized vector set

Writer idenification:
Training and testing fuzzy ARTMAP classifier

IFN/ENIT benchmark Arabic writer script database
classifier. Fuzzy ARTMAP is selected for Arabic writer identification which was developed primarily to resolve object detection, problems identification and proved ideal for Arabic writer identification problem at hand.

Local Binary Pattern (LBP)
LBP is a grayscale local texture operator with powerful discrimination and low computational complexity. Among LBPs many desirable properties are its invariant to monotonic grayscale transformation; hence, it has low sensitivity to changes in illumination.
The LBP operator represents the difference between a pixel x and its symmetric neighbour set of P pixels placed on a circle radius of R (the value of a neighbour is acquired by interpolation if it does not match with a pixel). In the current research, P = 8 and R = 1. LBP with uniform binsis used. The LBP descriptor is extracted from a set of sub-regions that are acquired by isolating each image cell into 9×10 equal non-overlapping regions. The set of descriptors are concatenated for describing the entire image.

Multi-Dimensional LBP (MD-LBP)
LBP features are linked from each radius to manage a single dimension feature vector for performing the Multi-scale LBP which means the LBP with diverse radii. Radius values of R2, R3 and R4 are used in text recognition.

Hierarchical Multi-Scale LBP (H-LBP)
Sometimes in a few cases, the patterns of the smaller radii counterparts are identical but of larger radii are non-identical. Hence, there is the possibility that their smaller radius counterpart scan be classified accordingly into their non-identical patterns. A multiscale histogram is constructed in H-LBP where the LBP map of the main radius is separated into identical and non-identical groups.

Complete LBP (CLBP)
CLBP is an LBP variant which utilizes both the sign and magnitude information in the difference between the central pixel and some pixels in its neighbourhood (the conventional LBP operator only uses the sign component). The intensity of the vital pixel is taken into account by the CLBP; therefore, the final code is obtained from the combination of three codes: CLBP_S, that judge the sign component of the difference (i.e., the standard LBP), CLBP_M, that counts the magnitude component of the difference and CLBP_C, which reflect on the intensity of the central pixel. Ojansivu and Heikkila (2008) anticipated the Local Phase Quantization (LPQ) operator as a texture descriptor. Basically the blur invariance property of the Fourier phase spectrum is the base of LPQ. The local phase information is extracted by employing the 2-D Short-Term Fourier Transform (STFT) where each pixel position of the image is computed over a rectangular neighbourhood. The parameters set at 3, 5, 7 and 9 are used to extract the LPQ features.

Gabor Wavelets
Dennis Gabor introduced the Gabor wavelets by deploying the complex functions to provide the base for Fourier transforms in information theory applications. The main feature of the wavelet is that it curtails the variation and ambiguity in the time and frequency domain.

Dimensionality Reduction
It is a classic strategy to decrease number of features and to retain considerable features that could perform significant role in the process of classification . Several techniques are reported in the state of art including, Principal Component Analysis (PCA), ICA, and Matrix Feature Factorization. Such approaches work out on the existing features and reduce to the most discriminative features . In this research, PCA is applied for dimensialnaty reduction.

Classification
Fuzzy ARTMAP is a neural network architecture developed by Carpenter et al. (1992) and is based on Adaptive Resonance Theory (ART) exhibited in Fig 6. It consists of the supervised model using incremental learning system with dynamic structures and learning algorithm. The main feature of the Fuzzy ARTMAP is its learning from previous data and at the same time learn new information from new data. Additionally, without preceding knowledge on the distribution of the data samples it carries out pattern classification tasks. It is proficient to self-organize and selfstabilize the information and manage the network configuration on presentation of input samples. Fuzzy ARTMAP is considered fast even among members of the ARTMAP family due to its low computational mapping demand between inputs and outputs. Also, it takes less classification time. Its structure is composed of two ART modules, ART a and ART b such that in each module there are two nodes layers, 2 2 a b F and F that represent input and desired respective output such that the training pairs are {(X 1 , Y 1 ), (X 2 , Y 2 ), (X 3 , Y 3 )…….(X n , Y n )} where X k and Y k , (k = 1,2,3,…,n) represent input and output vectors respectively the weights of the nodes are adapted automatically during learning process from inputs to desired outputs. The nodes number could be added during training process if desired. Upon the addition of a node, weights vector are set to 1 and centres are initialized to 0.  (Carpenter et al., 1992) The entire process of training, learning and matching is exhibited in Fig. 6 and readers are referred to (Charalampidis et al., 2001;Wong et al., 2015) for indepth study of Fuzzy ARTMAP.

Experimental Results
The proposed approach is evaluated on a full set of 411 Arabic writers from IFN/ENIT database. Following preprocessing, handwritten scripts are segmented into graphemes. The segmentation process is already explained in the previous section. Currently, in this section, further segmentation results are exhibited in following Fig. 7.
Following segmentation, prescribed features are extracted from these graphemes and fed to fuzzy ARTMAP classifier for Arabic writer identification. Accordingly, results are demonstrated in Fig. 8. Finally, results for each feature set under observation are summarized in Table 1.

Results Analysis
Arabic writer identification is a current research problem in the state of art. Several techniques are reported in this regard, however, they differ in respect of a number of features extracted, database employed, a number of training, testing images and their features, classifier used and so forth. However, in this research, results are compared with those research conducted on IFN/ENIT benchmark. Bulacu et al. (2007) experimented 350 writers of IFN/ENIT and reported an accuracy rate of 88%. Abdi and Khemakhem (2015) reported an accuracy rate of 90% for 411 writers from IFN/ENIT. Hannad et al. (2016) reported an accuracy rate of 94.89% for 411 writers data from IFN/ENIT database. The comparative writer identification rates are presented in

Conclusion
The current research has presented Arabic writer identification approach using 411 Arabic writers data from IFN/ENIT benchmark database. The main contribution of the approach is the investigation of several statistical features extracted from Arabic script graphemes and evaluation of their possible combinations on Arabic writers identification accuracy using fuzzy ARTMAP classifier. In this regards, fuzzy classifier is investigated the first time in the state of art. A variety of extracted statistical features are manipulated with a fuzzy classifier to improve Arabic writer identification accuracy. The most fruitful combination observed is Gabor-wavelet Zernike with an achieved accuracy of 94.724%. The apparent weakness of the proposed approach is the features extraction from several graphemes particularly in case of a large database. However, PCA approach is adopted to reduce the dimensions and to speed up the process.