ONLINE ARABIC HANDWRITTEN CHARACTER RECOGNITION BASED ON A RULE BASED APPROACH

Handwriting recognition is a very challenging problem. Much work has been done on the recognition of Latin characters but limited work has been done on recognizing Arab characters. Most Arabic handwriting recognition in previous works focused on recognizing offline script and little take the online cases. The main theme of this study is on-line handwritten Arabic character recognition. A successful handwritten Arabic character recognition system improves interactivity between humans and computers. A successful handwritten Arabic character recognition system cannot be fulfilled without using suitable feature extraction and classification methods. The main theme of this study is on-line handwritten Arabic character recognition. The foremost contribution of this study is to propose a rule based production method to recognize Arabic characters based on the proposed hybrid Edge Direction Matrixes and geometrical feature extraction method. In addition, a horizontal and vertical projection profile and a Laplacian filter were used to identify the features of the characters. The training and testing of the online handwriting recognition system was conducted using our dataset; it has used 504 characters from different writers for training and 336 characters from different writers for testing. The evaluation was conducted on state of the art methods in the classification phase. The results show that the proposed method gives a competitive recognition rate for character categorywas 97.6%. The proposed approach succeeded in providing high recognition rate to match characters based on the shape and edges of character. The results proved that the proposed method can obtain a competitive result comparing with state of the art methods


INTRODUCTION
Automatic character or text recognition of handwriting can be classified into two approaches. A great deal of handwriting recognition tasks depend on the applications. Initially, the systems recognized text is originally written on papers. In such systems, the study sheets are digitized into two dimensional images. Features for recognition are first enhanced and then extracted from the bitmap images by digital image processing. This type of recognition is called the offline recognition approach. In the online handwriting recognition approach, the user writes on a digital device using a special stylus, the system samples and records the point sequence as it is being written. Therefore, the online handwriting samples contain additional temporal data, which is not present in offline sampled data (Mezghani et al., 2003;Bakhtiari-haftlang, 2007).
Online handwriting recognition still remains an active area of research. This is due to the following three main challenges: low recognition rate; high recognition time and lack of comprehensive databases which represent the variation and the variability exhibited in handwritten data. Recognition of Arabic characters is a quite problematic process. The main problem is encountered while dealing with Arabic characters written by different persons, representing the same character differently in terms of size and shape. This variation is due to the individuality of the person writing the script, apart from the mood and situation of the writer (Al-Taani and Al-Haj, Science Publications JCS 2010). The existence of 'dots' in Arabic is the other problem that provokes the difficulty of the recognition process, the single, double or triple dots can be placed above or below the letter body. It is common that Arabic letters have the same body but will differ in dots, which are helpful to identify them (Fig. 1). Therefore, it is vital to recognize all the component edges and dots in a character. This study focuses on the recognition of the isolated Arabic character by identifying the features and dots by three main steps that is: pre-processing, feature extraction and classification (Fig. 2) to produce a successful online Arabic handwriting isolated character recognition system.

Related Works
For the past few decades, intensive research has been done to solve the problem of Arabic character recognition. Various approaches have been proposed to deal with this problem. Khorsheed (2003) had proposed a technique for handling on-line handwritten Arabic script recognition based on Hidden Markov models and structural features. El-Sheikh and Guindi (1990) and El-Sheikh and El-Taweel (1988) introduced two algorithms for Arabic handwritten characters and cursive words recognition. The former assumes that characters result from a reliable segmentation phase and subsequently, their positions are referred as priori. The shapes of the characters are called: initial, medial, final and isolated. These are further categorized into four subsets based on the number of character's strokes. Amin (2000) and Amin and Singh, (1998) presented a technique for recognizing Arabic words and Chinese characters with the C4.5 machine learning system. This technique has the following steps: digitization, pre-processing feature extraction and classification. Based on the recognitionbased segmentation technique, an Arabic OCR system was proposed by Cheung et al. (2001). This technique eradicates the classical segmentation problems and has a feedback loop to manage the grouping of character fragments for recognition. Mapping for the recognition of on-line handwritten characters was proposed by Khorsheed (2003). This mapping creates the same output pattern, apart from of the orientation, position and size of the input pattern. Later in the same year Mezghani et al. (2003) introduced a method for on-line Arabic characters recognition. This method is based on the use of Kohonen maps and their corresponding confusion matrices which serves to prune them of errorcausing nodes and to combine them consequently. Alnsour and Alzoubady (2006) proposed a recognition system for handwritten Arabic characters using a neural network classifier. This system was trained on 600 images and tested on 250 images. The classification rate for the system reached 90%. Abdullah et al. (2009) presented an automatic license plate detection system based on image processing and clustering. Enhanced geometrical feature topological analysis has been used as the feature extraction technique while a support vector machine has been applied as the classification technique.
Recently Al-Taani and Al-Haj (2010) presented an approach for on-line Arabic handwritten character recognition. This approach utilizes structural features and decision tree learning techniques and has three phases: First, the user writes the character on an exclusive window on the screen and then the coordinates of the pixels forming the character is acquired and saved in a special array. Second, a bounding box of 5×5 is drawn around the character and five features are drawn out from the character that is used in step three for recognizing the character using a decision tree learning technique. This approach is tested on a set of 1400 different characters written by ten users. Each user wrote the 28 Arabic characters five times in order and achieved about 75% recognition rate. Omer and Ma (2010) proposed an approach for online Arabic handwriting character recognition. This approach is based on a decision tree and matching algorithm to learn the stroke direction of the Arabic character. In this approach a dataset collected for handwriting samples are used as a training set and tested on a set of 140 different characters written by five users, each user wrote 28 characters randomly. The result achieved in this approach was about 97%.

The Arabic script
The distinctiveness of Arabic script lies in the mode of writing, it is cursively written from right to left.

JCS
The cursive nature allows some words to be written in a single stroke without taking the pen off from the writing platform. Most of the letters within a word or "sub word" are linked to each other. But at the same time a sub word cannot be attached to another letter in the same word. The Arabic alphabet consists of 28 characters. Each character has two or four different shapes. The form of the character is based on its location within a word as shown in Table  1. The isolated, initial, medial and final are the different placements of the alphabet. The non -initial or medial shaped letters will not be linked to the consecutive letters and hence a sub word is formed (Bakhtiari-haftlang, 2007). In this study we focused on the isolated character shape.

The OCR Structure
The proposed system consists of the following steps:

Pre-Processing
To recognize Arabic characters the pre-processing stages should be done before the recognition stage. In this study the pre-processing stage involves finding the vertical and horizontal projection profile and using the Laplacian filter.

Horizontal and Vertical Projection Profile
Projection profile has been widely used in line and word detection (Ha et al., 1995). The horizontal projection profile is computed to determine the features that in turn will determine the shape of character and the dots above or below the character ( Fig. 3). The proposed algorithm uses horizontal and vertical projections profile from different regions to extract the features from the projection profiles of the character: • Horizontal profile: sum of black pixels perpendicular to the x axis • Vertical profile: sum of black pixels perpendicular to the y axis

Laplacian Filter
A Laplacian filter forms another basis for edge detection methods. A Laplacian filter is used to compute the second derivatives of an image, which measures the rate at which the first derivatives change. This helps to determine the change in adjacent pixel values in an edge. Kernels of Laplacian filters usually contain negative values in a cross pattern (similar to a plus sign), which is centred within the array.
The corners are either zero or positive values. The centre value can be either negative or positive. In this research we have applied a Laplacian filter with a 3×3 kernel matrix (Fig. 4), because it is a powerful technique to deduct edges in all directions and it is also effective to solve salt and pepper noise (Bataineh et al., 2011). Figure 4a shows the original character and Fig. 4b shows a character after applying a Laplacian filter. It represents the final result of pre-processing.

Feature Extraction
In this study two types of features were applied as combinational, the Edge Direction Matrixes (EDMS) and geometrical features.

EDMS Features
To extract the features of an Arabic character a statistical analysis technique is presented by using Edge Direction Matrixes (EDMS) (Bataineh et al., 2011). Eight neighbouring kernel matrices have been applied and associated each pixel according to their two neighbouring pixels. Based on the previous illustration, this method was introduced depending on two perspectives such as: Finding the first order relationship EDMS1 and finding the second order relationship EDMS2. In the first order relationship, we firstly create a 3×3 Edge Direction Matrix (EDM1). Each cell in EDM1 contains a position within 0 until 315 degree value. Secondly, we determine the relationship of the scoped pixel in the edge image Iedge (x,y) by calculating the number of occurrence for each value in EDM1. In the second order relationship, each pixel will present by one relationship only. We firstly create a 3×3 Edge Direction Matrix (EDM2). Secondly, we determine the relationship importance for Iedge(x, y) by sorting the values in EDM1 descendingly respectively. We take the most importance relationship of the scoped pixel in Iedge(x, y) by calculating the number of occurrence for each value in EDM2. From the EDM1 and EDM2, several feature values are presented. Some features are summarized by calculating their homogeneity and edges regularity as follows:

Homogeneity
This feature represents the percentages of each direction to all available directions in the edge character as follows:

Geometrical and Structural Features
To extract the features of an Arabic character, as well as the EDMS feature we have used geometrical features. Five features were applied such as: the width and the height of the horizontal and vertical base, width of the dot, number of occurrence in (H, V) projection and a comparison between the two parts of the projection profile (Fig. 5).

Normalization
Normalization is a process that changes the range of pixel intensity values. In this research, the normalization process is applied on the dataset before feeding the input features into NN and decision tree classifiers. The normalization process is calculated based on the following equation: where, V is the output value, value is the current input value and Max and Min are the maximum value and the minimum value in the range of all input values subsequently.
Science Publications   In the proposed rule based method, the normalization has not been used, because it can discriminate each feature efficiently and it produces a better classification performance.

The Proposed Production Rule
Rule based approach is the simplest way, but is very tedious and costly, in which a piece of knowledge is expressed by a human expert that carries important information. On the other hand, it is also called a Production rule (also called IF-THEN rule or, simply, rule) the basic structure of most expert systems. In this study the necessary conditions are introduced by an "IF" keyword and the consequences, if those conditions are met, by a "THEN" keyword and compound conditions by using the word and to join two conditions in the same rule. In a handwritten character, the information features in the character can be represented by production rules. In this research 28 rules have been built. There is a rule for each character used to match it based on EDMS and geometrical features.
The proposed rule based approach was built based on the edge regularity and homogeneity and geometrical features. Figure 6 presents the connecting states of the homogeneity features representation in the proposed rule based approach for each character. Table 2 represents the decision table of the proposed rules for 28 characters without reflection between it. In this table the conditions were built based on nine features for the EDMS method that contain edge regularity and homogeneity features and the geometrical feature that contains five features such as: width and height of the horizontal and vertical base, width of the dot, number of occurrence in (H, V) projection and a comparison between the two parts of the projection profile. This table also explains the features that were taken by each character.
In order to come out with a reliable and accurate comparison, three classification techniques are selected. The selected techniques include; production rule, NN and decision tree. All of these techniques are applied on the same features and dataset.

RESULTS
In these experiments the performance of the proposed rule based technique has been evaluated by comparing the different methods of classification that have been used in OCR. The Neural Network (NN) and decision tree classifiers are used. These methods were used in these experiments because they were used in OCR. Based on the literature, the neural network and decision tree are one of the best classification methods that have been applied on handwriting recognition. So, to find the best performance, the decision tree and neural network classifiers were applied. In this study the NN algorithm with 14 input layers, 20 hidden layers and 28 output layers, has been applied. For decision tree the J48 decision tree algorithm was used.
For this algorithm the default parameters were used, where -C = 0.25 and -M = 2.    The dataset was split into training and testing datasets. In this experiment, the training dataset is determined from percentages 60 and 40% for testing. Based on the experimental results, the proposed method obtained competitive accuracy rates with the NN and decision tree classifier. Based on the results, a NN obtains the best performance at 98.8% recognition rate. For decision tree the best rate was obtained at 97.5% among the many experiments. As shown in Table 3 and 5 and Fig. 7, the proposed rule based method achieves a high performance with about 97.6% recognition rates. Based on the result, it is noted that the proposed method obtained the competitive performance with NN and decision tree methods in all experiments. Table 4 presents the t-testing for three classifiers applied in this study. The t-test assesses whether the means of two groups are statistically different from each other. If there is less than 0.05 (In most social research, the "rule of thumb" is to set the alpha level at 0.05) chance of getting the observed differences by chance, we reject the null hypothesis and say we found a statistically significant difference between the two groups. In this study the t-test of the proposed method with DT and NN is less than (0.05). This result indicates there is a statistically significant difference between the proposed method with DT and NN. This result would be statistically significant, but the difference is small enough to be unimportant.
The detailed results for each character of the higher experimental results are shown in Table 3 ), which produced a 98.85% accuracy rate. The lowest accuracy was obtained with the ( ‫س‬ , ‫ذ‬ ) at 1.15%. Taking these results into account, the proposed method obtained a competitive accuracy rate with NN and decision tree for each class.
Based on the experiment results, the proposed method gives the competitive result and accuracy, compared with the NN method and decision tree method after it is applied on the same feature extraction. It produced about 97 % accuracy rate.

DISCUSSION
The main contribution of this research includes the new online Arabic handwriting character recognition system which is developed based on the feature extraction and rule based methods was proposed. The evaluation of proposed rule based classifier done by applying the feature extraction on three classifiers that is NN, decision tree and proposed rule based, were the proposed method obtained competitive accuracy rates at 97.6%. In contrast, the decision tree obtained 97.5% and NN obtained 98.8%. Our experimental results proved that the proposed method gives high performance.
From the testing process we have noticed the following important remarks: • The drawing speed may affect the recognition process, if the user draws very quickly • The accuracy of the system depends on many factors such as: (i) whether there is noise in the test data, (ii) are the letters poorly written and (iii) are they deliberately written in some strange and unusual way • The proposed system works only on Arabic isolated letters

CONCLUSION
The main aim was proposed in this study to build a new system that can obtain a high accuracy rate. This is achieved through a set of steps. This study consists of three main phases: In the pre-processing phase the Arabic characters are pre-processed according to the common pre-processing methods such as: vertical and horizontal projection profile and Laplacian filter. In the feature extraction phase, the Geometrical and EDMS features have been applied. In the recognition phase, we applied three classification techniques; the production rule, NN and decision tree. The evaluation of the proposed rule based classifier was done by applying the feature extraction on three classifiers that is NN, decision tree and proposed rule based, where the proposed method obtained competitive accuracy rates at 97%. Our experimental results proved that the proposed method gives competitive accuracy.

ACKNOWLEDGMENT
This research is based on two fundamental research grants from Ministry of Science, Technology and Innovation, Malaysia entitled "Logo and Text Detection for moving object using vision guided" UKM-GGPM-ICT-119-2010 and "Determining adaptive threshold for image segmentation" UKM-TT-03-FRGS0129-2010. We also would like to thank Dr. Bilal Bataineh, who is a member of the research group.