Recognition of Hand Printed Characters Based on Simple Geometric Features

: Problem statement: The use of computers in information processing technology nowadays is one of the main trends of office automation. For more than four decades, information from the outside world is transferred into computers in a traditional way by keying in these raw data with the help of keyboard. Most of these data are in hand printed form and very large; therefore the use of automatic recognition technique of hand printed character by machine is essential to achieve an efficient way of high speed man-machine communication. Approach: Handwriting recognition had always been tough problem because of the handwriting variability, ambiguity and illegibility. This study described a simple approach involved in offline hand printed recognition. Results: The cost of computing had been reduced a lot by introducing a simple mechanism yet effective to extract the minimal number of geometric features for recognition’s purpose. Conclusion: The experiment had shown that when hand printed characters are drawn clearly according to specified rules and models; it is then possible to perform recognition perfectly.


INTRODUCTION
Machine simulation of human writing is one of the most challenging research areas. It has been the subject of intensive research especially in unconstrained handwriting recognition. The interest devoted to this field is not explained only by the exciting challenges involved, but also the huge benefits that a system, designed in the context of a commercial application, could bring (Morita et al., 2003). Two classes of recognition systems are usually distinguished: online systems (Tappert et al., 1990;Namboodiri and Jain, 2004;Liu et al., 2004;Bahiraie et al., 2009) for which handwriting data are captured during the writing process, which makes available the information on the ordering of the strokes and offline systems (Steinherz et al., 1999) for which recognition takes place on a static image captured once the writing process is over. With the increase in popularity of portable computing devices such as PDAs and handheld computers, non-keyboard based methods for data entry are receiving more attention in the research communities and commercial sector. Large number of symbols in some natural languages (e.g., Kanji contains 4,000 commonly used characters) making keyboard entry even a more difficult task.
Problem background: Handwriting recognition has always been tough problem. Recognition of handwritten characters by computer poses serious problems because of the high variability in the character shapes written by individuals. As people tend to adjust their handwriting style to personal preferences, the resulting variability of handwriting styles often makes reading difficult even for humans. This problem becomes even more complicated when the writer is unknown (Powalka, 1995). Moreover, pairs of characters can be formed which are ambiguous, both for human and machine recognition, for instance U-V, C-G, Q-G, D-O, F-P.
There is extensive study in the field of handwriting recognition and a number of reviews exist. General methodologies in pattern recognition and image analysis are presented in (Mantas, 1986). Character recognition is reviewed in (Suen et al., 1980;Govindan and Shivaprasad, 1990;Vinciarelli, 2002;Koerich et al., 2003) for off-line recognition and in (Nouboud and Plamondon, 1990;Plamondon and Sargur, 2000) for online recognition.
Numerous techniques for handwriting recognition have been investigated based on four general approaches of pattern recognition, as suggested by (Jain et al., 2000): template matching, statistical techniques, structural techniques and neural networks. Template matching operations determine the degree of similarity between two vectors (groups of pixels, shapes, or curvatures) in the feature space. Matching techniques can be grouped into three classes: direct matching (Gader et al., 1991), deformable templates and elastic matching (Dimauro et al., 1997) and relaxation matching (Xie and Suk, 1988;Mitoma et al., 2005).
Statistical techniques are concerned with statistical decision functions and a set of optimal criteria, which determine the probability of the observed pattern belonging to a certain class. The statistical scheme is receiving increasing attention in recent years (Liu et al., 2004). Statistical techniques use concepts from statistical decision theory to establish decision boundaries between pattern classes (Jain et al., 2000). In structural techniques the characters are represented as unions of structural primitives. It is assumed that the character primitives extracted from handwriting are quantifiable and one can find the relationship among them. Basically, structural methods can be categorized into two classes: Grammatical methods (Shridhar and Badreldin, 1986;Aly and Abuelnasr, 2010) and graphical methods (Kim and Kim, 1999).
From an application's point of view, a recognition method can be writer-dependent or writer-independent. Writer-independent recognition is more challenging due to the diversity of writing styles. On the other hand, writer-dependent recognition allows stable recognition due to the relative stability of personal writing styles (Liu et al., 2004). The representation schemes of input pattern and model database are of particular importance since the classification method depends largely on them.

MATERIALS AND METHODS
The approach taken here is by combining elements from several methods such as globally and locally computed features to give unique feature descriptors. Initially, local information is computed from low level features, i.e., end-point and T-point, in the relative locations of the features within the character. An endpoint is basically defined as a point where the stroke ends, while a T-point is described as a point where more than two strokes meet. A frame of four quadrants is then mapped onto a thinned image to determine its feature quadrant location. For example, character 'A' (Fig. 1) as to consist of two end-point; each in quadrant 3 and 4 respectively and two T-points; each in quadrant 2 and 4 respectively.  While it might be argued that the character bounding frame should be divided into a large number of sectors instead of four quadrants to give higher degree of uniqueness. It is found the division into four quadrants is enough to provide good performance and a low degree of computational complexity. After a low level feature has been identified (as an end-point or a Tpoint), its quadrant location is determined and this information is added to the feature descriptor.
The resulting features in feature descriptor with regard to their quadrant locations are matched to the unique stored features of standard character in the developed character dictionary. The prototype that matches closely provides recognition. Prior to this, the expected quadrant locations of end-point and T-point of each standard letter ('A' to 'Z') are determined and kept in the suggested character dictionary as a standard database (Table 1). ). It should be noted that in this implementation, the focus based features had been reduced to two, i.e. end-point and T-point. Whereas in other past implementations, more than two studied features had been based onto, yet the ambiguities were still aroused and had to be resolved by other complex computational means, such as stroke orientations, aspect ratio of the stroke, or shape curvature (Kerrick and Bovik, 1988).
The reason is that, given a set of possible input characters, even allowing for substantial variation, using only end-point and T-point features would be enough to recognize a large number of them. That is to say, a larger number of character set, considering the attribute of these features only would be enough to give their uniqueness.
However, in some cases involving very similar characters, the ambiguity cannot be resolved by only end-point and T-point features properties. The ambiguity here is referred to two characters having the same properties as far as these features are concerned. For example, character 'N' and 'S' have the same amount of end-point in the same quadrant locations. The method used for resolving this ambiguity is by using another property kept in the stroke drawn namely called corner. Corner is a single intermediate level feature (Fig. 3).
Although corner is a locally defined feature, as with low level feature, but the number of corners does exist is used as a source of global information. Corner is a rather imprecise term; by corner it means an area of sustained high stroke curvature, surrounded by area of unsustainable or low curvature on either side. Thus, a corner is less a locally defined phenomena than endpoint and T-point. However, instead of determining the quadrant locations for expected corners, the possible number of corner occurrences in the pattern is checked regardless of their quadrant locations. For example, compared to character 'N', character 'S' has four expected corners, while character 'N' has only two. • Perform thinning algorithm on a binary image of hand drawn character • Perform feature extraction scanning process on the skeleton image. Upon the completion of scanning process, each feature collector will has a number of quadrant locations for each collected feature and pass to pattern examiner • Perform matching procedures: The pattern examiner matches this with the stored information in character dictionary for a targeted symbol. A matched symbol of a highly similar character is passed to a pattern examiner for corner detection to provide the recognition. The whole process is explained in what follows.

Technique of defining character dictionary based on expected of end-point and T-point for each character of 26 letters:
Here, the concept of bit masking is used to simplify our constructed character dictionary. The concept which will be explained later, not only gives the dictionary looks simpler, but also eases the complexity of programming task especially inn feature matching logic. The idea of bit masking is simple; for example, given a byte (8 bits) of word and masks to another word, according to logical operators used, the results are given as follows:  1 1 1 1 1 1

is TRUE
In the implementation, the quadrant numbers are first changed to binary system, as follows: 2 1→ 2 1 = 2 2 0 = 1 3 4→ 2 2 = 4 2 3 = 8 The character dictionary is then constructed, based on Table 1 for expected locations of end-point and Tpoint of 26 letters through the following operations: • If the expected quadrant location of any specified feature is in either quadrant #1 or in quadrant #2, the possible value to be put in the dictionary for that particular feature is calculated in the following manner Suppose, character E is drawn in this shape (Fig. 2) and the quadrant locations of each end-point and Tpoint are shown below; Its input of end-point and Tpoint as a candidate input to be recognized is defined as follows: Input-endpoint = {1,8,8} Input-T-point = {3} Table 2: The testing sequence  Input  Standard Comments  {1,8,8}→ {1,15,8} Using bit mask operation: 1 AND 1 is TRUE, an input of 1 is removed, 1 in standard is replaced by any negative integer (-2) {0,8,8}→ {-2,15,8} 8 and 15 is TRUE, an input of 8 is removed, 15 in standard is replaced by-2 {0,0,8}→ {-2,-2,8} 8 and 8 is TRUE, An input of 8 is removed, 8 in standard is replaced by-2 {0,0,0} {-2,-2,-2} As no more in input, then the process of matching is terminated The matching of those collected features with the stored ones in character dictionary is taken place in the following sequence. Firstly, an input of end-point is passed to test function to check whether or not, its input meet to those feature kept in the dictionary, starting with index number 1 (i.e. index of letter A) and it would last to the position of targeted symbol.
At this junction, let us see as to how an input element is tested one after another. Suppose, an input of end-point for character E drawn as previously defined, {1,8,8}, is taken into our consideration and it would be matched to the standard ones, {1,15,8}; stored in the dictionary. The testing sequence, is shown in the following steps (Table 2).
If the test for an input of end-point is succeeded, then an input of T-point is passed to test function to check whether or not, at particular index, its input also matches to the standard ones kept in the dictionary. If the test function returns TRUE for both inputs at the same index, then the matched symbol is checked whether it belongs to the group of uniquely characters or not. If not, the ambiguity of the involved letters will be resolved by locating the occurrences of corner in the skeleton.

RESULTS AND DISCUSSION
More than one hundred test specimens that cover all characters ('A' to 'Z') were collected to study the effectiveness of the proposed technique ( Fig. 3 shows some of the scanned hand printed characters).
The size of the binary digitized characters was set to 25×25 pixels. The characters were hand printed with different style of shapes and variations in normal accepted handwriting styles that could easily recognizable by human being. The smoothing, thinning and reshaping algorithm had been well performed on numerous types of digitized patterns and the results are satisfactory. The experiment has shown that when hand printed characters are drawn clearly according to specified rules and models; it is then possible to perform recognition perfectly. The proposed feature extraction mechanism had been successfully developed to cater all possible type of T-point. The algorithm has been tested with even the worst shape of T-point that may occur in a skeleton. As an endpoint is initially defined as any black pixel which has no more leading black pixel, then its extraction does not cause any problem. Since, in the proposed method, corner is vital information used to resolve the ambiguity problem which involve in a highly similar character, then any applied corner finding algorithm should be able to give enough information about the existence of corner in the skeleton. Table 3 shows in detail, the results of feature analysis and recognition done for a tested symbol shown in Fig. 3.

Recognizable Character (RC):
About 99% true recognition rates have been achieved in this experiment. It should be noted that the time taken to recognize one letter is less than 1 sec.

CONCLUSION
In this study, a new deterministic approach which is based on the concept of using minimal character features set has been discussed. The proposed method has introduced only two easily detectable features, i.e., end-point and T-point, to perform classification. While highly similar characters in term of having same properties of end-point and T-point is resolved by the number of corner occurrences in the skeleton. The result is satisfactory and the cost computing is less.