TEXT SIGNAGE RECOGNITION IN ANDROID MOBILE DEVICES

This study presents a Text Signage Recognition (TSR) model in Android mobile devices for Visually Impaired People (VIP). Independence navigation is always a challenge to VIP for indoor navigation in unfamiliar surroundings. Assistive Technology such as Android smart devices has great potential to assist VIPs in indoor navigation using built-in speech synthesizer. In contrast to previous TSR research which was deployed in standalone personal computer system using Otsu’s algorithm, we have developed an affordable Text Signage Recognition in Android Mobile Devices using Tesseract OCR engine. The proposed TSR model used the input images from the International Conference on Document Analysis and Recognition (ICDAR) 2003 dataset for system training and testing. The TSR model was tested by four volunteers who were blind-folded. The system performance of the TSR model was assessed using different metrics (i.e., Precision, Recall, F-Score and Recognition Formulas) to determine its accuracy. Experimental results show that the proposed TSR model has achieved recognition rate satisfactorily.


INTRODUCTION
According to World Health Organization (WHO), around 285 million people around the world are suffering from Visual Impairment and Blindness (WHO, 2013). In general, visual impairment is defined as the reduction of the eyesight that cannot be corrected by standard glasses or contact lens and it reduces a person's ability to perform some daily tasks. It is always a great challenge for VIP to walk along streets in town without the company of a guide dog or caretaker. With the advancement of assistive technology, the VIPs from the Western countries could perform their daily tasks like grocery shopping or even take a bus to go to various places independently. Some alternative ways had been proposed (Coughlan and Shen, 2013;Zhang et al., 2013;Katz et al., 2012;Sarfraz and Rizvi, 2007;Al-Saleh et al., 2011) so that VIP can get around independently with the help of technologies. However, most of these systems are based on carrying heavy equipment or a computer system and there are limited choices of text signage recognition applications that run on smart phones. Based on statistics from Infographic, approximately 46.9% of smart phone users from a total of 1.08 billion users are running their mobile applications in Android platform (http://www.go-gulf.com/blog/smartphone). Indeed, it has been a paradigm shift to design, develop and implement mobile text recognition applications for indoor signage objects.
The VIPs are not able to see the texts on the indoor sign boards clearly. However, there are many electronic gadgets available on the market but they are either too expensive or difficult for VIPs to handle those devices. We were inspired to initiate this research after a site visit to the National Council of the Blind Malaysia (NCBM) in Kuala Lumpur. From the interview sessions, we have learned that the VIP desperately needs affordable electronic gadgets to assist in navigating around various places independently without the help of caretakers (Foong and Safwanah, 2011).

JCS
This is especially pertinent for VIP who works as professionals such as lawyer, accountant or computer network manager (Tocheh, 2006).
Unlike previous TSR research which was developed in standalone system using MATLAB Release 2012 (Foong and Safwanah, 2011;Hairuman and Foong, 2011),an attempt was made to develop a Text Signage Recognition (TSR) Model in Android mobile devices for better portability. Based on our previous research, Hairuman and Foong (2011) applied canny edge detection algorithm, Hough Transformation and Shearing Transformation to detect and correct skewed signage images. In this study, the main objective is to propose a light weight text signage recognition model for VIPs in Android mobile devices using Tesseract OCR engine.
However, there were many challenges in text signage recognition such as low quality input images, limited bandwidth, low memory capacity, poor resolution and inadequate lighting. Adapting OCR on mobile devices poses another challenge, i.e., mobile devices have much lower processing capabilities than standalone workstation or personal computer. This portrayed an obstacle in particular for the recognition of signage images in mobile devices. The problem was intensified by the use of mobile applications that required real-time or on-the-fly processing. Fortunately, mobile devices offer its advantages beyond portability and ubiquity. Since smart phones are getting more affordable and widely use, they are no longer just a communication device. Smart phones or Android mobile devices are used to capture text images from signage objects and translate them to voice for VIPs or foreign tourists whose native languages are not in English.
Hence, we focus on offline mode in the development of text signage recognition for the VIP in Malaysia. The text images were translated by Tesseract OCR engine. Random public signage was snapped. Texts were extracted from the captured image that will be translated to text characters. Eventually, the text will be recognized and read out the text. The authors had implemented the TSR system in Samsung Galaxy S2 device using Android version 2.3.
The remainder of the paper is organized as follows. In Section 2 is the literature review. Section 3 describes the proposed Text Signage Recognition model. Section 4 includes the experimental results. Section 5 discusses the experimental findings. Section 6 describes some of the challenges and limitations of text Signage recognition. Section 7 states the conclusion and future work.

RELATED WORK
Recent researches in OCR using Neural Network or pattern recognition in which texts in an image are translated into a format that computers can manipulate usingmathematical formula. There are numerous devices and system applications that are beneficial to the VIPs. One of the most common devices is the domestic helperrobot that can recognize scene texts in real-time (Ruiz et al., 2013). Case et al. (2011) had built a robot to read the texts from office signage and performed a semantic mapping between the office location and the name of the office occupant. Those devices are helpful to the VIPs but there exists several drawbacks. First of all, the devices are cumbersome to carry around. Secondly, the devices lack ergonomics features for the VIP to operate on the devices.
The rapid development in mobile technology and the increasing number of smart phones users have captivated the attention of mobile application developers. Quite a number of OCR related applications that run on smart phones had been developed. There are numerous text detection/recognition and translation applications (Hsueh, 2011;Lue et al., 2010) from the market but they are without speech synthesizer. Those applications can be used at anytime and anywhere because they can be installed on the smart phones and carried around. However, it is mandatory to have internet connectivity for smart phone users to run their mobile applications in back-end server. Undeniably, it will be costly to do roaming if free WIFI is unavailable.
A considerable amount of related work that incorporates speech synthesizing ability with text recognition could be found in Tian et al. (2010;; Yi andTian (2011;; Wang et al. (2013) and Arditi and Tian (2013). The central theme of these works revolves around independence navigation and way finding by exploiting the capabilities of camera-based computer vision. Tian et al. (2010; and Wang et al. (2013) described in detailed manner their way finding methods to access unfamiliar indoor environments such as different rooms and other building amenities. Similarly, Yi and Tian (2011; detailed out their proposed framework that could extract text regions from scene images with complex backgrounds and multiple text appearances. Taking a different perspective that is more to the end-users' viewpoint, Arditi and Tian (2013) reported on the user interface preferences in the design of a camera-based navigation and way finding. The research's finding confirmed on the need for devices that can guide the VIPs to architectural landmarks in their environment. Likewise, the finding indicated preferences towards device

JCS
interfaces that could give the VIPs control over the new surroundings learnt.
Recently Zhang et al. (2013) conducted a study on state-of-the-art algorithms on scene text classification and detection and claimed that his proposed algorithm significantly outperformed the existing algorithms for character identification. Chang (2011) came up with an intelligent text detection and Extraction technique using Canny Edge Detection algorithm in addition to using fast Connected Component algorithm to filter noise to obtain features and candidate text. Gonzalez and Bargasa (2013) applied geometric and gradient properties to design a text reading algorithm for natural images. Carlos et al. (2013) presented a method for perspective recovery of natural scene texts where the geometry of the characters was estimated to rectify homography for every line of text. Koo and Tian (2013) and Shi et al. (2013) extracted connected components in images usingMaximally Stable Extremal Region (MSER) algorithm to perform scene text detection and recognition. In addition, Liu and Smith (2013) invented a simple algorithm that used the density of special symbols to detect equation regions in printed document. They claimed that the performance of the Tesseract OCR engine had improved after the detector was enabled. Thus, we decided to deploy Tesseract OCR engine as a backend engine for this research because we discover thatTesseract OCR engine is open source (http://code.google.com/p/tesseract-ocr/) and it is rankedthe top three of the optical character recognition system at the Fifth Annual Test of OCR Accuracy (Rice et al., 1996).

PROPOSED TEXT SIGNAGERECOGNITION (TSR) MODEL
As mentioned earlier, the Tesseract OCR engine has been shortlisted for this research because it is reliable, open source and free of charge. The recognition process of the Tesseract OCR engine mainly consists of features detection, segmentation, features extraction and lastly character recognition steps (Smith, 2007).
To shorten the development time, the authors had adopted also the Open Source Computer Vision (OpenCV)'s libraries for image processingbecause it contains a list of real time computer vision library functions which is free for academic use http://opencv.willowgarage.com/wiki/Welcome/). Figure 1 showsthe proposed Text Signage Recognition (TSR) Model. In a nutshell, the TSR model recognizes text in signage image based on the following steps: • Image Acquisition • Image Pre-Processing • Character Recognition using Tesseract OCR Engine • Speech Synthesizing Process

Image Acquisition
The text image was captured using Samsung Galaxy S2 device and stored as standard jpeg format or the image was selected from secured digital card in Android smart phone. The image may contain some unwanted noises such as blur images, complex background objects, illuminating objects. Thus, pre-processing steps are required to smoothen or reduce these noises forbetter recognition rate.

Image Pre-processing
The image pre-processing eliminates unwanted noise in signage objects. It comprises of a series of operations like convert to grayscale, morphology transformation, thresholding, find the edges of the characters, copy those characters in new white background and finally convert the gray scale image to binary image. The entire steps of image preprocessing are illustrated in Fig. 2.
The original colored image in Fig. 3 is converted into gray scale image by using cvtColor() function with "COLOR_BGR2GRAY" color space conversion code. (http://opencv.willowgarage.com/wiki/Welcome/). The gray scale image carries intensity information.
The formula used for color space conversion to gray scale is as shown in Equation (1) where, R, G and B are the normalized input Red, Green and Blue pixel values. Morphological Transformations will be carried out next on the image as shown in Fig. 4. Morph_Open(http://opencv.willowgarage.com/wiki/Wel come/) is the morphological operation that has been used in this research project. The image was first eroded and then dilated as shown in the following codes: dst =dilate(erode (src,element), element) Thresholding operation that was performed to convert from gray scale to binary image based upon Otsu's algorithm (Otsu, 1979). The digital gray scale image was partitioned into multiple segments or sets of pixels. The selected pixels must fall within the limit of the thresholdvalue. Otherwise the pixels values that are either above or below the threshold value will be rejected. A mask was created by applying the drawContours()function from the OpenCV library (http://opencv.willowgarage.com/wiki/Welcome/). The mask would eliminate and darken all background objects in the image and allow only the characters on the signage image to be visible as depicted in Fig. 6.
The masked image was copiedandpastedontoanother image withwhitebackground as illustrated in Fig. 7.
The image is finally converted to binary image before proceed to the character recognition process.

Character Recognition
The processed binary image will be further processed and converted to text during the character recognition process (Smith, 2007) without any internet connection. The characters outlined in the binary image were analyzed into Blobs using Connected Component Analysis. The Blobs were organized into text lines and then further divided into words.

Speech Synthesizing Process
The words or texts were then fed into the speech synthesizer for the Text-To-Speech (TTS) translation. The TTS read out the words if recognized correctly.

JCS
Otherwise, it either read out all the characters in the word one after another or no output was produced.

RESULTS
For the TSR prototype, the authors adopted a new paradigm shift to design its mobile user interface as simple as possible. Every option is performed in a few clicks. Figure 8 shows the mobile user interface design that consists of two options, i.e., either to select image from phone gallery or from its camera in the homepage.

Experiment Setup and Implementation
A total of 60 different samples of signage images were selected from the ICDAR 2003 dataset (http://algoval.essex.ac.uk/icdar/RobustWord.html). Of all the random samples selected, 60% of the selected images were used for training while the remaining samples were used for testing. The aim of the experiment was to determine the recognition rate of the TSR model. We used two open sources libraries i.e., OpenCV (Bradski and Kaehler, 2008) and Tesseract OCR engine during the development of the TSR mobile application. We installed the proposed TSR model on Samsung Galaxy SII device for system training and testing.

Training and Testing
Prior to system performance testing, we had four volunteers/users (3 males and 1 female) between 20-22 years old to train the TSR mobile application. The users manually input those images into the proposed model for training purpose. Samples of training and testing data were displayed in Fig. 9 and 10 respectively. The results of training and testing were tabulated in Table 1.
The Recognition Rate (RR) of the proposed TSR model is calculated using Equation (2): Based on our observation, the recognition rate for trainingis 16.7% higher than that for testing dataset as illustrated in Table 1. The huge discrepancy between training and testing dataset was due to noises that were inherent in the selected testing images. The testing samples downloaded from the ICDAR2003 dataset contained complex background objects which constituted the noise.

User Evaluation on Proposed TSR Model
The TSR model was tested by the four volunteers/users to estimate its recognition rate. All users were blind-folded. 15 signage images were randomly selected in which some characters were slanted or skewed. Each signage image was captured using Samsung Galaxy SII device. If the signage was recognized correctly (T = True), a tick (√) was placed in the entry. Otherwise the signage was not recognized (F = False). Table 2 shows the user evaluation results for the ICDAR 2003 Robust Word Recognition dataset.

JCS
From the experimental results illustrated in Table 2, it appeared that the TSR model could not detect and recognize the text signage for "Alarm" by user 1 and user 4, "AGENTS" by user 2, "PEPPER" by user 3, "Videos" by user no. 4, "1946" by user 1 and user 3, "EST." by user 2 and "YOU" by user 2 and user 3. However, the TSR could recognize signage for "STAFF", "2002", "GAS", "TRAFFIC", "South", "E14", "FLATS" and "SERVICE". In general, most slanted or skewed text images could be detected and recognized by the Tesseract OCR engine.

Comparison of Performance Evaluation Using Precision, Recall and F-Score
In this subsection, we compared the proposed TSR model's system performance results with other alternative models in terms of Precision, Recall and F-

DISCUSSION
We observed that the TSR model had encountered some hiccups in text recognition process. Firstly, it was noticed that those signage images with white colored text failed to be recognized correctly. This was because when texts were extracted, masked, copied followed by pasted onto a new white background image, the white texts were blended into the new image that had white background. Secondly, it recognized the character 'O' incorrectly as 'Q' and vice versa. Thirdly, signage that had illuminating texture cannot be recognized at all. Fourthly, any incomplete and cropped character in the captured image would be recognized wrongly by the TSR model as well. On the other hands, the proposed TSR model could recognize well on text signage that had simple/plain background. In addition, sometimes the speech synthesizer failed to speak the word correctly. For example, it pronounced the word "YOU" as "Y", "O" and "U" by reading out the characters one by one. The signage recognition would deteriorate further if the image captured were either too dim or too bright since the users were blind-folded.
As there was no prior comparative study conducted on the ICDAR 2003 datasetusing the Recognition Rate formula in Equation (2), further testing on more ICDAR 2003 images were conducted using alternative metrics (i.e., Precision, Recall, F-Score).
The experimental results of the proposed TSR model were demonstrated in Table 3.  Table 3, Yao et al. (2013) demonstrated that their algorithm had achieved average F-Score of 0.63 by minimizing energy cost function for each region in its surrounding text. The proposed TSR model which used the Tesseract OCR engine had performed better than Ashida's method (Lucas et al., 2003) that gave an average F-Score of 0.50. But our TSR model produced average F-Score of 0.61 which was below Yao's method (Yao et al., 2013).

CHALLENGES AND LIMITATIONS
Text detection and recognition have posed many challenges to research community. Researchers have attempted to bridge the research gap by improving the text detection and recognition techniques. Yet, there exist plenty of rooms for improvement. The challenges still persist in scene text recognition such as cluttered or complex background, small font size, inappropriate font style, slanted/skewed images, embossed images and poor lighting are the main contributing factors of poor detection and recognition rates. However, the greatest challenge for the visually impaired or blind people in capturing scene text is that the captured image tends to be out of focus, missing or partial text appeared in the captured image, or even image that contains illuminating color. Undoubtedly, lack of contrasting colors between foreground and background objects would also be a stumbling block to better text detection and recognition.
In the contrary, with the advancement in mobile technology, the prices of mobile devices have become cheaper and affordable. But, all mobile devices still have little memory and limited processing capability to process indoor text images. The offline operating mode is helpful in avoiding heavy roaming costs while using the TSR mobile application abroad.

CONCLUSION
The proposed TSR model which was deployed with Tesseract OCR engine could successfully recognize the text images from the International Conference on Document Analysis and Recognition (ICDAR) 2003 dataset with satisfactory recognition rate. We attempt to build a Text Signage Recognition model in Android mobile devices for visually impaired people in Malaysia.
Our approach is simple and easy to implement. The proposed model does not require the internet connection during its operating mode because the Tesseract OCR engine was embedded in the Android mobile devices. However, the major drawback for implementing text signage recognition is that its recognition performance suffers when compared to state-of-the-art algorithms for text extraction and recognition. Yet, the proposed TSR Model is capable of translating texts to speech through the built-in speech synthesizer.
At this juncture, the TSR model could recognize and pronounce English words only. In future, we intend to enhance the TSR prototype by translating any English texts to different languages such as Malay or Mandarin languages.

ACKNOWLEDGEMENT
Special thanks and gratitude to the National Council of the Blind Malaysia (NCBM) and the Research and Innovation Office (RIO) of Universiti Teknologi PETRONAS for providing support in this research.