Improving Arabic Instant Machine Translation: The Case of Arabic Triangle of Language

Corresponding author: Tala M. Albashir Department of Computer Engineering, Al-Ahliyya Amman University, Amman, Jordan Email: t.albashir@ammanu.edu.jo Abstract: Recently, instant translator applications would be a very useful applications when traveling especially when one knows little about the language of the country she/he is traveling to. Arabic to English instant translation has not yet been made available by most applications. In this Article, we attempt to provide mimic way of an application for instant Arabic to English translation. The system provides translation for Arabic Triangle of Language (ālmṯlṯātāllġwyh) which includes Arabic three words that are homographs. The process starts by capturing an image for a homograph using a mobile phone camera, after that the captured word is recognized, taking the diacritic markers into consideration, using an Arabic Optical Character Recognition (OCR). Finally, the system provides an English translation to the homograph. The researchers made use of Histograms of Oriented Gradients (HOG) features and a set of structural and geometrical features of Arabic word segmented and the SVM (multi class) classifier for classification, then providing the English meaning.


Introduction
In the modern age, machines have been used in different domains to improve humans' life and facilitate communication. Machine Translation (MT), which is defined as the process by which a computer application is used to translate texts from one natural language into another, is used nowadays by millions of users to translate message letters and website content from one natural language into another. One of the recent applications of MT is machine Instant translation. Machine instant translation is the process of translating linguistic content from one language into other using mobile devices with a camera. This type of machine translation relies on Optical Character Recognition (OCR) that helps the machine read textual material on images and then translate the content into another natural language.
Google has recently used its own MT engine to power a new mobile application can that read texts on images captured by users, using their mobile phones and instantly translate them into several natural languages. Google application supports the translation from serval languages but, unfortunately, Arabic is not yet supported. Even Google translate engine does not produce a good quality translation when working on some special cases of Arabic language such as homographs.
See examples in Table 1. As can be noticed from the above example, the word ‫بر‬has three different meanings depending in the diacritic markers attached to it. Google and Microsoft machine translation systems translated the three words as Mainland even though the three words have three different meaning. This problem is very serious especially when machine translation systems are used by language learners. In this article, we attempt to build our own lexicon that contains groups of these words. The lexicon will be used to create an application that is able to provide instant machine translation to the words included in the Arabic Triangle of Language. This process relies on our proposed OCR algorithm. The following sections discuss the notion of MT and the difficulties that face Arabic into English MT and OCR.

Tringle OF Language Ālmṯlṯātāllġwyh
Arabic language has a unique linguistic phenomenon called Triangle of Language (ālmṯlṯātāllġwyh). "In Arabic linguistics, 'triangle of language' refers to three words that are identical in spelling, but are different in diacritics, in which changing the vowel points will lead to a change in meaning. Hence, these three words are homographs. Words that are related to this triangle are nouns and verbs" (Abdul Ameer, 2010) for example ( ‫ْرُ‬ ‫َب‬ ‫،الك‬ ‫ْرُ‬ ‫ُب‬ ‫،الك‬ ‫ْرُ‬ ‫ِب‬ ‫الك‬ ) . If we change the diacritics of a word, the meaning will change also. Almost all translation applications do not generate the translation for these words and do not have the ability to distinguish between them. So, in this study we try to build our own lexicon that contains groups of these words, to get the correct translation after recognizing the diacritics and structure of the word by using the proposed OCR algorithm in this article.

Arabic in to English Machine Translation
There are several studies that have explained the most important problems when using machine translation to translate Arabic into English texts (Izwaini, 2006). Discusses the problems of Arabic-English MT systems by analyzing the Arabic translation of 76 English texts and the English translations of 46 Arabic texts produced by three MT systems: Google, Sakhr and Systran. The researcher found out that one of the major problems in the Arabic-into-English mode is the non-vocalization of Arabic words. Thie particular problem leads to the wrong choice of TL words. Other problems the research discovered are related to inadequate lexicon, multiple meanings, connotation and collocation. Deletion was also reported as one of the main problems of Arabicinto-English MT (Izwaini, 2006). Also found out that Arabic word order, Arabic definite articles and Arabic propositions are problematic to the three systems. As for the problems face MT when rendering English text into Arabic, the researcher found out that there are additions and deletions in the translations produced by the MT systems. He also discovered that polysemy and choosing the right TL meaning is a thorny issue in English-Arabic MT (Izwaini, 2006). University of Jazan in Saudi Arabia published a study in the Journal of Literature, Language and Culture discussing the issues that encounter Arabic-English MT. The authors indicate that the complexity of Arabic syntactic structure, the length of the Arabic sentence, homographs, diacritics, are among the main factors that make it hard to develop "good" Arabic-English machine translation (Al-Mubark, 2015). Lutf et al. (2014), authors considered the diacritics and strokes, which have been added to the original Arabic alphabet marks and embed very useful information about font type. The authors proposed a method to Arabic word recognition using two algorithms to segment diacritics: Flood-fill based and clustering based algorithms. Typically, they used 10 Arabic fonts. They showed that their approach could achieve high recognition rates with minimum computation cost and facility to be integrated with OCR systems. The steps in the proposed approach steps are explained, for training and testing stages:

Preprocessing:
Noise removing, orientation correction and text localization 2. Segmentation: Flood fill-based diacritic segmentation and Clustering-based diacritics segmentation 3. Diacritics validation: They defined the diacritic D as a label of a letter when the height M and width N of D satisfy M>2N 4. Feature extraction and classification: They Compute Central and Ring Projection (CCRP) 5. Measure the similarity The total recognition rate for their approach was 98.73%. Keysers et al. (2016) presented Google's online handwriting recognition system. In their approach, they used the same architecture on cloud and smart mobiles with a small computational power by changing some features system of the system. It's worth mentioning that this system is currently implemented in Google products such as Google translate, as well as an input method in Android devices. The approach in (Shahrour et al., 2015) depends on segmentation-and-decoding. At the beginning, authors segmented the ink into character hypotheses. Then, they classified it as characters. Finally, they used lattice decoding to the approach in (Shahrour et al., 2015), performs automatic discretization of Arabic sentences, which combines linguistic rules and statistical treatments.
The system performs on four stages: 1. Morphological analysis using the second version of the morphological analyzer AL Khalil Morpho Sys 2. Eliminating invalid word transitions according to the syntactic rules 3. Determining most probable discretized sentence by using discrete hidden Markov model and Viterbi algorithm 4. Dealing with words using statistical treatments based on the letters The research in (Al Tameemi et al., 2011) proposed system that was implemented using pattern recognition tool and a set of structural and geometrical features of Arabic word. Then, they classified the word using SVM (multi class). Their proposed method used a multi database of Arabic words and achieved good results.

Optical Character Recognition
OCR applications have been used vastly in the last decades, where replacing the paper storage with computer files became a persistent need. This manifests in the efforts that aim to preserve old manuscripts. The drastic growth of concern in improving OCR techniques has imposed some challenges for developers to achieve high accuracy rates and increase performance. Moreover, the economic aspects force to provide OCR services in various applications, to compete the human reading capabilities, with little feasibility, but with lower payoff (Izwaini, 2006). OCR techniques consist of five essential steps: Preprocessing, Segmentation, Features extraction, Classification, Post-processing.

Pre-processing
The pre-processing step is an initial stage precluding to other steps. This stage is divided into several tasks. First, image enhancement techniques are used. These techniques are mostly used to deal with low quality images.

Thresholding
Thresholding is a method of dividing the image into two parts: Foreground and background, usually this after converting the color image to gray scale image, Image

Enhancement Techniques
The main goal of enhancement techniques is to manipulate an image so that the result is more suitable than the original image for specific applications (Al Tameemi et al., 2011).

Noise Removal
Source image may be exposed to several types of noise due to errors. Noise may smear or break the character in the target image and may cause poor recognition rates. Therefore, it is necessary to eliminate this imperfection before the actual processing of the target image is carried out (Al Tameemi et al., 2011).

Morphological Operations
There are non-linear operations that depend on the arrangement of pixel values, regardless of their numerical values, especially in binary images (Maini and Aggarwal, 2010). The structuring element is an essential concept in morphological operations, which is a small binary image, with all pixels having zeros or one's value, Dilation, Typically, this technique is used to magnify the borders of foreground regions progressively. Erosion On the contrary of dilation operation, erosion shrinks the area of foreground regions by eroding the boundaries of the foreground (Maini and Aggarwal, 2010).

Features Extraction
The second component of an OCR system is feature extraction and it is the most difficult process in pattern recognition. Feature extraction has two main approaches. The most direct one is to describe characters by actual raster image, the other approach extracts important features that characterize symbols but leaves the trivial attributes.

Structural and Geometrical Features
We are bypassing with Arabic word segmentation into characters in this study, therefore an Arabic word can be considered as a separated class. The extracted features are calculated as a number of connected components, positions of corners are indicated by end points in the word image. The mean value of the word image is calculated and we slide the image to 20 vertical slides, then calculate the summation pixel values for each slide. Next four local maxima points are found in the vertical projection into consideration to the center of gravity (Al Tameemi et al., 2011).

Histograms of Oriented Gradients (HOG)
HOG is a popular method; it has proven high success in complex environments. HOG depends on defining an image as a group of local histograms (Korkmaz et al., 2017); this descriptor is popular for object detection. HOG divides an image into small grid cells, calculates a histogram of oriented gradients in each cell to return a descriptor to each cell and then normalize the result using block wise pattern.

Classification
The final phase in OCR systems is classification by means of supervised and unsupervised methods. It aims at classifying a character or word into its appropriate category. The Support Vector Machine (SVM) is the adopted classification algorithm in our work to classify input images. We next briefly review the most important aspects of SVM.

Support Vector Machine (SVM)
In our work SVM classifier was used to map the features of input image to different categories or classes. It is an exciting and superior algorithm, especially for classification of high-dimensional datasets; it has achieved great results in pattern recognition applications (Al Tameemi et al., 2011).

The Proposed Approach
In this section, we illustrate the essential steps in chronological order, accompanied with illustrating figures that explain the different stages.

Processing
As a first step, we prepare some images of Arabic words.
We used MATLAB software to apply preprocessing techniques (thresholding, noise removal). Firstly, we convert RBG image to a gray-scale image then, to a binary one. After that all objects containing fewer than 30 pixels are removed, definitely the 30 pixel is dependent on the font size and words that will be translated in our lexicon. As well as salt and pepper noise was removed also by using median filter.

Dilation
We have dilated the word for two objectives to assign word center of word easily.

Segmentation
By the assigning central line, after that we dilated the central line to cover the middle area of word structure, by this step the word and dilated center line have become a one object. The diacritics markers isolated from word structure by removing biggest object from image to produce two images, Fig. 1.

Features Extraction
Feature extraction stage is performed starting with word image. We divide the word image into 18 equal parts (Fig. 2) the figure, to calculate the summation of pixels in each part to get feature number one of the features vector.
Vertical projection for the word image was calculated, then the projection was divided for four parts to obtain maximum and minimum four points for it as the second features.
Calculate the number of corners, connected components, means and gravity for word image to get the remaining features.  Diacritics markers are segmented independently to get discriminative features at connected components level (Fig. 3).
HOG features are calculated for each diacritics image.

Classification
Multiclass Support Vector Machine (MCSVM) algorithm is used after word features and HOG features are identified for each diacritics vector that has been prepared in the previous step. MCSVM classifier scrutinizes and maps the given input information in the training data to assign word class and define diacritic marks in the same word, the as the following: 1. Whole word classifications, as we mentioned previously each word is considered as a standalone class, therefore in this step loading database matrixes for each class as well as rearrange it as training set. The order of training matrix contributes the class number, as we can see in Table 2 2. Diacritics classification, Seven class (diacritics) were classified in this stage (hmzh ‫,همزة‬ ftḥh ‫,فتحة‬ ksrh ‫,كسرة‬ ḍmh ‫,ضمة‬ šdh ‫ة‬ ‫,شد‬ ‫سكون‬skwn, ‫نقطة‬ nqṭh, ‫نقطتين‬ nqṭtyn(. The classification approach for diacritics is the same as whole word, we load the training data base and specify class number for each diacritic. As can be noticed, there is matching between the structures of some diacritics such as (ftḥh ‫,فتحة‬ ksrh ‫)كسرة‬ and ( ‫نقطة‬ nqṭh, ‫نقطتين‬ nqṭtyn). The following steps are followed to distinguish between them: i. Compute the centers of all diacritic markers in the input image individually ii. Compute the center of the whole word (referred to as dilation steps) iii. Subtract y coordinator in each diacritics centers (Yd) from y coordinator in word center (Yc) iv. Examine the sign of the subtraction result. If it is a negative number, then it is a (ksrh ‫)كسرة‬ or ( ‫نقطة‬ nqṭh) under the word and so one, else the diacritic markers is located above the word The number of diacritics in each word is different. The complete system code iterates the execution of function corresponding number of diacritics in word, since, as numbers for each diacritics, will be converted to one dimension matrix to we note, the output of above function (Diacritics_class{i}) is cell matrix contains class conjugate it with word class to generate key for lexicon (Table 3).

Translation
At the end of the pervious step the proposed OCR algorithm is terminated, it fulfilled required key which produces consist of word classes and diacritics classes. In the next phase, we specify the right entry in our lexicon to give the English translation of the word. Term and word in Arabic language have multiple meanings, almost all machine translation systems suffer from this problem. Nonnative Arabic speakers who are interested in Arabic and Islamic studies introduce that problem in Google translate community, our own lexicon contains homographs in Arabic to provide correct English meaning. In this lexicon words have been rearranged in Excel sheet with three columns Arabic word, Arabic meaning and English meaning. Excel sheet is loaded in MATLAB workspace through the system code and it created an object map mapping key and retrieved values using container map function.
In our case, the retrieve values are the meaning of the word in Arabic and English, otherwise the key is a vector consists of classes numbers of diacritics conjugated with the number of class word (Table 3).

Experimental Results
We discuss the collected data (Triangle of Language (ālmṯlṯātāllġwyh) and the analysis approach. Then, we point out salient results for translating Arabic captured images of three words that are homographs after being recognized, taking the diacritics markers into consideration. To demonstrate the effectiveness of the proposed algorithm, two hundred Arabic word homographs were taken from the Arabic poem (mṯlṯqṭrb), as the data set as input to the experiments to be recognized then translated.
In this section, we describe three types of experiments. First, we show that our procedure for certain number of Arabic word image, then evaluate the system and measure the accuracy of OCR. Next, changing the ratio between training images and testing images, in addition to show how that effects on accuracy and performance of system. Finally evaluate and measure the accuracy of diacritics markers recognizer.
As mentioned in proposed approach section the experiments were performed to validate the system on Arial font type and multiple font size (16,22,36). The experiments were executed using a mobile camera with low resolution to take word images of width (500) and height (750) pixels. The output of the MCSVM classifier, whether word or diacritics, generates key for our lexicon. Table 4 defines the class number for each diacritic used in our system. On the other hand, word class is specified according arrangement of training set in our system, for example: If the training set is Alef1 then class number (Group set) is 1 and if the training set is Alef2, class number (Group set) will be 2 and so on (Table 2). Arabic meaning English meaning "9 6 5 -7 5 4 5 1" ُ ‫َاء‬ ‫َع‬ ‫ب‬ ‫رْ‬ َ ‫األ‬ ‫اسمُيوم/اسمُموضع/ُأنُيجلسُمتربعا‬ Wednesday/cross legged setting " 9 6 5 -7 5 4 6 1 " ُ ‫َاء‬ ‫َع‬ ‫ب‬ ‫رْ‬ ‫األ‬ ‫اسمُموضع‬ name of place "9 6 5 -7 5 4 -5 -1" ُ ‫َاء‬ ‫َع‬ ‫ب‬ ‫اإلرْ‬ ‫اسمُيوم‬ Wednesday

Guilt
It is worth mentioning that the key of our lexicon consists of class word and a class diacritic. To declare system approach, we will illustrate various examples, Example1, Example2 shows the similarity between the feature vector of test image and training set matrix for the same image. On other hand, Example3 shows how segmentation problems affect the efficiency of the proposed OCR.

Example 1
We The word has class number 9, according to the order of data training for this word (Alef 9) in our lexicon after classifying the features vector which present 31 coefficients values for (ālʾrbʿāʾ)at font size 22, Table 3. Otherwise diacritics classes for each structure of word are defined in Table 5.

Example 2
We have certain word ‫)الجرم(‬ (aljrm) homographs. The word has class number 96 as we can Table 5, according the order of data training for this word (je18.mat) in our lexicon. In three experiments different number of samples for training and testing, the features values is extracted for each training image, then stored in matrix format. Training model is constructed by using features matrix and labels which have been created manually.

Experiment 1
In this experiment 540 Arabic word images were used which is distributed to 135 classes, each word class was trained using 3 samples and tested using one-word sample for each class. There was no overlap between training and testing samples. The values of evaluation parameters have been calculated through MATLAB 2017 software (MATLAB, 2019). Table 6 shows evaluation parameters result for this experiment.

Experiment 2
In this experiment, we used 1215 Arabic word images that distribute to 135 classes, each word class was trained using 6 samples and tested using 3 word samples. The evaluation parameters were calculated as the method in experiment 1 as shown in Table 7.

Experiment 3
This experiment based on diacritics markers images, Table 4 presents 7 classes of diacritics markers, each class trained using 54 samples and tested using 18 word samples, so the total images for this experiment is 504 (727).
The small number of diacritics classes facilitate displaying the result as describe in Fig. 4 which display confusion matrix for diacritics markers classes, the following MATLAB function was used to calculate the classification accuracy of diacritics. Table 8 presents evaluation parameters for this experiment.

Discussion
The difference in font size (16,22,36) plays an essential role in pixels' allocation in each image. So, features number 1 (sum pixels in 18 divisions) in training dataset and in testing vectors are different completely. But the gravity and the values of the points of projection are similar. Otherwise, there is matching between the last three coefficients (Mean, Number of connected values and Number of corners), because the structure of the word is fixed, in spite of font size. Otherwise there are matching between last three coefficients (Mean, Number of connected values and Number of corners), because the structure of word is fixed, in spite of font size changing. Also there are different circumstances, controlled and uncontrolled, play essential role in the performance of this system. The controlled conditions include lexicon size (number of words), data base type, type and size of font which clearly reflects on accuracy specially when changing number of testing images. In Experiment 1 one test image has been used that was chosen randomly regardless of any font size has been written, while Experiment 2 three testing images was used, it gives the possibility of greater in to equivalent training features matrix and greatly increase the error ratio. Based on above, the recognition accuracies and other evaluation parameters for two experiments are reported in Table 9.
The followed method for capturing image using mobile camera, (Fig. 5). The whole word in this image localize at the bottom of image, the central line has passed through the upper part of word not the middle, that effected segmentation stage, some diacritics will not be separated correctly from the whole image, as we can see in Fig. 6. Some people are better at handholding a camera steady than others. If images are not sharp or in other words, if there is vibration in image quality, there will be a segmentation problem, Fig. 7 for illustration. In the example of (alall) ( ( ‫لل‬ ُ ‫األ‬ vibration result is clear. Figure 8 to 10 show other affected phrases in our system. One of most serious effects of vibration, is melting, Fig. 9 shows melting in whole image omitted characteristic features some diacritics or occurs segmentation fault because consider diacritics as a part of whole word, using mobile stabilizer decrease from effect of vibration and solve the problem.

Conclusion
This research presents an Arabic printed word recognizer approach that involves Arial font and multi font sizes (16, 22 and 36). Segmentation algorithm is applied after morphological operations, where two images are produced: One far Arabic word stripped from diacritics and the other contains image. A set of structural and geometrical features are obtained for the Arabic segmented word, in addition to HOG features extracted from the diacritics image. The word and diacritics were recognized using an MCSVM classifier to provide Arabic-to-English translation for the whole word (taking the diacritics markers into consideration). The experiments we carried out using a collected dataset (Triangle of Language (ālmṯlṯātāllġwyh)) to test the proposed system. The achieved recognition rate exceeded 90.92% for word level and 96.83% for diacritics markers.
The following points are recommended as future work: 1. Increasing the number of words in the database 2. Expanding the idea of the translation of words to encompass whole sentences, segment each word using the proposed approach, then using MT approaches to provide non-literal translation 3. Improving the quality of input images by controlling the circumstances by which images are captured to increase the accuracy of OCR 4. Using parallel and GPU techniques to extract entries from lexicon in order to enhance the speed of the system 5. Providing a framework for the recognition of multi font type texts 6. Exploiting Google cloud features to enhance the efficiency of MT