BanglaEnglish Machine Translation Using Attention-based Multi-Headed Transformer Model

Corresponding Author: M. A. H. Akhand Deptartment of Computer Science and Engineering, Khulna University of Engineering and Technology, Khulna 9203, Bangladesh Email: akhand@cse.kuet.ac.bd Abstract: Machine Translation (MT) refers to translate texts or documents from the source language into the target language without human intervention. Any MT model is language-dependent and its development requires grammar, phrase rules, vocabulary, or relevant data for the particular language pair. Hitherto, little research on MT for Bangla-English is reported in the literature, although Bangla is a major language. This study presents a deep learning-based MT system concerning both-way translation for the Bangla-English language pair. The attention-based multi-headed transformer model has been considered in this study due to its significant features of parallelism in input processing. A transformer model consisting of encoders and decoders is adapted by tuning different parameters (especially, number of heads) to identify the best performing model for Bangla to English and vice versa. The proposed model is tested on SUPara benchmark Bangla-English corpus and evaluated the Bilingual Evaluation Understudy (BLEU) score, which is currently the most popular evaluation metric in the MT field. The proposed method is revealed as a promising Bangla-English MT system achieving BLEU scores of 21.42 and 25.44 for Bangla to English and English to Bangla MT cases, respectively.


Introduction
Translation of speech or text contents from one natural language to another is often indispensable in politics, business, research and other areas. Translation through human experts is a well-known approach over the centuries. Human translators perform an expert job interpreting conversations between two parties (e.g., country chiefs, tourists, business giants) spoken in different languages. Globalization today requires translating web contents (e.g., website, references, documents) in everyday living. Translating such huge contents (especially text, document and web) persuades machine translation as an emerging research field in recent years (Garg and Agarwal, 2018).
The idea of natural language translation using computer systems appeared in the 1950s (Hutchins, 2000). Machine Translation (MT) has become a research field through the public demonstration of the Georgetown-IBM experiment (Hutchins, 2005). On a fundamental basis, MT was used to conduct direct substitution of words from source language ones to a target language (Hutchins, 1995). However, it is clear that only word-for-word translation does not provide semantic meaning to be useful in real life. Efforts have been made by the research community to develop new methods in the last several decades to improve the quality of MT.
The MT methods are broadly categorized into four approaches: Rule-Based MT (RBMT), Example-Based MT (EBMT), Statistical MT (SMT) and Neural MT (NMT). A number of hybrid methods combining two individual approaches are also available, e.g., RBMT and SMT (Xuan et al., 2012). RBMT is basically based on linguistic information and it produces translation through rules generated by human experts considering verbs, phrases, prepositions, etc., of the language pair (Bhattacharyya, 2015). EBMT takes a parallel corpus that contains the source sentence and its translation. After taking help from parallel corpus, the translation mechanism finds similar words/phrases to adopt the previously available word/phrase to translate a new sentence (Sumita and Iida, 1991). SMT is an MT model which generates translation on the basis of probability generated through statistical analysis of bilingual aligned corpora (Babhulgaonkar and Bharad, 2017). NMT is the most recent method with encoders and decoders in the core; it is a data-driven approach that trains a special Neural Network (NN) model for MT (Kalchbrenner and Blunsom, 2013). NMT has emerged as a powerful approach to MT research with the advancement of deep neural networks over the last decade (Stahlberg, 2020).
A number of remarkable researches are available in the literature with rich resources which achieved good performance for English-French (Luong et al., 2014), English-German (Jean et al., 2014), English-Chinese (Wang et al., 2018) language pairs. In contrast, MT resources on the Bangla language are very limited despite being a major language in the world, the fifth-ranked globally with 228 million native speakers and the first language of Bangladesh (Akhand et al., 2016). A number of Bangla-English MT studies are available with different methods, but they are not significant with respect to resource-rich language (Dandapat and Lewis, 2018;Hasan et al., 2019a;Siddique et al., 2021). Therefore, the aim of this study is to develop an NMT system for the Bangla-English language pair.
A deep learning-based transformer model is investigated in this study to develop an MT, taking advantage of the transformer's parallelism features in the input data processing. A transformer model consists of encoders and decoders, where learnable parameters are tuned to identify the best performing MT model for Bangla to English and vice versa. The proposed model is tested on SUPara benchmark Bangla-English corpus and evaluated the Bilingual Evaluation Understudy (BLEU) score. The proposed method is revealed as a promising Bangla-English MT system while compared with the prominent existing methods on the basis of the achieved BLEU scores.
The rest of the paper briefly reviews existing Bangla-English studies, describes the proposed methodology, reports experimental studies and results. Finally, the paper concludes the findings with a few remarks.

Related Studies
A number of studies have been reported over the last decade for Bangla-English MT with different techniques. Most of the existing studies are only considered Bangla to English (denoted as B2E) or English to Bangla (denoted as E2B) case. Among the existing studies, the E2B method called ANUBAAD (Naskar et al., 2004) is the pioneering one which is a hybrid MT system using EBMT and RBMT explicitly. ANUBAAD considered noun phrase, adverbial phrase and verb phrase. The system morphologically analyzes the input sentences and defines some formal grammars. Noun phrases and adverbial phrases are translated through EMBT with a template matching module, whereas verb phrases use the RBMT approach.
Several other RBMT methods have been available for E2B MT in recent years. Dandapat et al. (2010) investigated a Translation Memory (TM) based EBMT architecture for E2B. They built two TMs: One is based on phrase pairs alignment and the other is based on a word-aligned file from source to a target language. Finally, they integrated TM with EBMT and compared it with basic EBMT. Salam et al. (2013) suggested an EBMT method emphasizing unknown word handling using Word Net and International-Phonetic-Alphabet (IPA) based transliteration with software. Salam et al. (2017) proposed another EBMT method for E2B where the unknown words are searched in WordNet using synonyms, antonyms and hypernyms. Francisca et al. (2011) proposed an E2B RBMT that divides the words of English sentences based on sentence characteristics like grammar and structure. A lexical analyzer is used to generate the class of sentences utilizing the information of the word from a dictionary. With the help of the partially or fully matched fuzzy rules, output Bangla sentences are generated using a dictionary. Ashrafi et al. (2013) used Context-Free Grammar (CFG) in replacing the tokenized words with the variable in their E2B RBMT. CFG provides grammatical rules according to the English and Bangla language structures. They created an intermittent parse tree to stimulate computational history. The outcome is the substitution of the English words with equivalent Bangla meaning as well as reordering the previous tree to get the actual parse tree by Bangla CFG rules. Muntarina et al. (2013) proposed the E2B RBMT model on the basis of tensebased rules. The model constructs a parse tree for input English sentences and then converts it into Bangla parse tree based on production rules for both languages generated by syntactic and morphological analysis. Rabbani et al. (2014) proposed an E2B RBMT approach, which transforms different forms of English sentences (like active, passive, assertive, interrogative, imperative, exclamatory, simple, complex and compound) into simplified forms, i.e., subject + verb + object. After identifying the principal verb from the English sentence, it binds the rest of the parts of speech as subject and object. Bangla output sentences are generated by the translation of English words of the newly structured English sentences. Recently, Haque and Hasan (2018) proposed an algorithm that takes person, verb root and tense as arguments and finds what should be appropriate verb in the sentence, which later applied to E2B RBMT system architecture.
A few studies have been carried out on B2E RBMT. Anwar et al. (2009) used Context-Sensitive Grammar (CSG) rules to analyze a Bangla sentence syntactically. The sentences can be simple, complex, or compound. After analyzing, the sentences are translated into English. Rahman et al. (2010) proposed a method of using root words to translate Bangla to English. Morphological analysis is used to find out the root word. In addition to the root word, parts of speech and grammar of the source sentence are also detected. After combining all, a Bangla sentence is translated into English. Anwar et al. (2010) focused on the lexical mappings and structural analysis of Bangla sentences. They introduced a rule-based grammatical approach to perform syntactic analysis on every type of sentence. The system tokenizes the Bangla words based on the lexicon and uses a parser to group the tokenized words according to grammatical rules. Chowdhury (2013) projected a system where Bangla sentences are read from left to right and corresponding English words are generated by using a dictionary and context of the Bangla sentence. In addition to word generation, a set of grammatical rules are used to analyze the source sentence properly. Arefin et al. (2015) used CSG rules for translating assertive, interrogative and imperative Bangla sentences into English. The rules are developed based on the mood of the sentence and ignoring sentence structure. Alamgir et al. (2016) also used CSG to translate imperative, optative and exclamatory Bangla sentences into English. Mukta et al. (2019) proposed a phrase-based E2B MT using fuzzy rules. The system takes the input of different types of sentences based upon the tense, phrase and affirmative and negative sentences. The system also emphasizes English grammar, verbs, prepositions, inflection and other grammatical rules on Bangla. After tokenizing and matching fuzzy rules, the model translates English sentences to Bangla with the help of a dictionary. Anwar (2018) also used fuzzy logic for B2E MT, which includes syntactic analysis of source language and generation of the target language.
As a data-driven approach, the SMT model has been developed in several Bangla-English MT studies. Roy and Popowich (2010a) presented a phrase-based B2E SMT with a unique transliteration method. In addition, a specialized component for detecting prepositions and Bangla compound words is also used to improve the performance. Roy and Popowich (2010b), in another work, presented a word reordering technique with SMT that had a positive effect on overall performance. Recently, Al Mumin et al. (2019a) presented a phrasebased SMT model (called shu-torjoma) for both B2E and E2B. The proposed system excels other developed systems significantly. On the other hand, Rabbani et al. (2016) proposed a hybrid phrase-based E2B MT using the concept of RBMT and SMT. The model finds the principle verb from any kind of sentence and then converts it into the simplest form.
Deep learning-based NMT is a recent trend in MT systems in different languages and a few studies are available for Bangla-English. Hasan et al. (2019a) used Bidirectional Long Short-Term Memory Network (BiLSTM) and transformer, the two popular deep learning methods, for B2E NMT. In comparison between the methods, BiLSTM based model is found better than the transformer. Hasan et al. (2019b), in their study, BiLSTM based methods compared with SMT. They also used different datasets and measured which model worked better for which dataset. Their results showed that the NMT model provides a better result than the SMT model. Dandapat and Lewis (2018) developed an English-Bangla general-purpose MT domain and worked on both SMT and NMT fields. They used Phrasal (Green et al., 2015) (for B2E and vice versa) and Treelet (Quirk et al., 2005) (for E2B) translation model using different training sets. They also developed a word segmentation model to handle unknown words. They showed that NMT works better than SMT. Recently

Attention-based Multi-Headed Transformer Model for Bangla-English MT
A transformer deep learning model with a multiheaded attention mechanism for both B2E and E2B MT is proposed in this study. The method comprises two major phases: The data preprocessing phase and the transformer model training phase.

Data Preprocessing
For the NMT system, data preprocessing includes tokenization, true-casing, normalizing punctuation and removing non-printable characters from the data. Long sentences and empty sentences may cause a problem so that a fixed sentence length is used in this study like any NMT system. The BPE algorithm (Gage, 1994;Sennrich et al., 2015) is applied to the corpus for subword segmentation to handle rare words. Preprocessing depends on data to be used and is explained for the selected data in the experimental studies section.

Transformer Architecture and Its Adaptation
The recently proposed deep learning model, Transformer (Vaswani et al., 2017), is one of the most significant models in the field of Natural Language Processing (NLP). The significance of the model is that data do not need to feed into the model in a consecutive manner that permits parallelism. So, a transformer model ensures fast training for the NLP tasks. A transformer is widely used in MT, time series prediction (Maxime, 2019), named entity recognition (Davydova, 2017), document generation (Radford et al., 2019), biological sequence analysis (Nambiar et al., 2020;Rives et al., 2021). Another important issue for choosing the model is the open-source model availability and customization facility for a particular task in OpenNMT toolkit (Klein et al., 2017). Figure 1 demonstrates the layers of the transformer model, which has basically four main operating units: Embedding, Encoder, Decoder and Output Generation. In the embedding layer, the words in a sentence are transferred into word embeddings. A word embedding is a fixed-sized vector representing an input word. Then each embedding is added to the positional encoding vector (in the range of -1 to 1) of the same dimension. The resultant vector presents all the necessary information, such as the sequence of words in the input sentence and the distance of different words.
The resultant embedding vector of numeral values is the input to the encoder module. The encoder module contains several encoders in a cascade fashion and the encoded vector is the outcome from the encoder module after the successful operation of individual encoders. The Encoder Output Vector (EOV) is fed to the decoder module and output words are generated sequentially considering decoders' current status and previously generated words. For the sample input sentence in English, 'I Love My Country', Bengali word 'আমি' (phonetic: Ami; means I) is generated first with successive operations on the EOV by the decoders and output generation. To generate the second word, the already generated output word 'আমি' is feed into the decoder module and the word 'আিার' (phonetic: Amar; means my) is generated. The word 'আিার' is used to generate 'দেশকে' (phonetic: Deshke; means country). Finally, the last word 'ভাক াবামি' (phonetic: Bhalobashi; means love) is generated while the third output word is feed into the decoder module.
The separate encoder and decoder modules are the core of the transformer model, which handles the attention mechanism to improve NMT performance. The encoder (or decoder) module is a stack of several encoders (or decoders) and the number of encoders and decoders is generally the same. Figure 2 presents general architectures of an encoder and a decoder illustrating individual layers. An encoder has mainly two sub-layers: Multi Headed Attention (MHA) and Feed Forward NN (FFNN). In each of the sublayers, normalization performs on the vector, adding the input vector of the sublayer and the vector from the MHA/FFNN.
The attention mechanism enables the transformer model to understand how much the other words are relevant to the word that is currently being processed. At first, the attention process multiplies Embedded Input Vector (EIV) with three matrices (such as Wq, Wk and Wv) individually and creates three vectors: A Query vector (q), a Key vector (k) and a Value vector (v). These new vectors are smaller in dimension than the EIV and essential for calculation for attention. The second step in attention is to calculate a score which determines how much focus to put on other parts of the input sentence. To calculate the score for the first word of the shown example (i.e., 'I'), the score of every word in the input sentence is to be calculated against the first word. This can be done by calculating the dot product of the query vector with the key vector while there is n number of words in the sentence. Therefore, a score for 'I' is q1.k1 q1.k2 q1.k3 ..... q1.kn. The third and fourth steps are to divide the scores by the square root of the dimension of the Key vectors and passing the result through a Softmax operation. Softmax normalizes the scores, so they are all positive and add up to 1. The fifth step is to multiply each value vector by the Softmax score. The sixth step is to sum up the weighted value vectors. This produces the output of the attention layer for the first word.
The transformer model uses MHA with the abovedescribed attention for an individual head. In a multi-headed case, a set of Query, Key and Value vectors are produced for each individual head. It expands the model's ability to focus on different positions. For example, to translate a sentence like "The animal didn't cross the street because it was too tired", the MHA helps to know which word "it" refers to.
A decoder consists of three sub-layers: MHA, Encoder-Decoder Attention (EDA), FFNN. The operations of MHA and FFNN are the same as in an encoder and EDA is significantly different from an encoder. The EDA layer works just like MHA, except it creates its query matrix from the layer underneath it. The output generation mainly consists of the Linear and Softmax layers, which convert the decoder output vector into some probabilistic values. These values help the model generate the next token.
There are several hyperparameters in the transformer model, such as batch size, dropout, learning rate, number of encoder layers, number of decoder layers, number of heads, etc. To achieve better performances, the hyperparameters of the transformer model should be adjusted. Multiple numbers of heads help the self-attention and make the attention layer work better as it increases the model's ability to guess the other words referring to a particular word that is currently being processed.

Experiments Studies
This section describes the experimental outcomes of the proposed NMT system on the chosen benchmark dataset. The performance of the proposed method is also compared with existing methods.

Benchmark Data and Preprocessing
A few parallel corpora are available for Bangla-English MT. In this study, SUPara (Al Mumin et al., 2012) dataset is used as a number of recent studies have used this corpus (Al Mumin et al., 2019aMumin et al., , 2019bHasan et al., 2019aHasan et al., , 2019b. The dataset contains 70861, 500 and 500 parallel sentences for training, validation and test sets, respectively. In the data processing step, tokenization, true casing, normalizing punctuation and removing non-printable characters are performed using Moses (Koehn et al., 2007), the open-source toolkit for MT. Moses changes the raw sentences into a number of tokens where words and punctuation marks are separated by a space. As long sentences and empty sentences may cause a problem, the sentence length is limited to 40 in our NMT model. Having a small corpus size results in a poor dictionary which might cause a large number of unknown words in the test case. To handle such scenarios, sub-word segmentation is employed using the BPE algorithm. The algorithm counts the frequency of each word in a corpus and a special stop symbol </w> is added at the end of each token. Characters are then separated. After that, the algorithm finds out the most frequent two consecutive byte pairs and merges the two-byte pairs into one token. As an example, the BPE algorithm can recognize 'r' and 'o' as a consecutive frequent token pair and thus merge them into one token 'ro'. The same explanation is for the 'ses' token. Then the BPE algorithm divides the token 'roses' into two sub-words: 'Ro' and 'ses'; adding '@@' in between them. After all these preprocessing operations, training, validation and test sets contain 65855, 366, 361 parallel sentences, respectively. Table 1 shows preprocessing effect on few sample sentences in both Bangla (for B2E) and English (for E2B). For the Bangla sentences, English phonetics and meanings are given for better realization to the international community. It is shown in the table that few sentences have some '@@' after preprocessing.

Performance Evaluation and Experimental Setup
For evaluating the performance of the model, the Bilingual Evaluation Understudy (BLEU) score is measured, which is currently the most popular evaluation metric in the MT field (Papineni et al., 2001). It is a precision-oriented measurement and evaluates the correctness of system output. BLEU score is measured in three steps. At first, n-gram or the number of word matches are calculated in the candidate sentences (system output) and the reference sentences. Then the candidate counts are clipped by their corresponding maximum reference value. Next, the clipped n-grams are summed and divided by the number of candidate n-grams (Papineni et al., 2001). Through this step, the modified precision score (pn) is found: Here Candidates denotes the complete corpus and C denotes a hypothesis sentence. The second step is BLEU Brevity Penalty (BP) factor calculation: Here c is the length of candidate translation and r is the length of reference translation. Finally, the BLEU score is the geometric mean of the precision scores and calculated using Eq. (3) Here, N is set to 4 as the baseline system and wn is a positive weight that is typically set to 1 /N. BLEU score represents the proficiency of an MT system and its higher value indicates better performance. The translation with a score between 20 to 29 is quite understandable (Cloud, 2021). The proposed NMT model is implemented using the OpenNMT toolkit (Klein et al., 2017). To train the model, we have used a batch size of 4096 and neurons in FFNN of 2048. The word embedding size was 512. The encoderdecoder layer size was kept at 6. Adam optimizer (Kingma and Ba, 2015) is used for training the model with a dropout of 0.1. The values of alpha, beta1 and beta2 are 0.00031, 0.9 and 0.998, respectively. The PC in which the experiments were conducted had the following configuration: Processor of 7th Generation Intel® Core™ i5-7400 CPU @ 3.50GHz, GPU of NVIDIA Ge-Force GTX 1070Ti, 8 GB.

Experimental Result and Analysis
A number of experiments have been conducted to improve the performance of the proposed transformer model. Since the number of heads is an important issue, experiments have been performed on varying head numbers to identify the appropriate number. Figure 3 represents B2E and E2B BLEU scores at iteration 10,000 and 20,000 for different heads from 1 to 32. From the figure, it is observed that BLEU scores are different while the number of heads varied, but the scores are not correlated with numbers. Therefore, it is a matter of empirical study to identify the best-suited head number for a dataset. The best BLEU scores for B2E and E2B are 17.24 and 19.09, receptively, at 20,000 iterations when the number of heads is equal to two. It is also observed that the BLEU score at 20,000 iterations is always better than that of 10,000 iterations, which indicates more training steps may provide a better score. Figure 4 shows the BLEU scores for training, validation and test sets for two heads while training continued for 100,000 iterations. From the figure, it is noticed that initially BLEU score improved rapidly but did not improve much after a certain number of steps. As an example, after 10,000 iterations, the E2B BLEU scores for training, validation and test set are 81.32, 12.57 and 13.16, respectively. On the other hand, at 50,000 iterations, the scores for the three sets are 91.2, 24.15 and 23.51, respectively. A similar observation is also achieved for B2E cases. Notably, the training set BLEU score is much better than validation and test sets for both B2E and E2B cases. The better BLEU score for the training set is logical because its samples are used for training the model and performance on the training set is a kind of memorization. Since the dataset provided a separate validation set, we presented a performance check on it without using it in the training process. Thus, the act of validation set is similar to the test set in this study and achieved BLEU scores for validation and test set are almost the same. For any MT system, a test set BLEU score is always important, which recognizes the generalization ability of the system. The proposed transformer model with two heads achieved the best B2E and E2B test set BLEU scores (at 100,000 iterations) 21.42 and 25.44, respectively. Table 2 compares the performance of the proposed transformer model with other prominent Bangla MT models on the basis of the achieved BLEU score on the SUPara test set. The existing methods are a phrase-based SMT and five deep learning-based recent NMT methods. The table has also mentioned a brief description of the datasets used in different methods. The exiting methods reported BLEU scores for the SUPara test dataset, while several methods considered a more extensive training set combining different datasets with the SUPara training set. The training dataset is augmented considering one or more datasets from among Indic Languages Multilingual Parallel Corpus (ILMPC) (Nakazawa et al., 2018), Six Indian Parallel Corpus (SIPC) (Post et al., 2012), Penn Treebank Bangla-English parallel corpus (PTB), Amader CAT (Hasan et al., 2020) and GolbalVoices (Tiedemann, 2012) those contain ~337K, ~ 20K, 1313, 1,782 and 126,724 sentences, respectively. As an example, Hasan et al. (2019a) trained the transformer model with a base configuration with 419,109 sentences combining ILMPC, SIPC, PTB, SUPara and AmaderCAT. On the other hand, our transformer model with optimal heads with BPE is only trained with the SUPara training set (i.e., 70,861 sentences). The results indicate the computational proficiency of the proposed model.     Mumin et al., 2019b) method is shown the best BLEU score among existing deep learning-based methods, which is 22.68. The proposed method has shown the competitive performance having a BLEU score of 21.42. It is notable that the GRU method is trained with a large training set (197,338 samples) combining SUPara and GlobalVoices. In comparison, the proposed transformer model is trained with the SUPara training set with 70,861 samples. The existing transformer model (Hasan et al., 2019a) achieved a B2E BLEU score of 18.99 with the base configuration. The outperformance of the proposed model over the existing transformer model indicates the proficiency of model tuning and head selection.
Among the exiting methods presented in Table 2, only two studies considered E2B MT. The achieved E2B BLEU score by existing SMT and NMT methods are 15.27 and 16.26, respectively. On the other hand, the proposed model achieved an E2B BLEU score of 25.44, which is much better than the other two studies. Moreover, both existing methods are trained with samples combining SUPara and GlobalVoices datasets, whereas the proposed model uses only the SUPara training set. The table clearly demonstrates the proficiency of the proposed transformer model for Bangla-English language pair MT.
The reason behind the outperformance of the proposed model is the technique employment and the appropriate setting. In the proposed model, sub-word segmentation helped the model to guess rare words. In addition, a proper number of heads for the proposed model is identified through empirical study, which is two. The two heads enhance the ability of the model to put appropriate attention on different positions of words in a sentence which helps the model to perform better in combination with sub-word segmentation.

Conclusion
In this study, an MT system for the Bangla-English language pair has been proposed using the deep learning technique. Specifically, a standard transformer model is tuned to achieve better performance for B2E and E2B MT. It is identified that two heads in the model have performed better than a larger number of heads while tested on the benchmark dataset. The proposed model considered Byte Pair Encoding in preprocessing. The achieved BLEU scores higher than 20 for both B2E and E2B cases on the benchmark dataset, which indicates the model's translation proficiency.
The proposed model outperformed several leading MT methods in terms of the achieved BLEU score.
The present study also opens several research directions. This study has identified that the performance for Bangla to English is different from English to Bangla, although the training, validation and test sets were the same. Therefore, it will be interesting to investigate the effect of source-target interchange on MT performance. In this study, only SUPara training set is used to train the model; therefore, training with additional samples might improve the performance of the present model.