Context-based Machine Translation of English-Hindi using CE-Encoder

: The difficulty in obtaining accurate word alignment and determining a target word that is the best candidate for a source context in machine translation leads to different translations. In this study, we propose a method with a more accurate context model. Our Neural Machine Translation (NMT) approach focuses on the encoder to apprehend the meaning of source sentences for improved translation. The recurrent encoder works by taking into consideration the history and future information of the source context. In this study, we implement the proposed approach into three steps. Firstly, we learn the representation of future context in advance. Secondly, a context-based recurrent encoder called as CE-Encoder with two-level Gated Recurrent Unit (GRU) is used. In this, the bottom-level GRU gathers history data of a sentence and top-level GRU assembles future data information. Finally, the future learned context and the history information from the opposite direction is integrated. The distinguishing factor of the proposed framework from the existing models, specifically Bidirectional Recurrent Neural Network (BiRNN) is that, the current models have not spent substantial time and capacity in learning future context or disambiguating source and target words based on the context which is defined by source sentence. We conduct experiments on the datasets from ILCC and CFILT for the English-Hindi language pair. From the comparative evaluation, we observed that the proposed model outperforms the Bidirectional RNN encoder in terms of translation quality. The proposed model has shown the improvement of 7 Bleu points using the ILCC dataset and 9 points using the CFILT dataset over BiRNN.


Introduction
Machine Translation (MT) is one of the earliest applications of Natural Language Processing (NLP). Machine Translation makes use of computational linguistics in translating text from one language to another. It involves decoding the meaning of the source text and re-encoding this meaning in the target language. It also helps people from different regions to communicate. Despite being one of the official languages, English is not popular among the Indian population. It is understood only by less than 3% of the Indian people. Hindi, the other official language of the country, is used by more than 400 million people. The different regions in north India, apart from their native language, know only Hindi. Most Indian government records, documents, education, news and historical data are available in English. It is one of the primary reasons that the automatic translation from English to the Indian language is gaining significant importance. However, the English-Hindi translation process raises some challenges due to the nature of languages, such as (i) Indian languages are morphologically rich. (ii) English is different in terms of word order from the Hindi language. (iii) The availability of parallel corpus for these languages is very limited. Therefore, our research focuses on the machine translation of English-Hindi language pairs. We proposed a context-based Neural Machine Translation (NMT) technique using Context Encoder (CE) for English to Hindi language pairs to address the above challenges. The framework has the best performance in comparison to the baseline methods.

Motivation and Contribution of the Work
The existing research on translation systems was developed using Statistical Machine Translation (Liu et al., 2019) based on translation and language models. These models-built translation systems based on phrases or words. The recent advancement in deep neural models has made significant breakthroughs in NLP applications such as language modeling, word embedding and paraphrase detection. NMT (Forcada, 2017) attracted attention and provided promising results on various language pairs. This approach uses an Artificial Neural Network (ANN) to predict the likelihood of a sequence of words while modeling entire sentences in a single integrated model.
Only a few research works Agrawal and Sharma (2017), Parida and Ondřej (2018), Grundkiewicz and Heafield (2018), Saini and Sahula (2018), used the NMT method for English to Hindi translation. Agrawal and Sharma (2017) use a Recurrent Neural Network (RNN) to deal with variable-length input and output by employing Gated Recurrent Units, Long Short Term Memory Units (LSTM), Bidirectional LSTMs and Attention Mechanism. They have not concentrated on the presence of repeated tokens and unknown words (Knowles and Koehn, 2018) presented in the sentence. Parida and Ondřej (2018) target translating short sentences or noun phrases with NMT. They used Bidirectional RNN, shallow and deep sequence-to-sequence and transformer models. They have not explored it on the monolingual and large dataset. Grundkiewicz and Heafield (2018) transliterating named entities that are the phonetic translation of names across languages for which they use deep attention RNN encoder-decoder models. The encoder encodes the input sentence sequence to compute word embedding for vector representation, whereas the decoder decodes that representation into another sequence of symbols. They learned to align and translate simultaneously. Saini and Sahula (2018) also replace LSTM with Bi-LSTM and Deep Bi-LSTM by adding a residual connection. There is a lack of fine-tuning the training of rare and long sentences using smaller datasets in their method.
Our main contribution to the above literature is investigating the pros and cons of employing strategy in translating English to Hindi language pair and highlighting additional opportunities our strategy provides. Considering the above four previous works, all these systems process sentences in isolation and their extended context can prevent mistakes in ambiguous cases and improve translation coherence. The following example illustrates the context-related problem in English sentences:

I Tried to Teach her the Meaning of Fast and Slow
In the above sentence, without knowing the future context of "Slow", RNN encoding would not know whether the word "Fast" means speedy, vigorous, or abstaining from food. To understand the exact meaning of "Fast", we must feed the future context information into the encoder. Sometimes the translation of these types of words is done as "व्रत" or as "unk" words. Therefore, in this study, we aim to solve a context-related issue that leads to ambiguity in the Hindi translated sentence and to improve the translation quality. The main contribution of our work can be summarized as follows: 1) Unlike Bidirectional Recurrent Neural Network, where one neural network encodes the forward contextual representation of the sentence and passes this information from history to future, another neural network encodes the backward representation. Our approach gives the future words information to the past words of a sentence, i.e., by using the two combined hidden states, information can be preserved from both the history and the future. We first compute the future context representation in advance and call it CE-Encoder. The CE-Encoder pre-computes semantic knowledge for source words. Then it feeds CE-Encoder in the RNN model with the history of context. Both of them work in opposite directions by using top-level and bottom-level hierarchy 2) In BiRNN, we do not get all the input at the same time. Sometimes, for forward-pass input, backwardpass information is not available. It tends to generate less accurate target sentences. Also, BiRNN passes the stack of layers in forward and backward directions simultaneously. The stacking of layers makes it slower than our model 3) We show that source context is important when translating texts from several domains through the proposed method and conducted experiments. Our source language, English has Subject-Verb-Object (SVO), whereas target language is Subject-Object-Verb (SOV) form. Therefore, the source-context information is also relevant when the language has source words with the same form that can translate into target words of different form

Related Work
A brief review of the research work carried out in English-Hindi translation is discussed in this section. Sen et al. (2016), authors developed a hierarchical phrase-based Statistical Machine Translation (PBSMT) system for English-Hindi. They performed reordering and augmented bilingual dictionaries to improve the syntactic order of English. PBSMT used an independent reordering model that reorders phrases so that source language word-order and Hindi language word-order would be similar. Banik et al. (2020) proposed an alternative SMT technique in which the scores of the phrases from the phrase table are re-balanced by increasing the weight of correct phrases and decreasing the weight of incorrect phrases. The authors Singh et al. (2017) presented a 827 translator that uses Translation Memory for English-Hindi. Their translator works on both Fuzzy match and exact match. Translation memory in the translator is a database of segments that are already translated to help human translators. English is divided into segments and matched with the database to fetch the translation from the database. Sharma and Singh (2021) incorporated phrase-based topic model system into baseline phrase-based system. They analysed the effect of topic modeling on general corpus sentences mixed with in-domain text. Jaya and Gupta (2016) considered an Out-Of-Vocabulary (OOV) approach to standardize the dataset. They investigated a corpus augmentation method to improve the translation quality of the bidirectional English-Hindi SMT system. Their strategy worked well for fewer resources without assimilating the external parallel data corpus.
Ambiguities of content and some function words pose challenges to MT systems. These ambiguities forced the authors to explore deep learning (Costa-jussà et al., 2017) for the English-Hindi translation system. Narayan et al. (2016) presented a quantum neural network machine translator based on the concept of machine learning of semantically correct corpus. The system performed the translation task using its knowledge gained while learning the input sentences from source to target language. The translator acquired the knowledge required for translating in implicit form from the input pair of sentences. Saini and Sahula (2018) investigated the possibility of using shallow RNN (Jang et al., 2019) and Long-Short Term Memory (LSTM) based NMT for solving Machine Translation issues. They have used a small dataset and fewer layers for their experiment. They have obtained their results on two-layer and 4-layer LSTM with Stochastic Gradient Descent (SGD). They have compared their results of 2-layer and 4-layer LSTM, stochastic gradient descent with residual connections. Saini and Sahula (2021) also proposed Sequential Adaptive Memory (SAM) model which is an augmented version of the Cortical Learning Algorithm (CLA). They created word pairs, rules and dictionaries for translation but using smaller dataset.
Bhatnagar and Chatterjee (2020) adapted bilingual embeddings and autoencoder networks techniques for English-Hindi language translation. Ojha et al. (2018) performed machine translation for the Indic Languages Multilingual task for the 2018 edition of the WAT shared task. They used English to Hindi, Telugu, Bengali, Tamil languages and a shared Statistical and Neural translation task. Gupta et al. (2020) introduced pathological invariance methodology for syntactically similar but semantically different sentences. They replaced one word in a sentence using masked language model and removed word or phrases based on constituency structure. Parida and Ondřej (2018) discussed three NMT models-shallow and deep, sequence-to-sequence and Transformer model. They used a visual genome dataset for translating English to Hindi using out-of-domain datasets of varying sizes. Their target domain was short segments appearing in descriptions of image regions in the visual genome. Kunchukuttan et al. (2018) performed training using the neural architecture of transliteration models for multiple language pairs. Each language pair benefits from sharing knowledge with related tasks, i.e., phonetic properties and language writing systems. They used maximal sharing of network components to utilize high task relatedness on account of orthographic similarities: Overlapping phoneme, similar grapheme to phoneme mappings for zero-shot transliteration. Ratnam et al. (2021) developed knowledge-based method which is able to handle the linguistic specificities like auxiliary verbs, helping verbs of source and target languages. In Pathak and Pakray (2019) considered optimality in translation through training of the neural network, using a parallel corpus is having a considerable number of instances in the form of a parallel running English-Tamil, English-Hindi (Agrawal and Sharma, 2017) and English-Punjabi translations. This helps in analyzing the context better (Weissenborn et al., 2018) and produce fluency to make NMT a good choice for Indian languages. Very few approaches explored on Context-based English-Hindi Statistical and NMT. Gaikwad (2020) explored the suitability of word to vector (word2vec) and hash to vector (hash2vec) approaches to sequence to sequence text translation using Recurrent Neural Network. The word2vec uses neural network whereas hash2vec is based on hashing algorithm. Gupta et al. (2016) presented a methodology for lexical disambiguation in a phrase-based MT system. In that system, source-context is used to extract information from training sentences similar to the sentences to be translated.
Besides this, we have proposed supervised machine learning-based deep learning approach where CE-Encoder is used to solve the Context-based issue compared to the previous research works, which used unsupervised machine learning based approach with deep auto-encoder. To the best of our knowledge, the proposed work is the first to consider this model.

Basic Model-Neural Machine Translation
In this section, we discuss the basics of Neural Machine Translation. The model is attentive to the words of a source sentence, which are more related to the prediction of a target word. It frees a neural network model from compressing source sentences, regardless of their length, into a fixed-length vector. Here, the input sentence is a sequence of words x1, x2,....., Tx that needs to be translated and the target sentence is a sequence of words y1,y2,……,Ty.

Encoder
The encoder is a bidirectional RNN (Sundermeyer et al., 2014) consists of forward and backward RNN's. The forward RNN f reads a source sentence x = x1, x2,..…., x T x from left to right x1 to x T x and calculates a sequence of forward hidden states or semantic representation as T j h : The backward RNN f reads the sequence in the reverse order from Tx x to x1 and calculates backward semantic representation as hj-: Then combine forward and backward hidden states Therefore, source annotation hj contains the summary of both the preceding and following words and focuses on the words around xj described in Table 1.

Decoder
In the decoder, which is a forward RNN, at each time-step t, the soft-alignment mechanism first decides which annotation vectors are most relevant. A relevant weight eij, for each of the j th annotation vector, are computed with feed-forward neural network f that takes previous decoder's hidden state si-1, j th annotation hj of input sentence and previous output yi-1 : The output eij are normalized over the sequence of annotation vectors: The alignment weight αij of hj is the probability that the target word yi is translated from xj. The alignment weight αij of the j th annotation vector is used to obtain context vector ci of the i th word at target.
The i th context vector ci is the expected annotation over all the annotations with probabilitie αij. This way, information can be spread throughout the sequence of annotations, which can be retrieved by decoder accordingly. where: si-1 = Decoder's previous hidden state. yi-1 = Previous output at decoder. hj = j th annotation of input sentence αij = Alignment weight that target word yi translated from xj eij = Relevance weight ci = Context vector of i th word at target Figure 1, two Recurrent Neural Networks are represented as encoder and decoder. The Encoder bidirectional RNN consists of forward RNN and backward RNN. The forward RNN reads source sentence "I will visit you in april" ordered as (from x1, x2, x3, x4, x5, x6) and calculates forward hidden states as  a3,2, a3,3, a3,4, a3,5 and a3,6 which is used to obtain context vector c3 i.e., the target word y3 translated using input sequence x1, ….x6. Context vector ct of decoder calculates by combining encoder hidden state by using attention weights. hj Source annotation encodes the information about j th word with respect to all the other surrounding word in sentence.

Our Approach
As the hidden representation cannot take future context information sufficiently, it can only encode nearby context. Combining independent forward RNN and backward RNN in Bidirectional RNN directly does not fully use contextual information. Therefore, we discuss the Interdependent Bidirectional Recurrent Neural Network called as CE-Encoder in this section. There are two types of CE-Encoder: FE-Encoder and BE-Encoder. To correctly understand the semantic meaning of a word, we use FE-Encoder if the future context is required. Otherwise, on the other hand, use BE-Encoder to learn the historical context information first. We explain FE-Encoder in detail below, where in the first step, we learn the representation of future context. Then, define context-based recurrent encoder using input token based on learned context representation. Finally, in the third step, we combine these two steps into NMT.

Representation of Future Context
First, the RNN learns the future context of the source sentence. GRU can be considered a simplified version of LSTM. For this, we select the GRU-based recurrent network (Chung et al., 2014) in our work due to its capability to remember the processed token. When reading the word "fast", the sentence "I tried to teach her the meaning of fast and slow" has the word "slow" also. The context representation e j h induced from right to left given source sentence x at the j th time step and 1 e j h  consider to store information of all j+1 inputs. Here, we use e to denote the context.
Where: h  holds the information of the future j+1 is multiplied by its own weight Uz. These are added and the sigmoid function is applied. The update gate determines the information that should be passed further. The reset gate (r) in GRU is used to decide how much information to forget. The new memory content uses the reset gate to store relevant information removing the previous time steps information. The final memory content determines what to collect from the current memory content and other previous steps. GRU non-linear transformations are as follows: Where: e j h = A candidate activation function for j th token

Context-Based Forward Recurrent Encoder Network
The source sentence representation is obtained by concatenating the forward hidden state of an encoder with future context to produce semantic source representation. For that matter, we use Forward Encoder consists of two-level of hierarchy-Top-level and Bottom-level. The hidden state connected these together to form the final hidden state. It not only looks after long-term dependencies but contains more information on sentences. GRU acts as a bridge to infuse these two kinds of information flow.

Integration of FE-Encoder with Neural Machine Translation
We suppose that FE-Encoder encodes history information from left to right, whereas the future context is encoded from right to left. Whether the history and future context are related to each other should be encoded in the opposite direction. The conventional Bidirectional RNN connects forward RNN and backward RNN in (j-1) th layer and sent it to j th layer. In contrast, in our method, after getting future context and history, it combines to give the final hidden state for (j+1) th layer. Our encoder uses the independent forward and backward information from the j th layer, making them no longer independent from each other at (j+1) th layer. Therefore, previous layer information is utilized for the accurate content of the current layer.

FE-Encoder
It is called Forward Encoder. It reads the source sentence left to right and the future context in the opposite direction. The future context calculated previously is represented as the Eq. 7 and the summarization of history information by FE-Encoder is represented as the Eq. 8. Substituting both Eq. 7 to 9 gives the forward Encoder as n j h . Figure 2,

Backward CE-Encoder
Above, we expressed the required computations for FE-Encoder. Now, the explanation of BE-Encoder is given as follows.

BE-Encoder
It is called Backward Encoder. As compared to Forward Encoder, it operates in the opposite direction of both future context and FE-Encoder. It first learns the historical context, then combines it with future information. The historical context represented in Eq. 11 and future information using another GRU is represented in Eq. 12. By combining Eq. 11 and 12 and inserting them in Eq. 13 gives the Backward Encoder. Because it works in reverse order than FE-Encoder, therefore, in Fig. 3 Bidirectional RNN in Fig. 4 generates the forward hidden states as e j h and backward hidden state as n j h by using the same gated recurrent unit for both history and future contextual information. Then, it connects both the hidden states   , en jj hh together to generate the output information. In BiRNN, the output layer gets the information from past and future states simultaneously i.e. it does not save both left-to-right and right-to-left information independently in advance. If any of the information is missing either forward or backward, the translation would not be accurate. Sometimes, we may not get all the inputs all at a time. Let suppose at time t+2 we should have input x1 at time t, x2 at time t+1 and x4 at time t+3 to compute output y3 at time t+2. Therefore, it lags behind to translate long-term dependency sentences, whereas our method computes the final hidden state by analysing the forward and backward information independently. The alignment model assigns score α ij to input at position j and output at position i.

Pseudocode of our Proposed Approach
3. The encoder hidden state and alignment score are multiplied to form a context vector: The context vector computes the final output of the decoder.

Experiments
In this section, the description of the experimental setup and details of the datasets used for training the translation of English-Hindi language pair using the proposed methodology is given as follows.

Dataset Details
The initial condition for setting up the MT system is the availability of a parallel corpus for the source and target languages. An NMT system has been trained using source-target sentence pairs where English is the source and Hindi is the target language. We have considered the following datasets for our experiment:  The first data was taken from the Institute for Language, Cognition and Computation (ILCC), the University of Edinburgh (Ins 1 )  Another dataset has been taken from the Centre for Indian Language Technology (CFILT), IIT Bombay (Res 2 ) The dataset with the name of Indic parallel corpora from ILCC contains the sentences translated from Wikipedia. The total number of English-Hindi sentence pairs in ILCC is 43,396. We used 35,396 sentences for the training set, 4000 sentences for the development set and 4000 sentences for testing. The CFILT dataset contains data from multiple disciplines. The training set consists of sentences, phrases as well as dictionary entries, their applications and details. The total number of English-Hindi sentences is 1,495,854. We used 1,468,827 sentences for training set, 12,000 sentences for development set and 12,000 sentences for testing.

Dataset Preprocessing
In the above extracted corpus, we observed the presence of noise in both language sentences. To ensure these do not affect training the translation model, we used Moses toolkit for tokenization and cleaning the English data. Hindi data is first a normalized with Indic NLP library 3 followed by tokenization. All the sentences of length greater than 80 from our training corpus were removed and also excluded the sentence pairs containing URLs. To overcome rare words problem in the corpus, we used Byte Pair Encoding (BPE), as proposed by Sennrich et al. (2015). It is a data compression technique that replaces the most frequent pair bytes with a single unused byte in a sequence. By merging frequent byte pairs, we combine character/character sequences. BPE helps compound splitting and suffix, prefix separation, which creates new words of the target language. For training, we lowercased all of our training data and used Moses toolkit true caser during testing.

Training Details
We implemented our model using an open-source dl4mt 4 system. Our NMT system is trained on GPU based system using Theano computational framework. The dimensional word embedding and hidden states dimension for both source and target languages are dw = 512 and dh = 1000, respectively (Bahdanau et al., 2014). The encoder and decoder both have 1000 hidden units each. We initialized non-recurrent parameters randomly according to a normal distribution with a standard deviation of 0.01 and zero mean, but the recurrent square matrices are initialized with Singular Value Decomposition (SVD). We applied gradient norm as 5, batch size of 80 to train our model and Ad delta algorithm (Zeiler, 2012) for optimization. Each Stochastic Gradient Descent (SGD) update is computed using a minibatch of 80 sentence pairs for optimization. The learning rate is set to be 0.0005 and is reduced by half at every epoch. Dropout is applied to 0 on the output layer to avoid over-fitting. Both the datasets are trained for 40 epochs. We used a beam-search algorithm during decoding and set beam size to 10.

833
We train our dataset on two more versions of word embedding with 256 and 620 units. The training is carried out with the embedding dimension of 256 and the batch size of 128 and with embedding dimension of 620 and batch size of 160 by keeping rest of the parameters same. The datasets with 256 embeddings is trained for 30 epochs and datasets with 620 embeddings is trained for 60 epochs. The words that have not appeared in the vocabulary are replaced with a token "UNK".

Experimental Setup
We compared our proposed model with three systems, i.e., Moses, RNN Search 5 and Transformer:

a) Moses: It is an open-source phrase-based Statistical
Machine Translation System. It consists of two components, training and decoder. The training process in Moses takes parallel data and uses phrases to infer translation. The decoder finds the highest scoring sentence in the target language corresponding to the source sentence (Koehn et al., 2007). For Moses 6 , we used both datasets ILCC and CFILT. We used 35,396 sentences for the training set, 4000 sentences for the development set and 4000 sentences for testing of ILCC dataset. On the other hand, we used 1,468,827 sentences for the training set, 12,000 sentences for the development set and 12,000 sentences for testing CFILT. We trained a language model (Rahimi et al., 2016) of 4-gram on the target data using SRILM 7 (SRI Language modeling toolkit) with modified Kneser-Ney smoothing (Stolcke, 2002). We used GIZA++ 8 toolkit for the word-alignment of training corpus with the " grow-diag-final-and" option. We also exerted the lexical reordering model with type "wbe-msdbidirectional-fe-allff" i.e., word-based extraction considering monotone, swap and discontinuous orientation. This model is made for both forward and backward sides and it is conditioned on the source and target languages and it treats the scores as individual features. Apart from this, we kept all other parameters to default settings. A configuration file is generated to translate our test sets and compute Bilingual Evaluation Understudy (BLEU), NIST and METEOR scores. b) RNN Search/BiRNN: It is an attention-based (Bahdanau et al., 2014) (Vaswani et al., 2017) with six layers in encoder and decoder networks. The same dataset division used for this also. Each encoder block contains a self-attention layer, followed by two fully connected feed-forward layers with a Rectified Linear Unit (ReLU) non-linearity between them. Each decoder block contains self-attention, followed by encoder-decoder attention and two fully connected feed-forward layers with a ReLU between them. We used word embedding and hidden state dimensions as 512 and 2048 feed-forward inner-layer dimensions with dropout = 0.1. We used Adam optimizer (Kingma and Ba, 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 10 -9 . Further, we used multi-headed attention with eight attention heads, length penalty α = 0.6 and beam search with a beam size of 4. The trained model is saved to test the translation results d) Our Approach: First, we trained our system using the English-Hindi ILCC training dataset and saved the trained model obtained at 30, 40 and 60 epochs each for different word embedding dimensions. Each of the trained models has been tested using the ILCC test set. The predicted results of translations are provided to BLEU for evaluation (Guzmán et al., 2017). Furthermore, we also re-trained our NMT system using the CFILT training corpus. Then we performed the training process as carried out earlier.
Further, such a setup helps analyze the change in the behavior of the Machine Translation system with an increasing number of new sentences in the corpus.

Results and Analysis
We compared the performance of our proposed model with Moses, RNN Search, Transformer and their different variants as given in the tables below.
Using the ILCC test set of 4,000 sentences, FE-Encoder achieves a 28.56 BLEU score, whereas the BE-Encoder achieves a 28.48 score as the highest score while trained the model using a word embedding with 512 dimensions in Table 2. They score 10 BLEU points over Moses, 7 points more than BiRNN and 2 points over Transformer. FE-Encoder NIST and METEOR scores are 5.95 and 0.58, respectively, from Table 3 and Table 4. The BE-Encoder gives 6.05 NIST scores more than FE-Encoder, whereas METEOR scores are almost identical for all dimensions. They score 2 NIST scores and 2 METEOR points; moreover, Moses, 1.5 NIST and 1.5 METEOR scores over BiRNN, whereas both NIST and METEOR 1 point over Transformer. Therefore, both FE-Encoder and BE-Encoder outperforms all the evaluation metrics.
Using a CFILT test set of 12,000 sentences, FE-Encoder achieves a 43.52 BLEU score, whereas BE-Encoder achieves a 44.34 score. This dataset also achieves the highest BLEU score while we use the word embedding dimensions of 512. Therefore, medium size dimensions are sufficient to train the translation model.
They score 13 BLEU points over Moses, 9.5 points more than BiRNN and 2 points more than Transformer scores. FE-Encoder scores using NIST and METEOR are 8.35 and 0.75, respectively. The BE-Encoder gives 8.91 NIST scores more than FE-Encoder, whereas METEOR scores are almost the same. They score 2 NIST scores and 2 METEOR points; moreover, Moses, 1 NIST and 2 METEOR scores over BiRNN and 1 point more than Transformer of NIST and METEOR score. It shows our CE-Encoder is better than other models. The difference between the scores of FE-Encoder and BE-Encoder is minimal. This is because the Forward Encoder reads the sentence left to right and the Backward Encoder reads the sentence from right to left, the difference of only reading direction.
In BiRNN, we do not get the inputs all at once concerning the time. Sometimes, for forward pass input, the backward pass information is not available. Therefore, it gives less accuracy in translating source sentences. The other reason that BiRNN is lagging is behind since it does not encode the entire input sentence into a fixed-length vector. Therefore, BLEU, NIST and METEOR scores are comparably less than our model.
The results shown in Fig. 5, BLEU score of sentences with the number of epochs increased our proposed model. The BLEU score is the highest of the medium size trained model. Therefore, it achieves more accuracy for the translation model. The learning rate of the training network increases with the increase in the number of epochs. BLEU of training data has seen normalizing after 36 epochs, that is, epoch coverage.
The best-trained model was obtained at epoch 40 and scores did not improve much after 40 epochs. After that, we group our sentences of similar lengths to evaluate the translation model.
We categorize our test set into six separate groups according to their length of source sentences ((0-10), [10][11][12][13][14][15][16][17][18][19][20], [20-30), [30-40), [40][41][42][43][44][45][46][47][48][49][50]) and all the sentences above the length of 50 has been kept in another set. NMT models perform better on shorter sentences than long sentences. The results are shown in Fig. 6 satisfy the finding, as our CE-Encoder performs better on short sentences than long sentences. Also, CE-Encoder outperforms BiRNN, Moses and Transformer for all the sentence lengths. Moses performs better on the longest sentences because Moses is based on a phrase-based translation process. But the average size of all models is the same that does not affect the improvement of CE-Encoder. The BiRNN passes the stack of layers in a bidirectional way, in forward and backward directions simultaneously. The stacking of layers makes it slower than our model.
In Table 5, we have shown the BLEU score achieved by RNN Search, Transformer and our model with respect to training time of word embedding of 512 dimension units. Table 6 gives the Bleu score using the CFILT dataset. As we can see from both tables, BiRNN needs more time if we want better results and on the other side, Transformer requires approximately 70 hours more to complete the training on the dataset. Hence, Transformer has not achieved a 26.09 and 41.77 bleu scores mentioned in Table 2 using both datasets.
In Table 7, we compared our model with previously applied approaches to the English-Hindi language pairs by using BLEU scores. We divide CFILT 1,492,827 sentences into training, validation and test set. The training set contains 130,000 sentences, validation and the test set contains 35,000 sentences each. The sentences are trained for only 11 epochs. After that, the Bleu score is calculated on the test set. During the comparison, our BE-Encoder achieves a 21.63 BLEU score. Moreover, FE-Encoder outperforms all the other models with a 21.84 BLEU score.
We can see from Table 8, Table 9 and Table 10 that performed translation on three types of sentences (short, medium, large) where MosesT, RNN SearchT, TransformerT and CE-EncoderT are transliterated sentences. Moses incorrectly conveys the meaning of sentences and sometimes misses some words. RNN Search does not have fluency comparable to our CE-Encoder. A Transformer is near to translate the small and medium sentences correctly but it can only deal with fixed-size text strings. We can see in the medium translated sentence from Table 9, Moses dropped the words "किया जाता है " after the word "किभाजन" and in RNN Search the position of the word "खिलाड़ी में " changes the meaning of source sentence. In a long translated sentence in Table 10, Moses model translates "seeking to break" as "में टू टा ऑटो ब़ीमा" and RNN Search as "टू टना चाह रहा था". Therefore, they fail to recognize the proper meaning of the source sentence. Their system wants to convey a similar meaning, but the sentence structure is somewhat adequate but not fluent.
For the deep analysis of other systems, the attention weight alignments of RNN Search, Transformer and CE-Encoder for short, medium and long sentences are represented in Fig. 7, 8 and 9. The vertical axis translates English sentences into horizontal aligned Hindi language. For the short sentence in Fig. 7, RNN Search aligned all the words correctly except "He" and "their", whereas Transformer aligned the complete sentence correctly. As the sentence length increased, the performance of RNN Search and Transformer deteriorates. For the medium sized sentence in Fig. 8(a), RNN Search has no alignment for the word "their" and there is weak alignment for the word "are", which modifies the meaning of the input sentence. Whereas, in Fig. 8(b), Transformer MT has no alignment for the word "are categorized". So, RNN Search and Transformer do not perform the translation thoroughly. For the long translated sentence, in Fig. 9(a) RNN Search has not strong alignment for the words "seeking to break" and Transformer in Fig. 9(b) has not aligned "would pay" correctly. The CE-Encoder aligned the word "seeking to break" to "प्रिे श िरने िा प्रयास" and "would pay" to "भु गतान िरें गे ". We observed that our CE-Encoder solved these problems by preserving all the words to maintain adequacy and fluency better than Moses, RNN Search and Transformer systems.              (Ma et al., 2019) 17.64 Phrase and Neural Unsupervised MT (Lample et al., 2018) 18.10 Deep Recurrent NMT (Zhou et al., 2016) 18.79 NMT in Linear Time (Kalchbrenner et al., 2016) 19.40 Evolved Transformer (So et al., 2019) 20.55 FE-Encoder 21.84 BE-Encoder 21.63 Usane teen saal tak unkee sangtee mein yaatra kee

Conclusion
The above findings draw a picture of machine translation on a context basis. Our approach can be tuned easier as compared to the existing methods. Through BLEU, NIST and METEOR evaluation system, we evaluated the performance of the proposed model of CE-Encoder, which is a context-based recurrent encoder for translation and compared the results of the evaluation systems. Our translated target Hindi sentences from the source English sentences have found the proposed model superior in accuracy than other strong baseline models. Significantly, the comparison of the results of the BLEU system with the conventional translation system shows that our model achieved scores of 44.34 using the CFILT dataset and 28.56 using the ILCC dataset, which is higher than the scores obtained by the other translation approaches. The results demonstrate that our model performed better than Moses and BiRNN models and achieved scores of 13 and 10 more points for CFILT, whereas 10 and 7 more points for ILCC using the scores of BLEU, respectively. In the future, we would like to apply our approach to other language pairs and for other datasets of large sentence lengths. We will consider the BERT model and compare the performance with our CE-Encoder in the future.