A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora

: In today’s Globalized Scenario, the requirement for translation is high and increasing rapidly in the number of fields, but it is difficult to translate everything manually. Machine Translation, which is dependent on corpora availability, is a medium for meeting this high demand for translation. Parallel corpora are used to gain most translation knowledge. But, the number and quality of parallel corpora are critical. Because parallel corpora are not readily accessible for many different language pairs, comparable corpora that are widely accessible can be used to extract parallel corpora. A systematic literature survey is performed on 188 research articles that are published in premier journals, conferences, workshops and book chapters. The research process is carried out while considering the research questions. Different MT systems along with their features are identified. Several datasets and techniques for bilingual lexicon extraction, parallel sentence and fragment extraction are revealed. A proposed architecture and a mind map are also showcased in this review article to provide better clarity regarding parallel data extraction using comparable corpora. The study of the paper will increase readers' understanding of parallel data mining through bilingual lexicons, parallel sentences and fragments.


Introduction
In today's era of globalization, a lot of data is accessible on the internet in diverse languages and domains. Due to diversity in languages all over the world, it is impossible to learn every language. People who are accessing the internet come from different language backgrounds. To overcome this problem the content available on the internet requires translations. In any case, it is difficult to perform the job of translation manually. There is the requirement of the machine to do the translation. With this requirement, the existence of Machine Translation (MT) emerged. MT (Machine Translation) is a medium to achieve high demand of translation. It is among the applications that fall under the umbrella of Natural Language Processing (NLP). It is an incredible tool to increase competence and decrease the cost of translation. For the process of translation, there is a need for some kind of dataset or corpus to train the machine. Thus, translation highly depends upon the availability of corpus. Corpus is an enormous assortment of text used to analyze how the words, phrases and language are used. It is used by linguists, social scientists, natural language processing experts, etc. It tends to be arranged into two prime classes, namely Parallel Corpora (PC) and Comparable Corpora (CC). The parallel corpus comprises two different language corpora where one is the translation of another. Parallel corpus is sentence-aligned bilingual texts. Whereas a comparable corpus is a set of two or more different language corpora which is not the exact translation of each other and hence are not aligned. There are two fundamental ways to make a corpus specifically, rule-based and statistical analysis (Babych et al., 2012). In the early days of MT research, rule-based played a keen role. In this, all the conversion rules are composed manually and afterward, the encoding is done into the MT framework. However, languages are very vast and complex, it is quite impossible to write the rules manually in a relatively short period. To address this issue, the emphasis was shifted to statistical analysis. In due course of recent decade or two, MT research has started working in the branch of Statistical Machine Translation (SMT) (Koehn, 2009;Och and Ney, 2003;Brown et al., 1993) and it has risen as a key method in the field of both research and business area. In SMT (Koehn, 2009;Och and Ney, 2003;Brown et al., 1993), translated information is consequently procured from PC (parallel corpora) which is a kind of sentence-aligned bilingual text. Therefore, huge growth is seen in the MT framework for various language sets. Nowadays, most machine translation research is conducted with this approach. In SMT, due to high reliance on Parallel Corpora (PC), the quality and quantity of PC are serious (Ali et al., 2010;Srivastava and Bhat, 2013;Post et al., 2012). Nonetheless, aside from a couple of language sets and in some specialized fields, a top-notch PC of adequate size stays a scant asset. The insufficiency of Parallel Corpora (PC) has become SMT's primary challenge. There is no abundant Parallel Corpora (PC) available for performing the task of translations. Making the use of Comparable Corpora (CC) is a compelling method to solve the problem of insufficiency of PC for SMT (Statistical Machine Translation). The main reason behind using Comparable Corpora (CC) are, first these are undeniably more accessible for different fields than PC, such as Wikipedia, bilingual articles, bilingual websites, patent documents, enewspapers, social media and research-related academic papers which are easily available. Second, single language corpus is easy to obtain and in using comparable corpora, work is performed on single language corpus only. Third, a lot of parallel data like bilingual lexicon, parallel sentences and fragments can be obtained from comparable data.

Motivation
The motivation for this review is derived from the fact that a detailed insight study is required for studying the mining of Parallel Corpora (PC) in the form of lexicons, sentences and fragments from Comparable Corpora (CC). When the search was conducted in the relevant literature, it did not reveal a clear review regarding PC and CC. A collaborated work that includes the extraction processes of lexicons, fragments and sentences for Parallel Corpora (PC) was not available in a systematic format. Also, no relevant review was available that could focus on statistical machine translations for different languages and domains. There was a need to study this area and give an accredited overview.
This study goal is to review and contribute towards:  A feasible study that focuses on strengths and weaknesses of the research in the concerned domain,  A systematic review is required in the branch of parallel data mining from comparable data. Therefore, this study explores the on-hand research on different data extraction techniques.  A combined effort for extraction of lexicons, sentences and fragments from comparable data.  Also, this survey will give us insight into the number of languages and domains which use machine translation.

Background of Related Work
Different authors have studied the PC and CC and their usage among Indian languages as well as European languages. A few reviews in this field are given by some researchers like Maskara and Bhattacharyya (2018), Khosla and Acharya (2018), Iyer (2015), Kulkarni (2013), Lehal et al. (2018), Saini and Sahula (2015) and Padhya and Sheth (2019). Maskara and Bhattacharyya (2018) focused on the recent developments in the field of parallel sentence mining from CC using techniques like word embedding, deep learning and machine translation systems. Different classifiers were used by different authors but most of the work was done with maximum entropy-based classifier and SVM classifier (Chang and Lin, 2011;Zhu et al., 2011;Bouamor and Sajjad, 2018). Some research projects have made use of Solr (Zhang and Zweigenbaum, 2017) and Lucene (Azpeitia et al., 2017) search engines which are information retrieval-based frameworks. Khosla and Acharya (2018) depicted in their literature survey, the existing methods by which parallel corpus can be built. The survey focused on corpus built aligned at the document level and sentence level. This study discussed three approaches to create parallel corpus i.e., Sentence Alignment approach, Web Mining approach and Manual approach. Iyer (2015) performed the literature survey on comparable corpora. Different existing methods to extract PC from CC at sentence, phrase and word level were reviewed. The survey report was categorized in the form of chapters. The introductory part included approaches to machine translation. It was followed by different techniques used for mining the parallel sentence from comparable data. The report also gave a glimpse of different approaches to extract the phrases from comparable data. Lastly, the report focused on the extraction of bilingual lexicons, an application of CC. Lehal et al. (2018) offered a review of different processes and approaches for bilingual lexicon extraction. It included Correlation-based extraction, Vector Depiction, Projection-Based approach, Classifiers-based mining, PC based approach, Linguistic Knowledge-based extraction for mining of bilingual lexicons for Comparable Corpora (CC). The review also mentioned the limitations of different extraction techniques like level of complexity, parameters, corpus size, accuracy, etc. A suggestion was also made to combine either two or three methods to improve efficiency and overcome the limitations. Kulkarni (2013) revealed the investigation of literature by exploring different approaches for parallel sentence mining, parallel phrase extraction and bilingual lexicons abstraction from the Comparable Corpora (CC). Padhya and Sheth (2019) conducted a review of the literature on numerous Machine Translation (MT) systems for Indian languages. The survey presented us with the findings that "Statistical Machine Translation" and Example-Based MT are the best approaches when working with a large corpus. Rule-Based Machine Translation is useful when there is no corpus. And for the same ordered language, Direct Machine Translation is best suited. The survey report concluded that all Indian languages have future enhancement scope. Similarly, Saini and Sahula (2015) also depicted the current situation of machine translation research in India. The survey also provides the difference amongst the methodologies used.
Literature surveys, according to the aforementioned researchers, include the latest updates on previously completed work in the concerned domain. It's also a simple way to look through the literature on a particular subject. The systematic review in this study has followed the footsteps of Singh and Kaur (2018). According to their study, a systematic literature survey traces the available and relevant literature by framing research questions. Literature is collected by following the criteria of inclusion and exclusion, keeping in mind the main topic. But, before we get into the specifics of the job, it is important to define "Machine Translation", "Statistical Machine Translation", Comparable Corpora (CC), Parallel Corpora (PC) and bilingual lexicons. The study will further review the steps to extract parallel data from comparable data.

Corpus
Corpus is a very large collection of text used to analyze how the words, phrases and languages are used. Its plural is corpora. Linguists, social scientists and specialists in natural language analysis, etc. use it. Corpus is used in different domains and has been of keen importance. It is used in Discourse analysis, literary studies, translation work, forensic linguistics, Pragmatics, political discourse and social discourse (O'Keeffe and McCarthy, 2010). There are different kinds of corpora that are used for various purposes. Tognini Bonelli and Sinclair (2006) presented in their study about the topology of corpora i.e., Sample corpora, CC, Special corpora, Corpora with the time dimension, Bilingual and Multilingual corpora, PC, Spoken corpora, Non-native Speaker corpora and Normative corpora. But in this particular literature survey, the focus will only be on Comparable and PC which are to be used for translation work.

Comparable Corpora
CC are a group of transcripts that are closely related to one another but are different in some of the other aspects (Kenning, 2010). The texts in Comparable Corpora (CC) are linked together based on criteria like a set of topics, the text of a certain size, time of the text, etc. CC are the set of two or more different language corpora that are not the exact translation of each other and are not aligned. Some sources of comparable corpora are Wikipedia (Adafre and Rijke, 2006), Bilingual articles from newspapers and web, etc. (Tillmann, 2009;Zhao and Vogel, 2002)

Parallel Corpora
Parallel Corpora (PC) are collections of transcripts in two or more languages that are precise translations of each other. In this, the relationship amongst the text in two languages or more language pairs lies in shared meaning (Kenning, 2010). Parallel corpora are not available easily because of the scarcity. The parallel texts such as lexicons, fragments and sentences need to be mined from the comparable corpora. Section "Parallel Sentences and Fragments Extraction" in this literature paper gives more clarity to the extraction process of fragments and sentences. Examples of already created PC are the English-Norwegian Parallel corpus, WHO bilingual articles which are in English and Spanish etc.

Machine Translation
"Machine Translation" is a device that is used to create translations from one normal language into other, with/without human intervention (Hutchins and Somers, 1992). Nirenburg and Wilks (2000) gave an overview regarding Machine Translation along with the crucial issues and highlights of the latest applications under machine translation. The survey report explained the different areas where there is the use of MT like linguistics, neuroscience, artificial intelligence, software designing, philosophy, etc. It provides an opportunity for software engineers to experiment and construct non-numerical complex systems. It is also used by field computational linguists for encoding the syntax and semantics of different languages into computer understandable form. Computational linguistics has a subfield called machine translation. Isabelle and Foster (2006) gave an overview of Machine Translation (MT). It defines MT as a process that translates two human languages: The source and target languages. This study stated machine translation as the study of different ways and methods that make the machine produce translations. It revealed that there is always a requirement of understanding the source language, grammar of target language and relevant knowledge to fulfill the informational gap between the target language and source language. This study gave insight into segmenting texts into words, word forms, word co-occurrence, dealing with unknown words, finding idioms in the sentences, solving the problem of ambiguity in the source language, word insertion and deletion, order of the words, etc. Machine translation has two levels: Metaphrase and Paraphrase (Tripathi and Sarkhel, 2010). Metaphrase refers to the word-to-word translation but the text converted may not convey the same meaning. There can be a difference in the semantics from the original text. Paraphrase is not the word-to-word translation but provides the user with exact meaning as of original text. Two models play a role in MT which are Rule-based MT and SMT (Statistical machine translation) (Babych et al., 2012). Rule-Based machine translation is based on creating a group of rules manually using linguistic information whereas Statistical Machine Translation (SMT) (Koehn, 2009;Och and Ney, 2003;Brown et al., 1993) is a machine translation method in which translations are produced on the base of statistical models, the parameters of which are extracted from a bilingual text corpora analysis. MT systems can be classified into several ways based on the specific approach by which the translation is carried out (Chand,2016). Figure 1 depicts the classification of MT approaches. There are many methods under MT like Rule-based, Corpusbased, Hybrid and Neural based.

Statistical Machine Translation
"Statistical Machine Translation" (Koehn, 2009;Och and Ney, 2003;Brown et al., 1993) is a method of creating a machine that automatically decides translation rules from a collection of the translated manuscript by integrating the contribution and production of the translation process and getting the results from the data figures (Koehn, 2009). Brown et al. (1993) presented a mathematical logic about the working of SMT. It is said that to convert a sentence in a foreign language into a sentence in English, there is a need for logic to make the SMT system work. SMT (Koehn, 2009) has emerged as a key method in both the academic civic and the marketable sector over the last decade or so, with machine translation research taking a turn towards it.
In SMT (Koehn, 2009;Och and Ney, 2003;Brown et al., 1993), Parallel Corpora (PC) are used to automatically gain translation knowledge (Veronis, 2000) and there is the rapid creation of MT frameworks for various language pairs and domains. The scale and quality of PC have a tremendous effect on the quality of translation in the "Statistical Machine Translation" system. But there are very few resources available of PC for different language domains. "Statistical Machine translation" has emerged as the main tool for conversion work over the past two decades. It has emerged out to be fruitful for the research society and commercial community (Koehn, 2009). Babhulgaonkar and Bharad (2017) suggested that the problem related to translation can be reduced by restricting to certain domains and languages only.

Research Methods
The systematic approach for reviewing the literature is chosen. It is a process for identifying, evaluating and understanding all the available research in the particular domain. Singh and Kaur (2018) literature review technique was followed to direct the systematic approach in the paper. A systematic approach is chosen to give better insight into the concerned subject. Moreover, while collecting the literature, to the best of our knowledge there was no such systematic review in this particular field. There were undoubtedly a few surveys that are already mentioned in "Background of Related Work" but none of them was done systematically.

Procedure of the Review
Having study questions, collecting data, analysis of data, applying inclusion and exclusion criteria, reviewing and assessing the research results and concluding with the discussions are part of the analysis protocol. Both electronic and manual databases, including journals, conference proceedings and researcher thesis, are searched for the literature review. There is a need for doing a literature review as it deals with collecting all the related pieces of evidence as per the research questions regarding the specified topic. A proper research procedure is required to be followed as it provides more clarity of the topic and also makes our research work more organized.

Research Questions
The research question is the key component for designing a systematic survey. To keep the study focused on the specific goal, research questions are mentioned. Research questions motivate to work towards a particular direction and carry out the survey. The main aim of research questions in this survey is to reveal different MT approaches, datasets used in the extraction process, techniques followed for bilingual extraction and various methods for parallel sentence extraction. Table 3 lists a series of research questions that can be used to perform a systematic literature review in the current study.

Sources of Information
To collect the relevant studies, current work identifies and evaluate the pool of articles. In performing a literature survey, extensive searching is done. Before initializing the review, some proper databases are to be chosen. Then the searching of databases is done by using the keywords. The study also checked databases of academic resources and publishers on a general and iterative basis such as: a) "ACM Digital Library" (http://dl.acm.org) b) "ScienceDirect" (https://www.sciencedirect.com) c) "IEEE eXplore" (https://ieeexplore.ieee.org/) d) "ACL" (https://www.aclweb.org/) e) "Springer" (https://www.springer.com/) For the review of the work, the study included international journals, review articles, the thesis of researchers, book chapters and conference proceedings that we have mentioned under "Other" academic resources. It contains all the research works that are indexed in "Google Scholar" and "Citeseer". Some papers presented in MT "Summit" are also included. Papers published in Journals like IJET, IGI Global, IJCA, Tand Fonline, CFILT, etc. are mentioned under the category of "Other". Figure 2 elaborates the percentage of papers included in this survey from the above-mentioned data sources. A total of 188 papers have been added from various databases. The pie chart in Fig. 2. presents a percentage of papers added from different data sources in this survey.

Vital Keywords
Keywords play a vital role in the process of the systematic review. During the implementation of research process, a set of keywords were defined. These keywords were used in searching the databases for the relevant papers. Every database mentioned in "Sources of Information" was searched for the given keywords. After obtaining the papers based on keywords, the title of the paper was read. Inclusion and exclusion of papers were then done according to the title. If the title seemed satisfactory then the abstract reading was done. Keywords made the search easy and relevant to the field. Figure 3 depicts the percentage of papers included in the survey on a particular keyword. Following are the keywords that have been used for each data source:

Inclusion and Exclusion Criteria
Lots of literature is available with the above keywords. While exploring different databases, similar papers were seen in multiple repositories. Therefore, to ensure that the search is easily manageable, review established certain conditions for inclusion and exclusion in the selection of articles as follow. What is the current status of data extraction in respect of parallel and comparable data? RQ2 What are the various datasets used in the process of extraction? RQ3 What type of machine translation approaches are used for translation in different language domains? RQ4 What kind of parallel data can be mined by using CC ("comparable corpora")? RQ5 What are the different kinds of parallel sentence and fragment extraction techniques followed? RQ6 What are the different ways to extract Bilingual Lexicons from comparable data?  Criteria the study follows to include the articles in the survey are:  Articles relating only to computer science and engineering have been included because the term "corpora" are multi-disciplinary and is found in different branches.  The papers written in English were included.  Conference papers were also included.  Book chapters were included.  Papers indexed in Google Scholar were included with relevancy with the keywords.
Exclusion Criteria:  To exclude the unwanted articles for review, the following criteria are followed  All other articles on different subjects like medical, animal sciences, biomechanics, etc. were excluded.  Informal studies like unknown conferences or journals were discarded  Papers irrelevant to the research questions were also excluded  Wikipedia writings are excluded  Predatory journals were left  Information or articles available in Blogs were not included The inclusion and exclusion process were divided into the following 4 levels.

Parallel Data Extraction: The Proposed Architecture
The extraction of parallel data involves the number of tasks which are elaborated in Fig. 5. Firstly, there is the requirement of data resources for performing the task of extraction. But as mentioned earlier also in the survey, parallel data is not easily available in desired languages. To overcome this problem of scarcity, CC which are available in huge amount but in raw form can be used for the extraction of parallel data. There are three types of parallel data in CC i.e., "parallel sentences", "parallel fragments" and "bilingual lexicons".
A parallel data extraction consists of the following steps: 1. Potential resources, like comparable corpora in the desired language pair, Extraction of Bilingual Lexicons and a seed parallel dictionary. 2. Document Alignment model, to get similar document pairs. 3. Parallel Sentence and Fragment Extraction, to get parallel sentences and fragments from the aligned documents. 4. Improving SMT accuracy Different techniques and methods used by enormous researchers for performing these steps are elaborated below.

 Comparable Corpora
Comparable Corpora is composed of two languages textual data which are the raw translations of each other. The documents in CC are not properly aligned. Research scholars used various kinds of datasets as a source of comparable data like bilingual newspapers (Zhao and Vogel, 2002;Munteanu and Marcu, 2005;Tillmann,2009;Do et al., 2010), bilingual articles (Munteanu et al., 2004;Utiyama and Isahara, 2003;Abdul-Rauf and Schwenk, 2011;Abdul-Rauf et al., 2017), Web (Jiang et al., 2009;Hong et al., 2010), Wikipedia (Stefanescu and Ion, 2013;Chu et al., 2012;Chang et al., 2008;Adafre and Rijke, 2006;Smith et al., 2010;Mohammadi and Ghasem Aghaee;2010;Archana et al., 2015, Chu et al., 2014b and Social media (Ling et al., 2013). Non-parallel and non-aligned bilingual records make up a quasi-comparable corpus (Fung and Cheung, 2004b;Quirk et al., 2007). The TDT3 Corpus, which is a transcription of radio and TV reports in bilingual sentences and paraphrases, is an example of a quasi-comparable corpus. Data can also be collected from the "Internet Archive" (Resnik and Smith, 2003). It is a non-profit organization that archives the entire Web and the material is freely accessible via a Way back Machine Web Interface. Hindi and Punjabi data for the development of lexicons can also be taken from two conventional dictionaries available at Bhasha Vibhag and the National Book Trust. But this data has to be converted into digital format manually (Goyal and Lehal, 2010). Data can also be obtained from websites where bilingual transcripts are available such as Vikaspedia.in, e-books, film captions, online freely available encyclopedias, Quran, Bhagavat Gita and Bible (Premjith et al., 2019). Jindal et al., 2018 collected the raw data from different sources for creating the PC. English and Punjabi textual data were collected from online as well as offline resources. The raw form of data was collected from Gyan Nidhi, EMILLE, Bible, Guru Granth Sahib corpus available in electronic form, PSEB E-books, Bilingual Newspapers, tourism and health-related corpus from the web. Figure 6 depicts the percentage of different datasets used by different researchers mentioned in this survey. From the figure, it's prominent that news datasets and Wikipedia are common when creating comparable data.

 Parallel Seed Dictionary
The seed dictionary is the kind of glossary that contains the source word and its target translation. Seed dictionary is very important for training the machine. There is always the requirement of an external source like a seed dictionary along with the CC for sentence extraction and fragment extraction. Lakshmi et al. (2020) revealed that the dictionary is always a good alternative to CC and both can work hand to hand also. Table 4 provides some of the seed dictionaries used by researchers with various language pairs. The seed dictionary can be created manually (Utiyama and Isahara, 2003;Fung and Cheung, 2004;Adafre and Rijke, 2006;Lu et al., 2010;Jindal et al., 2018a;Deep et al., 2018) or a seed parallel corpus (Zhao and Vogel, 2002;Kumar and Goyal, 2010;Munteanu and Marcu, 2006;Ling et al., 2013;Smith et al., 2010;Tillmann, 2009;Lakshmi and Shambhavi, 2020;Gahbiche-Braham et al., 2011;Stefanescu and Ion, 2013;Stefanescu et al., 2012;Abdul and Schwenk, 2011) available can be utilized. Lu et al. (2010) provided a broad parallel corpus derived from an Internet-sourced corpus of comparable English-Chinese patents. First, parallel sentence pairs were formed using Champollion, a publicly available sentence aligner and then the candidates were filtered using MS Aligner, another publicly available sentence aligner. Around 7 million high-quality parallel sentences were chosen as the final parallel corpus from a pool of over 22 million bilingual sentence pair applicants. This is one of the patent domain's largest corpora of parallel sentences. Later, Zhu et al. (2011Zhu et al. ( ,2012) also designed a system that mined PC from web pages automatically. The system identified a decent number of parallel texts based on heuristic information extracted from web content for minority languages like Chinese-Mongolian. A similar kind of work was also done by Tan and Zhou (2010) for English and Chinese language pairs. It was also a webbased corpus that was parallel in nature. Kumar and Goyal (December 2010a) created a Hindi-Punjabi parallel corpus of 50,000 sentences based on a freely accessible Hindi-Punjabi machine translation system. The corpus is in .xml and .doc formats. The parallel corpus created was sentence-aligned. Few errors from categories such as out-of-vocabulary, grammar, inflection generation, transliteration, etc. were found when the parallel corpus was created. The current Hindi-Punjabi Machine Translation System was used to analyze the errors. The terms discovered during the study were applied to the machine translation dictionary that already existed. Jindal et al. (2018a) focused on creating an English-Punjabi corpus of big size. The use of a parallel corpus is important for statistical machine translation training. The creation of a corpus had huge challenges as raw data was not easily available in the required language pairs. English-Punjabi Corpus was generated because basic data was not available for regional language pairs. The raw text was obtained from different resources like Gyan Nidhi, EMILLE, Bible, Guru Granth Sahib electronic version, PSEB e-books, Bilingual newspapers, tourism and health websites. Also, Jindal et al. (2018b) worked upon English to Punjabi machine translation using free translation software called Moses (Koehn et al., 2007). In their research, they created a corpus of 20000 sentences that were of different domains. The sentences were aligned using the GIZA++ alignment tool. The accuracy was checked using BLEU scripts. Lakshmi and Shambhavi (2020) revealed that one of the promising resources to extract dictionaries is PC. Their study found that Comparable Corpora (CC) could be an alternative to extracting a dictionary. The proposed solution was to extract the dictionary for a low-resource language pair of English and Kannada using Comparable Corpora (CC) collected from Wikipedia dumps and corpus collected from the Indian Language Corpus Initiative (ILCI). Dictionary constructed comprises both translation and transliteration entities with term level associations from English to Kannada. The resulting dictionary is of size 77545 tokens with a precision score of 0.79.

Bilingual Lexicon Extraction
The oldest method of using CC is to extract bilingual lexicons. Artificial Intelligence and CLIR (Cross-Lingual Information Retrieval) (Widdows et al., 2002) both depend heavily on bilingual lexicons. A bilingual lexicon consists of words that are almost synonyms for one another (Haghighi et al., 2008). The bilingual lexicon is either hand-crafted or automatically produced from a Parallel Corpus (PC). Different systems for extraction have been elaborated in Table 5. From earlier work (Rapp, 1995) Bilingual Lexicon Extraction (BLE) has exploited Comparable Corpora (CC) for SMT. BLE's main aim is to create and enable bilingual dictionaries or seed lexicons, which are critical for both SMT and CLIR (Pirkola et al., 2001;Jagarlamudi and Kumaran, 2007;Chinnakotla et al., 2007). Their manual creation necessitates a high level of proficiency in both languages involved and can be a time-consuming operation. Vectors, Projections, Classifiers, Correlations, Linguistic information and other techniques may be used to derive bilingual lexicon from Comparable Corpora (CC). Goyal and Lehal (2010) worked on a direct translation approach. The data was collected for two closely related languages, Hindi and Punjabi in terms of grammar and vocabulary. Data were available in the form of hardcopy. It was then digitized and molded as required for machine translation. A lexicon of 1,00,000 words was manually created for word-to-word translation. The problem of ambiguity was resolved using a tri-gram approach. Using BiLDA topic models, Liu et al. (2013) developed a method for translating CC into a parallel aligned corpus, which is an advanced version of the LDA model (Blei, 2003) and with the aid of word alignment, defining word translations. Largescale experiments in this study demonstrated that the proposed model introduced a range of benchmarks using both automated measures and manual assessments. The research also demonstrated that their subject-dependent translation systems are capable of capturing a few of the important poly-semi concepts in dictionary construction. Bouamor et al. (2013) later proposed a diverse approach for constructing a domain precise "bilingual lexicon" based on Wikipedia. These large multiple languages encyclopedia paved the way for the development of lexicons for a vast range of language pairs. Gaussier and Li (2010) proposed a comparability metric and then created a model for improving a CC by eliminating a subpart and completing the left subpart with external tools. They showed how to improve bilingual lexicon extraction using information gathered during the building process.
Fung and Yee (1998) described an associate algorithmic rule for mining bilingual lexicon from CC for the English-Chinese language domain. This algorithmic rule was language independent and took into account the burden of bilingual seed words. Additional language sets, such as English-French or English-German, were also benefited. This computational rule can also be implemented in a repetitive manner where better bilingual word pairs are added to the seed word list, yielding additional new bilingual similar words. Xu et al. (2011) explained the context-based approach for the creation of bilingual lexicons from CC. The experiments showed the mapping of context words, directions and types of dependency relationships. The proposed method surpassed the state-of-the-art scheme in bilingual lexicon creation for language sets of English and Chinese. Later Qian et al. (2012) discussed a comparable corpus, a bilingual dependency mapping model for bilingual lexicon building from English to Chinese. This model considers both dependent words and their relationships when measuring the similarity between bilingual words and thus offers a more precise and less noisy representation. It also illustrated that bilingual dependency mappings can be created and optimized automatically without human input, contributing to a medium-sized set of dependency mappings and that their impacts on Bilingual Lexicon Construction (BLC) can be fully exploited through weight learning using a simple but effective perceptron algorithm, making their approach quickly adaptable to several other language pairs. For BLE from CC, Bouamor et al. (2013) presented the associated degree approach. This research focuses on the unresolved issue of polysemantic words discovered by dictionaries and suggests the need for an acceptance clarification approach to boost the appropriateness of context vectors. Empirical experimental findings on two advanced French English CC showed that the technique outperformed two stateof-the-art approaches. The most widely used methods for the BLE from CC were evaluated in comparison by Hazem and Morin (2013a). Their observations supported the hypothesis, that using a re-estimation methodology of word co-occurrence in a similar corpus can improve the accuracy of the standard method. A year apart, Chu et al. (2014a) developed a scheme for extracting bilingual lexicons that combined topic-based (Vulic et al., 2011) and context-based (Rapp, 1999, Harastani et al.2013 methods. Experimental studies on Chinese-English and Japanese-English Wikipedia data revealed that their proposed approach outdoes a state-ofthe-art technology. Cao et al. (2007) also identified a system that extracts English-Chinese translation combinations mechanically from a substantial quantity of monolingual Chinese web material. Candidate translations are derived using pre-specified models in this method. On over 300GB of Chinese content online, the study compares a variety of approaches to aligning transliterations and mining translations.
To provide a new perspective on BLE, Gaussier et al. (2004) demonstrated the geometric form of BL extraction from CC. Evaluations of the strategies were proposed on a comparable corpus extracted from the CLEF collection and showed the strengths and weaknesses of each technique. The final results showed that the mixture of comparatively straightforward strategies helped in improving the average preciseness of BL extraction approaches from CC by ten points.

Document Alignment Model
CC can be huge in size so it is quite difficult to examine every sentence in the corpora. So, the concentration is made only on those documents and sentences which have similar kinds of content. For finding similar or comparable documents, techniques like topic alignment (Zhu et al., 2013), content alignment, text alignment and cosine similarity can be employed. In parallel data extraction, different authors employed various document alignment techniques which are mentioned in Table 6.   (2010) English-French BiLDA model Liu et al. (2013) English-French Direct translation approach Goyal and Lehal (2010) Hindi-Punjabi  Gale and Church (1991, Resnik and Smith, 2003, Zesch and Gurevych, 2010

Cosine Similarity
Cosine similarity is used to compute the similarity amongst the two documents described as vectors of the terms they contain. Cosine similarity is defined as the Dot product of the vectors (Fung and Cheung, 2004a). Lehal et al. (2019) compared the similarity and distance measures. Their research has analyzed and compared cosine similarity, Jaccard coefficient, Hamming distance and Euclidean distance (Fung, 1995;Yu and Tsujii, 2009). The accuracy levels were found using these metrices. It was concluded that if the data is imbalanced, accuracy will lack in providing the true efficiency. In that scenario, precision and recall will give better results. In terms of Euclidean Distance, Cosine Similarity and Jaccard Similarity, precision gave high result above 95% as compared to Hamming distance. However, the value of recall was high than precision in Hamming Distance. In analyzing cosine similarity, the f1 score and accuracy was much better, than in any other similarity measures. The data was taken from Wikipedia in Punjabi and English language. Numerous works are done in the field of web alignment. Online available alignment software and websites help in achieving the target of alignment for all kinds of textual data (Nie et al., 1999;Zhang et al., 2006;Fung et al., 2010;Uszkoreit et al., 2010). Goyal et al. (2020) described the process of aligning the documents based on topics. A comparable corpus of English-Punjabi originated from the dump taken from Wikipedia. PHP scripts were developed for fetching and aligning articles. Articles have been aligned in two separate directories. Its corresponding English record was detected for each Punjabi record. A corpus was created which could be used for parallel data extraction. Utiyama and Isahara (2003) suggested two measures to ensure the correct alignment of article and sentence. Similarities in sentences associated with Dynamic Programming (DP) matching and similarities in papers matched with Cross-Language Information Retrieval (CLIR) (Utiyama and Isahara, 2003;Fung and Cheung, 2004b;Munteanu and Marcu, 2005;Gahbiche-Braham, 2011) for sentence alignment are used in the article alignment test. The experiments involved the enhancement of each other and permitted the accurate mining of the related article and phrase alignments from the excessively noisy parallel Japanese-English corpus. An effective large-scale article and sentence alignment corpus was built and made available to the public using these steps. Li, (2011b) also introduced a technique that can select candidate sentences for sentence alignment. The technique has mainly experimented on bilingual Comparable Corpora (CC) obtained from Wikipedia for English and Chinese language pairs.

Context Alignment
Context alignment was anticipated by Gale and Church (1991) and Brown et al. (1993). Brown et al. (1993) defined a set of five applied mathematics variants of the interpretation method and provided algorithms for estimating their parameters, resulting in a set of pairs of sentences that are translations of each other. The study tended to illustrate a thought of wordby-word alignment between certain pairs of sentences and also offered an associated formula for finding the highest likelihood of such alignment. Though its formula is sub-optimal, the alignment thus provided good accounts for word-by-word associations within the combined sentences. Resnik and Smith (2003) also revealed word-to-word translation. This technique employs translation similarities based on the word-byword translation lexicon. It is also known as contentbased alignment. BLE from CC is built on the distributional hypothesis (Harris, 1954) that terms with identical meanings feature in identical dispersals across languages. Srivastava and Sanyal (2012) presented an approach that increased the performance of word alignment for small PC of the English-Hindi language pair. Their model used POS tagging with word alignment and expressed the significant decrease in Alignment Error Rate. Post et al. (2012) compiled and fine-tuned PC at the document level between English and six verb-final languages: Bengali, Hindi, Malayalam, Tamil, Telugu and Urdu. The set of six Parallel Corpora (PC) containing fourway redundant translations of the source-language text was identified in their research. They revealed that the Indian languages of these corpora are low-resource and understudied and exhibit markedly different linguistic properties compared to English. Their study included performing baseline experiments quantifying the translation performance of several systems, investigated the effect of data quality on model quality and suggested many approaches that could improve the quality of models constructed from the datasets. They also concluded that the PC provides a suite of SOV languages for translation research and experiments.

Parallel Sentences and Fragments Extraction
The data collected through different web sources in the form of comparable, quasi-comparable, or noisy parallel is used to mine the parallel data in the form of sentences and fragments. PC are phrasealigned bilingual documents. They are vital tools for natural language production in bilingual or multilingual contexts . PC provides the majority of translation expertise, but the quality and quantity of PC are limited. A significant portion of the phrases encountered at run-time in such language pairs is unknown. Integration paraphrases into applied mathematics computational linguistics, according to Burch et al. (2006), would hold crucial improvements in coverage and translation accuracy. Paraphrases, in essence, introduced a degree of generalization into applied mathematics and computational linguistics. Their study was able to take advantage of information outside of the interpretation paradigm, such as terms with similar meanings and apply it to the translation process. Parallel sentences can be recognized dependent on classification (Munteanu and Marcu, 2005;Tillmann, 2009;Smith et al., 2010;Bharadwaj and Varma, 2011;Stefanescu et al., 2012) or by utilizing similarity procedures (Utiyama and Isahara, 2003;Fung and Cheung, 2004;Fung et al., 2010;Abdul-Rauf and Schwenk, 2011;Abdul-Rauf et al., 2017).
Various techniques for sentence extraction along with language pairs are mentioned in Table 7. Also, the techniques are elaborated below which were used by authors for classifying and mining the sentences from the aligned documents: Munteanu and Marcu (2005) used a classifier for mining parallel sentences. The model uses linear functions. It classifies the sentences into parallel and non-parallel classes. But there was an error in this classification process as the maximum of the sentences were termed as non-parallel. This created an imbalance in extraction. Chu et al. (2013a) also suggested a procedure for extracting sentences from a quasi-comparable corpus. The system trained and tested a unique classifier that stimulates parallel sentence extraction. The study used linguistic information of Chinese characters for extraction. Smith et al. (2010) used the Maximum entropy model to rectify the problem faced in the above technique. The same model which was used in the classification technique was used here also. In this, the sentences are chosen based on probability scores. The higher the score, the more is the chance of the sentence being parallel. Fung and Cheung (2004a) suggested a multi-level bootstrapping method for parallel sentence extraction from quasi-comparable corpora. The research examined the suitability of various bilingual corpora for a trilingual natural language system. Beginning with parallel, comparable and non-parallel corpora, a variety of bilingual corpora were contrasted and differentiated. A lexical alignment score measured for the bi-lexicon tried within the matched bilingual sentence pairs is then used to test the usability of each corpus type. Fung et al. (2010) introduced a new multilingual web crawler and sentence extracting method for mining and extracting parallel sentences from trillions of websites with no regard for domain or address architectures or publication dates. Their primary goal is to improve applied computational machine translation frameworks.

Conditional Random Field
For aligning the parallel sentences, Smith et al. (2010) made use of the conditional random field. In this, only the sentences which are present in the aligned documents can be extracted. This same technique was also followed by Blunsom and Cohn (2006). The study worked on a small set of data and made use of GIZA++ for training purposes. In this technique, every word gets aligned to its target word and in reciprocation, the target word can get aligned to the number of source words. Wolk and Marasek (2014) presented a method that constructed PC from noisy parallel and CC. Wikipedia data as a source was selected for Polish and English languages. A web crawler was used for obtaining the bilingual articles from Wikipedia. The Hunalign tool was used for sentence alignment. Freely available translators were used for Polish language translation to English. MGIZA++ tool was used for word and sentence alignment. At last, the training was done using Moses which is an open-source SMT-related toolkit (Koehn et al., 2007). For evaluation BLEU was utilized. At last, for evaluating the quality and quantity of evaluation, human translators were used for manually aligning the articles on the sentence level. This study lacked due to human intervention and also due to fewer data available.

LEXACC
It stands for "Lucene based Parallel Sentence Extraction from Comparable Corpora". Stefanescu and Ion (2013) identified a series of parallel sentences for three sets of languages: English-German, English-Romanian and English-Spanish, which were extracted from Wikipedia. To do so, they used a method called LEXACC, which was developed during their project and was used to extract parallel sentences from CC. Stefanescu et al., 2012 made use of CLIR to find parallel sentences. With the help of a seed dictionary, the source words are translated to target words. Rahimi et al. (2016) explained CLIR ("Cross-Language Information Retrieval") and extraction of translations from CC for CLIR. CLIR is directly linked with the translation quality, so there is the requirement for a proper translation model from the available CC. The experimental work involved the gathering of English-Persian CC which was obtained from news articles in both languages. A successful translation model was built from CC available without any additional linguistic tools. To extract correlations between each pair of bilingual terms, a language modeling method was proposed. Integration of monolingual relations of word co-occurrences was done with translational relations for the translation of lowfrequency terms. Various estimates of translation probabilities from word correlations have been compared. It was, therefore, claimed that the calculation affected the efficiency of cross-language information retrieval.
Some other authors also contributed in "Parallel Sentence Extraction". Kumar and Goyal (2018) used a mathematical approach to investigate the design of a Hindi to Punjabi machine translation method. The set of 3 lakh parallel sentences was the starting point for the creation of a machine translation method. The parallel sentences have been developed using different tools like Akhar, Microsoft's bilingual sentence aligner, spell checker, Tokenizer and translation software mentioned in the study of Kumar and Goyal (2012). The parallel corpus used by Kumar and Goyal (2018) was supplemented with approximately one lakh Hindi-Punjabi lexicons. For statistical analysis of the Hindi and Punjabi languages, preprocessing and post-processing modules were developed. For pre-processing, "Word Tokenizer" and "Text Normalization" modules were developed. For precision, the Transliteration and Grammatical Error Correction modules were used. The GIZA++ tool was used to create the translation model and Moses (Koehn et al., 2007) software was used as the decoder. The BLEU and NIST scores are used to assess quality. Deep et al. (2018) provided in the research different sources to collect the English data and Punjabi data. Their work presents the Punjabi -English parallel corpus and named it Pun Eng. They used the human translation approach and online translators for converting the data into the required language. Entire data was cleaned and unnecessary tags were removed manually. After removing all the tags, translations by google translate and human verification, they were left with parallel sentences. Premith et al. (2019) presented a neural MT technique for building four Parallel Corpora (PC) in the language combinations English-Malayalam, English-Hindi, English-Tamil and English-Punjabi. The information was gathered in the form of text from both online and offline sources. The models obtained were tested both automatically and physically. The BLEU score was used for automatic evaluation and three criteria, fluency, rating and adequacy, were used for manual evaluation. Long sentences were found in the English-Malayalam and English-Hindi corpora, which influenced the translation. In addition, the attention mechanism was applied to the issue of translating long sentences. Their findings revealed that, in addition to the corpus' size and coverage, the length of sentences plays an important role in translation efficiency. Later, Agic and Vulic (2020) created a parallel corpus for 300 languages with nearly a lakh of sentences in a single language. Their study work on extracting parallel sentences and creating a corpus of them. The corpus thus created was named JW 300 and is freely available online. The corpus created could be used for part of speech tagging projects as well as for cross-lingual procedures.
When there are fewer parallel data, then the focus is turned towards non-parallel data. The extraction of sentences from non-parallel is not feasible. So, there arises the requirement of fragment extraction. Fragments are the phrases present in the sentences.
Various techniques to mine the fragments are used by the number of authors. Below mentioned are the approaches for fragment mining. Munteanu and Marcu (2006) used this technique to extract the segments. For using this approach, the system is provided with some sentence pairs from the corpus. These sentence pairs are obtained by making the use of the GIZA++ tool on the given data. After getting sentence pairs, fragments are extracted from only those sentences which have an exact translation. The only drawback concerned with this technique is that the system has to be provided with correct translated words in the seed dictionary. Hewavitharana and Vogel (2013) made use of this technique for the extraction of fragments from the available nonparallel corpus. In this technique, a source fragment and sentence pair are taken. Then the alignment of words is done. The words are combined with the help of heuristics. On the translated side, some split points are looked at. Splits points are searched based on the probability of word alignment. In this, words inside the source phrase align with the words inside the target phrase. Words outside the source phrase get aligned to the words outside the target phrase. Alignment goes hand in hand on both the source and target sides.

Chunking Approach
Chunk is the small part/phrase of the sentence. These small phrases are extracted from the main sentence and translation is done on that phrase. The chunk can be placed anywhere in the target sentence. The words in the chunk remain the same even after translation. Gupta et al. (2013) used the chunking method to translate a source fragment and measured the similarity between the translated source and target fragments to classify the target fragment. The study revealed the use of an automated method for extracting parallel English-Bengali text fragments from CC generated using Wikipedia materials. The method takes advantage of Wikipedia's multilingualism. The study also found that using an out-of-domain corpus was beneficial in training a site-specific MT system.  Munteanu and Marcu (2005) Chinese, Arabic, English, German Tillmann (2009) Spanish-English Bharadwaj and Varma (2011) English-Hindi Chu et al. (2013) Chinese Maximum Entropy Model Smith et al., 2010 Spanish-English, Bulgarian-English, German-English Sentence Similarity Fung and Cheung (2004a) English-Chinese Utiyama and Isahara (2003) Japanese-English Fung et al. (2010) Chinese Abdul and Schwenk (2011) Arabic-English, French-English Conditional Random Field Smith et al. (2010) Spanish-English, Bulgarian-English, German-English Blunsom and Cohn (2006) French-English, Romanian-English Och and Ney (2003) German-English, French-English LEXACC Stefanescu et al. (2012) English, Estonian, German, Greek, Lithuanian, Latvian, Romanian, Slovene Stefanescu and Ion (2013) English-German, English-Romanian and English-Spanish

Improving SMT Accuracy
In "Statistical Machine Translation" (Koehn, 2009;Och and Ney, 2003;Brown et al., 1993), the translation model is trained in unsupervised manner from parallel corpora. The translation model consists of translation pairs as well as the feature scores. The accuracy of SMT is hampered due to inaccurate translation pairs and feature scores. Inaccuracy arises due to paucity of parallel corpora. Accuracy can be improved by:  Increasing the amount of parallel corpora  Filtering the noise translation pairs from translation model  Estimating new features from comparable corpora for the translation pairs Parallel corpora are not easily available for number of languages and domains. So, increasing the quantity of parallel corpora is not an easy task.
Filtering the noisy translation pairs can no doubt increase the accuracy but it can also lead to removal of some good translation pairs. This further will decrease the coverage of translation model.
Comparable features (Irvine and Callison, 2013) such as similarity scores obtained from comparable corpora can be combined with original features to differentiate between good and bad translation pairs. BLE can be used to justify the accuracy issues in SMT. Different similarities like topical, contextual (Rapp, 1999), orthographic and temporal can be individually used or combined together for bilingual lexicon extraction. SMT quality and coverage issues were discussed simultaneously with BLE by Irvine and Callison (2013);Pal et al. (2014); Marton et al. (2009);Ganitkevitch and Callison-Burch (2014). For six languages with limited resources, a comparable corpus was used to validate the performance and scope of phrase-based Machine Translation models developed with small bilingual corpora. The results of the experiments show that each of these approaches increases the performance of the BLEU score on its own. Nevertheless, the findings suggest that having low frequency word translations increases efficiency more than translations for OOVs (out-of-vocabulary) (Callison-Burch et al., 2006) alone. The results showed improvement for lesser data for parallel training. Richardson et al. (2013) illustrated that the implementation of contextual features can dramatically improve the efficiency of transliteration. In addition, even for out-of-domain source terms that have an unknown distribution of the subject, their extended model may produce a considerable improvement of accuracy. Chu et al. (2014a) made the use of paraphrases along with BLE to rectify the problem of accuracy. Paraphrases can also be used as training data to improve the accuracy of SMT. Paraphrase can be generated from parallel corpus and thus can reduce the problem of data sparseness also.

Results and Discussion
The findings of the systematic literature review are organized in accordance with the research questions which are mentioned in Table 3. A total of 188 papers were reviewed in this survey. The survey focused on the proposed technique through which parallel data could be extracted from the given nonparallel data. Different ways of "parallel data extractions" used by researchers are mentioned in this survey. Out of 188 papers, 34% literature review is done on the works under the term "Machine Translation" whereas 9% of papers are found on "Statistical Machine Translation". Furthermore, 20 and 15% of the papers are found on "Comparable Corpora" and "Parallel Corpora" respectively. Additionally, 10% of papers are found on "Bilingual Lexicon Extraction" and 6% of papers contributed towards "Parallel Sentence Extraction" which are published in esteemed journals, conferences and workshops depicted in Fig. 3.
The research papers are collected from databases like IEEE, ACL, ACM, Springer, Science Direct and some journals which are indexed in Google Scholar and Citeseer. A total of 188 papers were selected for writing this review. Out of the total papers included, 12% of research articles are printed in IEEE, 16% in ACM, 9% in Springer and 4% in Science Direct. ACL contributed 26% in writing this review. 34% of papers were the ones that were accepted in some conferences and workshops but are indexed in Google Scholar. The contribution of papers from different resources is clearly depicted in Fig. 2.
The survey performed is based on questions framed which are mentioned in Table 3. We will provide insight into these questions with justifications as per the literature we reviewed.

RQ1
: What is the current status of data extraction in respect of parallel and comparable data?
It has been discovered from the literature survey that for extracting the parallel data, there is the requirement of many things such as comparable data, a seed dictionary, bilingual lexicons and some alignment tools. Even after that, the parallel data is retrieved from some parallel fragments and sentences. For the realization of this process, different databases were searched for relevant works in these fields. Around 188 papers were studied after following inclusion/exclusion criteria mentioned in Research Process and also in Fig. 4. Different researchers used various techniques for finding comparable data, aligning the data, extracting lexicons, fragments and sentences in different years. Very little work came to light from the period 1990-2000 as shown in Fig. 7. With the passage of time and improvements in technology, a noticeable amount of research was carried out from the period 2001-2020 as shown in Fig. 7.
It is observed from Fig. 8. that most of the work was done from 2010 to 2014. Fig. 9. shows the comparative analysis for the period 2010-2014. The comparative analysis is based on the selected research papers published concerning the keywords used in the literature survey. It's noticeable from Fig. 9 that 25 papers were published in the said period on "Machine Translation", 7 papers on "Statistical Machine Translation", 21 on "Comparable Corpora" and 14 on "Parallel Corpora". RQ2: What are the various datasets used in the process of extraction?
The literature survey showed the usage of multiple kinds of datasets by various authors for the mining of parallel data from comparable data. Different researchers worked with a variety of European and Asian language pairs. The textual data was taken from numerous online as well offline resources. Data was taken from Wikipedia (Stefanescu and Ion, 2013;Chu et al., 2014;Adafre and Rijke, 2006;Smith et al., 2010;Mohammadi and Ghasem , 2010), Bilingual newspapers (Zhao and Vogel, 2002;Munteanu and Marcu, 2005;Tillmann, 2009;Do et al., 2010), E-books like from Gyan Nidhi, PSEB E-books, Bible and social media, Bilingual websites etc. Table 8 elaborates about different datasets used by researchers in this survey of 188 papers. The table also depicts the size of various datasets. Also, Fig. 6 shows the percentage of different datasets used by various researchers mentioned in this survey. It's evident from Fig. 6 that datasets of news and Wikipedia are largely used by researchers. While conducting this survey it's seen that 39% of researchers used news or newspapers as a source of data for comparable corpora. News is easily available in bilingual forms. Also, Wikipedia acted as a major source of comparable data, contributing nearly 28%. Wikipedia has its translator for various languages. In the literature survey its evident that Wikipedia is widely used by researchers in their work because of its easy availability.
RQ3: What type of machine translation approaches are used for translation in different language domains?
The systematic literature survey also focused on different works done in the field of "Machine Translation". There are different approaches in MT such as Rule-Based, Corpus-Based and Hybrid. All these approaches are further subdivided into Interlingual, Statistical, direct, etc. Figure 1. presents the division of various approaches of MT. Based on these approaches, several MT Systems were created by various researchers. In this literature survey, we gathered 188 papers in contrast with some important keywords mentioned in Fig. 3. With a focus on these 188 papers, MT systems created by several authors were studied. Table 1 depicts about various MT systems created for Indian languages along with some key features. Table 1 also reveals about the language pairs used by researchers in creating the MT Systems. Whereas Table 2 reveals the MT Systems created by researchers for European and Asian languages (other than India). In this table also some prominent features used in the creation of MT Systems by authors have been mentioned with the MT approach followed. We saw from both Table 1 and 2 that despite work done in the field of MT, still a lot is to accomplish in the field of "Statistical Machine Translation" without human intervention. This literature survey focuses on the "Statistical Machine Translation approach" and various methods to mine the parallel data from comparable data under SMT.
RQ4: What kind of parallel data can be mined by using CC?
Parallel data is the collection of texts in two or more languages with exact translations. "Parallel Corpora" is the aligned text where one is the source language and the other is the target language. A target language is the one in which translation is made. PC is of huge requirement when translations are done in the context of SMT. But PC are still a scarce resource due to their non-availability in good quantity and quality (Ali et al., 2010;Srivastava and Bhat, 2013;Post et al., 2012). So, CC are exploited to get parallel data from it in the form of lexicons, fragments and sentences. Bilingual datasets can be easily created from textual dumps of different languages, available through Wikipedia, Bilateral articles, etc. This data can further be filtered, aligned and cleaned to form "comparable corpora". Different techniques are used for the mining of bilingual lexicons, parallel sentences and fragments which are mentioned under heading "Parallel Data Extraction" of this literature survey. Figure 10. presents a mind map that covers numerous aspects, properties, extraction methods of lexicons, sentences and fragments which are derived from this literature paper. "Bilingual lexicons", "Parallel Fragments" and "Parallel Sentences" collaborate to form "Parallel Corpora". All the mining techniques are elaborated in sections namely "Parallel Resources" and "Parallel Sentence and Fragment Extraction". Also, Table 5 and 7 give an insight into various techniques used by authors for "Bilingual Lexicon Extraction" and "Parallel Sentence Extraction" respectively. RQ5: What are the different kinds of parallel sentence and fragment extraction techniques followed?
The systematic literature survey aims towards the ways of mining parallel data from the available nonparallel data. As there is an unavailability of a good amount of parallel data so the concentration moves towards the mining of comparable data. From Comparable data, "Parallel Sentences" and "Parallel Fragments" could be easily mined. This survey focuses on 188 papers that are gathered after implying the Inclusion/Exclusion Criteria mentioned in Research Process. After exploring 188 papers, the survey report manages to collaborate various extraction techniques of "Parallel Sentences" shown in Table 7. Also, the paper gives more clear insight into the mining procedures followed by numerous authors for different language pairs in terms of "Parallel Sentence and Fragment Extraction". RQ6: What are the different ways to extract Bilingual Lexicons from comparable data?
Many bilingual Natural Language Processing (NLP) tasks, such as "statistical machine translation", rely heavily on bilingual lexicons. Meanwhile, automatic construction of bilingual lexicons is desirable because manual construction is extremely tedious and costly. So, one approach is to mine bilingual lexicons from "parallel corpora". As earlier clarified in Introduction of this study, "parallel corpora" is not available in a good amount and better quality. Extracting "bilingual lexicons" from CC is an appealing option since "comparable corpora" are much more commonly accessible than "parallel corpora". Lexicons also act as an integral part for the building of PC. Table 5 presents various extraction techniques of "bilingual lexicons" used in 188 papers that are included in the survey. Furthermore, detailed invasion in BLE is provided in section named "Bilingual Lexicon Extraction" of this literature review.

Conclusion
To summarize, the systematic literature survey is conducted on 188 research papers which are collected from various databases such as ACM, ACL, IEEE, Springer, ScienceDirect, etc. which are elaborated in Fig. 2. The papers were also taken from conference proceedings and workshops in the context of keywords mentioned under subheading "Vital Keywords". From the set of 5 digital libraries, set of 1270 papers were searched. After implementing inclusion/exclusion criteria on these 1270 papers, later 188 papers were collected for writing this literature survey. The results are presented in the form of Tables, figures, pie charts, flow diagrams, mind map, bar graphs, etc., Fig. 10 presents a mind map that gives a clear picture of different aspects involved in machine translation. The contribution of different researchers in the field of parallel data mining from CC is found in this study. It seems that PC is a scarce resource. It is a major hurdle in the development of statistical machine translation for different kinds of language pairs. But there is a large amount of comparable and non-parallel corpora resources available which can be used to extract the parallel data. The work has described different kinds of "Parallel Data Extraction" such as "parallel sentence extraction", "Parallel Fragment Extraction" and "bilingual lexicons extraction" which can be easily extracted through CC. The paper also proposed architecture for mining parallel data with the help of bilingual lexicons, fragments and sentences under "Statistical Machine Translation". Thus, it is perceived that data mined through CC can be of abundant importance in "parallel corpus" formation for language pairs with a shortage of PC resources.