Evaluating Machine Translation for Domain Specific Low-Resource Nepali-English Language Pairs: The Impact of Tokenization on Statistical and Neural Techniques
- 1 Department of Computer Science, Assam University, Silchar, Cachar, Assam, India
Abstract
In the modern era, the field of Machine Translation (MT) has seen a significant shift towards Neural Machine Translation (NMT) techniques, which have surpassed traditional Statistical Machine Translation (SMT) models in terms of the quality of translation. Despite this, the efficacy of these techniques may differ based on the language combination in consideration. While SMT is somewhat more flexible in this regard, NMT often needs sizable parallel corpora to attain high translation accuracy. As a result, a benchmark system capable of offering sufficient translation for languages with limited resources, like Nepali, remains a pipe dream. This paper focuses on translating text using statistical and neural MT techniques for the under-resourced English-Nepali language pair. As a part of this system development, we built a parallel corpus of English-Nepali in the tourism domain. We explore the impact of different tokenization techniques on translation outcomes. A substantial analysis is also done for the performance of both approaches using automatic evaluation metrics, BLEU and TER. This paper aims to provide insights into the applicability of SMT and NMT for the under-resourced English-Nepali language pair in light of two popular epitomes of tokenization and to determine the most effective approach for achieving accurate translations.
DOI: https://doi.org/10.3844/jcssp.2025.3041.3050
Copyright: © 2025 Amit Kumar Roy and Bipul Syam Purkayastha. This is an open access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 54 Views
- 6 Downloads
- 0 Citations
Download
Keywords
- Statistical MT
- Neural MT
- Tokenization
- Sentence Piece
- Low-Resource MT
- Nepali Language