Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields
K. P. Pallavi, L. Sobha and M. M. Ramya
DOI : 10.3844/jcssp.2018.645.653
Journal of Computer Science
Volume 14, Issue 5
Named Entities (NEs) that exist in the sentences are essential to build Natural Language Processing (NLP) applications for Information Extraction (IE) from large corpora. However, generating a large corpus is challenging for resource poor languages, such as Kannada. Further, there is no annotated corpus available online. The challenges faced in annotating NEs with pre-defined classes are: It is morphologically joined with other words and the spelling variations are more frequent for Kannada words. Sentence structure varies according to morphology, parts of speech (pos) and chunking of a language. These parameters differ from one language to another. To address these challenges, a novel application system is proposed to identify NEs in Kannada using a large corpus of 73,676 tokens. The Named Entity Recognition (NER) system consist of a robust pos tagger and Noun Phrase (NP) chunker developed for generic data. Five gazetteer lists were created from many orthographic patterns for each word. Context information such as previous two words, next two words, word morphology and gazetteer lists were added to feature lists. An unigram-bigram template was designed and incorporated into Conditional Random Fields (CRFs) to generate conditional feature functions. The proposed system resulted in 86.85% and 71.01% f-measure for gold test data and newspaper data respectively.
© 2018 K. P. Pallavi, L. Sobha and M. M. Ramya. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.