Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields

K. P. Pallavi; L. Sobha; M. M. Ramya

doi:10.3844/jcssp.2018.645.653

Research Article Open Access

Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields

K. P. Pallavi¹, L. Sobha² and M. M. Ramya¹

¹ Hindustan Institute of Technology and Science, India
² AUKBC, India

Abstract

Named Entities (NEs) that exist in the sentences are essential to build Natural Language Processing (NLP) applications for Information Extraction (IE) from large corpora. However, generating a large corpus is challenging for resource poor languages, such as Kannada. Further, there is no annotated corpus available online. The challenges faced in annotating NEs with pre-defined classes are: It is morphologically joined with other words and the spelling variations are more frequent for Kannada words. Sentence structure varies according to morphology, parts of speech (pos) and chunking of a language. These parameters differ from one language to another. To address these challenges, a novel application system is proposed to identify NEs in Kannada using a large corpus of 73,676 tokens. The Named Entity Recognition (NER) system consist of a robust pos tagger and Noun Phrase (NP) chunker developed for generic data. Five gazetteer lists were created from many orthographic patterns for each word. Context information such as previous two words, next two words, word morphology and gazetteer lists were added to feature lists. An unigram-bigram template was designed and incorporated into Conditional Random Fields (CRFs) to generate conditional feature functions. The proposed system resulted in 86.85% and 71.01% f-measure for gold test data and newspaper data respectively.

Journal of Computer Science

Volume 14 No. 5, 2018, 645-653

DOI: https://doi.org/10.3844/jcssp.2018.645.653

Submitted On: 3 October 2017 Published On: 24 February 2018

How to Cite: Pallavi, K. P., Sobha, L. & Ramya, M. M. (2018). Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields. Journal of Computer Science, 14(5), 645-653. https://doi.org/10.3844/jcssp.2018.645.653

Copyright: © 2018 K. P. Pallavi, L. Sobha and M. M. Ramya. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

3,818 Views
2,083 Downloads
5 Citations

Download

Keywords

Named Entities
Natural Language Processing
Noun Phrase Chunker
Conditional Random Fields