Improving Arabic Named Entity Recognition with a Modified Transformer Encoder

: This article investigates the use of a transformer encoder for Arabic Named Entity Recognition (NER). The classic transformer that was originally proposed for machine translation adopts the absolute sinusoidal position embedding which is aware of distance but unfortunately is not aware of the directionality. However, in the NER task, both distance and orientation are crucial. Therefore, in this study, instead of using absolute sinusoidal position encoding, we employ relative positional encoding and incorporate the directionality information in our NER model. More specifically, our proposed model uses Bidirectional Long Short-Term Memory (BiLSTM) for encoding every input token. Then, the output of the encoder is fed to the multi-head attention where both the distance and directionality information are incorporated. The decoder layer with a simple fully connected layer takes as input, the result of the attention layer, and the prediction layer with Conditional Random Fields (CRF) predicts the tag of each token. We validate our proposed approach on two merged public datasets, namely, ANER corp and AQMAR. Our experiment results demonstrate significant improvements when compare to the vanilla Transformer with absolute sinusoidal position encoding while achieving a state-of-the-art result on a merged two Arabic public datasets.


Introduction
Named Entity Recognition (NER) is one of the most studied Natural Language Processing (NLP) tasks.With a given text, the NER model's goal is to assign to each word, a label using a pre-defined set of labels (Li et al., 2020).NER is an important text pre-processing phase in several NLP tasks such as question answering, relation extraction, and so forth.More specifically, in NLP tasks such as machine translation, question answering, text clustering, and relation extraction, NER produces useful information for chunks of raw text to be used in such tasks (Roy, 2021).In the literature, numerous natural languages, including English (Sun et al., 2018), Chinese (Cao et al., 2018), Arabic (Ali et al., 2020), have been the focus of NER research, and so on.
Arabic is widely spoken.Over 25 nations have 360 million Arabian speakers.Arabic is rich in morphology and semantics which is one of its important features.It's one of the five recognized international languages (Ali et al., 2019).When it comes to NER study, Arabic has recently emerged as one of the most significant challenges because of: (1) Particular features and peculiarities (Farghaly and Shaalan, 2009) of the Arabic language and (2) The scarcity of labeled corpora currently available.Several Arabic NER models and a small number of publicly available datasets were suggested in the literature (Shaalan, 2014).Especially, the ANER corp dataset was created by Benajiba and Rosso (2007) while Mohit et al. (2012) proposed the AQMAR dataset.These datasets have a limited number of classes.Therefore, the WikiFANEGold dataset which contains 50 classes was proposed by Alotaibi and Lee (2014).Despite significant efforts that have been made recently in the Arabic NER, challenges still exist because of some characteristics of the Arabic language.Named entity recognition consists of:  Detection of entities in the raw text: If we can use the capital letters as indicators to determine where named entities start and end as in the English language, it will be quite an easy task.However, the Arabic language does not support capitalization which is one obstacle to obtaining good performance in NER (Benajiba and Rosso, 2007).As illustrated in Fig. 1, we present an example of two words where only one of them is a Named Entities (NE) while both of them start with the same character  Classification of NEs: To decide the class of a NE, the classification system looks at both the word and the sentence where it occurs.However, Arabic is a highly inflectional language meaning that a word is formed by the following components (Eq.1):

 
word prefix es lemma suffix    (1) where the prefixes can be either articles, conjunctions, or prepositions.In the Arabic language, both prefixes and suffixes can be combinations meaning that a word can have one or more affixes.In terms of a statistical viewpoint, this inflectional feature of Arabic makes it a more complex morphology and more sparse language compared to other languages.Therefore, when dealing with Arabic NLP tasks, a large amount of training data must be used in order for the model to work well.Additionally, the lack of uniformity in writing styles is another challenge.Moreover, compared to other languages, the Arabic language has more speech sounds, that is, when names in another language are transliterated to Arabic, many variants of named entities from the same name indicating the same meaning can be obtained.Also, in Arabic, most vowels are represented by diacritics, which influences the representation of the phonetic and therefore resulting to have words with different meanings.Thus, having the same word with different meanings indicates that a word may be represented by two different named entities.
Considering the above-mentioned challenges, in recent years, researchers have tackled the Arabic NER using either handcrafted rules (Shaalan and Raza, 2008) or statistical learning (Benajiba and Rosso, 2007).In the former case, where handcrafted grammatical rules of linguists are employed, in-depth knowledge of the Arabic language is required.
This approach consumes time when linguists' knowledge plus background are weak.Contrary, statistical learning uses a training set of instances to pick up the patterns of the model relevant to any NER task.The advantage of such an approach is that no knowledge of the Arabic language is required.Only sufficient corpus is needed to fit Machine Learning (ML) models.The challenge with traditional ML-based methods is that they are not able to deal with the great amount of Arabic data available on the web.Recently, researchers' interest in neural networks has led to the development of several deep learning architecture proposals.More specifically, advanced solutions for NLP tasks in different languages have been made by combining supervised methods with deep neural networks (Melis et al., 2017).These neural networks achieved extremely impressive results on different NLP tasks such as NER (Lample et al., 2016), relation extraction (Konstantinova, 2014), machine translation (Chu and Wang, 2018), and so forth.These neural networks are mostly built using word representation information or embedding.However, these models struggle when handling Out of Vocabulary (OoV) words.In the past years, several NLP models have considered both word and character embedding to solve the problem of OOV.
For the purpose of machine translation, (Vaswani et al., 2017) suggested a fully connected self-attention design called a "transformer" for representing lengthy contexts.The attention mechanism is widely used to improve deep learning models for numerous fields including NLP tasks, computer vision, and soon, even though it was first proposed for machine translation models (Vaswani et al., 2017;Luong et al., 2015).Especially, in the context of NER, the attention mechanism enables a deep learning model to focus its attention on the relevant part parts of the sequence during the prediction of the label of a word.However, (Yan et al., 2019;Guo et al., 2019) researchers demonstrated that the vanilla Transformer doesn't perform well on the NER tasks.They found that the sinusoidal position encoding used by Vaswani et al. (2017) is aware of distance but not of directionality, which is important for the NER models.According to our best knowledge, the Transformer network is not yet used in Arabic NER.Even though the transformer is used for NER in other languages like English (Liu et al., 2022), all the models that employed the transformer used the vanilla one which is not aware of the directionality.In this study, we employ the Transformer for Arabic NER by incorporating it into our model, the directionality aware.We measure the performance of our architecture by using the F1-score metric.
This section provides recent previous works that have been published in the literature in the context of Arabic NER using deep learning.Nevertheless, before deep learning shows its efficiency in NLP tasks, feature engineering methods were used.Especially, (Goyal et al., 2018) classified NER approaches into ones that depend on either rules or learning or a mix of both.
Moreover, Etaiwi et al. (2017) categorized Arabic NER approaches into the following classes: Markov-based model or HMM (hidden Markov Model), Conditional Random Field (CRF), Naïve Bayes, neural networks, maximum entropy, and support vector machine.

Fig. 1: Illustration of the lack of capital letters in the Arabic language
An in-depth survey on Arabic NLP approaches was reviewed by Al-Ayyoub et al. (2018).By comparing the Arabic NLP approaches to works available in the English language, the authors concluded that there exists a considerable gap that needed to be filled.Dahan et al. (2015) developed an ANER model based on HMM where stemming was used to solve some ambiguity problems related to the Arabic language.A decision trees approach was proposed by Al-Shoukry and Omar ( 2015) for Arabic NER.The authors considered named entities such as a person, time, and date and achieved an F1-measure of 81.35% using a dataset gathered from online resources.Benajiba et al. (2008) replaced maximum entropy with a conditional random field to improve the performance of their NER model.They obtained 72.77, 86.90 and 79.21% respectively for recall, precision, and F-measure.
AbdelRahman et al. ( 2010) used word-level features, POS tagging, BPC, gazetteers, and morphological features to handle Arabic NER.They used CRF to identify named entities such as a person, location, device, cell phone, organization, car, date, and time.Mohammed and Omar (2012) used Neural Networks (NNs) to classify four entities in the Arabic language: A person, organization, location, and miscellaneous.When comparing to decision trees, the authors have demonstrated that NNs reach 92% of precision while the decision tree achieves only a precision of 87%.
Bazi and Laachfoubi ( 2018) conducted an investigation on the impact of word representations on the Arabic NER systems.They compared different neural word embedding algorithms on the AQMAR dataset and concluded that combining different approaches together improves the network performance.
Because handcrafted features and machine learning approaches cannot deal with a large volume of data available on the web, current studies shifted towards deep learning to address Arabic NLP problems, including NER.In what follows, we review NER methods proposed for both Arabic and other languages, which rely on deep learning.When it comes to Arabic NER, several deeplearning approaches were recently proposed.More specifically, (Shahina et al., 2019) applied Bi-directional recurrent networks on the ANER corp dataset and the results revealed that bidirectional implementations provide excellent performance compared to only one-direction recurrent networks.More specifically, the authors experimented with RNN variants such as LSTM and GRU.
A deep co-learning based on a semi-supervised learning algorithm is proposed by (Helwe and Elbassuoni, 2019) for Arabic NER.The authors first designed and implemented a classifier for Wikipedia articles by using LSTM-DNN and thus obtained a semi-labeled corpus for the Arabic NER application.In the testing phase, the authors used three public datasets and concluded that their proposal outperforms the state-of-the-art approaches.
Using a BiLSTM design, (Ali et al., 2018) suggested an approach to Arabic NER.The authors used both word and character embedding.When comparing Bi-LSTM to Bi-GRU on the ANER corp dataset, the authors achieved with Bi-LSTM an F1-score equal to 88.01% while Bi-GRU reached 87.12% of the F1-measure.The same researchers suggested a BiLSTM and multi-attention layer-based NER technique (Ali et al., 2019).The first attention layer is the embedding attention layer which concatenates the word and character embedding.The second attention layer is on top of the encoder which is a Bi-LSTM.By using two attention layers, the authors significantly improved the performance of their model and reached an F1measure of 91% on the ANER corp dataset.
El Bazi and Laachfoubi (2019) came up with a new design for Arabic NER based on BiLSTM and "Conditional Random Fields" (CRF).On the ANER corp dataset, they obtained an F1-measure of 90.6%.Kuru et al. (2016) proposed a NER model for the Arabic language by employing character-level information and word embedding.A deep BiLSTM is used to encode the embedding information.By comparing their approach to hand-engineered features methods, they achieved good results.Al-Smadi et al. (2020) suggested an Arabic NER model based on transfer learning.They used a BiGRU to get the encoded output from a "Universal Sentence Encoder" (USE) that had been pre-trained.Next, the researchers applied "Global Max Pooling" (GMP) and "Global Average Pooling" (GAP).The output of such processes is merged and passed into a feed-forward neural network, which makes predictive inferences based on this data.For the WikiFANEGold corpora, they got an F1-score of 91.20%.Ali and Tan (2019) suggested an encoder-decoder NER model for Arabic.The authors added a mirror change to their paper proposed by (Ali et al., 2019).Using an attention layer, the word embedding and character embedding in the initial layer were joined together.A BiLSTM is used to encode the result of this layer.Following the construction of the encoder, a second attention layer is added on top, the output of which is fed into the decoder, itself a second BiLSTM.The model made by the researchers did much better on the AQMAR and ANER corp corpora than the model presented by Ali et al. (2019).
Alsaaran and Alrabiah (2021) fine-tuned the BERT model for Arabic NER.Especially, the pre-trained BERT model was used for embeddings and this layer's result is utilized as incoming characteristics to a Bidirectional Gated Recurrent Unit (BiGRU).Ultimately, a fully connected layer is utilized for classification in the deepest layer.On the ANER corp corpora, they got an F1-measure of 92.28% and when they added the AQ-MAR corpora to the ANER-Corp corpora, they got an F1-measure of 90.68%.
Gridach (2016) utilized a BiLSTM-CRF system that embeds characters and utilized a pre-trained word2vec to embed words for extracting named entities from Arabian social mediums.Khalifa and Shaalan (2019) used Convolutional Neural Networks (CNNs) for character-level embedding and the Skip Gram Word2vec model for word embedding.To further train the model, a BiLSTM is given the embedded data and a CRF performs the classification at the last layer.To keep away from using task-related characteristics in the tasks of sequence labeling like NER, Part of Speech (PoS) and Chunking, (Collobert et al., 2011) combined Multi-Layer Perceptron (MLP) a-n-d C-N-N.The BiLSTM was first combined with CNN by Huang et al. (2015) to address sequence tagging problems and after that BiLSTM is widely used in NER tasks for various languages such as (Chiu and Nichols, 2016;Dong et al., 2016) which used a combination of BiLSTM and CRF to recognize Chinese named entities.

Materials and Methods
In this part, we describe in depth the methodology behind our suggested system.Firstly, we presented the transformer encoder and its components.We then explained the absolute encoding that is inherent to the classic transformer, before turning to our modified solution.Our suggested system is shown in Fig. 2.

Transformer Encoder
Given a sequence with length l and d be the dimension of the input, the transformer encoder of Vaswani et al. (2017) takes as input a matrix H ∈ ℝ l× d .Then, to project H into different spaces, the query matrix Wq, the key matrix Wk and the value matrix Wv with the usual size of ℝ l× dk are learned.Here, dk is a hyperparameter.After H has been projected, the scaled dot product attention is computed by using Eqs.2-4 respectively: , , , , where, Qt, j, and Kj are the query vector of the t -th token, the token the t -th token attends, and the key vector representation of the j -th token respectively.The SoftMax is applied along the last dimension.To improve the capability of self-attention, numerous sets of Wq, Wk, and Wv can be used instead of one set.The multi-head selfattention is formed by using several groups.It can be calculated in Eqs.5-7 respectively: Here, n represent the number of heads, the superscript h represents the head index and [head (1) , head (2) , …, head (n) ] means concatenation in the last dimension of size ℝ l× d .Wo is a learnable parameter of size ℝ d× d .dk * n = d.
After that, the result of "multi-head" self-attention is fed into a "position-wise" feed-forward network, Eq. 8: Here, W1, W2, b1 and b2 are trainable parameters and dff is a hyperparameter.In addition, layer normalization and Residual connection of Vaswani et al. (2017) are also included in our architecture.

Sinusoidal Position Encoding and Relative Position Encoding
In this part, we provide the difference between the sinusoidal position encoding utilized in the classic Transformer and the relative position encoding that we employ in our proposed model.

Sinusoidal Position Encoding
In the classic transformer, Vaswani et al. (2017) proposed using the embeddings of position produced by sinusoids of different frequencies in order to enable selfattention to be able to catch the languages' sequential features.Especially, the sinusoidal encoding models the position of each token and measures the distance between two tokens.Given a sentence with length l, the position encoding of the t -th token is obtained by Eqs.9-10 respectively: .With any fixed distance k, PEt+k is represented using linearly transforming PEt (Vaswani et al., 2017).However, as mentioned earlier, this encoding type is not aware of the directionality which is important in NER models.

Relative Position Encoding
Based on Shaw et al. (2018), we adopt the relative position encoding instead of using the sinusoidal one used in the classic transformer.In the classic transformer, for a position t and a distance k, the calculation between PEt and PEt+k, is . . .
W Wk is the learnable parameters of the query and key matrices.To incorporate the directionality in our model, the attention score is computed using Eqs.11-14 respectively (Shaw et al., 2018): , , , , 22 sin cos 10000 10000 where, t denotes the target token and j denotes the context token.QR  denotes t -th token's bias on specific relative distance.The j th token's bias is denoted T j uK while T tj vR  representing the bias term for a specific orientation and distance.By using Eq. 12 and applying rules sin(x) = -sin(x) and cos(x) = cos(-x), the relative position can be expressed as given in Eq. 15: In the above equation, for a distance t, the forward and backward relative positional encodings are similar according to the cos(cit) terms and are the opposite according to the sin(cit) terms.Consequently, by using Rt−j, we enable the attention's value to differentiate between various distances and orientations.

Embedding Layer
To tackle the OOV challenge, we use the embedding of words and characters.In the literature, word-level embedding is tackled in most cases with pre-trained word embedding like a glove (Pennington et al., 2014) and fast text (Bojanowski et al., 2017).On the other hand, character-level representation is achieved through CNNs or Bi-LSTM (Khalifa and Shaalan, 2019).

Character Level Embedding
Currently, according to our best knowledge, characterlevel representations are extracted through either CNNs or Bi-LSTM (Gridach, 2016).Initially, vocabulary characters receive arbitrary embedding vectors.After that, a BiLSTM or CNN is applied to each word in the vocabulary to handle its characters one by one.Finally, the Bi-LSTM's result is the word's character representation.We used in this study BiLSTM for character embedding.

Word Level Embedding
Any word's meaning can be described in 40-300dimensional space by a continuous tensor of real values that shows how the word is embedded.While unrelated words have dissimilar representations in this space, related words are represented by similar vectors.In this study, we use fast text (Bojanowski et al., 2017), which is an open-source pre-trained word embedding model maintained by Facebook, to encode the information of each word.

Encoding Layer
We adopt in this study a Bi-LSTM for encoding with an implementation detailed in Eqs.16-19 respectively: where, c, f, i and o represent the cell state, forget, input, and output gates respectively and all bs are biased.σ is the sigmoid activation function.To consider both the past and the future context of the input sequence, we utilized BiLSTM, which takes the same input and processes it in both directions (right-to-left and left-to-right) before combining the results to form the result.A Bi-LSTM is composed of a forward hidden layer and a backward hidden layer.More specifically, the forward hidden layer t f h processes the input sequence from left to right, i.e., t = 1, 2, 3, …, T, while the backward hidden layer t b h processes the input sequence from right to left, i.e., t = T, …, 3, 2, 1.Finally, t f h and t b h are combined to generate an output yt.Bi-LSTM is implemented using Eqs.20-22 respectively:

Multi-Head Attention Layer
In this layer, we implement the transformer encoder with a relative positional encoding network to consider both the directionality and distance features.Both residual and layer normalization of the transformer is also used.

Decoding Layer
We use a simple fully connected layer in the decoding layer and reduce the learnable parameters instead of using a recurrent unit as it is done in other sequence-to-sequence models.

Prediction Layer
When dealing with sequence labeling issues, like NER or POS labeling, it is essential to take advantage of dependencies among tags instead of making decisions about each tag on its own.To give just one example, it is more common for some other "PER" label to succeed a "PER" label as opposed to an "ORG" label to do so.It is possible to make use of such dependencies with the CRF suggested by Lafferty et al. (2001).
Given an input sequence X = {X1, X2, …, Xn}, his corresponding labels or tags y = {y1, y2, …, yn} has the score expressed in Eq. 23: where, P ∈ ℝ n× k , k, Ai,j are respectively the matrix of scores output by the word representation layer, number of possible labels or tags, in addition to the result of moving from label i to label j.For a given set of labels y, the likelihood is given by Eq. 24: Here, YX represents those potential sequence labels for the provided X. Equation 25 shows that the goal of the learning procedure is to make it more likely that the right labels come in the right order: During the decoding phase, the Viterbi algorithm proposed in 1967 by Viterbi (1967) can be utilized to identify the tags' sequence with the highest value, Eq. 26:

Experiments
In this section, we detail our experiment settings, the dataset that we used, the state-of-the-art baseline references that we use for comparison, and briefly define our evaluation metric.

Dataset
In this study, we used ANER corp (Benajiba et al., 2008) and AQMAR dataset (Mohit et al., 2012) which were proposed for Arabic NER tasks for training and testing.On one hand, the ANERCorp dataset is divided into four categories: Pers (39%), Loc (30.4%),Org (20.6%), and Misc (10%).IOB or (inside, outside, beginning) is the standard format of the dataset.In the dataset, if a token comes at the start of any named entity, then it must be labeled as B-label while I-label means that the token comes inside the named entity but not the first token within the named entity.If a token is neither at the start of a named entity nor inside a named entity, it is labeled as O.The dataset is composed of 316 articles where articles from different newspapers are chosen.These articles contain many topics enabling thus to create a generic dataset.There are over 150,000 tokens in the dataset and the three named entity classes available for labeling are Persons, Locations, and Organizations.On the other hand, (Mohit et al., 2012) proposed the AQMAR dataset as an Arabic dataset for named entities, which is collected from Wikipedia.AQMAR consists of 74,000 tokens that are hand-labeled for named entities to ensure uniformity of labeling and that it is openly accessible for study needs.By merging these two datasets, 224, 286 tokens were obtained.

Experiment Settings
The following settings are used in our experiments.The maximum sentence length in the used dataset is equal to 212.Each word's features (word embedding dimension) were set to 300 while each character was represented by a vector of length equal to 48 (character embedding dimension).For output tuning, we added the L2 regularization component to our loss function.We used a 50% dropout to control the inputs of the encoder and avoid the overfitting problem.To optimize the cost of the network, we used Adam (Kingma and Ba, 2014) to learn the system with a 64-batch size.To train each model, we used a maximum number of epochs equal to 100.We set an initial learning rate to 1e-3 and progressively reduce it during the training with a decay factor of 0.95 until it reaches the minimum learning rate set to 5e-5.We employed an early stopping, with min delta and patience equal to 0 and 6 respectively.

Evaluation Metrics
F1-measure is the primary criterion for evaluating the effectiveness of different systems.This measure is contingent on two others: Recall (R) and Precision (P).(R), (P) and F1-measures can all be calculated with the help of Eqs.27-29: The notations "TP," "FP," and "FN" stand for "true positive," "false positive," and "false negative," correspondingly.

Results and Discussion
Table 1 illustrates the experimental findings.We were able to get a result of 93.27 for the F1-measure by combining the data from ANER corp (Benajiba et al., 2008) and AQMAR (Mohit et al., 2012) with relative positional encoding.With the same settings and the same merged datasets, the absolute sinusoidal encoding achieved an F1-score of 92.40.
Our experiment results demonstrated that both the distance and directionality information are important for NER as we obtained significant improvement when adding directionality information to the vanilla transformer encoder.
We have chosen three baselines that have used the same data and same performance metrics and compared our proposal with them.More specifically, (Ali and Tan, 2019;Ali et al., 2019;Al-Smadi et al., 2020) were selected as baselines.Table 1 also displays findings from these baselines using the same experimental conditions.Out of these three baseline models, (Ali and Tan, 2019) are the most optimistic which achieves 92% of the F1-score while (Ali et al., 2019;Al-Smadi et al., 2020) achieved respectively an F1-score of 91 and 90%.Our proposed approach, therefore, outperformed such baselines on the Arabic NER.

Conclusion
In our article, we suggested a Transformer-based Arabic NER model.More specifically, a BiLSTM layer was utilized for encoding each input token of the sentence.Next, the multi-head attention layer receives the encoder's output where both the distance and directionality information are incorporated.The attention layer's results are passed into the decoder layer, which is a simple fully connected layer.Then the prediction layer with Conditional Random Fields (CRF) predicts the tag of each token.We validate our proposed approach on two merged public datasets, namely, ANER corp and AQMAR.The use of relative positional encoding enables our network to be aware of both the distance and the directionality.Our model used fast text to embed each word and a BiLSTM to embed each character.Our experiment results have demonstrated that the directionality feature is important when using the transformer for NER.We outperformed the vanilla transformer and some recent works that have been carried out using the same datasets and the same performance metrics.
In the future, we aim to rely on a fully connected selfattention architecture (a.k.a transformer) without an RNN variant for Arabic NER.ethical issues or conflict of interest.
is the dimension of the input sequence while i is in the range of 0

Fig. 2 :
Fig. 2: Architecture of the proposed model

Table 1 :
Experiment results