Arabic Online Event-Based System for Monitoring and Extracting Infectious Disease-Related Information

: With the revolution of the internet, online data play a significant role in identifying disease outbreaks. This has led researchers, governments and organizations to pay close attention to such data in order to employ and exploit them in developing event-based systems. This research studies the infectious disease outbreaks domain in the Arabic language. In this paper, the Arabic Surveillance Infectious Disease Outbreak System (ASIDOS), which is able to extract infectious disease-related information from unstructured data published by newswires is developed. For identifying the features extraction and performing the data analysis, the word association methodology was adopted. The proposed system is validated through experiments using a corpus collated from different sources. Precision, recall and F-measure are used to evaluate the performance of the proposed information extraction method. The overall results achieved are: precision 94%, recall 74% and F-measure 83%.


Introduction
During the past few years, the spread of many different pandemic diseases has increased worldwide; for example, the disease caused by the Ebola virus was first reported by the World Health Organization (WHO) in Guinea in 2014, which then spread rapidly to many West African countries causing hundreds of deaths (Guinea 346, Liberia 181, Nigeria 1, Sierra Leone 37) (Washington, 2015;PAHO/WHO, 2014). It was also transmitted to other countries outside of the African continent, including Italy, the United Kingdom, Spain and United States of America. The latest information about Ebola can be accessed via the following link: http://www.who.int/csr/don/archive/disease/ebola/en/.
In addition to the outbreak of Ebola in Africa, respiratory syndrome coronavirus (SARSCoV) was identified in Asia (2002Asia ( -2003, an outbreak of the pandemic disease H1N1 influenza virus occurred worldwide (2009) and the Middle East Respiratory Syndrome (MERS) was found in Saudi Arabia (2012to date) (Velasco et al., 2014;CDC, 1996). Therefore, the threat of infectious disease outbreaks to public health has prompted countries and organizations to develop several early warning surveillance systems (Choi et al., 2016). However, the traditional surveillance systems or indicator-based surveillance systems require public health networks (Sentinel networks) to collect predefined structured data about diseases on a routine basis from indicator sources, such as over-the-counter and emergency department visits (Christaki, 2015;Collier and Doan, 2012). The World Health Organization (WHO) defines this type of system thus, "A passive surveillance system relies on the cooperation of health-care providerslaboratories, hospitals, health facilities and private practitioners-to report the occurrence of a vaccinepreventable disease to a higher administrative level" (WHO, 2014) Therefore, in the case of using passive surveillance systems, regular submission of monthly, weekly or daily reports of disease data by all health facilities is required. Although implementing this type of system has some advantages, such as its ability to cover all parts of a country and its statistical power, it takes a couple of weeks for disease patterns to be detected and the results regarding possible outbreaks to be disseminated; furthermore, not all countries have the required infrastructure to implement this system (Velasco et al., 2014;Collier and Doan, 2012;WHO, 2014;Ramalingam, 2016). On the other hand, as a result of the technological revolution of the internet, another type of surveillance system has emerged. This type is known as the event-based surveillance system. The WHO defines it thus, "Event-based surveillance is the organized and rapid capture of information about events that are a potential risk to public health" (WHO, 2008). Generally, event-based surveillance systems can be described as real-time monitoring of diseases 24/7 through gathering information from informal sources, such as online news. According to Blench et al. (2009), the WHO's investigations into the majority of disease outbreaks are obtained through diverse informal online sources.
The remainder of the paper is organized as follows. In Section 2, a background to the topic and a review of related work are presented. Section 3 provides the built corpus that contains reports on various infectious disease events collected from different online sources. Data analysis is provided in Section 4. In Section 5, an overview of the Arabic Surveillance Infectious Disease Outbreak System (ASIDOS) architecture, together with an explanation of the methodology used for event information extraction is provided. Section 6 presents the experiments and the performance evaluation. Finally, the conclusion of the work is presented in Section 7.

Related Work
There are two types of surveillance systems: indicator-based surveillance systems (syndromic surveillance) and event-based surveillance systems (Agheneza, 2011). This section will examine eventbased surveillance systems, which use informal data related to disease outbreaks collected from newswires to detect and extract infectious disease outbreak relatedinformation, such as disease type, location name, date and number of victims in order to issue early warning. Greater concentration will be placed on event-based surveillance systems that support Arabic.

Indicator-based Surveillance Systems
This type of system relies on structured data collected from various official sources, such as emergency departments, telephone calls and over-the-counter drug sales to detect and track increases in disease incidence rates (Christaki, 2015). Many detection approaches are utilized for achieving this task. These approaches are classified into three types: temporal, spatial and spatiotemporal surveillance techniques (Tsui et al., 2010). Further information on these types of systems was examined in (Tsui et al., 2010;2008), recent reviews of these systems can be found in (Azzedin et al., 2014;Abat et al., 2016). Moreover, the national communicable diseases surveillance systems developed between 2000 and 2016 in developed countries were reviewed in (Bagherian et al., 2017).
• The HealthMap consists of five parts: data gathering from different sources (newswires, Really Simple Syndication (RSS) feeds, ProMED-mail and WHO), classification, database, web backend and web frontend (Freifeld et al., 2008). HealthMap is similar to GPHIN, as it is a multilingual web-based system, but is free and publicly available, unlike, information produced by HealthMap is presented in seven languages: Arabic, French, Chinese, Spanish, Portuguese, English and Russian. With regard to non-English news reports, HealthMap uses translation to process them • The Medical Information System (MedISys) and Pattern-based Understanding and Learning System (PULS) is also a multilingual early-  Woodall, 1997). It is a multilingual system with free public access and no subscription fees are required. In addition, it relies on public health reports obtained from its subscribers and ordinary users through submit Info from (http://www.promedmail.org/submitinfo). The information presented by the system is firstly examined by experts of ProMED-mail prior to publication (Woodall, 1997;ProMED, 2010). Therefore, many surveillance systems depend on health warning reports produced by ProMED-mail, such as Argus, BioCaster and HealthMap As can be seen, most systems developed in the USA and Europe to process data are written in their own native language and are then enhanced to serve other languages by utilizing a translator engine, which helps monitor the spread of pandemic diseases worldwide.
On the other hand, few systems were found in the literature that are able to directly process Arabic health data, i.e., without using translation engines. For example, in (Samy et al., 2012) two approaches were proposed for recognizing and extracting medical terms from the Arabic medical dataset. The first approach is based on a gazetteer that contains 3473 Arabic medical terms translated from English medical terms resources (SNOMED and UMLS). The second approach is based on 410 Arabic terms that are equivalents of Latin prefixes and suffixes commonly used in the medical and health domain.
Furthermore, a named entity recognition system, NAMERAMA, has been proposed to identify cancer disease related-information, such as disease names, symptoms, treatment and diagnosis methods from Arabic texts in the medical domain (Alanazi, 2017). The system relies on the Bayesian Belief Networks (BBN) algorithm. However, both systems are not event-based surveillance systems and have not been developed for processing data related to infectious disease outbreaks.
Finally, a special type of system has been developed for tracking specific epidemic diseases, based on online social networks, such as Twitter, Facebook and Instagram. For example; the flutrack system (http://flutrack.org) was developed for tracking the spread of influenza epidemics based on data published by Twitter users. In addition, Tracking Flu Infections on Twitter (Lamb et al., 2013), HealthTweets.org website, a platform using Twitter for public health surveillance (Dredze et al., 2014) and the ARGO system for monitoring dengue fever epidemics using internet-based sources (Yang et al., 2017) all belong to this type of system. Further information about systems that rely on online social networks can be found in (Al-garadi et al., 2016).

Datasets
To the author's knowledge, there is currently no available dataset for infectious disease outbreaks. As is well known, availability of a suitable corpus is the base of text mining research. In order to identify and understand the language's behaviour used in the health domain to describe events of infectious disease outbreaks a specialized dataset must be built. Therefore, five corpora containing only Arabic data on epidemic diseases was manually compiled from different sources to conduct this research and one of the contributions of this study was to create such a linguistic resource (e.g., corpus). These corpora contain many news reports written in Arabic that describe various types of events of infectious diseases. The five corpora contain a total of 317 files that comprise 76, 230 tokens. Each corpus represents different events related to a specific disease from the following list:
In addition, n-gram is used in creating five independent sub-datasets by extracting sequences of ten words from both sides of the target word, i.e., the epidemic disease names mentioned in Table 2 from the five original datasets. It was found that these disease names can be written in different forms as in Table 3. In addition, it was found that ‫ز‬ ‫ا‬ / H1N1 disease name is sometimes written as it is pronounced in the English language " ‫ا‬ 1 ‫إن‬ 1 ". However, the sparseness problem in the disease names, such as typographic and spelling variants, can be reduced by performing normalization and stemming.

Word Frequency
Reducing the original corpora will help identify the greatest number of word associations within a specific extracted window to the target token. Moreover, it helps in extracting the most common words among and between these corpora. As illustrated below, Fig. 1 shows distribution of the top 30 words in ‫أ‬ Ebola dataset. Informative words appear in this analysis, some of which indicate types of epidemic diseases, such as ‫وس‬ virus, fever, ‫ا‬ ‫أ‬flu, ‫ء‬ ‫و‬ epidemic and ‫ض‬ disease. Table 4 shows their inflected forms.
In addition, some words indicate the number of victims, as in Table 5.
Further analysis will be conducted on these words in the next sections.

Common Words Analysis
Another analysis was performed to identify the 25 common words among the five corpora. For example, Fig. 2, 3, 4 and 5 show the comparison between the Ebola outbreak disease and other diseases that are listed in Table 2. As can be seen, many words, such as "fever ", " flu ‫ا‬ ‫"ا‬ and "virus " ‫وس‬ are common between the five corpora. The derived common words have strong association with the infectious disease names as well be seen below.

Words Cluster Analysis
Examining words organization in sentences is an important step in identifying how the structure might affect identification of epidemic disease-related information. Therefore, the focus will be placed on identifying the relevant words (keywords) regarding disease names, event locations and number of victims.
The word association methodology was used to derive keywords from text data (disease name concordance words corpus), i.e. finding or extracting relations between units (n-gram span 10 and collocation span 5). In other words, the content of the five corpora was analysed in the search for relations between the words. Therefore, clustering analysis was used to model hierarchical relations between words that frequently occur. Word cluster analysis (text-based dendrogram) was performed to compute the differences between each row of the matrix using packages of R language. This analysis is based on the dissimilarities in the distance between the Term-Document Matrix (TDM). In order to avoid a clutter problem, the sparsity of TDM was adjusted to 0.999. This aims to limit the number of words in the TDM and therefore, make the clustering analysis easier to interpret. The following is the clustering analysis for the five corpora that represent the five diseases ‫إ‬ Ebola, ‫رو‬ ‫آ‬ MERS, ‫ا‬ Dengue, ‫ز‬ ‫ا‬ ‫ا‬ ‫ا‬H1N1, ‫رس‬ SARSCoV. We use the hcluster package provided in R language for implementing the hierarchical clustering algorithm. Therefore, for performing word clustering, the following steps are applied: • Loading the data • Creating Corpus for storing text documents • Applying the function (tm_map) to the corpus for data preprocessing, such as removing punctuation marks, extra white spaces and stopwords and replacing "tab" with a space. • Creating "Term document matrix" by Term Document Matrix function. • Applying removeSparseTerms function.
• Converting data in the form of a distance matrix by using dist function to compute the euclidean distance between the documents, i.e., calculating the differences between each row of the matrix. • Applying hclust function to perform cluster analysis on the dissimilarities of the distance matrix. As previously mentioned, in order to prevent a cluttered dendogram and to facilitate interpretation, the sparsity value was 0.999. This aims to limit the number of words. As can be seen, valuable clusters appear in the Figures.
The Figures show the hierarchical relations between words in each specific disease corpus. In Fig. 6, 7 and 8, the disease names "Ebola ‫,"ا‬ "MERS ‫رو‬ ‫"آ‬ and "SARSCoV ‫رس‬ " always appear related to the word "virus ‫وس‬ " whereas in Fig. 9 and 10 the two disease names "Dengue ‫"ا‬ and "H1N1 ‫ز‬ ‫"ا‬ always appear related to the words "fever " and ‫ا"‬ ‫,"ا‬ respectively. Moreover, the previous disease names have a co-occurrence with the words "disease ‫ض‬ " and "epidemic ‫ء‬ ‫."و‬ Therefore, these words can be utilized to identify disease names from the news reports.
On the other hand, in the preprocessing phase, all Arabic numbers that indicate the number of victims in the dataset were changed to the word "number ‫."ر‬ This word appears in the clustering analysis related to the words "cases ‫ت‬ ", "death ‫ت‬ ‫"و‬ and "infected ‫ت‬ ‫"ا‬ as in "3 cases -‫ت‬ 3", "3 deaths -‫ت‬ ‫و‬ 3" and "3 infected-‫ت‬ ‫."3ا‬ As is well known, Arabic nouns are inflected for gender (masculine and feminine) and number (singular, dual and plural). As a result, the number of victims are sometimes written in the form of dual nouns without writing an explicit number, such as "2 cases -", "2 deaths -‫"و‬ and" 2 infected -‫."ا‬ Similarly, this may happen when the singular form is used, as in "one case -", "one infected -‫"ا‬ and" one death -‫ة‬ ‫."و‬

‫ا‬ ‫ا‬ ‫ز‬ ‫ا‬
With regard to location and date of events, it is not necessary to perform an analysis to study the word relations. Disease outbreak location can be located through a pattern-matching method using predefined lists that contain names of countries and cities. Regarding the date, a report date is adopted, as the infectious disease date and <pubdate><\pubdate> tag is used for recognizing date, which will be presented later.

Arabic Surveillance Infectious Disease Outbreak System (ASIDOS)
In this section, the extraction methods for the four entities: disease name, location, number of victims and date are explained. The architecture of the Arabic Surveillance Infectious Disease Outbreak System (ASIDOS) is depicted in Fig. 11. The ASIDOS system is now online (Alruily, 2018).
ASIDOS is a real-time system for monitoring diseases 24/7 via information gathering from informal sources, such as online news, for detecting and tracking disease outbreaks. ASIDOS comprises several stages as follows: • Data collection The process used is based on receiving Really Simple Syndication (RSS) feeds from predefined sources that normally publish reports on public health that include disease outbreaks.

• Information extraction engine
In this stage, xml files are processed in order to extract information of outbreak incidents (disease name, location, number of victims, date). It was noticed that the disease-related information often occurs within the <title><\title> tag and if an entity or entities are missing, the system moves to <body><\body> tag to find the remaining entities.  The developed system is based on rules using regular expressions that were inferred, as seen in Section 4. In addition, a gazetteer containing keywords that indicate the position of patterns of interest within a body of text is used:

Extracting the infectious disease
For extracting an infectious disease from incident narrative reports, regular expression consisting of a single keyword from the keywords list in Table 6 is used. Therefore, the word that follows the keywords is an in infectious disease.
The following is the list of regular expressions for extracting infectious diseases within a body of text: (Regex) Keyword [\s]?[\ w]+ 2. Extracting location of event Simply put, here a straightforward pattern-matching method is used for identifying the place of a disease outbreak. A location gazetteer was created for achieving this task.

Extracting number of victims
For extracting the number of people affected in the incidence of a disease outbreak, the following regular expressions are utilized:

(Regex) \d[\d] + [\s]?Keyword
In the case of these regular expressions failing to extract the number of victims, the list of keywords in Table 7 is used for performing pattern matching.
These keywords indicate that the number of victims is either one or two.

Extracting date of event
Regarding the date, a report date embedded in the xml file is adopted as the outbreak of infectious disease date and <pubdate><\ pubdate> is used for recognizing date.

Interface
After detecting and extracting outbreaks of infectious disease-related information from the online textual news reports, they are mapped so as to be visualized in logical representation. Therefore, a relational database is created in order that extracted facts can be stored and then visualized in different forms, e.g. temporal and spatial visualizations. A web server is used in order to make the database available online and can be accessed through the website (www.wabaa.org). A web-based interface is provided to make the database Additionally, the searching ability in the database is available with many options to users. They are able to search by disease name, location, source and period of time (last month, last three months and last six months), as shown in Fig. 15. Moreover, advanced searching is also available, which comprises previous search options together. Furthermore, the system is able to generate statistical information for users about a specific disease within a specific year, as can be seen in Fig. 16.

Experiments and Evaluation
The experiments were performed on new and untouched corpus. This dataset contains 266 articles collected from different sources. Moreover, reports of new types of infectious disease outbreaks were added to this corpus, such as typhoid, cholera, malaria and tuberculosis, which will test the performance of the ASIDOS system more efficiently. To evaluate the performance of this system the Precision (P), Recall (R) and F-measure metrics are used.

Experiments
The following experiments test the performance of the ASIDOS system for extracting outbreaks of epidemic information, i.e., disease, location, number of victims and date.

• Disease
In this experiment, the derived regular expressions and list of keywords for pattern matching were tested to extract infectious disease.
The ASIDOS system was able to extract 266 disease names out of 287. The precision is high 100% as well as the recall 93%. It yielded an overall f-measure of 96%. The reason for not identifying some disease names was that a number of news reports did not have any keywords that are used in the regular expressions and, as a result, the extraction process was not implemented.

• Location
The system was able to correctly identify 205 locations out of 328. The precision, recall and fmeasure results are 98%, 63% and 76%, respectively. It can be observed that the recall value is low for various reasons. Arabic is a highly inflected language with a very complex morphology. For example, the word "Beijing" the capital of China, can be written in two forms in the Arabic language news " and " ". Therefore, all the different forms of countries' or cities' names must be added to the location gazetteers in order that value of recall can be improved. Moreover, in any events, cities' names in the Arabic language are preceded by either the preposition " " or ‫,"ب"‬ both of which mean "in". The preposition ‫"ب"‬ is fused at the beginning of the word as a prefix. Hence, the locations cannot be extracted using straightforward pattern matching. In addition, sometimes, as in the sentence " ‫ا‬ ‫ر‬ ‫ا‬ ‫ا‬ ‫ت‬ ‫و‬ ‫ة‬ ‫ري‬ ‫,"ا‬ the name of the location "Syrian coast" is written in the form of an adjective not as a noun and therefore, cannot be recognized. In some cases, two or more locations are written in a report as places of disease outbreaks but the system is designed to identify only one location.

• Victims
In this experiment, the system was evaluated and achieved 84% precision, 69% recall and 75% fmeasure. Although the system failed to extract 104 entities and 44 were wrongly identified, the results seem relatively satisfactory. The reason for not extracting a number of entities was due to certain cases where Arabic numbers occur after the keywords, as in the sentence: " ‫ة‬ ‫و‬ 7 ‫ا‬ ". Moreover, sometimes the regular expressions created for the extraction process is not implemented when the keywords are not found in the text, as in " ‫ل‬ ‫رو‬ ‫آ‬ 4 ‫ض‬ ‫ا‬ ‫ة‬ ‫وا‬ ‫اد‬ ‫أ‬ ". In consequence, the number of victims cannot be extracted.

• Date
The publishing date of news reports is considered the date of the outbreak of an infectious disease. For extracting the date, the xml <pubdate><\pubdate> tag indicating the date is used. The system successfully extracted it with no errors.
To the author's knowledge, no event-based epidemic disease surveillance system has been developed for directly processing Arabic texts, i.e. without using translation engine. For example, the Global Public Health Intelligence Network (GPHIN) is a multilingual system supporting eight languages including the Arabic language but uses a translation machine to translate non-English reports into English in order to process them. It also presents information in English and French languages. HealthMap system is freely and publicly available not as GPHIN. HealthMap also uses a translation machine to support other languages including Arabic texts and presents information in Arabic. HealthMap's Arabic page (http://www.healthmap.org/ar/) was visited many times; the last visit was on 21 September 2017, in all visits it did not work. The system was evaluated using data written in English; the overall accuracy is 84% and the ASIDOS system outperforms it with 94%. Moreover, MedISys and PULS are able to process 60 languages. Although MedISys is able to retrieve data from many resources in different languages, Plus system is responsible for information extraction tasks and is only able to process reports in English language, i.e., it makes use of machine translation to perform text extraction processing for other languages. The system achieved 72% accuracy. The overall performance results for the ASIDOS system is listed in Table 8.

Conclusion
The main aim of this work was to develop an eventbased surveillance system for extracting infectious disease, location of disease outbreak, number of affected victims and date of disease outbreaks from Arabic health news reports. An overview of the architecture of the proposed system was presented. Also, the performance of Arabic Surveillance Infectious Disease Outbreak System (ASIDOS) was evaluated through implementing experiments. The overall results achieved were: precision 94%, recall 74% and f-measure 83%. The system interface and its features were explained and can be accessed via the link www.wabaa.org. The extracted information is visualized in various ways and a user is able to search in the old data.
Many contributions were delivered from this research. The main implication was developing the ASIDOS system, which is the first system developed and available online in the Middle East and North Africa (MENA) region to track outbreaks of infectious disease. In addition, the first analysis on infectious diseases data written in Arabic was performed using R in this research. Moreover, building an infectious disease corpus is one of the contributions of this study, as currently, there is no available specific dataset containing reports about infectious diseases.
Finally, the ASIDOS system could be improved in the future through extending its work to cover other events, such as biological events and diseases that affect animals and plants. Also, disseminating warning messages to subscribers could be added to the system to alert them of any disease outbreaks.