DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type

Email: sssaloum@ju.edu.sa said.saloum@gmail.com Abstract: This paper presents a novel Arabic dataset that considers the characteristics of the Arabic language filling some gaps not covered by existing datasets. Conventional datasets consider Arabic in a similar way to Latin languages. These datasets either delete diacritic and supplement marks, considering them as defects, or keep them without considering the actual meaning. More than half of all Arabic characters have diacritics above or below characters. In this context, this work presents the novel Detailed Arabic Dataset (DAD) for bridging these gaps. The additional marks included in this dataset are the single dot, two dots "-", three dots "^", Hamza and two supplement marks: The bar for Tah, or Zah and the complement bar for Kaf. A special application was built to generate a dataset for Arabic online recognition and writer identification (called OFMArabicDatasetBuilder). Totally the ground truth contains 93064 entries based on sub-word and letter parts (not on words or lines as other datasets). This dataset will provide researchers with a strong tool for online Arabic language text recognition especially in the segmentation phase and writer identification. This paper also presents benchmarking results of using k-nearest neighbours machine learning with DAD.


Introduction
In the literature, many papers that focus on Arabic text recognition and recognition. Most of these papers present offline recognition systems. The reason for this might be because databases for offline systems are easy to create and some benchmark databases for offline systems have become available over the last two decades (Al-Hashim and Mahmoud, 2010;Mezghani et al., 2012;Alamri et al., 2008;Mahmoud et al., 2012;Kharma et al., 1999).
Lately, several papers regarding online Arabic recognition have been published. However, many of them use their own databases (El Abed et al., 2009). To the best of my knowledge, only nine online text datasets and three online digits datasets have been published, which are the ones from the Online Arabic Handwriting Recognition in 2009 (El Abed et al., 2009), the On/Off (LMCA) Dual Arabic Handwriting Database (Kherallah et al., 2008), the online database of Quranic handwritten words (Abuzaraida et al., 2014), the one from the online Arabic handwriting recognition competition in 2011, the MAYASTROUN Multilanguage handwriting database (Njah et al., 2012), the OHASD online Arabic sentence handwritten on tablet PC database (Elanwar et al., 2010), the AltecOnDB large-vocabulary Arabic online handwriting recognition database (Abdelaziz and Abdou, 2014) and the one from the online Arabic handwriting digits recognition (Azeem et al., 2012). However, they do not properly handle all the features of Arabic Language. They simply provide databases similar to English language databases. They either delete the supplement marks or do not add pertinent information regarding these marks to the ground truth. Figure 1 illustrates the importance of these supplementary marks. In this figure, eight Arabic words are shown, all of which consist of the same sub-word (having three letters). However, the diacritics of each word completely change the meaning of the word. From this, one can understand the significance of including diacritics.
In this study, additional marks refer to both diacritics and supplement marks. diacritics. The meaning of each word is different (two dots appear as "-"; three dots appear as "^") When using the previous datasets, it is difficult to discern the cause of the recognition error. The error occurs in the body of the word (sub-word) or in one or more of the additional marks around it.
More than half of all Arabic characters have additional marks above or below the letters. However, no Arabic dataset contains information regarding these marks in the ground truth files. Alternatively, they are simply deleted as part of the preprocessing. Some of the online datasets provide the coordinates of pixels without referring to them in the ground truth. For example, any word in Fig. 1 has 1 entry in the ground truth, but in this dataset it will have 4 entries. If the writer had written two dots instead of "-", three dots instead of "^', or a mix of them, the ground truth will contain more entries. This information (style of writing) is very important in writer identification and very helpful in the segmentation phase.
This study is aimed at addressing this issue by presenting a dataset prepared using a tool designed specifically for the Arabic Language (OFM-ArabicDatasetBuilder). The ground truth files of this dataset contain information regarding sub-words, dots, Hamza, bar for Tah, Zah and complement for the letter Kaf. Some diacritics which are rarely used in handwritten texts are not considered in this version.
There are no criteria for a "good" dataset with regard to offline recognition; however, I do not believe this holds true for an online dataset. An online dataset, to be acceptable, must provide the researcher with the ability to rewrite any word in the dataset in the exact same manner that the original was written in (El Abed et al., 2009). To accomplish this, the database must contain all necessary information, including coordinates of all pixels, the time when the digital-pen/finger passed the pixel, the colour, the azimuth of the pen, the altitude of the pen, pressure and so on. This version of the OFM-ArabicDatasetBuilder is designed to manipulate the most important information required to rewrite a word as the original author had written the word, the coordinate and the time of every pixel. Other data (mainly pressure, azimuth and altitude of the pen) will be considered in the subsequent version.
The words in DAD were very carefully selected, such that they contain all Arabic letter shapes (initial, middle, last and isolated).
The paper is organized as follows. The next section introduces the most relevant related works. Section 3 indicates the main features of the Arabic language as the basis of this work. Section 4 presents the novel DAD. Section 5 describes the experimentation with this dataset and section 6 shows the results discussing the most relevant aspects. Finally, section 7 mentions the conclusions and future work.

Literature Review
Since the beginning of scientific research regarding optical and writer recognition, many researchers used datasets that they have created on their own. These datasets mostly included templates of tens of writers. Only a few datasets had templates exceeding 50 writers. Lately, benchmark databases have appeared. These databases include templates of hundreds of writers. Several even have 1000 or more writers. For example, "KHATT" database (Mahmoud et al., 2012) incorporated 1788 pages with a total of 165890 words. However, the oldest and most widespread database is IFN\ENIT, which includes texts written by 411 people with 26,459 words (Pechwitz et al., 2002;El Abed and Margner, 2007). Previous databases were for offline optical recognition, with more information about offline datasets as one can observe in (Parvez and Mahmoud, 2013).
In the first work (QHW (Abuzaraida et al., 2014)), a special tool was used to record the coordinates of the dots over which the digital pen travels. Here, a platform was designed to collect handwritten information. A total of 120 words were written by 200 writers. Overall, 12000 samples with over 42000 characters and 23300 sub-words were included. However, no information regarding the time was considered.

Fix to Home
Not virgin I repent
The third study (ADAB (El Abed et al., 2011;Kherallah et al., 2011)) is the most common among researchers (Elleuch et al., 2015;Potrus et al., 2014;Eraqi and Azeem, 2011;Chernodub and Nowicki, 2016;Hamdi et al., 2016;Ahmed and Azeem, 2011;Maalej et al., 2016;Abdelazeem and Eraqi, 2011). It consists of 19,575 Arabic words written by 166 different writers. This database is the only benchmark that has been widely recognized among researchers so far. However, it did not consider any information regarding diacritics or additional marks. The recent online version of KHATT dataset contains of 10,040 lines of Arabic text written by 623 writers. Part of the collected data is segmented into characters (separated dataset). It includes information about time and pressure. But it lacks detailed information about subwords, diacritics or additional marks (Mahmoud et al., 2018).
The last two datasets AHWDB1 and AHWDB2 appeared in 2019 and both of these just received one input, "Mohammad" and "Mohammad Abdallah" respectively. Each input written 10 times by 200 writers. The goal of these two datasets is to identify writers by their Arabic handwriting from one or two words only. The second group of DAD dataset will cover the shortage of this type of datasets. Detailed information about online text datasets can be found in (Al-Helali and Mahmoud, 2017; Al-Salman and Alyahya, 2017; Tagougui et al., 2013).
All datasets of the aforementioned studies build ground truth tables without considering Arabic language characteristics. Arabic words usually have multiple parts, referred to as sub-words in this study (see next section) and additional marks. In many cases, Arabic words consist of more than one subwords. Each Arabic word will be divided into two or more sub-words if one of the non-connectable letters appears in the word. Year Digits Writers LMCA (Kherallah et al., 2008) 2008 30,000 55 AOD (Azeem et al., 2012) 2012 30,000 100 MAYASTROUN (Njah et al., 2012) 2012 6,500 355 There are datasets for online handwritten digits too. Many researchers had studied the recognition of Arabic (or Hindi) digits: Offline (de Sousa, 2018; Jaha, 2019; Abdleazeem and El-Sherif, 2008;El-Sawy et al., 2016;Abdelazeem, 2009;Almodfer et al., 2017;AlKhateeb and Alseid, 2014;Mahmoud, 2008) and online (Ahmad and Maen, 2008;Azeem et al., 2012). Researchers had used their own datasets or some benchmark datasets.
The current work resolves the Arabic characteristics problems appearing in previous studies, by designing a tool specialized for the Arabic alphabet.

Summarized Online Digit Datasets
The summarized online digit datasets contain isolated digits. In this new dataset DAD, the writers had asked to write the ten digits in one screen. In real life, native speakers usually write digits sequentially, as ID number or bank account. Table 3 indicates the features of the most relevant summarized datasets. These allow researchers to study the delay between every two digits. Notice the relevance of the connection from the last pixel in the first digit to the first pixel in the next digit in forensic sciences.

Characteristics of the Arabic Language
For simplicity and the benefit of speakers who are unfamiliar with Arabic, only the main characteristics that are required to build an Arabic dataset will be discussed. The Arabic alphabet is comprised of 28 letters. Some of these are similar in shape to that of the main body and are differentiated with dots placed above or below them (Al-Hashim and Mahmoud, 2010). The number of the dots is either one, two, or three. There is a symbol that is called "Hamza." that appears above or below some letters (such as Alif, Waw, Yaa and Kaf). Sometimes this symbol is considered to be a different letter if it is written separately. The letters Tah, Zah and Kaf are sometimes written in two parts like the letter "t" in English. For this reason, the last two options (Bar to Tah and Zah and complement to Kaf) have been included as shown in Fig. 2. It is important to note, especially for non-Arabic readers, that the Arabic alphabet contains 28 letters, but because of the additional marks and connectivity, the Arabic keyboard contains additional characters (Table 1). Words written in Arabic differ from words written using the Latin alphabet, in the fact that the typed and hand-written words have almost the same shape in the latter. In other words, the printed Latin text is written using separate letters, whereas the letters of words written by hand are usually connected. Additionally, there are no fixed rules for connecting letters to each other. Hence, the connections between letters can be considered as the writer's preference. For example, when someone writes the word "university," he/she can connect all the letters together, or he/she can write each letter separately. Arabic, however, has a rule that cannot be broken either for printed or hand-written texts. Understanding this rule explains the following:  Why and how the program in Fig. 2 divides the word to sub-words  The manner in which the truth file must reflect the structure of the word The rule can be summarized as follows: Every letter must be connected to the letter after it, unless it is one of the letters indicated by the following numbers in Table  2: 1, 2, 3, 10, 11, 12, 13, 29, 31, 33, 34, 35, 36, 37 and 38. If one of these characters appears in a word, the word would be written as two sub-words. If two characters appear, the word would be written as three sub-words and so on a so forth. Figure 3 demonstrates how a fiveletter word is written. Two letters are non-connectable with the letter after them; hence, the word appears to be written using three sub-words. shows the same word written by hand, which is somehow similar to the printed word. However, there are some differences between the printed and the handwritten word. For instance, two dots in the handwritten word are usually written as a small vertical bar and three dots are usually written with the symbol "^" (Fig. 1), also overlapping between sub-word, slant, skew, etc. Fig. 4 shows the importance of information about diacritics. It is a word written by three writers and the main difference is the diacritic of the first letter, so detailed information would help the researchers to develop new writer identification algorithms.

Detailed Arabic Dataset
The creation of DAD was performed with an iterative process with the three steps of data collection (see section 4.1), data extraction (introduced in section 4.2) and the incorporation of data in ground truth file (described in section 4.3).

Data Collection
The program instructs the volunteer to write 132 entries, in which10 entries are numbers 0-9 (the numbers must be written 10 times on 10 screens). Additionally, there are six words that must be entered 10 times and 62 words that must be entered just once. The information regarding these 132 entries is saved in a text file. In this study, 159 people voluntarily participated and they were recruited among instructors and students of the faculty. Table 4 summarize important statistics about DAD, totally ground truth for DAD contains 93064 records in 18767 files, the number of records about five times more than the number of files, usually in other Arabic datasets the number of records is equal to the number of files, this is simply explained that the ground truth in DAD based on sub-words and letters parts not on words or lines. These details are important for segmentation and writer identification (Fig. 4).

Fig. 4:
Arabic word written by three writers (means ox), the main difference is the diacritics of the first letter. adata set will contain two records: #1 and #2, b-dataset will contain three records of #1, c-dataset will contain one record of #3. In addition to two entries for the two The file contains the following information about each writer: Writer name, strokes count, letters count, the word in Arabic letters, the word in English letters, the word in Unicode, the coordinates of all pixels detected by the hardware and the time in milliseconds for every pixel. This file contains a crude dataset called a library.
In fact, libraries can be used as datasets; however, some volunteers make grammatical or syntactic errors, or enter the words with significant slant and skew, or in some cases, it is simply difficult for both human and machine to read the word. Most importantly, it does not contain detailed information regarding additional symbols (diacritics and supplement marks). Hence, the library (crude dataset) must be treated, as explained in the next section. The dataset was collected using the "Wacom pro tablet", which is a device that captures the writing when a user writes on it with a special digital pen. Figure 5 shows an Arabic word written by two writers, Table 5 indicates the main information fined in truth files for each word. The second writer had written first and third sub-words using two strokes. Another valuable information that the first writer input the two additional marks immediately after the sub-word belongs to it, whereas the second writer input the three additional marks after writing all sub-words, this is valuable information for writer identification.

Data Extraction
The libraries generated in the previous section must be preprocessed before it can be used. This is done using the same application using the "edit Library mode", which allows user to edit some Arabic samples. In this mode, the application simply redraws the first word in the library (crude dataset) and colours the first stroke in red. The user (author) can click "cancel this word" (Fig. 2) if the word is not legible or has any of the aforementioned issues. If the word is suitable for the dataset, the user must choose the type of stroke marked in red colour. If it is a sub-word, its name must be chosen from a drop-down list. If it is a diacritic mark, its type must be chosen then subword related to it from the drop-down list.    For example the word in Fig. 6 contains three subwords, but it is written in five phases. The reason behind is that the first letter has a bar and the last letter has two dots above it (small horizontal line in the hand-written word). In order to build a ground truth file, the program instructs the user to enter the name of each stroke, as shown in Fig. 2, in the same order that the word was written using the digital pen by the writer. In this example, the names of the five strokes were asked to be entered as listed in Table 6 to help the user, the program highlights the relevant stroke in red. The first, third and fourth strokes are sub-words without any additional marks. Their letters are listed in the dropdown list after the word is analysed. The second stroke is a bar for the "Taa" letter of the first sub-word "Bar to Tah or Zah." The final phase involves a line that replaces the two dots of the letter "Taa-Al-Marbouta." If the stroke is incorrect, the user can press the "Ignore this stroke" and then the program will delete it allowing the user to re-enter names. Another choice is that the user can name it "!" from the drop-down list, which implies that the end-user of the dataset must ignore this stroke.
In order to build the ground truth table for this word, some records are required to define the strokes of the word. In the next section, the information regarding this word and the required record is discussed. The data are organized in a manner that makes processing easy and may be performed using any programming language, hence, every word has been stored in a separate text file, its format is original Comma-Separated Value (CSV) matlab file. The name of the file donates the writer ID, the Arabic word in Latin letters and a number. This number is generated by Windows operating system when the application attempt to store a file with a name that already exists. This occurs when the same writer inputs the same word more than once. In other words, the name of the file has the following Lowing structure: Writer ID, 1st letter, 2nd letter, …, nth letter(Number).txt.
Internally, the information in the file is stored in the records. The record has two items: Attributes and its data. The first six attributes contain information regarding the word (obtained from the library), Table   7. To make the information more readable, the attributes names start with the "#" character. Next, the attributes will be explained. #WriterName attribute: The data in this attribute is the writer ID, every volunteer must input his/her ID before he/she inputs words into the application. #StrokesCount attribute: The data in this attribute is a number that indicates the number of strokes used by the writer to write (draw) the word. The points of all strokes are detected using hardware starting with at the moment that the pen touches the screen and stopping at the moment the pen is raised from the screen. #LettersCount attribute: The data in this attribute is a number that indicates the number of letters in the Arabic word. This number is counted by the software with the knowledge that the word shown is the word that the volunteer must write in buttons bar.     Number of points detected by hardware between when the pen touches the screen and when it is raised again. Subsequent line(s) list the coordinates of all points in the format x1, y1, x2, y2, x3, y3 and so on #Tms Number of time values. The subsequent line(s) list all times in millisecond in the format t1, t2, t3 and so on. t1 is the time between the first and second points, t2 is the time between the second and third points. Single dot above or below the Arabic letter. #2 Two dots above or below the Arabic letter. #3 Three dots above the Arabic letter (or under in some languages written in Arabic letters). #H Hamza above or below the Arabic letter. #K Bar for letter "Kaf" in the beginning or middle. #T Bar for "Tah" or "Zah" #WArabic attribute: The data in this attribute is the Arabic word written in Arabic letters. This record may not correctly appear if the operating system does not support the Arabic language. For this reason, the two next attributes were added. #WEnglish attribute: The data in this attribute is the Arabic Word in English letters, depending on Table 2. #WUnicode attribute: The data of this attribute is the Arabic letters in unicode. The #StrokesCount attribute provides the number of strokes used to write (draw) the word. Hence, the succeeding records in the file provide information regarding every stroke. Five or six records are used per stroke (Table 8). If the stroke is a sub-word(s), then five records are used: #SArabic, #SEnglish, #SUnicode, #Dots and #Tms. If the file contains numbers only first record was used.
#Dots attribute: The data of this attribute is the number of points detected by the hardware between when the pen touched the screen and when it was raised back up. The lines(s) after this indicate the coordinates of all points in the format x1, y1, x2, y2, x3, y3 and so on. #Tms: Number of time values. The next line(s) list all times in milliseconds in the format t1, t2, t3 and so on.
If the stroke is a dot(s) or Hamza or bar for Kaaf or Taa, then one of these records is added before the previous five records: #1, #2, #3, #H, #T, #K. for a single dot, two dots, three dots, Hamza, bar for Tha, or bar for Kaf respectively (Table 9).

Ground Truth File
The following example explains the structure of the ground truth file and the records mentioned in the previous section. Figure 7 illustrates the ground truth table for the word in Fig. 6. First, six records are used to indicate each of the following: Writer name, strokes count, letters count and the word in Arabic, Latin and Unicode. Since the strokes count is five, there are five sections (only three of them are shown in Fig. 7). The first section gives information regarding the first stroke and it consists of two letters (given in Arabic, Latin and Unicode). While writing these two letters, the volunteer passed the pen through 132 pixels.  Table for the Arabic word Tawila The subsequent lines indicate the coordinates of these pixels in the previously mentioned format. The last record provides the time for each point (i.e., when the hardware detects it). This is really helpful for writer identification. The next section contains six records. Since this section is not a sub-word, but rather a supplement mark (a bar for Tah), the first record is #T. The subsequent three records provide the name of the syllable the supplement mark belongs to. The next record (#Dots) provides the number of pixels used in writing the bar. The lines after this provide the coordinates of these pixels. The last record (#Tms) determines the time that each pixel was detected. Note that the writer starts to write the second syllable after approximately two seconds (1972 ms).
In other words, the writer uses approximately two seconds for writing the two letters in the previous subword, plus the delay between two sub-words. The delay is equal to the time of the first pixel in the second sub-word (1972 ms) minus the time of the last pixel in the first subword (1282 ms), which is not shown in Fig. 7. The delay time (time that the pen was raised from the tablet) is also a useful feature for writer identification.
The third section includes information regarding the third stroke. It consists of a letter as it appears from the first three records. The fourth record indicates the number of points and the coordinates of the points. The fifth and sixth records determine the number of time values and the next line lists the values themselves. The results of this study indicate that some writers usually wrote dots as a long line and consequently it appeared as two dots. Other writers sometimes wrote two dots as a small line and subsequently these were detected as a single dot. This behavior is important for identifying writers. Thus, in many cases, *x is added to the record. For example, if a dot appears as two dots, the record will be #1*2. Every group of DAD was randomly distributed into training, testing and verification sets, containing 70, 15 and 15% of entries of the dataset, respectively.

Writing Identification Using KNN and Nearest Neighbour Interpolation
To provide other researchers with a benchmark to compare their results, K Nearest Neighbour (KNN) classification algorithm will be used to test both groups. The KNN algorithm is a simple and very effective machine learning technique (Mohammad, 2019;Dhurandhar and Dobra, 2013), as a result it is a commonly used as classification algorithm among researchers. It is used in text recognition and categorization (Alotaibi et al., 2017;Wan et al., 2012;ALSaif and Alotaibi, 2019;Chen, 2018;La et al., 2012), writer identification, image annotation (Gu et al., 2017), digit recognition (Gu et al., 2017), Arabic language processing (Selamat et al., 2009;Boubaker et al., 2014;Hafiz and Bhat, 2016;Al-Tamimi et al., 2017;Assaleh et al., 2009), internet content filtering (Guo et al., 2018) and many more.
The KNN has several merits such simplicity and high accuracy but it is relatively computationally expensive.

Preprocessing
The preprocessing concerns the preparation of the writing recognition system when using KNN, since this preprocessing is related with how DAD can be used to achieve high accuracy levels.
Due to the variations in position and scale among different writers and even with the same writer, a preprocessing is an essential step to improving accuracy. Each word undergoes the following steps: a-All strokes are combined into one stroke b-Strokes from the previous step are resampled to be accommodated in a fixed number of points, equal for all strokes. For shorter strokes this means upsampling, whereas it means down-sampling for long ones. Resampling is performed by Nearest Neighbour Interpolation (NNI) (Elglaly and Quek, 2011; c-All stroke points are normalized to mean zero and standard deviation one. This normalization process is performed by subtracting the sample mean to all the values, so that the mean of the new values is zero. Then, all the values are divided by the standard deviation, so that the standard deviation of the new values is one. This normalization process allows one to properly apply KNN with an adequate balance between the different properties KNN predictions are made using the training dataset directly. Predictions are made for a new data input by searching through the entire training set for the K most similar instances and then summarizing the output variable for K instances. For classification, the output is usually the most common class among the most similar cases. To determine which of the K instances in the training dataset are most similar to new input, a distance measure is used. The most popular distance metrics are Euclidean distance, Cosine distance, Minkowski distance, Mahalanobis distance, Chebychev distance, Hamming distance and Spearman distance.
In order to compare the accuracies of KNN with this dataset, several KNN model types were examined considering different parameter values, k and distance metrics. Table 10 summarizes these models and their importance.
The block diagram of Fig. 8 summarizes the whole work of building DAD and using it for experimentation. Firstly, an iterative process collected handwriting examples, extracted the relevant parts and included them in a ground truth file, conforming DAD. In the experimentation phase, DAD was used for training a recognition system with KNN. Later this recognition system was validated with different writing samples to assess its accuracy.

Results
The DAD dataset was evaluated by training a recognition system with it and measuring the accuracy of the trained system. In this research, both groups 1 and 2 are used for training and classification purposes. Table 11, Fig. 9 and 10 show that the accuracy and prediction speed achieved using group 2 is higher than achieved using group 1. The accuracy and speed for the second group are very high except for the speed for the cubic model. For the first group the speed is very high except for the cubic model, as the first group, but the accuracy is not high as the second group.

Discussion
High level of accuracy is considered as an indicator of the utility of the dataset for being used in writing recognition systems.
The improvement of group 2 over group 1 is attributed to the fact that words in group 2 are repeated several times by every writer.
Using another preprocessing algorithm or another classifier as SVM or deep learning must improve the accuracy.
Regarding the different KNN models, the most effective one in terms of accuracy was Fine KNN. This Data collection with writing tablet device Data extraction with edit library mode Inclusion in ground truth file Is DAD complete? [Yes] [No] Training KNN recognition system with DAD Evaluate recognition system with different writing model used a K parameter of 1 (i.e., just considering one neighbour, i.e., the most similar one) and the Euclidian distance, which is essentially based on the distances for each dimension. This reveals how low amounts of neighbours can obtain appropriate results in this context of Arabic writing recognition.

Conclusion and Future Work
This paper has introduced a new type of Arabic online handwritten word dataset. Its novelty relies on that it considers certain Arabic font details that are not taken into account by other similar datasets. Its ground truth files contain full details regarding sub-words and other diacritics: Single dot, two dots, three dots, Hamza and supplement marks (i.e., bar to "Tah" or "Zah" and Complement to "Kaf"). It contains 136 Bar for the letter "Kaf", 930 three dots, 2409 Bar for "Tah" or "Zah", 2777 Hamzas, 9580 two-dots, 15610 single dots and 46560 subwords. It contains also 14946 Indian numbers.
Totally ground truth for DAD contains 93064 records in 18767 files. The number of records is about five times more than the number of files, usually in other Arabic datasets the number of records is equal to the number of files, this is explained because the ground truth in DAD is based on sub-words and letters parts instead of words or lines.
This dataset will provide researchers with a strong tool for online Arabic language text recognition especially in the segmentation phase and writer identification.
As future work, it is planned to make a new version of this dataset that can (a) consider rarely used diacritics such as "Madda" and (b) collect more data, such as pressure, altitude, azimuth and coordinates while the pen is raised. This information would be very helpful for writer identification. With the contribution of other researchers from other countries, it is also planned to increase the number of writers and words and to apply these methods for offline datasets.
For some methods, it is a good idea to start experiments with this dataset then going to other datasets with less information about the text, but more writers. This dataset will be freely available for academic researchers worldwide with an interest in this study.