Block-Matching Twitter Data for Traffic Event Location

: We used a block-matching approach that is data-driven and relies mostly on patterns of tagged speech in Twitter streams as a way to identify events in road traffic. Events are useful because their location may identify the status of road segments, especially when cross-street data are available. Basing a system on patterns that are not pre-defined has the advantage of flexibility for a variety of scenarios.


Introduction
Road traffic congestion is a problem for many areas around the globe. This can be for a variety of reasons and may occur even when it is not normally a problem, such as when natural disasters occur. Social networking services are widely used by people sharing personal information about status and events. Twitter is one example where such information is shared, is accessible to the public and has a large volume of use. Since there are many users sharing information within a congested situation, there have been several methods developed to extract information from Twitter for use in traffic analysis.
There is a commonality among methods in that tweets are usually detected, tokenized and pruned, but there are different specifics (Kulkarni et al., 2016;Shuqair and Kozaitis, 2015). It has been shown that tweets alone can be used to estimate road traffic congestion conditions as tweet density and hours of the day are useful attributes for building a congestion severity prediction model (Wongcharoen and Senivongse, 2016). In addition, using official and public tweets were correlated to identify mobility patterns (Rebelo et al., 2015).
A real-time monitoring system with particular reference to traffic congestion and car accidents from Twitter stream analysis has been developed (D'Andrea et al., 2015). It used text mining techniques and then classification to determine if a tweet is traffic related. The system was reported to identify issues, often before traffic news web sites. It was also able to discriminate whether traffic was caused by an external event or not by solving a multiclass problem.
There have been several other approaches to using Twitter and social media to provide real-time traffic information by way of text mining or Natural Language Processing (NLP). In one system, words were tokenized, matched to a database and classified into one of eight categories of words such as adjective, noun, verb, etc. (Sakaki et al., 2010).
When considering tweets that were official in capacity, they were tokenized and classified into 12 categories that were classified as either events or situations (Ribeiro Jr et al., 2012). Then, an exact string matching process was used to find street names in a large database and fuzzy string searching using gazetteers was performed to match the streets by crossroads and neighborhood names. Another method used an existing tokenizer called Lexto to analyze and classify Twitter data by keywords that described traffic conditions (Wanichayapong et al., 2011). This approach classified road data either into a point (e.g., an intersection) or a link (e.g., a road) then into eight subcategories such as place, verb, etc. Tweets were limited to traffic keywords such as accident and traffic congestion with a large dictionary created for each category.
We examined Twitter data for cross streets and points-of-interest to help more accurately determine the status of a road segment in congested traffic. We looked for patterns of text and grouped like patterns together in what we refer to as a block-matching approach that was also data-driven by examining groups of like patterns to extract information such as the condition of a road. This approach has the advantage that it can adapt to different syntaxes and potentially different languages.

System
The system initially acquired tweets from Twitter streams and preprocessed them before further analysis.
Preprocessing included removing unnecessary characters and retaining tweets that were only related to traffic. Then, the tweets were broken into tokens, tagged and grouped before being classified. During classification, rules were applied that clustered and/or separated blocks of tweets. A block diagram of the system is shown in Fig. 1.

Preprocessing
Tweets were first gathered by the Twitter Streaming API and unwanted symbols were removed. Tweets were limited to a geographical area specified by a user. Symbols were removed including punctuation marks, URLs, emoticons, etc. Retaining those symbols would not help the classification process and would add unnecessary complexity to later stages in the system. The next part of the preprocessing step involved the removal of tweets that did not pertain to traffic. In order to determine if a tweet was useful, words within the tweets first had to be tagged. To perform this operation, each tweet was separated into tokens and each token was tagged as a Part Of Speech (POS). We then compared each POS that was most likely a name to a list of street names that were determined to be in the geographical area specified by a user. If a match

POS Tagging
We used a text parsing POS method by means of a Stanford parser (Endarnoto et al., 2011) and a Carnegie Mellon parser (de Marneffe et al., 2006), to tag tokens. This approach allowed us to develop algorithms based on tags to classify text. Although there are many tags, our work focused only on the most popular and simplest ones such as noun, verb, etc.

Block Matching
Once a collection of tweets had been acquired, we selected the first N POS tags and then searched the tweets for the same N tags, which we refer to as blocks. Each time we found the same block, we grouped the corresponding words together. We then looked at the next block of N tags in the tweet and repeated the process. Eventually, we built up a collection of blocks and associated words. We continued this process until all blocks of N consecutive tags were used in the matching process.
Results from a simple example of the block matching process are shown in Fig. 2. The process included nine tweets that were tagged using the convention in Fig. 3a adopted from, LG (2016). Considering N = 4, the grouping results are shown in Fig. 3b. The first tweet contains four tags so only one grouping was possible and is shown in the first row of Fig. 3b. The second tweet only contained three tags, so that tweet was discarded because the number of tags was less than N. The third tweet contained eight tags, so five entries were possible and are shown in rows 2-6. The block matching results show 14 different blocks.  Rules can be applied at this level or within the classification stage to reject or manipulate blocks of tweets if necessary. Once a number of tweets have been collected, the system passed the groups of blocks to the next step for classification.

Match Blocks
We chose an example to illustrate our approach that determined whether a street was possible or not. Specifically, we determined whether a street was open or closed between two cross streets. Our approach was to look for a pattern of tags that indicated a possible obstruction in a specific area, such as the intersection between two streets. We started by considering two different sets of blocks of tags. Then, we eliminated and/or combined different blocks before further classification, because they provided specific information about the location and condition of the road. Blocks used for block matching must satisfy one of the rules described below with N = 8. Applying these rules, we used the block matching process to create different blocks for further processing. Table 1 illustrates an example of the blocks created from example tweets. In left side of the table, several tagged tweets are shown and the right side shows the resulting blocks using the rules above.

Eliminate Blocks
We also used additional rules to elminate tweets from further processing. For example, we used the following rules for this case to eliminate tweets:

Classify Blocks
In this step, groups of blocks were formed that contained information on a particular road. Then, those blocks were examined to determine the status of a road. The conditions used to group blocks were as follows: • Blocks that include same NNP tags in the same positions • Blocks that include traffic nours and at least two NNP tags • Blocks that refer to the same street will be group together

Results
The performance of the system can be altered by the user. For example, if the rules for block-matching are very specific, then we can easily determine the status of a road, but many tweets may have to be rejected, which is not necessarily practical. In general, most tweets are not about traffic; however, at the time of a significant natural disaster, weather event or crisis, traffic related tweets will be more probable.
To test our system, we distributed a map of a city that contained indications of closed roads, accidents, etc. to students without any knowledge of our system and asked them to send a traffic-related tweet. Using the example described, 39% of the tweets generated were retained as useful. Of those, 54% aided a decision to be made -if a road segment was closed or open.
A decision is ready to be made about a road segment when combing and separating groups has ceased. At this point a group may consist of a single entry or multiple entries. For multiple entries, a variety of methods can be used to identify the status of a road. A weighted average of probabilities assigned to the tweets is the most straightforward. Relative probabilities and a threshold for a closed/open decision can be assigned by the user.

Conclusion
By grouping blocks of tokenized POS Twitter tags in a data-driven approach, we were able to determine the condition of road traffic segments. This process allows for more detailed information about congested areas to help navigate away from the area. Furthermore, by identifying cross-streets near a traffic event, better paths can be found.

Funding Information
The authors have no support or funding to report.

Authors' Contributions
Amal Shuqair: Participated in all experiments, did programming, gathered results wrote draft of manuscript, did portion of literature search.
Samuel Kozaitis: Designed research plan, organized study, performed final writing of manuscript, did portion of literature search.

Ethics
This article mostly original and is an extension of a conference paper.