Newcastle Traffic Classification Using Clustering Algorithms

Corresponding Author: Nayef Z Al-Mutairi Department of Civil Engineering, College of Engineering and Petroleum, Kuwait University, Kuwait Email: nayef.almutairiku.edu.kw Abstract: The urban road traffic network evolution is complex and varies depend on road type, zoning types and social activities. Typical traffic pattern variation of road network could be examined by considering the daily human travel activities. Thus, factor and cluster analysis is carried out. This paper is a comparative analysis of various Data Mining clustering methods for the grouping of roads based on traffic profile. The analysis was carried out using data available from 45 Automatic Traffic Recorder (ATR) sites in Newcastle, UK. Factor and cluster analysis were applied on the road traffic data so that roads could be classified, allowing diurnal traffic profiles to be assigned a group to roads with similar attributes. These groups could be classify based on road traffic characteristics. Five road classifications were found.


Introduction
Due to the growth of vehicle ownership and the road traffic demand exceeding the road network capacity in urban area, the traffic congestion causes a longer traveling time, more pollution emissions and higher accident risk. Nair et al., (2019) stated "the congestion index of different cities is strongly related to emissions, population density and GDP per capita". The traditional development of road network in the cities, such as new roads or widening roads construction, is not sustainable solution due to limited resources. The road network can be mange effectively by using Intelligent Transportation Systems (ITSs), such as Advanced Traffic Management Systems (ATMSs), to reduce travel time of the road users (Lee et al., 2011;Zhang et al., 2011). The morning and evening traffic peak hours caused by traffic to and from workplace or school are recurrent traffic pattern due to repetitive daily traveling activities (Francesc, 2012), There are numerous monitoring systems available to conduct traffic measurement surveys. These are divided into intrusive and non-intrusive sensors. Intrusive sensors such as induction tubes are positioned in or on the road surface to monitor traffic flow (Heidemann et al., 2008).
The non-intrusive sensors are positioned in a way which does not interfere with road traffic, either during operation or installation as a camera (Heidemann et al., 2008). Road traffic monitoring is classified according to manual, semi-automatic and automatic methods. The manual method uses sheets of paper, mechanical or electronic counting boards and have typical manual count intervals of 5, 10 or 15 min, which would be conducted over a period of less than 24 h (Smith and McIntyre, 2002). Vehicle type, turning movements, travel direction and vehicle occupancy are some of the types of data collected during a manual count (Smith and McIntyre, 2002). The semi-automatic method uses manual and pneumatic tubes for the traffic count. The pneumatic tube is used to count traffic automatically by recording the air pulse on the tube, due to vehicle axle passing over the detected tube (Mimbela and Klein, 2000). Then a manual traffic count is conducted for a short period of time, which focuses on the type of vehicle. The vehicle type and axle number ratio are calculated using the manual traffic count data. These ratios from the manual and axle count of the pneumatic tube will be used to extrapolate the traffic volume. The inductive loop system is one of the automatic methods. This system records the electrical current pulse on a detection wire set in the road and is caused by the electromagnetic field as the vehicle [metal] passes over it (Ritchie et al., 2002;Lin et al., 2004).
The general shapes for the daily traffic profiles have been analysed by a number of researchers (Festin, 1996;DfT, 2001;US DoT, 2001;Chrobok et al., 2004) and mainly focused on different road and vehicle types. These researchers classified the days into working and weekend days. The daily traffic profile for working days typically illustrates morning and peak periods and ■■ between peak periods. Based on the results from one of the UK studies, the typical value for the peak hour ranges from 8% to 12% of the total daily traffic volume (Taylor et al., 1996). Traffic volumes on weekend days are typically lower than traffic volumes on working days. Researchers investigated the difference in traffic volume between week days and weekends and concluded that there was no significant difference between the days of the week during week days, although there was a significant difference between Saturdays and Sundays (Stathopoulos and Karlaftis, 2001).
The advance of road traffic sensing technologies led to collect traffic data from multiple sensors, such as video cameras, inductive loop detectors and mobile phones (Leduc, 2008). A data mining technical is applied to the traffic data extract useful information for urban traffic management (Elhenawy et al., 2014). Weijermars and Van Berkum, (2005) illustrate the advantage of using traffic patterns from data mining to forecast traffic flow, imputation missing data and buildup traffic models. One of data mining tool is clustering techniques which used to traffic pattern profile including traffic volume prediction (Stutz and Runkler, 2002;Yang et al., 2017), dynamical traffic flow identification (Jiang et al., 2003), congestion patterns extraction of daily traffic (Wen et al., 2014) and non-recurrent traffic congestion detection on urban road networks (Anbaroglu et al., 2014). Nair et al., (2019) compare traffic congestion conditions across multiple cities of the world by using a standard data from crowdsourced.

Methodology
Traffic data was obtained from Traffic and Accident Data Unit [TADU] at Gateshead Council. A number of traffic counters were selected that are located within Gosforth Area. The traffic data, in vehicles per hour [veh/h], was obtained from 45 counter sites. The traffic data varies between daily records for the year or for a couple of weeks depending on whether the traffic count site is either temporary or a fixed counter. The traffic data was the only the traffic count that was available. The traffic composition and speed were not provided. Due to limitations in the availability of data with regards to the flow characteristics, the sites could not be classified into congested, busy or quiet roads. Therefore, cluster analysis was conducted on the data to classify daily traffic profiles into several groups.
Cluster analysis is one of the methods recommended by the Traffic Monitoring Guide published by the US Department of Transport for classifying roads (US DoT, 2001). This guide is used to assist the local authorities to apply appropriate extrapolation factors for annual average daily traffic [AADT] to short periods of counting traffic. The short period traffic count calculates the traffic for specific times of the day or specific days. This guide provides a method to classify a road based on traffic and road characteristics such as average speed and traffic flow then appropriate AADT interpolation factors are applied to each road based on the classification. Chen et al. (2008) showed that cluster analysis is an adequate technique for road classification and it requires little or no knowledge of the physical layout of the road network.
The method of unsupervised classification of behaviors or patterns into groups or clusters is called clustering. The patterns are clusters based on input data, which means unsupervised classification. By using a number of features, a pattern will be defined mathematically during the cluster analysis. These features are defined by hourly traffic flow or peaks and off peak traffic flow, which are daily traffic flow profiles. In some studies first cluster analysis are used then the results are validated by factor analysis (Berlage and Terweduwe, 1988;Clark et al., 2003). One of the cluster techniques used is the k-means algorithm, as it is easy to apply, simple and efficient (Anil, 2010). It has been documented that the k-means algorithm is applicable in many fields such as engineering, energy, medical, electrical and transport (Chen et al., 2008;Docquier et al., 2009;Mora-Flórez et al., 2009;Pandit et al., 2011;Yiakopoulos et al., 2011). Cluster analysis uses classified data that is divided into meaningful subgroups (Fraley and Raftery, 1998). The main categories of the cluster analysis techniques are hierarchical and nonhierarchical (Fraley and Raftery, 1998;Grimm and Yarnold, 2000;Anderson, 2003).
Non-hierarchical techniques allow multiple passes of the data which is an advantage over hierarchical methods. Each data point is allowed to move among clusters with each subsequent iteration and aims to maximise intra-cluster similarity and inter-cluster dissimilarity given that it will not increase square error within each cluster (Aldenderfer and Blashfield, 1984). However, these techniques are required for prior information to be applicable (Ketchen and Shook, 1996). For example, traffic characteristics and fleet compositions were used to classify more than 3000 roads in Leicester by using Non-hierarchical techniques (Chen et al., 2008). The K-means clustering algorithm was used, which specifies the number of clusters before conducting the analysis. The numbers of clusters allowed have to be flexible, in order that the analysis will be derived from the data and not by the subsequent assumption.
The K-means algorithm is one of most commonly used cluster analysis techniques due to its ease of use and simplicity. Starting from an initial partition, the objects move iteratively from one group or cluster centre to another, which is the procedure for Kmeans cluster analysis (Fraley and Raftery, 1998;Cheung, 2003;Chitta and Murty, 2010). Like all non-hierarchical procedures, the number of clusters has to be specified before conducting K-means cluster analysis (Grimm and Yarnold, 2000). The cluster centre has to be selected randomly, then objects grouped into the nearest cluster (Ketchen and Shook, 1996). The square error has to be calculated within each cluster (Kalyani and Swarup, 2011). For each cluster, the square error is minimized by allowing objects to move between clusters (Jeffrey et al., 2006). The cluster centre is recalculated and the process is restarted in an iterative method in case an object has moved to another cluster. The final groupings are determined once the clusters become stable (Hair et al., 1992).
This research employed the cluster analysis to classify roads based on traffic counts and as a result, the traffic data would be driving the classification process. The 2010 traffic count data for 45 sites in Newcastle was obtained from TADU within Gosforth Area, Newcastle. The traffic count data from each site varied from daily records for the year to a couple of weeks. Excel and SPSS software packages were used to analyse the traffic flow profile for each site. Before conducting cluster analysis, days with missing traffic data were edited and the missing traffic data for a day was removed. The zero traffic volume was interpolated by assuming a linear relationship between the traffic data before and after that within two traffic volumes greater than 100 vehicles.
The data was coded before conducting cluster analysis and was divided into three groups, weekdays, Saturdays and Sundays. Each group was analysed independently for each site. The means and standard deviations were calculated for each hour of the 24 hrs. traffic count. The data from one day was excluded in the case of having four hourly traffic volumes in sequence or six hourly traffic volumes, which exceeded the 95% confidence interval. Subsequently, the means and standard errors were calculated for each hour of the 24hr traffic count for each site.

Traffic Data Analysis
In this section, the description of the traffic data analysis is presented in detail. The chapter consists of several stages of traffic data analysis. The aim of collecting the traffic data was to select the sites based on the traffic data classification. Due to the data limitation the aim was changed into classifying the daily traffic profile into groups.

Traffic Counters Location
Traffic Counters Location Traffic data provided by TADU operating at Gateshead Council is used for traffic analysis. As shown in Fig. 1, the 2010 traffic count data for a number of sites in Newcastle were obtained from TADU. Sixty traffic count sites were identified that were located within Gosforth Area, Newcastle. Four traffic counters were not counting road traffic and eleven traffic counters had no traffic records. Furthermore, 45 counter sites had obtained traffic data. The traffic count data had the traffic flow per hour, which was saved in an Excel file. Depending on the traffic count site, traffic data varied between daily records for the year or for a couple of weeks. These record variations depended on whether the traffic count site had a fixed or a temporary counter. Indeed, the variation in the records could have been caused by traffic data being lost or corrupted. This data was analysed using Excel and SPSS software packages to carry out a traffic flow profile. A statistical method was used to identify traffic on a typical day. Next, a cluster was conducted to group roads based on traffic data for a usual day.

Traffic Data Limitation
The traffic count was the only traffic data that was available. The traffic composition and speed were not provided. Due to limitations in the available data with regards to the flow characteristics, the sites could not be classified into congested, busy or quiet roads. Therefore, the cluster analysis was conducted on the data to classify the daily traffic profile into several groups.

Cluster Analysis
This research employed the cluster analysis as explained before to classify roads based on traffic counts, thus, the traffic data would be driving the classification process. The 2010 traffic count data for 45 sites in Newcastle was obtained from TADU within Gosforth Area, Newcastle. The traffic count data from each site varied from daily records for the year to a couple of weeks. Excel and SPSS software packages were used to analyse the traffic flow profile for each site.
Before conducting cluster analysis, the days with missing traffic data were edited. The missing data for the whole day was coded to be excluded from the analysis. The traffic volumes equal to zero within two traffic volumes greater than 100 vehicles were interpolated by assuming a linear relationship between the traffic data before and after the zero value. The day of the week and the month were coded to enable identification after conducting cluster analysis. Working days, weekends and holidays were the types of day coded to be identified after conducting cluster analysis. The annual traffic flow was classified into three groups, specifically work days, Saturday and Sunday (Cheung, 2003). Four groups, Monday to Thursday, except a holiday, Friday or the day before a holiday and Saturday and Sunday are the other classification methods for the annual traffic flow (Chrobok et al., 2004). Therefore, five clusters were specified for cluster analysis, attempting to capture the main clusters for each site. K-means cluster analysis was employed to classify hourly traffic data into five clusters for each traffic counter data. By using hourly traffic volumes for a day, a pattern was defined mathematically using K-means cluster analysis.
The traffic data were analysed to produce daily traffic profiles for each site. A flag was assigned to the missing data for the whole day, which was excluded from the analysis. The traffic data was divided into three groups, weekdays, Saturdays and Sundays. Each group was analysed independently for each site. The means and standard deviations were calculated for each hour of the 24-hr traffic count. The data from one day was excluded in case of having four hourly traffic volumes in sequence or six hourly traffic volumes exceeding a 95% confidence interval. Subsequently, the means and standard errors were calculated for each hour of the 24hr traffic count for each site.

Results
The results of the cluster analysis varied across the 45 traffic count sites. There were two to four clusters for each traffic counter site data. The first group classifies two clusters of weekday and weekend. Three clusters of weekday, Saturday and Sunday is the second group, the third and fourth groups is include three to four clusters either two weekdays and weekend clusters or two weekdays, Saturday and Sunday clusters. Yang et al., (2017) find five clusters of traffic profiles which is not different from the finding of four clusters in this paper despite of absent of traffic speed data.
Based on that the traffic count sites were grouped into four groups. The first group is consisted of 21 traffic counter sites which classified into weekday, Saturday and Sunday clusters. The second group was from ten traffic counter sites which classified into two clusters of weekday and weekend. The third group was consisted of data from another eleven traffic counter sites which classified into four clusters that can group the cluster into weekday 1, weekday 2, Saturday and Sunday. Weekday 1, weekday 2 and weekend were the classification of the fourth group clusters which is the data from three traffic counter sites. ■■ K-means cluster analysis was a useful tool to classify traffic data based on the hourly flow traffic data, which could explain each cluster by day types. The results of the daily traffic profiles varied depending on the hourly volumes for the traffic and day type [weekday, Saturday and Sunday] Fig. 2 to 4. The profile trend for three day types followed similar patterns across the sites. The weekday profiles could be divided into four patterns based on the peaks, Fig. 2. One of the patterns had similar morning and evening peaks. Having either a morning or an evening peak was the pattern for two weekday profiles. The last pattern did not have a morning or an evening peak; instead it had a steady flow of traffic from the morning to the evening. The traffic profiles for Saturdays and Sundays were divided into three traffic profile shapes based on the peaks, Fig. 3 and 4. Two small peaks occurred at noon and in the afternoon, which was one of the traffic profile patterns. A further traffic profile pattern was one peak during the noon. The third traffic pattern had one peak and had a high volume of traffic during the early hours of the morning.

Conclusion
This paper gave a description of traffic data analysis, the cluster analysis is employed to exam daily traffic data variation and produce traffic patterns profile for the road. A daily traffic profile was classified based on the traffic data caused by several limitations. Data from each traffic counter site was grouped into two to four clusters. The traffic data was classified Successfully by employing the K-means cluster analysis based on the hourly flow traffic data, which could explain each cluster by day types. The daily traffic profiles varied depending on the day type and the hourly volumes for the traffic. These results could be used prior to road traffic or air quality modelling. Road transport emissions can be calculated based on daily traffic profiles.

Author's Contributions
Hamad B Matar: Writing the paper and analysis. Talal Almutairi: Literation review, data collection and preparation.
Nayef Z Al-Mutairi: Editing and prepere for publication.

Ethics
This article is original and contains unpublished material. The corresponding author confirms that all of the other authors have read and approved the manuscript and no ethical issues involved.