Measuring the Relevance of Trajectory Matching and Profile Matching in the Context of Carpooling Computational Systems

: Carpooling consists of sharing individual vehicle space among people with comparable trajectories. Although there are some software initiatives to help carpooling practice, none of them really implements features similarly to searching for people with similar trajectories and profile. In this study, we propose an innovative approach to generate clusters of users that share similar trajectories and profile for carpooling purposes based on Optics, K-means algorithm and ensemble learning. First, we provide a proper definition of fundamental elements of the carpooling context in order to contribute to a standardization of the concerning nomenclatures. Next, we perform four different experiments for the purpose of showing the feasibility of the approach. We also contribute to the construction of a real dataset (donated to UCI), properly depicted, used in two of these experiments. Results with Davies-Boulding index indicate that the generated clusters are feasible to the design of a carpooling recommendation system. Time performance evaluation of the approach has been also performed for both dynamic program analyses via software profiling method and time complexity analysis according to Big O notation.


Introduction
Traffic jams are a serious concern in metropolitan areas (He et al., 2012). Economic losses, health issues and environmental damages are some of the known consequences (Resende and Sousa, 2009;Currie and Walker, 2010;Levy et al., 2010;Hart et al., 2009). According to the DENATRAN (Brazilian National Department of Traffic), the number of vehicles has increased more than 100% in the last 10 years (DENATRAN, 2013). An alternative to avoid the congestion is to adopt the policy of restricting traffic for private cars. For example, in Beijing, China, the government has adopted these solutions to solve the problem of the one worst traffic in the world. Although it is a solution, the traffic is still critical in peak hours. (He et al., 2014). Since the USA suffered a loss of almost $78 billion in 2007 due to traffic jam issues (Schrank and Lomax, 2007), a lot of measures has been adopted to reduce the problem such as: Improve traffic light synchronization (He et al., 2014), building new roads/avenues, encouraging the use of bicycles as daily transportation and improvements in public transportation.
Carpooling (share individual vehicle space among people with similar destinations) is a typical solution used by some nations to avoid the problems generated by the increase traffic condition. However, this solution is strongly related to some cultural aspects (Gowri, 2008;Matos et al., 2014). Sharing cars' empty seats may be seen as an optimization method if we consider, for instance, the low occupancy rate per vehicles in traffic (He et al., 2014). In 2011, a research conducted by the Michigan University has shown an occupancy rate of 1.5 in the U.S.A. Such occupancy rate is easily decreased to 1.4 if we consider only "home->work" or "work->home" trajectories. In other words, there are plenty of vehicles with just the driver inside (Ghoseiri et al., 2011).
There are software initiatives to facilitate the carpooling's practice. Caronas Brasil (Azzam and Bellis, 2008), Zumpy, Poolmyride, BlaBlaCar (Mazzella, 2004), Go! , Carma (O'Sullivan, 2015), Carticipate (Frost, 2015), BeepMe, Lyft (Zimmer and Logan, 2012) and Bynd are some examples. However, to date, few of them have had commercial success (Ghoseiri et al., 2011). Some services provided by that software require that interested users execute a search for people who offer a ride with the same or alike trajectories. In some cases, it's not so simple to find this trajectory. Another problem is the fact that the driver and the passenger are unfamiliar. This kind of information contributes to the increasing of trust among users of the services (Furuhata et al., 2013). There are other factors that may discourage carpooling: The presence of a smoker, features of the vehicle itself, some aspects of the driver's profile (Agatz et al., 2012) and gender (Levin et al., 1977). The ridematching procedure has been proposed to deal with these issues and suggest the carpooling formation instantaneously (Agatz et al., 2012). It promises to facilitate the matching process among candidates by correctly attributing users who want to get a ride to users that offers.
Currently, the propagation and the facility of use of smartphone apps, the Global Position System (GPS) and APIs like GoogleMaps allow people to track their own trajectories and share them broadly: GPS-Way-Points, Share-My-Route, Bikely, Facebook (Shang et al., 2012). These shared data can be used to the development of a lot of impressive characteristics such as: Mining frequent trajectories (Savage et al., 2010), finding similar trajectories (Pelekis et al., 2007), mining Points of Interests (POIs) (Telles et al., 2012), find out sub-trajectories and so on. He et al. (2014) and Lee et al. (2007) have tried different approaches of mining trajectory to provide ridematching among users. Surely, most research centers on the improvement of the trajectory mining process, but few propose an effective approach to define the suitable granularity level of GPS-based trajectory and none have properly formalized central elements and features of carpooling context. As a consequence, a range of terms with the same meaning is used in different works and may confuse the reader: Route (He et al., 2014) or trajectory (Lee et al., 2007), a driver (Furuhata et al., 2013), passenger or riders (Agatz et al., 2012), etc. Finally, in the context of carpooling, few academic types of research consider both similarity of profile and trajectory (Furuhata et al., 2013;Yan and Chen, 2011). Cruz et al. (2015) propose a clustering approach to trajectories in the context of carpooling. This paper aims to extend the work of (Cruz et al., 2015) along three axes: (i) Propose (semi-formal definition of the elements of the carpooling context, towards a standardized nomenclature, (ii) Add users' profiles to the clustering approach and (iii) Provide a proper evaluation of the trajectories' dataset (GO! Track) used to train the clustering model.
In section 2 we provide proper definitions to the elements of carpooling context. In Section 3, we describe our approach to generating user's clusters with similar profiles and trajectories. In Section 4, we present the experiments and discuss the results. We conclude the work in section 5.

Formalization
Definition 1: Trajectory is a sequence of multi-dimensional points. These points are discrete and finite and they are represented by Tr = {p 1 , p 2 , p 3 ,...,p n }. Here, p is a 3dimensional point: Latitude, longitude and time-stamp, p = {lat, lng, t}.

Definition 2:
Driver is the user who shares a vehicle with the passenger and has similar trajectory with all passengers. Tr(d) is a trajectory that pertains to the driver. In this study, Tr(d) ~ U = {T r(a 1 ), (a 2 ), ..., T r(a n )} means that the driver's trajectory and passengers' trajectory are similar. In such case, dist(Tr(d), Tr(a i )) ≤ r, where dist() is some distance function and r is a limit constant.

Definition 3:
A vehicle is defined as any means by which someone may travel: A car, a motorcycle, etc. Here, a vehicle is represented by V, where V(d) is a vehicle that belongs to the driver d.

Definition 4:
Passenger is a user who shares a vehicle with a driver. Tr(a) is a trajectory that belongs to a passenger a. In this study, Tr(a) ≃ T r(d) means that the driver's trajectory and passenger s trajectory are similar.

Definition 5:
Ride is described as a form to share a private vehicle space among people with similar trajectory and interests. A ride is represented by is a driver's trajectory and A is the set of passengers.

Definition 6:
Origin is defined as the first point p 1 ∈Tr of each trajectory.

Definition 7:
Destination is defined as the last point p n ∈Tr of each trajectory.

Method
Our approach extends clustering trajectory proposed by (Cruz et al., 2015) with K-means (Macqueen, 1967) algorithm as follows. Given a set of users, U = {S 1 , S 2 , …, S n }, where each S i is denoted by a tuple set by a user's trajectory Tr i and user's profile P i , Optics* generates a set of cluster A = {C 1 , C 2 , ..., C n }, where each C i = {Tr 1 , Tr 2 , ..., Tr n } denotes a set of user's trajectory and it has at least one trajectory from a user called driver d that gives a ride. Next, K-means produces a set of cluster B = {X 1 , X 2 ,..., X n }, where each X i = {P 1 , P 2 , ..., P n } represents a set of user's profiles. Finally, the set of clusters A and B are combined using an ensemble approach. The result is a set of clusters R of users related profile and trajectory. Figure 1 describes the entire method. Perceive that the clustering process takes into account proper distances between trajectories and social distance between user's profile. The method is split into five principal steps: (i) Defining the granularity of user's trajectory, (ii) Temporal filter, (iii) Optics clustering, (iv) K-means clustering and (v) Relabel and intersection clusters. Since the first three steps are properly described in (Cruz et al., 2015), we present them briefly.

Trajectory's Granularity
Considerer U a set of user's trajectories. Each trajectory contains points that was collected in the short time interval. Because of this, it's necessary to reduce it. RotaFacil (Telles et al., 2012; dramatically reduces the number of trajectory's points by detecting Points of Interest (POIs) (Fig. 2). A new subset U' is thus generated. Figure 2 POIs within a circumference for a given radius. A temporal filter is a way of processing a pipeline in other to find similar trajectories with similar departure and destination times. Surely, it is irrelevant clustering together Tr(d) and Tr(a) if departure and/or destination times of users d and a are very different, even though Tr(d) ≃ Tr(a). Regarding t the time of departure ride and x as a bound that user is prepared to accept requests for a ride, the width of the filter is the interval [t − x, t + x]. For instance, a user d can offer a ride with departure time at 6 am and inform a boundary of 30 min earlier or later t to accept a request for a ride in an interval of [5: 30, 6: 30].

Optics Clustering
The clustering trajectory is performed by an adaptation of Optics* algorithm (Ankerst et al., 1999). Figure 3 illustrates algorithm's behavior. Tr(a) belongs to a passenger a who wishes to get a ride whereas Tr(d) belongs to the driver. The similarity only takes into account origin and destination points of the passenger s trajectory.  Cosine function has been used to compute the similarity (Theodoridis et al., 2010) between two users sim(P i , P j ): (1)

Relabel and Intersection
Relabel strategy is a way to arrange clusters granted similar as exhibited in Fig. 4.
The result obtained by Optics* and K-Means are two partitions (A and B) that are processed by Hugarian Algorithm. These algorithms will have relabeled the partitions to verify which clusters have more users in common. Consider, for instance, that cluster C 1 has 4 users with comparable trajectories and cluster X 2 has 10 users with similar profiles. If we are in mind that all combinations between A and B, X 2 and C 1 are clusters that have more users in common. Figure 5 illustrates relabeling process. The columns and rows represent the partitions and the users respectively. The permutation is used to align the most alike clusters. Users that belong to alike clusters will make part of the final clusters. This work uses trajectory partition as a reference partition which is used as the support to align other partitions. As Fig. 4 and 5 show, the voting approach is not used fully because, in our context, there is no need to vote to generate final clusters with just two partitions.
The consensus functions (represented by τ on Fig. 4) considers the intersection between clusters: because the partition D is resulted from two others partitions: A and B. These final partition is composed by clusters with users that have similar trajectory and profile.

Tr(d)
User d Tr(a) User a

Experiments
We have conducted four experiments to prove the feasibility of the ridematching with trajectory and profile. The first two experiments were based on (Cruz et al., 2015). The difference is that we will use a larger real dataset and show the result produced using Optics*. Third experiment shows the results of Optics and K-means using the clusters of user's profile. Finally, the fourth experiment presents results that were obtained from clusters with users that have both trajectory and profile similarity.

Datasets
Three different datasets have been used. The first consists of trajectories of users driving cars or taking buses collected by the Go!Track app. The second consists of trajectories that were roduced by Rota Facil artificially. It was produced 500 trajectories. The third dataset consists of 500 registers of profile attribute also artificially produced.

GO!Track Dataset
The Table 1 presents the first dataset. A total of 445 trajectories from 65 different mobile devices were collected between September 2014 and March 2016 in the city of Aracaju/SE. Each trajectory is a set of points obtained at an interval of 0.5 sec (for car) and 10 sec (for bus). The values of these parameters were defined empirically. Table 2 and 3 show the fields of the dataset. First table stores the collected trajectories and the second one, latitude and longitude mainly. Figure 6 a trajectory instance and corresponding data. The present version of the dataset includes an important set of streets and avenues of Aracaju city. We have plotted all the dataset points on Aracaju city map (Fig. 7).
In addition, Table 4 lists the top-20 most visited traffic roads by GO! Track users, according to the number of trajectories (#T) that actually used it. This was accomplished by the Google Geocode API. The column number of points (#P) shows the number of points presented in that traffic road, regardless of the trajectory. We highlight the known principal city traffic roads according to traffic density during peak hours.
Many useful applications use date-time data to predict traffic states or traveling time. Coming graphs provide relations between geographic points and the time. The graph of Fig. 8 presents the similarity between the trajectories and date-time.
The x-axis represents the 163 trajectories and the yaxis the time bands. Each line in the graph represents a trajectory: The higher the line, more time has been spent in the trajectory regardless of the traveled distance. Points represent the city in a diverse range of times, allowing to observe situations where the traffic is probably increased (peak time).       Similarly, the graph of Fig. 9 shows the relation between each of the top-20 most visited traffic roads (street and avenues) and the specific instant it was visited. We have considered a daily time interval, in particular, at peak hours. The experiments were realized using the follow parameters adjustment. For experiments one and two the value of the parameter ε was set to 100, 150, 200 and 300 m. The MinPts was set to 2 and 3. We have assumed that 100 to 300 m are moderate distance limits for a user who wants a ride to move towards the destination point of an offering ride. The values 50, 150, 200 and 250 were set to parameter ε which is used in cluster extraction algorithm.

Evaluation Metrics
Davies-Boundin Index (DBI) the clustering task. Equation 3 (Theodoridis et al., 2010) was used to evaluate defines the DB value: where, n is the number of clusters, c i and c j is the centroid of each cluster. The α i and α j are the similarity measures for clusters c i and c j . The values generated by Equation 3 reflect how similar the elements of the same cluster are, as well as the dissimilarity among different clusters. Smaller DBI values are better. Table 5 shows the results of the first experiment. The radius used in granularity definition step was 25 m. The best result occurs when ε was 150 and MinPts were smaller than 3. Table 6 shows a little contrast compared with Table   5. For example, the best results occur when MinPts was bigger than 2. The radius used in the granularity definition step was 30 m. Table 7 shows the results of the second experiment. The DBI values among three algorithms are similar. ε directly influences clusters' size according to experimentation. One ε that is "big" enough will produce good results. Unlikely, small ε will generate a plenty of objects with a reachability-distance value equal to undefined. Here, as well as in the work (Cruz et al., 2015) neither method was used to deduce the ideal ε.

Experimentation Results
The results of the third experiment are shown in Table 8. The experiment presents a comparison of clustering methods in regard to the user's profile. First, we show the results for Optics and next for K-means. We can verify that K-means has better results when the number of clusters grows up.
As a consequence of such results, K-means has been used to generate profile clusters in the fourth experiment. Table 9 shows the results of ensemble learning approach to provide clusters of users with similar trajectories and profiles. The results have been achieved by the matching of trajectory clusters generated by Optics* and profile clusters got with K-means. Table 9 shows the Number of Final Clusters (NFC), Davies-Boulding Index related to the Trajectory (DBIT) and Davies-Boulding Index related to Profile (DBIP). It shows the results of the Optics* and K-means considering the Number of Clusters (NC) once the ensemble learning approach by DBI metrics is done. DBI metrics is better when a number of clusters are larger (NC≥40).

Complexity Analysis
We have provided some time analysis for our approach. Firstly, we used the software profiling method which is a form of dynamic program analysis. Next, we calculated the time complexity estimation according to big O notation.
The cProfile library enables software profiling analysis. The profiling was used with the purpose to determine which part of the method should be optimized. The analysis covers a set of features such as: A number of function calls NumofCall, the total time spent by functions or operations, the cumulative time spent by the functions C Time, etc. In addition, we could verify the overall time spent in each method. Table 10 shows the four most expensive functions. Results of Table 10 were obtained with the following setup: ε = 100, MinPts = 2, ε' = 150, k = 54. Table 10 shows that function math.cos has been called a lot of times, but the time spent by the function is less than 10% of neighbors' function. The neighbor's function is used by Optics algorithm and has fundamental importance. The neighbor's function is a bottleneck of Optics algorithm because it consults the whole trajectories' set when it is called. According to (Ankerst et al., 1999), an index structure like tree-based spatial index can be used and consequently decrease the overall runtime.
The first complexity analysis of the method was done. According to (Ankerst et al., 1999), the Optics* algorithm has an overall runtime of O(n 2 . lg n) considering a spatial index and similarity algorithm. Additionally, K-means algorithm has an overall runtime of O(n dk+1 . lg n) where d is the dimension and k the number of clusters. The ensemble approach used has an overall runtime of O(K 3 ) considering that Hungarian is been employed. The runtime of ensemble approach can be considered constant because K is a number of the partition that is fixed in 2. Thus, the global runtime is O(n 2 . lg n + n dk+1 . lg n).

Conclusion
Encouraging carpooling is an important effort towards the reduction of in-transit vehicles. Although there are some concerned research initiative and even some related software, they do not appropriately treat carpooling context specificities. In this study, we have proposed an extension to the method developed by (Cruz et al., 2015) in order to deal with some of these specificities: Find out groups of users that have similar profile and trajectory and consequently determine potential carpooling possibilities. Furthermore, clustering users' trajectory and clustering users profile are results that can be considered separately in regard to the final interest of who desire to use the proposed approach as carpooling applications.
Density-based algorithm Optics was chosen due to some features like a minimal number of input parameters, ability to build non-spherical clusters and sturdiness to noise. These features are important to take into account in a task of finding similarity among trajectories. The well-known K-means was chosen because shows better results compared to Optics in a context of users' profile.
Clustering results and corresponding Davies-Bouldin Index values obtained from a dataset of actual trajectories collected pervasively have shown the feasibility of the proposal. According to experimentation, POIs with different radius seem to not influence the quality of our approach. Initial runtime analysis indicates the elevated complexity. However, if we consider that the problem of finding out similarity among trajectories of users is a special case of the so-called pickup and delivery problem, which is NP-Complete (Agatz et al., 2012), the analysis result proves its feasibility.
We are currently working on experiments that consider weighted profile s attributes, so the users can define their weighted preferences. We are also currently embedding this similarity approach into a carpooling recommendation service so it could be integrated into some open-source carpooling software, such as the GO!Caronas Macedo et al., 2014). Finally, we intend to compare our ride and profile matching approaches with the approach of (Carvalho and Macedo, 2013), which uses coalition structure to provide proper group's formation.