A Hierarchical Clustering Approach for DBpedia based Contextual Information of Tweets

Corresponding Authors: Venkatesha Maravanthe Department of Computer Science and Engineering, VTU Research Resource Centre, Belagavi, India Email: venkatesha.uvce@gmail.com Abstract: The past decade has seen a tremendous increase in the adoption of Social Web leading to the generation of enormous amount of user data every day. The constant stream of tweets with an innate complex sentimental and contextual nature makes searching for relevant information a herculean task. Multiple applications use Twitter for various domain sensitive and analytical use-cases. This paper proposes a scalable context modeling framework for a set of tweets for finding two forms of metadata termed as primary and extended contexts. Further, our work presents a hierarchical clustering approach to find hidden patterns by using generated primary and extended contexts. Ontologies from DBpedia are used for generating primary contexts and subsequently to find relevant extended contexts. DBpedia Spotlight in conjunction with DBpedia Ontology forms the backbone for this proposed model. We consider both twitter trend and stream data to demonstrate the application of these contextual parts of information appropriate in clustering. We also discuss the advantages of using hierarchical clustering and information obtained from cutting dendrograms.


Introduction
Adoption of Social Web as the platform for people to exchange opinions lead to a high volume of usergenerated content. The availability of affordable smart devices and easy access to the internet via wireless mediums such as WiFi, 4G, etc., has greatly aided this transition. Smartphone adoption, in particular, has resulted in a further increase in the user base for social applications. The technology shift has promoted the evolution of multiple social media platforms, prominent of which are Facebook, Twitter, Instagram, LinkedIn, etc. Users freely exchange information over these platforms via text, images and videos. All the social platforms provide the necessary security and privacy features. The accessibility to the information is largely defined and controlled by the users themselves. In the microblogging space, Twitter is the most popular social media website allowing users to assimilate concise information directly from the genesis. More than half of the world's population today are active Internet users with around 1.3 billion twitter users. Recent statistics presented an average of 500 million tweets posted by 100 million active users everyday a .
Twitter works on the concept of followee and follower. Most content on twitter is publicly accessible. Some users however have opted for private accounts and their activities are often not available. Abundance and simplified access of tweets leads to a problem of finding the right content and mining useful information from a given user's perspective. Human tendency has always been to stay curious about diverse topics and inability to find tweets suiting their interests can be discouraging. Hence there is a need for efficient frameworks that identify topics, tag contexts that eventually help define a foundation for development of context aware applications. We do find numerous Apps in popular Marketplaces working on information derived a https://www.omnicoreagency.com/twitter-statistics/ by processing twitter data. Such apps include recommender systems, in-app marketing, sentiment analysis, market insights, crisis management etc.
DBpedia b knowledge-base is a crowd-sourced project developed by extracting contextual data from Wikipedia. It uses Linked Open Data standard consisting of 3 billion Resource Description Framework (RDF) c triplets and provides a mechanism to query the relevant content and links mainly Wikipedia and Yet Another Great Ontology (YAGO) d categories along with many external web pages. Multilingual support is also provided by DBpedia. (Lehmann et al., 2009; provide more insights into design of this framework for building the Ontology and maturity of the available content. They also describe the mechanisms to access information via online interfaces.
Hierarchical clustering is one of the cluster analysis methods in data mining to group similar data points to clusters. Hierarchical clustering works with the help of cluster dissimilarity and cluster linkage which are explained in subsequent sections. This form of clustering is advantageous to visualize meaningful taxonomies and nested clusters. There are two different ways to perform hierarchical clustering: • Agglomerative (bottom-up) approach where each sample is a single cluster then merged form a single cluster • Divisive (top-down) approach which starts as a single cluster which is broken down until one cluster of each sample is left The first part of this work presents a model for extracting valid contexts from tweets using DBpedia followed by demonstrating how clustering can be applied to these contexts. Section Literature Survey provides an overview of existing similar models and the applicability of their contributions to our work. Section Problem Definition talks about the problem addressed in this work. Section Methodology outlines our optimized approach to determine contexts and clustering of the same. Section Experimental Results presents the results and Section Conclusions and Future Work wraps up the finding while noting the scope for future enhancements.

Literature Survey
Usage of Wikipedia and DBpedia as a knowledge base for mining text data has been long part of Semantic Mining. Bontcheva and Rout (2012)  of Wikipedia categories. Gabrilovich and Markovitch (2007) further show the versatility of concepts derived from Wikipedia and proposes 'Explicit Semantic Analysis (ESA)' for computing semantic relatedness in natural language texts. Ramanathan and Kapoor (2009) propose a model for creating user profiles with the help of Wikipedia. Framework by Genc et al. (2011) discuss leveraging Wikipedia to map tweet to its semantic space, to calculate distance between tweets, helping better classification. Muñoz García et al. (2011) describe a topic recognition scheme by linking keywords to a ranked list of DBpedia resources. Authors in (Hamdan et al., 2013) utilized DBpedia along with WordNet and SentiWordNet as a combination for sentiment classification.
User interest modeling is an important application of Semantic Mining. Initially Michelson and Macskassy (2010) use 'Named Entity Recognition (NER)' for getting entities and disambiguates leveraging Wikipedia for generating Twopics. Wikipedia concept linking in Lu and Lam (2012), put forth expansion of user's interest and results show better recommendation using these interests. Kapanipathi et al. (2014) process a hierarchy on 'Wikipedia Concept Graph (WCG)' to come up with 'User Interest Generator' and 'Interest Hierarchy Generator', mapping user's primitive interests to Wikipedia hierarchy. Shah et al. (2018) propose enrichment technique using DBpedia Ontology, generating niche interest and inferred general interest and works for least active users as well. Interests identified using DBpedia aggregates into a user profiling framework in (Orlandi et al., 2012).
Contexts or Topic identification promoted many recommender systems using other knowledge-bases than DBpedia. Abel et al. (2011) represent user modeling with entity identification via OpenCalais e . A news recommendation system has been built on top of user modeling and considers the temporal dynamics of profile changes. Initiation from work (Pla Karidi, 2016) manifests to a complete recommender architecture in (Pla Karidi et al., 2017) suggests both tweets and followees. This system takes advantage of Alchemy API f for deriving contexts to build a Knowledge Graph of 1092 nodes and 1323 edges. A super set of 1092 concepts may not be sufficient in specific areas and DBpedia offers an alternative to explore. Papneja et al. (2018) propose a content recommender related to user interest. DBpedia Spotlight g serves purpose to find mapping between domain ontology and DBpedia classes. Romero and Becker (2017) describe a classification framework, taking advantage of DBpedia for enriching semantic features. DBpedia spotlight connects terms to their respective URI for semantic enrichment.
DBpedia and Wikipedia based solutions can be found in combination with clustering. Szczuka et al. (2012), authors have used DBpedia dictionary and matched against respective concepts for converting texts from scientific documents. Here it is clearly concluded that the DBpedia concept representation of clusters are in line with manually assigned cluster labels. Likewise, in (Schuhmacher and Ponzetto, 2013) web search results are processed with DBpedia Spotlight for snippet semantification and topic assignment leading to better quality of clusters formed. Hu et al. (2009) present a method to cluster different sets of documents by generating document-category matrix built on top of Wikipedia term-concept matrix. Results show that Wikipedia category information yields better cluster output along with hierarchical clustering methods.
In the review work, Alnajran et al. (2017) have compared 13 different research works on applications of clustering for mining twitter dataset. Although the performance is low for hierarchical clustering, quality of clusters is pointed out to be much better. Twitter event detection using hierarchical clustering after computing pairwise distances of tweet-by-term matrix is proposed in (Ifrim et al., 2014). Experiments have shown hierarchical clustering can process 24 h stream data in 1hour time-frame with an accuracy of 80%.
Flisar and Podgorelec (2018) frames a classification model for tweets using DBpedia and is similar to our effort for identifying contexts. This work makes use of DBpedia Spotlight and queries DBpedia ontology for enrichment of data. In our prior work (Venkatesha et al., 2019), we attempt to find extended contexts and provide a scalable framework along with relevant data filtering. Vicient and Moreno (2015) have recommended hierarchical clustering for topic discovery in tweets. As a first step, semantic annotations are done on hashtags of tweets with the help of WordNet and Wikipedia categories. These annotated hashtags are the input for bottom-up hierarchical clustering procedure using complete linkage identical to what we are proposing. Saraçli et al. (2013) provide a detailed comparison of hierarchical clustering methods and help to determine right distance measures, thus guiding us for better decision making on clustering approach.

Problem Definition
Given the large set of tweets, model a framework to generate and cluster contexts for those tweets. Framework should consider perform the below outlined objectives: 1. Read and process the set of tweets resolving ambiguities 2. Generate primary context(s) from tweets 3. Get the extended context(s) for every primary context obtained 4. Cluster the primary/extended context(s) to visualize extracted metadata 5. Discover associated context(s) information at different levels of clusters First consideration should be a proper tool to work with Named Entity Recognition and disambiguation to avoid confusion between different contexts. Long sentence or paragraph input can result in one or more applicable contexts or categories. Hence the second step should consider all the applicable contexts of text input. Third step is to find metadata or hidden data around the primary contexts helping to derive more meaningful information about the text. These additional information about primary contexts are termed as extended contexts. Considering the amount of text data generated on twitter every day, framework for extracting primary and extended contexts should scale for larger datasets. For better learning of tweets, fourth and fifth steps attempt to cluster the contextual knowledge of group of tweets.

Methodology
The proposed framework for generating contexts is illustrated in Fig. 1 and subsequent clustering is given in Fig. 4. We commence the process by acquiring data i.e., Tweets. Tweets can be stored in multiple formats. We have taken JSON h file format to store the input data for the ease of use. Java nio package i is primarily used to read files and Jackson j open source library is utilized to process JSON data. Every tweet in the input data goes through 'Primary Context Generator' and 'Extended Context Generator' and the outcome is stored in JSON format.
We use text "Roaming around amazon forest is a great experience" to understand the working of intended framework.

Primary Context Extractor
Section 3 highlights the difficulties with ambiguous context. We chose open source DBpedia spotlight (Mendes et al., 2011;Daiber et al., 2013) for handling ambiguity and deriving primary contexts: • DBpedia Spotlight: DBpedia spotlight is an annotating tool built on top of DBpedia resources. It comprises of built-in disambiguation resolution on the phrases extracted from text. Results in Mendes et al. (2011) shows that DBpedia disambiguation evaluation has an accuracy of 80.52% for Spotlight Mixed approach. Daiber et al. (2013) further extends the model to multiple languages. This tool has both Web and Web-services based interfaces. In this paper, we rely on RESTful based web-services exposed connecting to '/candidates' endpoint.
Interlinking of annotated term to DBpedia resources with a unique URI string is an upper hand of this tool. URIs can be directly connected to either DBpedia or Wikipedia resources.
Output of this step is a set of URIs based on the input text. Hence for a given tweet t, the output can be defined as set of URI/s termed as primary contexts: For all t ∈ T, where T consists of multiple texts/tweets.
Sample text produces Amazon_rainforest as URI. Ambiguous tagging of Amazon as a company is avoided in API endpoint. Response from API contains additional attributes tagged to every URI such as contextualScore, support, priorScore and finalScore. This supplementary information is captured and stored, however not used in this work.

Extended Context Generator
Extended Context Generator defined in Fig. 1 deals with finding additional metadata for the extracted primary contexts. Extended contexts are queried through DBpedia Ontology applying SPARQL k . DBpedia SPARQL endpoint l is used to run respective queries and response is collected in JSON format. Query is modified to fetch resource class type of primary context in DBpedia Ontology. These types are in turn mapped to multiple named space schemas (e.g. dbo, dul, yago etc). Each primary context is mapped with valid response from SPARQL endpoint. Resultant data is subjected to filtering to extract only Wikipedia based categories. Java stream filters m are used to keep up with performance. Filtered results are termed as extended contexts: For all pc ∈ PC and m > n for tweets' set T. JSON object representation of primary context and extended contexts for the sample text is depicted in Fig. 2. Java Executors n capability is utilized to enable multiple threads reading tweets and to connect to two sources in parallel. k https://www.w3.org/TR/rdf-sparql-query/ l http://dbpedia.org/sparql m https://docs.oracle.com/javase/8/docs/api/java/util/stream/packagesummar y.html n https://docs.oracle.com/javase/tutorial/essential/concurrency/exinter.html IDF is calculated for every term w.r.t. the entire set of documents: where, N is the total number documents and n t number of documents where a particular term t appears. TF-IDF weight w of term t in document d is calculated as: where, d is the document consisting of either primary contexts or extended contexts. TF-IDF weight w for each term is represented in vector format for every document. Similarity measure is calculated on multiple documents extracted after sampling vectors of primary contexts or extended contexts. Representation of this approach is shown in Fig 3. Given two n-dimensional vectors W1 and W2 of TFIDF weights, cosine similarity between these two vectors are represented as: ( ) where, W1 and W2 are the vectors consisting of TF-IDF weights. Cosine similarity in (6) where, W1 i and W2 i are components of W1 and W2 respectively. Higher cosine similarity value indicates more similar vectors. If vectors in comparison are exactly the same because of the underlying text, then the similarity value would be 1 which corresponds to the maximum possible value.

Clustering Approach
Having validated the usefulness of contexts obtained by computing similarities, we further try to feed these metadata to a clustering method. The clustering step is mainly to identify useful patterns. Input data for clustering varies based on the type and size of tweets being sampled. The nature of input data makes it difficult to decide on the number of clusters or widow size, which are required with standard K-Means or Mean-Shift clustering approaches respectively. Similarly, algorithms like DBSCAN may not perform better if we have varied density clusters because of unrelated texts/tweets. Given these circumstances and the need for a generic approach, we chose to go with the bottom-up approach of hierarchical clustering i.e., Agglomerative Clustering.
In this study, we have followed the steps given in Fig. 4 for clustering data. Even for clustering, first we need to convert primary/extended contexts to vectors and we use the TF-IDF representation as explained in section 4.3. Next step is to select a distance measure to find dissimilarity between data points. Euclidean distance measure has been picked up after referring to the results of (Saraçli et al., 2013).
Here Euclidean distance is calculated as pairwise distances between two vectors. Hence the Euclidean distance between a pair of row vector x and y is given as: For 2-dimensional sets x and y Euclidean distance can also be represented as: The distance matrix from 8 is then used to generate a linkage matrix. Linkage is to determine the proximity of two clusters. For a large number of samples, it is favorable to apply complete linkage with most of the distance measures as indicated in (Saraçli et al., 2013 Clustering starts with computing distance matrix between each data point and merging two closest clusters until a single cluster is formed. For the implementation purpose we have used python SciPy o library along with python machine learning package scikit-learn p .

Experimental Results
In order to evaluate the model proposed in section 4, we have considered already extracted set of tweets as the data source. Recent updates to twitter API policies in July 2018 and subsequent difficulties faced with API rate limits in our prior work (Rao et al., 2018), made us choose standardized readily available data. Existing data has been used and more focus is given to the tweets' context information extraction.

Datasets
We have considered two different datasets; one is containing tweets specific to trends and the other one being a stream of tweets.

Contexts and Similarity
"#CatalanReferendum" trend from 1st day of October month is selected for generating contexts. There are totally 1297 unique texts tagged to this trend. Extracted tweets are randomly split into 5 equal sized chunks. These 5 sub lists are employed as input for the context generator and respective 5 documents store the primary and extended contexts. Merging these two documents result in 5 more documents leading to 3 different sets of 5 documents each involving primary, extended and (primary + extended) contexts.
Results of similarity scores considering all 3 contexts are given in Table 1. With the outcome, it is evident that either extended contexts or both the contexts give much better results than using only the primary contexts. Random split of specific trend might result in disjoint sets. In case of comparison between disjoint sets, the similarity measure after extending is a lower value e.g., Doc 1 v/s Doc 4 in Table 1. o https://www.scipy.org/ p https://scikit-learn.org/stable/ q http://followthehashtag.com/datasets/free-twitter-dataset-usa-200000free-usa-tweets/  For better understanding of the obtained results, plotting of Document 1 v/s Other documents is illustrated in Fig. 5. Cosine similarity for the entire #CatalanReferendum trend splits are represented in Fig.  6. Better results are acquired with extended contexts in most of the scenarios and both contexts in few cases.
We have experimented the model with few more trends extracted from dataset, outlined in Table 2. Similarity scores for these trends are preferable with extended and both contexts identical to #CatalanReferendum.
For fine-tuning the performance, number of threads to process tweets are kept configurable. Figure 7 shows time taken by the proposed system for Cristiano Ronaldo trend consisting of 794 tweets. With 250 threads program took approximately 30 sec for execution. On an average, 6000 r tweets are posted every second. Applying filtering of specific trend to the live stream of tweets we might end up with roughly 1000 tweets per trend for processing. Proposed system can be calibrated with appropriate number of threads and can complete processing within 30 sec. Running the framework on a multi-core server should provide even better scalability.
When we compare our approach with (Vicient and Moreno, 2015), our effort overcomes the difficulties involved with semantic annotation of hashtags by using DBpedia. By observing the classification results on top of DBpedia based enriched data in (Flisar and Podgorelec, 2018), we intended to experiment clustering of contexts. Outcomes of chosen hierarchical clustering are explained in the subsequent section.

Cluster Representation
Clustering was carried out on both of the datasets to observe what kind of patterns will emerge. We had to do a bit of processing as pre and post steps while generating contexts: • Removal of RT prefix from the tweets as DBpedia terming this as a separate context and linking it with category: RT (TV network) s • Generated contexts are URIs which might contain commas in the string label, e.g., Hilo, Hawaii t . Therefore, a different delimiter had to be used to split the data to get actual context labels.   Unlike the step mentioned in previous section, contexts Generation was carried out for the entire set of tweets and not the equal sized chunks.
Firstly, #CatalanReferendum trend was sampled with generated primary contexts. The dendrogram result is depicted in Fig. 8. Along the same lines we experimented clustering for extended contexts of #CatalanReferendum. Dendrogram for this is shown only for last 10 cluster merges and is given in Fig. 9.
To get insights into the contexts associated, related context labels from dendrogram is printed in user friendly format in Fig. 10a and 10b. As we can see, though a trend is predominately a single context, we observe one cluster showing other categories emerged out of contexts. These clusters formed with primary and extended contexts are comprehensive information about the hidden patterns of the trend.
As a next step, we experimented the same clustering for stream data. Here the objective was to find patterns within the corpus of unrelated tweets. We selected the first 1000 tweets from the stream and Fig. 11 visual representation of the dendrogram. Selected cluster labels of primary and extended contexts for the stream of tweets are provided in Fig. 12a and 12b.   Hierarchical clustering also provides the flexibility of knowing clusters at any of the merged levels. This could be of great use if we have to drill down to specific levels to find clusters. Table 3 displays details of few flattened clusters for #CatalanReferendum trend.

Conclusion and Future Work
In this work, the context modeling framework for tweets and usage of those generated contexts in clustering has been elaborated. Since we use a proven knowledge-base DBpedia which efficiently handles ambiguities in text contexts, we ensured relevant contexts are generated. Similarity scores tabulated above demonstrates the quality of disambiguated primary contexts and extended contexts. The architecture takes care of scalability aspects, handling fairly large datasets in multiple threads. In the second part of this paper, we have shown how this contextual information is useful for uncovering additional information about tweets. We have also presented flattened cluster data from various levels of hierarchy. Many of the existing works focus on specific areas and misses to come up with a comprehensive solution. We wanted to build a generic approach and have presented the same with two different types of tweet datasets. Overall, we have contributed to designing a scalable framework using open source knowledge-base/tools and shown clustering of this contextual data which can be a backbone for specialized problem domains.
In the future, we want to sample this model to cluster users and build a generic recommender system. Accurate users' interests are supportive in designing a strong recommender framework capable of suggesting tweets, topics, users or external contents. Based on the problem domain, we can either pick up primary or extended contexts and configure the number of clusters to address over-recommendation or over-specialization issues. We also intend to test the prototype with real-time streaming data to design an end-to-end framework. We wish to extend the model for a specific domain and compare with existing methods. Different combinations of distance measure and linkage can also be explored.

Author's Contributions
All the authors have contributed to conceptualization, to finalize the approaches, write-up and review of the manuscript. Venkatesha and Prasanth have assisted with the implementation and experimentation of the proposed framework.

Ethics
This article is original and contains unpublished material. The corresponding author confirms that all of the other authors have read and approved the manuscript and there are no ethical issues involved.