An Application of Session Based Clustering to Analyze Web Pages of User Interest from Web Log Files

: Problem statement: With the continued growth and proliferation of e-commerce, Web services and Web-based information systems, the volumes of click-stream and user data collected by Web-based organizations in their daily operations have reached astronomical proportions. Analyzing such data can help these organizations optimize the functionality of web-based applications and provide more personalized content to visitors. This type of analysis involved the automatic discovery of usage interest on the web pages which are often stored in web and applications server access logs. Approach: The usage interest on the web pages in various sessions was partitioned into clusters such that sessions with “similar” interest were placed in the same cluster using expectation maximization clustering technique as discussed in this study. Results: The approach results in the generation of usage profiles and automatic identification of user interest in each profile. Conclusion: The significance of the results will be helpful for organizations for web site improvement based on their navigational interest and provide recommendations for page(s) not yet visited by the user.


INTRODUCTION
With the continual growth of Web-based information systems, click-stream (sequential series of pageview requests) data and user data are collected by Web-based organizations in their daily operations. The necessity to understand the large amount of data is inevitable in all fields of business, science and engineering. The ability to extract useful knowledge hidden in these data and to act on that knowledge is an important strategic asset in today's competitive world. Capturing click-stream data will be helpful to model and analyze the users' browsing behavior. This type of analysis requires the automatic discovery of meaningful relationships from a large collection of semi-structured data stored in the application server logs. The discovery of usage profiles automatically from click-stream data will enable organizations to improve their web site management and Web personalization by providing dynamic recommendations.
Web mining can be considered as a special case of the Knowledge Discovery in Databases (KDD) (Cooley et al., 1997;Srivatsava et al., 2000;Agarwal and Srikant, 1994;Mobasher et al., 2000b). Web usage mining deals with the automatic discovery and analysis of "interesting" patterns from clickstream and associated data collected during the interactions with Web server on one or more Web sites. To discover knowledge from the raw data, it is necessary to perform the following three steps: • Data collection and preprocessing • Pattern discovery • Pattern analysis This study focuses on the discovery of "similar" interests of groups of sessions based on the navigation behavior. To achieve this: • First, a usage model is developed based on the browsing behavior in various sessions • Using this model, we learn "similar" interests of groups of sessions. This is called as the aggregate usage profile • Each usage profile consists of pages of varying user interest/significance • In the proposed method, the significance of the pages in each profile is determined. These profiles would later help in various applications of web usage mining such as web site improvement, assigning a new user to the appropriate cluster and recommend pages of interest not yet visited by the user to provide personalized web content Related work: A lot of research is being done in the area of Web Usage Mining. Based on the goals of the analyst and applications, various algorithms can be applied for cluster analysis. On the whole, clustering is the process of grouping the samples into clusters such that samples within a cluster have high similarity compared to each other but dissimilar to samples in other clusters. As mentioned in (Mobasher et al., 2000a), two types of clustering can be performed on usage data: Transactions Clusters and Page Clusters. Each type of clustering is helpful in different applications such as personalization and recommendation, system improvement, web site structure, business intelligence and user behavior. Similarity measures form the core component for every clustering algorithm.
For several years, focus on cluster analysis has been mainly distance-based cluster analysis. There are various distance-based similarity measures such as the Euclidean distance measure, Manhattan distance, Minkowski, Mutual Neighbor Distance (MND), Simple Matching Coefficient, Jaccard Coefficient and Rao's coefficient. The usage of these similarity measures depends on the features of the samples. However, in (Chaofeng, 2009), a Sequence Alignment Method has been used for measuring similarities between web pages by considering the URL and the viewing time of the URL. The proposed algorithm for Web Session Clustering Based on Increase of Similarities (WSCBIS) has been implemented along with k-means clustering and Robust Clustering using linKs (ROCK) proving the decrease in time and space complexity. Research regarding clustering of URLs using Sequence Alignment Method has also been done in (Hay et al., 2004). In this study, Hay et al. (2004) have clustered web users using two different similarity measures: SAM (non-Euclidean distance-based measure) and Association measure (Euclidean distance-based measure). The sequential order of pages is taken into consideration and not the position of the pages. Such sequences are called open sequences. As mentioned in (Hay et al., 2004), sequences with the same elements occurring in the same order and irrelevant of the positions of the elements are called open sequences. For example, the open sequence (1, 3, 5) occurs in the sequences (4, 1, 2, 3, 6, 5), (1, 2, 3, 4, 2, 5) and (3, 1, 3, 5, 2). Unlike most research, where users are grouped into cluster with similar pages, it was proved that SAM retrieves sequences not only with similar pages but the order of pages is also considered compared to the associative measure which is Euclidean-distance based. Hence users are clustered based on their sequential order of web navigation.
Stochastic methods have been proposed for clustering user transactions for the purpose of user modeling. Since each user may reveal different types of navigation behavior, the patterns should also capture the overlapping interests of these users.
Mixture models are able to capture complex, dynamic user behavior (Cooley et al., 1997). To determine the user behavior in web usage mining systems, (Mustapha et al., 2009) deals with modelbased clustering method using Expectation-Maximization (EM) algorithm which is an extension of k-means algorithm. EM algorithm is used for finding the parameter estimates in probabilistic models. The EM algorithm has been compared with the k-means algorithm and the results showed an improvement in the accuracy of the algorithm. A variant of the Model-Based Clustering has been done in (Pallis et al., 2005) in which the interpretation and visualization of modelbased clustering schemes using the concept of Correspondence Analysis (CO-AN) has been done. User sessions are clustered using the first-order Markov model using the EM algorithm. CO-AN is a multivariate statistical analysis method to interpret and visualize Web users' navigation patterns. This is helpful for commercial Web sites to understand the customer behavior and provides scope for site improvement.
A similar research on model-based clustering has been done in (Igor et al., 2003) based on first-order Markov model and a using a visualization tool, Web Canvas. The results have shown that learning time scales linearly with sample size using model-based clustering compared to agglomerative distance-based methods in which the learning time scales quadratic ally with sample size. In addition to the discovery of navigation patterns, prediction of future navigation behavior has been included in (Borges and Levene, 2008). Different scoring metrics such as the hit and miss score, the mean absolute error and the ignorance score have been employed to determine the quality of prediction. In (Lee and Fu, 2008a), two levels of prediction of users' browsing behavior have been proposed. Using Markov Model, browsing behavior is predicted at the category level and using Bayes Theorem, prediction is done at the web page level. A combination of Markov model and Bayes theorem results in a two-level prediction of user's browsing behavior. The results proved that the hit ratio is effective and accurate in both the levels. An extension of (Lee and Fu, 2008a) has been dealt with in (Lee and Fu, 2008b) in which the overlapping or heterogeneous nature of user's behavior and improvement in hit ratio has been considered.
Fuzzy Relational Clustering Algorithms have been applied for web usage mining (Krishnapuram et al., 2001;Labroche, 2007). Clustering of relational data using fuzzy approaches has been implemented using Fuzzy C-Medoids (FCMdd) and Robust Fuzzy C-Medoids (RFCMdd) in (Krishnapuram et al., 2001). These algorithms have been applied in Web Usage Mining for discovering user profiles. Similar research has been performed in (Labroche, 2007) for discovery of user profiles using Ant clustering algorithm and a linear version of Fuzzy c-Medoids.
Another approach to observing path traversal and clustering based on that data is advanced by Shahabi et al. (1997). The basic approach there is to define a path similarity measure for a given Web site. Then, the logged data about a user's paths is clustered using a simple K-means algorithm to aggregate users into groups. However, it is not clear how the similarity metric is devised and whether it can produce meaningful clusters. Approaches, essentially based on the association rule ideas (Agarwal and Srikant, 1994;Etzioni, 1996), have been proposed in (Cooley et al., 1997). However, these approaches assume that logs contain user IDs, which is not common in the real world except in the rare case that the ident protocol is used and the clients are agreeable to release the user names. A related topic that has been recently gaining momentum is the idea that we can learn much about users and customers by tracking and analyzing their clickstreams, which is of great importance in e-commerce.

MATERIALS AND METHODS
Data preprocessing: An important task in any web mining application is the collection of target data set to which mining techniques can be applied. Data preprocessing is a pre-requisite phase and timeconsuming process before the data can be mined to obtain useful and interesting patterns. As mentioned in (Suresh and Padmajavalli, 2006;Cooley et al., 1997;Srivatsava et al., 2000) data preprocessing involves data cleaning, user identification, session identification resulting in the creation of a user session file. A user session file is a collection of pageviews grouped by user sessions. Each user session can be considered in two ways: (i) a single transaction of many page references or (ii) a set of transactions each consisting of a single page reference. In this study, each session is considered as a single transaction consisting of a set of pageviews navigated by a user during a single visit to the site. Thus the session file consists of a sequence of user's request for pages P = {p 1 , p 2 , p 3 ,…,p n }and a set of m sessions, S = {s 1 , s 2 , s 3 ,….,s m } where each s i ∈ S is a subset of P.
Example: 2 1 3 2 1 represents a session consisting of a sequence of page requests.
Conceptually, each page is associated with a weight representing its significance. The weights can be determined in various ways depending on the type of analysis. In most Web Usage Mining tasks, the weights may be based on a combination of factors such as the time that the user has spent on a page visited, number of visits to the page and size of the page.
Consider the web as a directed graph G, where a node p i represents a web page visited and the edge e i represents the successive linear path to p i followed by the user. A session can be represented graphically as in Fig. 1.
In the context of web usage mining, a modified version of the Pagerank algorithm is used for assigning weights to the pages of a session based on their navigational behavior. Since it is a linear path (no parallel paths) each edge has a weight of 1. Hence the weight of a node p i is the sum of the weights of the in degree of the node p i : Hence, a session-pageview matrix is obtained. Each row represents a session and each column represents a frequency of occurrence of the pageview in the session. This is represented in Table 1 in which the first row represents the page id.  In our approach, the weight of the pageview is further determined by evaluating the importance of a page in terms of the ratio of the frequency of visits to the page with respect to the overall page visits in a session. A numerical weight is assigned to each pageview visited with the purpose of "measuring" its relative importance/interest within the session. If the page has not been visited, the weight of the page is assigned 0. The page visits repeated consecutively have been treated as a single visit to that page. The weights have been normalized to account for variances.
The above session file is represented by using the vector space model. Each session s i is modeled as a vector over the n-dimensional space of pageviews. Each session s i is represented as: s i = {pf 1, pf 2, pf 3, …, pf n } where, each pf j is the relative frequency of pageview j in session i. This type of weight normalization is referred to as transaction normalization which is beneficial since it captures the relative importance/interest of the pageview in a session.
Example 1: Suppose there are 17 pages and the user has visited page 1 twice (Table 2), page 2 twice and none of the other pages in session 1, then s 1 = 2/4, 1/4, 0, 0, 0,…….,0. Pattern discovery-model-based session clustering: Let S = {s 1 , s 2 , s 3 ,….,s m }be a set of m objects where each object is represented by a vector of pageviews. Our goal is to obtain sessions with "similar" interests. In contrast to the other clustering methods such as partitional clustering, hierarchical clustering wherein the similarity measure is distance-based, model-based clustering employs probability-based approach. The basis of the probability-based clustering approach is based on finite mixture model. Mixture models are able to capture more complex, dynamic user behavior.
A mixture is a set of k probability distributions, each of which governs the attribute values distribution of a cluster. Each individual distribution is referred to as a component distribution following a normal distribution. Each cluster is represented by a probability model. Model-Based clustering methods optimize the fit between the given data and a mathematical model.

A popular model-based clustering method is the Expectation-Maximization (EM) algorithm based on
Bayesian probability theory which is an extension of the k-means algorithm. The Expectation Maximization algorithm is an efficient iterative method to determine the Maximum Likelihood (ML) estimate when missing or hidden data exists.
The EM algorithm is an iterative refinement algorithm that can be used for finding the mean and standard deviation parameter estimates. The expectation maximization algorithm assigns each object to a cluster according to a weight representing the probability of membership. Therefore, new means are computed based on weighted measures.
Each iteration of the EM algorithm (Han and Kamber, 2006) consists of two processes: • E-step-Each object x i is assigned to cluster C k based on Bayesian probability. This is achieved using the conditional expectation. Assign: The missing data are estimated given the observed data and current estimate of the model parameters. The probability of cluster membership of object x i , for each of the clusters is calculated. These probabilities are the "expected" cluster membership for object x i • M-step-Assuming that the missing data are known, the likelihood function is maximized. The probability estimates from the E-step are used to re-estimate the model parameters:

∑ ∑
This step is the "maximization" of the likelihood of the distributions given the data.
It iterates until the parameters reach a stable, convergence point or until the Maximum Likelihood estimate reaches the maximum. The essence of the EM algorithm is that for every iteration, maximizing the conditional expectation leads to an increase of the log likelihood of the observed data for each iteration i. This determines the number of clusters.

Pattern analysis:
The web log files of msnbc.com web site have been used for this research. This data set is publicly available through the UCI KDD Archive (2005) at the University of California. The web site includes the page visits of users who visited the "msnbc.com" web site on 28/9/99. The visits are recorded at the level of URL category (for example sports, news and so on) and are recorded in time order. It includes visits to 17 categories (i.e., 17 distinct pageviews). The data is obtained from IIS logs for msnbc.com and news-related portions of msn.com. The client-side data is not available in the web log files. Each sequence in the dataset corresponds to a user's request for a page.
The 17  The above is a sequence of visits. Each record is a session. The first row indicates that the user has visited category 1 twice. The second row indicates that a user has visited category 2 once. The third row indicates that the user visited category 3 once, category 2 visited consecutively, then visited category 4 once and finally visited category 2 consecutively three times. About 10000 sessions have been selected randomly for this experiment. A portion of the dataset is as follows: 1 6 1 1 6 11 1 11 1 11 1 14 1 12 1 2 1 8 1 1 7 1 4 4 After suppressing the page visits repeated consecutively in a session (Table 3) using shell script in Linux, the sample dataset is as follows: 1 6 1 1 6 11 1 11 1 11 1 14 1 12 1 2 1 8 1 1 7 1 4  Each page has been given a numerical weight for each session. This indicates the relative importance of each page in the session. If the page has not been visited, its weight is 0. This is represented in Table 4.
Weka tool has been used for the experimental evaluation. A model is estimated from the available samples in the dataset which are generally split into training and testing sets. The model is first designed using training samples and then it is evaluated based on the performance on the test samples. In the proposed approach, the dataset has been partitioned into 60% of training data and the remaining 40% as test data. The Expectation Maximization clustering algorithm has been applied. The experiment was performed within 10 iterations resulting in 9 clusters with a maximum likelihood estimate of 24.95867. Each cluster represents sessions of "similar" interest in the web pages or the usage profile. Hence an aggregate usage profile is determined using the formula:  Table 5.

Evaluation methods for session clustering:
The Precision P(i,j), Recall R(i,j) and Purity evaluation measures of each cluster j for each web page i are calculated. The Precision Measure is given by: where, w ij represents the aggregate weight(user interest) of page i in cluster j. Since the weights have been normalized between [0,1], the weight of the cluster ∑w j is always equal to 1. Hence Table 5 also represents the Precision Measure of page i in cluster j.     The modified Recall measure is given by: n ij j j 0 R(i, j) w / w = = ∑ This is represented in Table 6. Purity is a simple and transparent evaluation measure. It represents the portion of the cluster corresponding to the largest aggregate weight of the page with respect to the cluster:  Table 7. The average purity of the clustering is the ratio of sum of the purity values of all the clusters to the total number of clusters. It is found to be 61%. The larger the purity result, the better is the performance of the clustering result. Table 3, each cluster represents the cluster centroid. The centroid represents the mean values of the web pages contained in each cluster. The aggregate usage profile is represented in Fig. 2 which shows the user interest on the web pages. The centroids enable us to describe each cluster by assigning it a name. Among all the clusters, using the Recall measure of each cluster for each web page shown in Table 7, we observe that the maximum recall of web page1 (FrontPage) is present in Cluster6. Similarly, in web page 2, it occurs in Cluster2. For web page 3, the maximum rating occurs in Cluster4. For web page 8, we find the maximum rating appears in Cluster0. From these observations, we infer that sessions in Cluster0 are interested in obtaining information about weather. Hence Cluster0 can be labeled as "Weather". This is represented in Fig. 3. However, sessions in Cluster1 indicate that users are randomly scanning the web pages. Hence Cluster1 may be labeled as "Random Surfers" as depicted in Fig. 4. Cluster2 is characterized by interest in web pages (News) and 13(Summary) as shown in Fig. 5. Hence cluster2 can be labeled as "News". Cluster3 is focused on web page 12 and 14. Hence this cluster can be labeled as "Sports" as shown in Fig. 6. Cluster4 can be labeled as "Technology" shown in Fig. 7. Cluster5 can be labeled as "Opinion" as represented in Fig. 8. Cluster6 is focused on "Frontpage" as seen in Fig. 9. Cluster7 can be labeled as "On-air" (Fig. 10). Cluster8 can be labeled as "Miscellaneous" (Fig. 11). A visual representation of the users' interest on the web pages is shown in Fig. 2-9. From this one can easily conclude that the user interests are either uni-focused (for instance Cluster0) or multi-focused (Cluster1). Data preprocessing has been extensively done in this work with respect to user's interest in each session. It has been widely reported in the literature regarding the application of Expectation Maximization algorithm in the context of web usage mining domain and the use of statistical techniques to draw conclusions. This article deviates from the conventional statistical methods to derive the interpretations and employs the Aggregate Usage Profile and a modified Recall measure for analyzing the user interest in various clusters. Further cluster evaluation techniques like Precision and Purity have also been adopted.

CONCLUSION
In this study, the authors have suggested a probabilistic-based approach for grouping web usage transactions and generated user profiles. First, we mapped the frequency of page visits to the relative user interest within a session and obtained a weighted session-pageview matrix. Then the model-based Expectation Maximization clustering algorithm is employed to generate clusters or usage profiles. The aggregate usage profile has been used to analyze user interest in the web pages. Experiments are done on real world data set to discover the user interest in the web site. The significance of the results will be helpful for organizations for web site improvement based on their navigational interest and provide recommendations for page(s) not yet visited by the user. The future research will focus on using other methods for grouping web pages of user interests.