Mining Fuzzy Weighted Browsing Patterns from Time Duration and with Linguistic Thresholds

: World-wide-web applications have grown very rapidly and have made a significant impact on computer systems. Among them, web browsing for useful information may be most commonly seen. Due to its tremendous amounts of use, efficient and effective web retrieval has become a very important research topic in this field. Techniques of web mining have thus been requested and developed to achieve this purpose. In this research, a new fuzzy weighted web-mining algorithm is proposed, which can process web-server logs to discover useful users’ browsing behaviors from the time durations of the paged browsed. Since the time durations are numeric, fuzzy concepts are used here to process them and to form linguistic terms. Besides, different web pages may have different importance. The importance of web pages are evaluated by managers as linguistic terms, which are then transformed and averaged as fuzzy sets of weights. Each linguistic term is then weighted by the importance for its page. Only the linguistic term with the maximum cardinality for a page is chosen in later mining processes, thus reducing the time complexity. The minimum support is set linguistic, which is more natural and understandable for human beings. An example is given to clearly illustrate the proposed approach.


INTRODUCTION
World-wide-web applications have recently grown very rapidly and have made a significant impact on computer systems. Among them, web browsing for useful information may be most commonly seen. Due to its tremendous amounts of use, efficient and effective web retrieval has thus become a very important research topic in this field. Techniques of web mining have thus been requested and developed to achieve this purpose. Cooley et al. [7] divided web mining into two classes: web-content mining and web-usage mining [7] . Web-content mining focuses on information discovery from sources across the world-wide-web. On the other hand, web-usage mining emphasizes on the automatic discovery of user access patterns from web servers [8] .
In the past, all the web pages were usually assumed to have the same importance in web mining. Different web pages in a web site may, however, have different importance to users in real applications. For example, a web page with merchandise items on it may be more important than that with general introduction. Also, a web page with expensive merchandise items may be more important than that with cheap ones. Besides, the time durations for the pages browsed are however an important feature in analyzing users' browsing behavior. In this research, we thus attempt to mine fuzzy weighted browsing patterns from the browsing time of customers on each web page. The minimum support is given as a linguistic value, which is more natural and understandable for human beings. Since the time durations are numerical and the page importance and the minimum support are linguistic, fuzzy-set concepts are used to process them.
The fuzzy-set theory has been used more and more frequently in intelligent systems because of its simplicity and similarity to human reasoning [20,21] . The theory has been applied in fields such as manufacturing, engineering, diagnosis and economics, among others [11,15,17] . Several fuzzy learning algorithms for inducing rules from given sets of data have been designed and used to good effect with specific domains [2,4,9,10,18] . Some fuzzy mining approaches were proposed in [5,13,16,19] .

REVIEW OF RELATED MINING APPROACHES
Agrawal and Srikant proposed a mining algorithm to discover sequential patterns from a set of transactions [1] . Five phases are included in their approach. In the first phase, the transactions are sorted first by customer ID as the major key and then by transaction time as the minor key. This phase thus converts the original transactions into customer sequences. In the second phase, the set of all large itemsets are found from the customer sequences by comparing their counts with a predefined support parameter α. This phase is similar to the process of mining association rules. Note that when an itemset occurs more than one time in a customer sequence, it is counted once for this customer sequence. In the third phase, each large itemset is mapped to a contiguous integer and the original customer sequences are transformed into the mapped integer sequences. In the fourth phase, the set of transformed integer sequences are used to find large sequences among them. In the fifth phase, the maximally large sequences are then derived and output to users.
Besides, Hong et al. [14] proposed a fuzzy mining algorithm to mine fuzzy rules from quantitative data [14] . They transformed each quantitative item into a fuzzy set and used fuzzy operations to find fuzzy rules. Cai et al. [3] proposed weighted mining to reflect different importance to different items. Each item was attached a numerical weight given by users. Weighted supports and weighted confidences were then defined to determine interesting association rules. Yue et al. [19] then extended their concepts to fuzzy item vectors.

NOTATION
The notation used in this research is defined as follows. n: The total number of log records c: The total number of clients m: The total number of web pages d: The total number of managers l: The total number of fuzzy regions D i : The browsing sequence of the i-th client, 1≤i≤c n i : The number of log data in D i , 1≤i≤c D id : The d-th log transaction in D i , 1≤d≤n i I g : The g-th web page, 1≤g≤m The count of region R gk max-count g : The maximum count value among all count gk values for page I g max-R g : The fuzzy region of page I g with maxcount g W gh : The transformed fuzzy weight for the importance of page I g , evaluated by the hth manager, 1≤h≤d ave g W : The fuzzy average weight for the importance of page I g u: The total number of membership functions for item importance I t : The t-th membership function of item importance, 1≤t≤u I ave : The fuzzy average weight of all possible linguistic terms of item importance wsup g : The fuzzy weighted support of page I g α: The predefined linguistic minimum support value minsup: The transformed fuzzy set from the linguistic minimum support value α wminsup: The fuzzy weighted set of minimum supports C r : The set of candidate weighted sequences with r linguistic terms L r : The set of large weighted sequences with r linguistic terms.

THE PROPOSED ALGORITHM
form linguistic terms. Each web page uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of original web pages. The algorithm thus focuses on the most important linguistic terms, which reduce its time complexity.
The importance of web pages is considered and represented as linguistic terms. The proposed fuzzy weighted web-mining algorithm then uses the set of membership functions for importance to transform managers' linguistic evaluations of the importance of web pages into fuzzy weights. The fuzzy weights of web pages from different mangers are then averaged. The algorithm then calculates the weighted supports of the linguistic terms of web pages from browsing sequences. Next, the given linguistic minimum support value is transformed into a fuzzy set of numerical minimum support values. All fuzzy weighted large 1-sequences can thus be found by comparing the fuzzy weighted support of the representative linguistic term of each web page with the fuzzy minimum support. Fuzzy ranking techniques can be used to achieve this purpose. After that, candidate 2-sequences are formed from fuzzy weighted large 1-sequences and the same procedure is used to find all fuzzy weighted large 2-sequences. This procedure is repeated until all fuzzy weighted large sequences have been found. Details of the proposed mining algorithm are described below.

The algorithm
Input: A set of n web log records, a set of m web pages with their importance evaluated by d managers, three sets of membership functions, respectively for browsing duration, web page importance and minimum support and a pre-defined linguistic minimum support value α.
Output: A set of fuzzy weighted browsing patterns.
Step 1: Select the records with file names including .asp, .htm, .html, .jva, .cgi and closing connection from the log data; keep only the fields date, time, client-ip and file-name.
Step 2: Transform the client-ips into contiguous integers (called encoded client ID) for convenience, according to their first browsing time. Note that the same client-ips with two closing connections are given two integers.
Step 3: Sort the resulting log data first by encoded client ID and then by date and time.
Step 4: Calculate the time durations of the web pages browsed by each encoded client ID from the time interval between a web page and its next page.
Step 5: Form a browsing sequence D i for each client c i by sequentially listing his/her n i tuples (web page, duration), where n i is the number of web pages browsed by client c i . Denote the d-th tuple in D i as D id .
Step 6: Transform the duration value g id v of the web page I g in D id into a fuzzy set g id f , represented using the given membership functions for the browsing duration of web pages, where I g is the g-th web page, gk R is the k-th fuzzy region of page I g , gk id f is g , id v s fuzzy membership value in region gk R and l is the number of fuzzy regions.
Step where, c is the number of browsing sequences Step 9: Find max- 1≤g≤m, m is the number of web pages in the log data and l is the number of linguistic regions for web page I g . Let max-R g be the region with max-count g for web page I g . The region max-R g will be used to represent the fuzzy characteristic of web page I g in later mining processes.
Step 10: Transform each linguistic term of the importance of the web page I g , which is evaluated by the h-th manager, into a fuzzy set W gh of weights using the given membership functions of item importance, 1≤g≤m, 1≤h≤d.
Step 11: Calculate the fuzzy average weight ave g W of each web page I g by fuzzy addition as: Step 12: Calculate the fuzzy weighted support wsup g of the representative region for each web page I g as: where c is the number of the clients.
Step 13: Transform the given linguistic minimum support value α into a fuzzy set (denoted minsup) of minimum supports, using the given membership functions for minimum supports.
Step 14: Calculate the fuzzy weighted set (wminsup) of the given minimum support value as: wminsup = minsup×(the gravity of I ave ) Where u t ave t 1 with u being the total number of membership functions for item importance and I t being the t-th membership function. I ave thus represents the fuzzy average weight of all possible linguistic terms of importance.
Step 15: Check whether the weighted support (wsup g ) of the representative region for each web page I g is larger than or equal to the fuzzy weighted minimum support (wminsup) by fuzzy ranking. Any fuzzy ranking approach can be applied here as long as it can generate a crisp rank. If wsup g is equal to or greater than wminsup, put I g in the set of large 1-sequences L 1 .
Step 16: Set r = 1, where r is used to represent the number of the linguistic items kept in the current large sequences.
Step 17: Generate the candidate set C r+1 from L r in a way similar to that in the aprioriall algorithm [1] . Restated, the algorithm first joins L r and L r , under the condition that r-1 linguistic terms in the two sequences are the same and with the same orders. Different permutations represent different candidates. The algorithm then keeps in C r+1 the sequences which have all their sub-sequences of length r existing in L r .
Step 18: Do the following substeps for each newly formed (r+1)-sequences s with linguistic web browsing pattern ( ) • Find the fuzzy weighted count (wf is ) of s in each browsing sequence D i as: where c is the number of the clients • Check whether the weighted support (wsup s ) of sequences s is greater than or equal to the fuzzy weighted minimum support (wminsup) by fuzzy ranking. If wsup s is greater than or equal to wminsup, put s in the set of large (r+1)-sequences L r+1 Step 19: IF L r+1 is null, then do the next step; otherwise, set r = r + 1 and repeat Steps 17 to 19.
Step 20: For each large r-sequence s (r>1) with fuzzy weighted support wsup s , find the linguistic minimum support region S i with wminsup i ≤wsup s <wminsup i+1 by fuzzy ranking, where: wminsup i = minsup i ×(the gravity of I ave ) minsup i is the given membership function for S i . Output sequence s with linguistic support value S i .
The linguistic weighted browsing patterns output after step 20 can then serve as meta knowledge concerning the given log data.

AN EXAMPLE
In this section, an example is given to show the proposed fuzzy weighted web-mining algorithm. This is can be used to generate fuzzy weighted browsing patterns for clients' browsing behavior according to the log data in a web server. A part of the log data is shown in Table 1.
Each record in the log data includes fields date, time, client-ip, server-ip, server-port and file-name, among others. Only one file name is contained in each record. For example, the user in client-ip 140.127.194.127 browsed the file inside.htm at 05:39:56 on March 1st, 2001.
Assume the membership functions for a browsing duration of a web page are shown in Fig. 1.
In Fig. 1, the browsing duration is divided into three fuzzy regions: Short, Middle and Long. Thus, three fuzzy membership values are produced for each duration according to the predefined membership functions. For the log data shown in Table 1, the proposed fuzzy web-mining algorithm proceeds as follows.  Table 2.
Step 2: The values of field client-ip are transformed into contiguous integers according to each client's first browsing time. The transformed results for Table 2 are shown in Table 3. Totally six clients logged on the web Step 3: The resulting log data in Table 3 are then sorted first by encoded client ID and then by date and time.
Results are shown in Table 4.
Step Simple symbols are used here to represent web pages for convenience. Let A, B, C, D and E respectively represent homepage.htm, inside.htm, search.asp, cheap.htm and person.asp. The durations of all pages browsed by each client ID are shown in Table 5.
Step 5: The web pages browsed by each client are listed as a browsing sequence. Each tuple is represented as (web page, duration). The resulting browsing sequences from Table 5 are shown in Table 6.     (Fig. 1). This step is repeated for the other web pages and browsing sequences. The results are shown in Table 7.
Step 7: The membership value of each region in each browsing sequence is found. Take the region D.Middle for Client 2 as an example. Its membership value is max (0.8, 0.6) = 0.8. The membership values of the other regions can be similarly calculated.
Step 8: The cardinality of each fuzzy region in all the browsing sequences is calculated as the count value. Take the fuzzy region D.Middle as an example. Its cardinality = (0.6+0.8+0.8+0.0+1.0+1.0) = 4.2. This step is repeated for the other regions and the results are shown in Table 8.
Step 9: The fuzzy region with the largest count value among the three possible regions for each file is selected. Take the web page A as an example. Its count is 0.0 for Short, 0.8 for Middle and 0.2 for Long. Since   the count for Middle is the largest among the three counts, the region Middle is thus used to represent the web page A in later mining processes. This step is repeated for the other web pages. Thus, Short is chosen for B, Middle is chosen for A, C and D and Long is chosen for E.
Step 10: Assume the importance of the five web pages (A, B, C, D and E) is evaluated by three managers as shown in Table 9. Assume the membership functions for importance of the web page are given in Fig. 2.
In Fig. 2, the importance of the web page is divided into five fuzzy regions: Very Unimportant, Unimportant, Ordinary, Important and Very Important. Each fuzzy region is represented by a membership function. The membership functions in Fig. 2 Table 9 Manager   The linguistic terms for the importance of the web pages given in Table 9 are transformed into fuzzy sets by the membership functions given in Fig. 2. For example, Page A is evaluated to be important by Manager 1. It can then be transformed as a triangular fuzzy set (0.5, 0.75, 1) of weights. The transformed results for Table 9 are shown in Table 10.
Step 12: The fuzzy weighted support of each web page is calculated. Take the web page A as an example. The average fuzzy weight of A is (0.333, 0.583, 0.833) from Step 11. Since the region Middle is used to represent the web page A and its count is 2.0, its weighted support is then (0.333, 0.583, 0.833) *0.8/6, which is (0.044, 0.078, 0.111). Results for all the web pages are shown in Table 12.
Step 13: The given linguistic minimum support value is transformed into a fuzzy set of minimum supports.  Assume the membership functions for minimum supports are given in Fig. 3. Also assume the given linguistic minimum support value is Low. It is then transformed into a fuzzy set of minimum supports, (0, 0.25, 0.5), according to the given membership functions in Fig. 3. The gravity of I ave is then (0.3+0.5+0.7)/3, which is 0.5. The fuzzy weighted set of minimum supports for Low is then (0, 0.25, 0.5)×0.5, which is (0, 0.125, 0.25).
Step 15: The fuzzy weighted support of the representative region for each web page is compared with the fuzzy weighted minimum support by fuzzy ranking. Any fuzzy ranking approach can be applied here as long as it can generate a crisp rank. Assume the gravity ranking approach is adopted in this example. Take web page B as an example. The average height of the fuzzy weighted support for B.Short is (0.428+0.611+0.733)/3, which is 0.591. The average height of the fuzzy weighted minimum support is (0+0.125+0.25)/3, which is 0.125. Since 0.591>0.125, B.Short is thus a large fuzzy weighted 1-sequence. Similarly, C.Middle, D.Middle and E.Long are large fuzzy weighted 1-sequences. These 1-sequences are put in L 1 (Table 13).
Step 16: r is set at 1, where r is used to store the number of the linguistic items kept in the current sequences.
Step 17: The candidate set C 2 is first generated from L 1 as follows: Step 18: The following substeps are done for each newly formed candidate sequences in C 2 .
• The fuzzy weighted count of each candidate 2sequence in each browsing sequence is first calculated. Here, the minimum operator is used for intersection. Take the linguistic browsing sequence (B.Short, C.Middle) for Client 4 as an example. There are three possible subsequences of (B.Short, C.Middle) in that browsing sequence. The average fuzzy weight of web page B is (0.583, 0.833, 1) and the average fuzzy weight of web page C is  Table 14 • The fuzzy weighted count of each candidate 2sequence in C 2 is then calculated. Results for this example are shown in Table 15. The fuzzy weighted support of each candidate 2-sequences is then calculated. Take (B.Short, C.Middle) as an example. The fuzzy weighted count of (B.Short, C.Middle) is (1, 1.6, 2.083) and the total number of the client is 6. Its fuzzy weighted support is then (1, 1.6, 2.083)/6, which is (0.167, 0.267, 0.347). All the fuzzy weighted supports of the candidate 2sequences are shown in Table 16 • The fuzzy weighted support of each candidate 2sequence is compared with the fuzzy weighted minimum support by fuzzy ranking. As mentioned above, assume the gravity ranking approach is adopted in this example.  Step 19: Since L 2 is not null, r = r+1 = 2. Steps 17 to 19 are repeated to find L 3 . C 3 is then generated from L 2 . In this example, C 3 is empty. L 3 is thus empty.
Step 20: The linguistic support values are found for each large r-sequence s (r>1 These three linguistic browsing patterns are thus output as the meta knowledge concerning the given log data.

CONCLUSION AND FUTURE WORK
In this research, we have proposed a new fuzzy weighted web-mining algorithm, which can process web-server logs to discover useful users' browsing behaviors from the time durations of the paged browsed. In the log data, each transaction contains only one web page. The mining process can thus be simplified when compared to that for multiple-item transactions in Agrawal and Srikant 's mining approach [1] . Since the time durations are numeric, fuzzy concepts are used here to process them and to form linguistic terms. Besides, different web pages may have different importance. The importance of web pages are evaluated by managers as linguistic terms, which are then transformed and averaged as fuzzy sets of weights. Each linguistic term is then weighted by the importance for its page. Only the linguistic term with the maximum cardinality for a page is chosen in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of original web pages. The algorithm therefore focuses on the most important linguistic terms, which reduces its time complexity. The minimum support is also given linguistic. Fuzzy operations including fuzzy ranking are then used to find fuzzy weighted browsing patterns.
Compared to previous mining approaches, the proposed one has linguistic inputs and outputs, which are more natural and understandable for human beings.
Although the proposed method works well in fuzzy weighted web mining and can effectively manage linguistic minimum supports, it is just a beginning. There is still much work to be done in this field. Our method assumes that the membership functions are known in advance. In [6,12] , we proposed some fuzzy learning methods to automatically derive the membership functions. In the future, we will attempt to dynamically adjust the membership functions in the proposed web-mining algorithm to avoid the bottleneck of membership function acquisition.