Web Structure Mining: Exploring Hyperlinks and Algorithms for Information Retrieval

—This paper focus on the Hyperlink analysis, the algorithms used for link analysis, compare those algorithms and the role of hyperlink analysis in Web searching. In the hyperlink analysis, the number of incoming links to a page and the number of outgoing links from that page will be analyzed and the reliability of the linking will be analyzed. Authorities and Hubs concept of Web pages will be explored. The different algorithms used for Link analysis like PageRank, HITS (Hyperlink-Induced Topic Search) and other algorithms will be discussed and compared. The formula used by those algorithms will be explored.


INTRODUCTION
The Web is a massive, explosive, diverse, dynamic and mostly unstructured data repository, which delivers an incredible amount of information, and also increases the complexity of dealing with the information from the different perspectives of knowledge seekers, Web service providers and business analysts.The following are considered as challenges [1] in the Web mining: • Web is huge and Web Pages are semi-structured • Web information tends to be diversity in meaning • Degree of quality of the information extracted • Conclusion of the knowledge from the information extracted Web mining techniques along with other areas like Database (DB), Information Retrieval (IR), Natural Language Processing (NLP), Machine Learning etc. can be used to solve the above challenges.Web mining is the use of data mining techniques to automatically discover and extract information from the World Wide Web (WWW).Web structure mining helps the users to retrieve the relevant documents by analyzing the link structure of the Web.This paper is organized as follows.Next Section provides concepts of Web Structure mining and Web Graph.Section III provides Hyperlink analysis, algorithms and their comparisons.Paper is concluded in Section IV.

II. WEB STRUCTURE MINING
A. Overview According to Kosala et al [2], Web mining consists of the following tasks: • Resource finding: the task of retrieving intended Web documents.
• Information selection and pre-processing: automatically selecting and pre-processing specific information from retrieved Web resources.
• Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites.• Analysis: validation and/or interpretation of the mined patterns.There are three areas of Web mining according to the usage of the Web data used as input in the data mining process, namely, Web Content Mining (WCM), Web Usage Mining (WUM) and Web Structure Mining(WSM).Web content mining is concerned with the retrieval of information from WWW into more structured forms and indexing the information to retrieve it quickly.Web usage mining is the process of identifying the browsing patterns by analyzing the user's navigational behavior.Web structure mining is to discover the model underlying the link structures of the Web pages, catalog them and generate information such as the similarity and relationship between them, taking advantage of their hyperlink topology.Hyperlink analysis and the algorithms discussed here are related to Web Structure mining.Even though there are three areas of Web mining, the differences between them are narrowing because they are all interconnected.

B. How big is Web
A Google report [3] on 25 th July 2008 says that there are 1 trillion (1,000,000,000,000) unique URLs (Universal Resource Locator) on the Web.The actual number could be more than that ICT_09_curtin and Google could not index all the pages.When Google first created the index in 1998 there were 26 million pages and in 2000 Google index reached 1 billion pages.In the last 9 years, Web has grown tremendously and the usage of the web is unimaginable.So it is important to understand and analyze the underlying data structure of the Web for effective Information Retrieval.

C. Web Data Structure
The traditional information retrieval system basically focuses on information provided by the text of Web documents.Web mining technique provides additional information through hyperlinks where different documents are connected.The Web may be viewed as a directed labeled graph whose nodes are the documents or pages and the edges are the hyperlinks between them.This directed graph structure in the Web is called as Web Graph.A graph G consists of two sets V and E, Horowitz et al [4].The set V is a finite, nonempty set of vertices.The set E is a set of pairs of vertices; these pairs are called edges.The notation V(G) and E(G) represent the sets of vertices and edges, respectively of graph G.It can also be expressed G = (V, E) to represent a graph.The graph in Fig. 1 is a directed graph with 3 Vertices and 3 edges.The vertices V of G, V(G) = {A, B, C}.The Edges E of G, E(G) ={(A, B), (B, A), (B, C)}.In a directed graph with n vertices, the maximum number of edges is n(n-1).With 3 vertices, the maximum number of edges can be 3(3-1) = 6.In the above example, there is no link from (C, B), (A, C) and (C, A).A directed graph is said to be strongly connected if for every pair of distinct vertices u and v in V(G), there is a directed path from u to v and also from v to u.The above graph in Fig. 1 is not strongly connected, as there is no path from vertex C to B. According to Broader et al. [5], a Web can be imagined as a large graph containing several hundred million or billion of nodes or vertices, and a few billion arcs or edges.The following section explains the hyperlink analysis and the algorithms used in the hyperlink analysis for information retrieval.

III. HYPERLINK ANALYSIS
Many Web Pages do not include words that are descriptive of their basic purpose (for example rarely a search engine portal includes the word "search" in its home page), and there exist Web pages which contain very little text (such as image, music, video resources), making a text-based search techniques difficult.However, how others exemplify this page may be useful.This type of "characterization" is included in the text that surrounds the hyperlink pointing to the page.
Many researches [6,7,8,11,12] have done and solutions have suggested to the problem of searching, indexing or querying the Web, taking into account its structure as well as the meta-information included in the hyperlinks and the text surrounding them.
There are a number of algorithms proposed based on the Link Analysis.Using citation analysis, Co-citation algorithm [13] and Extended Co-citation algorithm [14] are proposed.These algorithms are simple and deeper relationships among the pages can not be discovered.Three important algorithms PageRank [15], Weighted PageRank (WPR) [16] and Hypertext Induced Topic Search HITS [17] are discussed below in detail and compared.

A. PageRank
Brin and Page developed PageRank [15] algorithm during their Ph D at Stanford University based on the citation analysis [9,10].PageRank algorithm is used by the famous search engine, Google.They applied the citation analysis in Web search by treating the incoming links as citations to the Web pages.However, by simply applying the citation analysis techniques to the diverse set of Web documents did not result in efficient outcomes.Therefore, PageRank provides a more advanced way to compute the importance or relevance of a Web page than simply counting the number of pages that are linking to it (called as "backlinks").If a backlink comes from an "important" page, then that backlink is given a higher weighting than those backlinks comes from non-important pages.In a simple way, link from one page to another page may be considered as a vote.However, not only the number of votes a page receives is considered important, but the "importance" or the "relevance" of the ones that cast these votes as well.
Assume any arbitrary page A has pages T 1 to T n pointing to it (incoming link).PageRank can be calculated by the following equation (1).
The parameter d is a damping factor, usually sets it to 0.85 (to stop the other pages having too much influence, this total vote is "damped down" by multiplying it by 0.85).C(A) is defined as the number of links going out of page A. The PageRanks form a probability distribution over the Web pages, so the sum of all Web pages' PageRank will be one.PageRank can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the Web.Let us take an example of hyperlink structure of three pages A, B and C as shown in Fig. 2. The PageRank for pages A, B and C can be calculated by using equation (1).After doing many more iterations of the above calculation, the following PageRanks arrived as shown in Table I.For a smaller set of pages, it is easy to calculate and find out the PageRank values but for a Web having billions of pages, it is not easy to do the calculation like above.In the above Table I I, after the iteration 15, the PageRank for the pages gets normalized.Previous experiments [18,19] shows that the PageRank gets converged to a reasonable tolerance.The convergence of PageRank calculation for the Table I is shown as a graph in Fig. 3.

B. Weighted PageRank Algorithm
Wenpu Xing and Ali Ghorbani [16] proposed a Weighted PageRank (WPR) algorithm which is an extension of the PageRank algorithm.This algorithm assigns a larger rank values to the more important pages rather than dividing the rank value of a page evenly among its outgoing linked pages.Each outgoing link gets a value proportional to its importance.The importance is assigned in terms of weight values to the incoming and outgoing links and are denoted as W in (m, n) and W out (m, n) respectively.W in (m, n) as shown in equation ( 2) is the weight of link(m, n) calculated based on the number of incoming links of page n and the number of incoming links of all reference pages of page m.
Where I n and I p are the number of incoming links of page n and page p respectively.R(m) denotes the reference page list of page m.W out (m, n) is as shown in equation ( 3) is the weight of link(m, n) calculated based on the number of outgoing links of page n and the number of outgoing links of all reference pages of m.Where O n and O p are the number of outgoing links of page n and p respectively.The formula as proposed by Wenpu et al for the WPR is as shown in equation ( 4) which is a modification of the PageRank formula (equation 1).
Use the same hyperlink structure as shown in Fig. 2 and do the WPR Calculation.The WPR equations for Pages A, B and C are as follows.
The incoming link and outgoing link weights are calculated as follows: By substituting the values of equations (4d) and (4e) to equation (4a), you will get the WPR of Page A by taking a value of 0.85 for d and the initial value of WPR(C) = 1.The values of WPR(A), WPR(B) and WPR(C) are shown in equations (4f), (4g) and (4h) respectively.In this, WPR(A)>WPR(C)>WPR(B).This results shows that the page rank order is different from PageRank.

C. The HITS Algorithm -Hubs & Authorities
Kleinberg [17] identifies two different forms of Web pages called hubs and authorities.Authorities are pages having important contents.Hubs are pages that act as resource lists, guiding users to authorities.Thus, a good hub page for a subject points to many authoritative pages on that content, and a good authority page is pointed by many good hub pages on the same subject.Hubs and Authorities and their calculations are shown in Fig. 4. Kleinberg says that a page may be a good hub and a good authority at the same time.This circular relationship leads to the definition of an iterative algorithm called HITS (Hyperlink Induced Topic Search).The HITS algorithm treats WWW as a directed graph G(V,E), where V is a set of Vertices representing pages and E is a set of edges that correspond to links.
There are two major steps in the HITS algorithm.The first step is the Sampling Step and the second step is the Iterative step.In the Sampling step, a set of relevant pages for the given query are collected i.e. a sub-graph S of G is retrieved which is high in authority pages.This algorithm starts with a root set R, a set of S is obtained, keeping in mind that S is relatively small, rich in relevant pages about the query and contains most of the good authorities.The second step, Iterative step, finds hubs and authorities using the output of the sampling step using equations ( 5) and ( 6).
Where H p is the hub weight, A p is the Authority weight, I(p) and B(p) denotes the set of reference and referrer pages of page p.The page's authority weight is proportional to the sum of the hub weights of pages that it links to it, Kleinberg [20].Similarly, a page's hub weight is proportional to the sum of the authority weights of pages that it links to.Fig. 4 shows an example of the calculation of authority and hub scores.The following are the constraints of HITS algorithm [6]: • Hubs and authorities: It is not easy to distinguish between hubs and authorities because many sites are hubs as well as authorities.• Topic drift: Sometime HITS may not produce the most relevant documents to the user queries because of equivalent weights.• Automatically generated links: HITS gives equal importance for automatically generated links which may not produce relevant topics for the user query.• Efficiency: HITS algorithm is not efficient in real time.
Table II shows the comparison [21] of all the algorithms discussed above.

IV. CONCLUSION
This paper covers the basics of Web mining.The importance of the Web structure mining in Information retrieval is explained.The main purpose of this paper is to explore the hyperlink structure and understand the Web graph in a simple way.This paper also focuses on the important algorithms used for hyperlink analysis, explore those algorithms and compare them.

Figure 1 A
Figure 1 A Directed Graph G

Figure 3
Figure 3 Convergence of PageRank Calculation

Figure 4 .
Figure 4. Calculation of hubs and Authorities