Web Structure Mining
Web Structure Mining
Web Structure Mining
Johannes Fürnkranz
Austrian Research Institute for Artificial Intelligence
Schottengasse 3, A-1010 Wien, Austria
E-mail: juffi@oefai.at
Abstract
The World-Wide Web provides every internet citizen with access to an
abundance of information, but it becomes increasingly difficult to identify
the relevant pieces of information. Web mining is a new research area that
tries to address this problem by applying techniques from data mining and
machine learning to Web data and documents. In this paper, we will give
a brief overview of Web mining, with a special focus on techniques that
aim at exploiting the graph structure of the Web for improved retrieval
performance and classification accuracy.
1 Web Mining
The advent of the World-Wide Web (WWW) has overwhelmed the typical home
computer user with an enormous flood of information. To almost any topic one
can think of, one can find pieces of information that are made available by other
internet citizens, ranging from individual users that post an inventory of their
record collection, to major companies that do business over the Web.
To be able to cope with the abundance of available information, users of
the WWW need to rely on intelligent tools that assist them in finding, sorting,
and filtering the available information. Just as data mining aims at discovering
valuable information that is hidden in conventional databases, the emerging field
of Web mining aims at finding and extracting relevant information that is hidden
in Web-related data, in particular in text documents that are published on the
Web. Like data mining, Web mining is a multi-disciplinary effort that draws
techniques from fields like information retrieval, statistics, machine learning,
natural language processing, and others.
Depending on the nature of the data, one can distinguish three main areas
of research within the Web mining community:
Web Content Mining: application of data mining techniques to unstructured
or semi-structured data, usually HTML-documents
Web Structure Mining: use of the hyperlink structure of the Web as an (ad-
ditional) information source
Web Usage Mining: analysis of user interations with a Web server (e.g.,
click-stream analysis)
1
For an excellent survey of the field, cf. (Chakrabarti, 2000), a slim textbook
appeared as (Chang et al., 2001). For a survey of content mining, we refer to
(Sebastiani, 2002), while a survey of usage mining can be found in (Srivastava
et al., 2000). We are not aware of a previous survey on structure mining.
2 Motivation
2.1 The Web is a Graph
While conventional information retrieval focuses primarily on information that
is provided by the text of Web documents, the Web provides additional informa-
tion through the way in which different documents are connected to each other
via hyperlinks. The Web may be viewed as a (directed) graph whose nodes are
the documents and the edges are the hyperlinks between them.
Several authors have tried to analyze the properties of this graph. The most
comprehensive study is due to Broder et al. (2000). They used data from an
Altavista crawl (May 1999) with 203 million URLs and 1466 million links, and
stored the underlying graph structure in a connectivity server (Bharat et al.,
1998), which implements an efficient document indexing technique that allows
fast access to both outgoing and incoming hyperlinks of a page. The entire
graph fitted in 9.5 GB of storage, and a breadth-first search that reached 100M
nodes took only about 4 minutes.
Their main result is an analysis of the structure of the web graph, which,
according to them, looks like giant bow tie, with a strongly connected core
component (SCC) of 56 million pages in the middle, and two components with
44 million pages each on the sides, one containing pages from which the SCC
can be reached (the IN set), and the other containing pages that can be reached
from the SCC (the OUT set). In addition, there are “tubes” that allow to reach
the OUT set from the IN set without passing through the SCC, and many
“tendrils”, that lead out of the IN set or into the OUT set with connecting
to other components. Finally, there are also several smaller components that
cannot be reached from any point in this structure. (Broder et al., 2000) also
sketch a diagram of this structure, which is somewhat deceptive because the
prominent role of the IN, OUT, and SCC sets is based on size only, and there
are other structures with a similar shape, but of somewhat smaller size (e.g.,
the tubes may contain other strongly connected components that differ from the
SCC only in size). The main result is that there are several disjoint components.
In fact, the probability that a path between two randomly selected pages exists
is only about 0.24.
Based on the analysis of this structure, Broder et al. (2000) estimated that
the diameter (i.e., the maximum of the lengths of the shortest paths between
two nodes) of the SCC is larger than 27, that the diameter of the entire graph
is larger than 500, and that the average length of such a path is about 16.
This is, of course only for cases where a path between two pages exists. These
results correct earlier estimates obtained by Albert et al. (1999) who estimated
the average length at about 19. Their analyis was based on a probabilistic
argument using estimates for the in-degrees and out-degrees, thereby ignoring
the possibility of disjoint components.
Albert et al. (1999) base their analysis on the observation that the in-degrees
2
(number of incoming links) and out-degrees (number of outgoing links) follow
a power law distribution P (d) ≈ d−γ . They estimated values of γin = 2.45 and
γout = 2.1 for the in-degrees and out-degrees respectively. They also note that
these power law distributions imply a much higher probability of encountering
documents with large in- or out-degrees than would be the case for random
networks or random graphs. The power-law results have been confirmed by
Broder et al. (2000) who also observed a power law for the sizes of strongly
connected components in the Web graph. Faloutsos et al. (1999) observed a
Zipf distribution P (d) ≈ r(d)−γ of the outdegree of nodes, where r(d) is the
rank of the degree in a sorted list of out-degree values. Similarly, Levene et al.
(2001) observed a Zipf distribution in a model of the behavior of Web surfers.
Finally, another interesting property is the size of the Web. Lawrence and
Giles (1998) suggest to estimate the size of the Web from the overlap that
different search engines return for identical queries. Their method is based on
the assumption that the probability that a page is indexed by search engine
A is independent of the probability that this page is indexed by search engine
B. In this case, the percentage of pages in the result set of a query for search
engine B that are also indexed by search engine A could be used as an estimate
for the over-all percentage of pages indexed by A. Obviously, the independence
assumption on which this argument is based does not hold in practice, so that
the estimated percentage is larger than the real percentage (and the obtained
estimates of the Web size are more like lower bounds). Lawrence and Giles
(1998) used the results of several queries to estimate that the largest search
engine indexes only about one third of the indexable Web (the portion of the
Web that is accessible to crawlers, i.e., not hidden behind query interfaces).
Similar arguments were used by Bharat and Broder (1998) to estimate the
relative size of search engines.
redundancy: Quite often there is more than one page pointing to a single page
on the Web. The ability to combine multiple, independent sources of infor-
mation can improve classification accuracy as is witnessed by the success of
ensemble techniques in other areas of machine learning (Dietterich, 2000).
independent encoding: As the set of predecessor pages typically originates
from several different authors, the provided information is less sensitive to
the vocabulary used by one particular author.
sparse or non-existing text: Many Web pages only contain a limited amount
of text. In fact, many pages only contain images and no machine-readable
text at all. Looking at predecessor pages increases the chances of encoun-
tering informative text about the page to classify.
irrelevant or misleading text: Often pages contain irrelevant text. In other
cases, pages are designed to deliberately contain misleading information.
For example, many pages try to include a lot of keywords into comments
3
or invisible parts of the text in order to increase the breadth of their
indexing for word-based search engines.1 Again, the fact that predecessor
pages are typically authored by different and independent sources may
provide relevant information or improve focus.
foreign language text: English is (still) the predominant language on the
Web. Nevertheless, documents in other languages occur in non-negligible
numbers. Even if the text on the page is written in a foreign language,
there may be incoming links from pages that are written in English. In
many cases, this allows an English speaker (or a text classifier based on
English vocabulary) to infer something about the contents of the page.
In summary, the use of text on predecessor pages may provide a richer vo-
cabulary, independent assessment of the contents of a page through multiple
authors, redundancy in classification, focus on important aspects, and a simple
mechanism for dealing with pages that have sparse text, no text, or text in
different languages.
a complete list of Austrian ski ressorts in comments, apparently with the goal of being retrieved
if somebody looks for information of one of its competitors (in fact this was how the page
was found by the author). For recognizing this page as a ski ressort, this information may
be helpful. On the other hand, the author also came across a page that included an entire
dictionary of the most common words of the English language as invisible text. . .
4
X
Hi+1 (x) = Ai (s)
(x,s)
X
Ai+1 (x) = Hi (p)
(p,x)
where (x, y) denotes that there is a hyperlink from page x to page y. This com-
putation is conducted on a so-called focussed subgraph of the Web, which is ob-
tained by enhancing the search result of a conventional query (or a bounded sub-
set of the result) with all predecessor and successor pages (or, again, a bounded
subset of them). The hub and authority scores are initilialized uniformly with
A0 (x) = H0 (x) = 1.0 and normalized so that they sum up to one before each
iteration. Kleinberg (1999) was able to prove that this algorithm (called HITS)
will always converge, and practical experience shows that it will typically do so
within a few iterations (about 5; Chakrabarti et al., 1998b). HITS has been used
for identifying relevant documents for topics in web catalogues (Chakrabarti
et al., 1998b; Bharat and Henzinger, 1998) and for implementing a “Related
Pages” functionality (Dean and Henzinger, 1999).
The main drawback of the HITS algorithm is that the hubs and author-
ity score must be computed iteratively from the query result, which does not
meet the real-time constraints of an on-line search engine. However, the im-
plementation of a similar idea in the Google search engine resulted in a major
break-through in search engine technology. Brin and Page (1998) suggested the
use of the probability that a page is visited by a random surfer on the Web as a
key factor for ranking search results. They approximated this probability with
the so-called PageRank, which is again computed iteratively:
1 X P Ri (p)
P Ri+1 (x) = (1 − l) +l (1)
N |(p, y)|
(p,x)
The first term of this sum models the behavior that a surfer gets bored (with
probability (1 − l), where l is typically set to 0.85) and jumps to a randomly
selected page of the entire set of N pages. The second term uniformly distributes
the current page rank of a page to all its successor pages. Thus, a page receives
a high page rank if it is linked by many pages, which in turn have a high page
rank and/or only few successor pages. The main advantage of the page rank
over the hubs and authority scores is that it can be computed off-line, i.e., it
can be precomputed for all pages in the index of a search engine. Its clever
(but secret) integration with other information that is typically used by search
engines (number of matching query terms, location of matches, proximity of
matches, etc.) promoted Google from a student project to the main player in
search engine technology.
3.2 Classification
Not surprisingly, recent research has also looked at the potential of hyperlinks as
additional information source for hypertext categorization tasks. Many authors
addressed this problem in one way or another by merging (parts of) the text
5
of the predecessor pages with the text of the page to classify, or by keeping
a separate feature set for the predecessor pages. For example, Chakrabarti
et al. (1998a) evaluated two variants, one that simply appends the text of the
neighboring (predecessor and successor) pages to the text of the target page,
and one that uses two different sets of features, one for the target page and
one for a concatenation of the neighboring pages. The results were negative: in
two domains both approaches performed worse than the conventional technique
that uses only features of the target document. From these results, Chakrabarti
et al. (1998a) concluded that the text from the neighbors is too noisy to help
classification and proposed a different technique that included predictions for
the class labels of the neighboring pages into the model. Unless the labels for the
neighbors are known a priori, the implementation of this approach requires an
iterative technique for assigning the labels, because changing the class of a page
may potentially change the class assignments for all neighboring pages as well.
The authors implemented a relaxation labeling technique, and showed that it
improves performance over the standard text-based approach that ignores the
hyperlink structure. The utility of classes predictions for neighboring pages was
confirmed by the results of Oh et al. (2000) and Yang et al. (2002).
A different line of research concentrates on explicitly encoding the rela-
tional structure of the Web in first-order logic. For example, a binary pred-
icate link_to(page1,page2) can be used to represent the fact that there is
a hyperlink on page1 that points to page2. In order to be able to deal with
such a representation, one has to go beyond traditional attribute-value learn-
ing algorithms and resort to inductive logic programming, aka relational data
mining (Džeroski and Lavrač, 2001). Craven et al. (1998) use a variant of Foil
(Quinlan, 1990) to learn classification rules that can incorporate features from
neighboring pages. The algorithm uses a deterministic version of relational
path-finding (Richards and Mooney, 1992), which overcomes Foil’s restriction
to determinate literals (Quinlan, 1991), to construct chains of link_to/2 pred-
icates that allow the learner to access the words on a page via a predicate of the
type has word(page,word). For example, the conjunction link_to(P1,P),
has word(P1,word) means “there exists a predecessor page P1 that contains
the word word. Slattery and Mitchell (2000) improve the basic Foil-like learning
algorithm by integrating it with ideas originating from the HITS algorithm for
computing hub and authority scores of pages, while Craven and Slattery (2001)
combine it favorably with a Naive Bayes classifier.
At its core, using features of pages that are linked via a link_to/2 predicate
is quite similar to the approach evaluated by Chakrabarti et al. (1998a) where
words of neighboring documents are added as a separate feature set: in both
cases, the learner has access to all the features in the neighboring documents.
The main difference lies in the fact that in the relational representation, the
learner may control the depth of the chains of link_to/2 predicates, i.e., it may
incorporate features from pages that are several clicks apart. From a practical
point of view, the main difference lies in the types of learning algorithms that
are used with both approaches: while inductive logic programming typically
relies on rule learning algorithms which classify pages with “hard” classification
rules that predict a class by looking only at a few selected features, Chakrabarti
et al. (1998a) used learning algorithms that always take all available features
into account (such as a Naive Bayes classifier). Yang et al. (2002) discuss both
approaches, relate them to a taxonomy of five possible regularities that may be
6
present in the neighborhood of a target page, and empirically compare them
under different conditions.
7
Fürnkranz et al., 1998). The overall classifier improved the full-text classifier
from about 70% accuracy to about 85% accuracy in this domain. It is still open
to see whether this generalizes to other domains.
4 Conclusion
The main purpose of this paper was to motivate that the graph structure of
the Web may provide a valuable source of information for various Web mining
tasks, as witnessed by the success of search engines that try to incorporate graph
properties into the ranking of their query results. In particular, we wanted to
show that the information of predecessor pages can be successively used for
improving the performance of text classification tasks. We reviewed various
previous approaches, and briefly discussed our own solution. However, research
in this area has just begun. . .
Acknowledgments
The Austrian Research Institute for Artificial Intelligence is supported by the Austrian
Federal Ministry of Education, Science and Culture.
References
R. Albert, H. Jeong, and A.-L. Barabási. Diameter of the world-wide web.
Nature, 401:130–131, September 1999.
K. Bharat and A. Broder. A technique for measuring the relative size and overlap
of public web search engines. Computer Networks, 30(1–7):107–117, 1998.
Proceedings of the 7th International World Wide Web Conference (WWW-
7), Brisbane, Australia.
K. Bharat, A. Broder, M. R. Henzinger, P. Kumar, and S. Venkatasubramanian.
The connectivity server: Fast access to linkage information on the Web. Com-
puter Networks, 30(1–7):469–477, 1998. Proceedings of the 7th International
World Wide Web Conference (WWW-7), Brisbane, Australia.
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a
hyperlinked environment. In Proceedings of the 21st ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR-98), pages
104–111, 1998.
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search
engine. Computer Networks, 30(1–7):107–117, 1998. Proceedings of the 7th
International World Wide Web Conference (WWW-7), Brisbane, Australia.
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata,
A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks,
33(1–6):309–320, 2000. Proceedings of the 9th International World Wide Web
Conference (WWW-9).
S. Chakrabarti. Data mining for hypertext: A tutorial survey. SIGKDD explo-
rations, 1(2):1–11, January 2000.
8
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using
hyperlinks. In Proceedings of the ACM SIGMOD International Conference
on Management on Data, pages 307–318, Seattle, WA, 1998a. ACM Press.
9
J. Fürnkranz, T. Mitchell, and E. Riloff. A case study in using linguistic phrases
for text categorization on the WWW. In M. Sahami, editor, Learning for Text
Categorization: Proceedings of the 1998 AAAI/ICML Workshop, pages 5–12,
Madison, WI, 1998. AAAI Press. Technical Report WS-98-05.
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal
of the ACM, 46(5):604–632, September 1999.
S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280:
98–100, 1998.
M. Levene, J. Borges, and G. Louizou. Zipf’s law for Web surfers. Knowledge
and Information Systems, 3(1):120–129, 2001.
O. A. McBryan. GENVL and WWWW: Tools for taming the Web. In Proceed-
ings of the 1st World-Wide Web Conference (WWW-1), pages 58–67, Geneva,
Switzerland, 1994. Elsevier.
H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext categorization
method using links and incrementally available class information. In Proceed-
ings of the 23rd ACM International Conference on Research and Development
in Information Retrieval (SIGIR-00), pages 264–271, Athens, Greece, 2000.
10