CALA: A Non-Supervised URL-Based Web Page Classier
Inma Hernández ∗, Carlos R. Rivero , David Ruiz , Rafael Corchuelo
a,
a
a Universidad
a
a
de Sevilla, ETSI Informática.
Avda. de la Reina Mercedes, s/n, Sevilla E-41012, Spain.
Abstract
Unsupervised URL-Based Web Page Classication refers to the problem of clustering the URLs in a web site
so that each cluster includes a set of pages that can be classied using a unique class. The existing proposals
to perform URL-Based Classication suer from a number of drawbacks: they are supervised, which requires
the user to provide labelled training data and are then dicult to scale, are language or domain dependent,
since they require the user to provide dictionaries of words, or require extensive crawling, which is time
and resource consuming. In this article, we propose a new statistical technique to mine URL patterns that
are able to classify Web pages. Our proposal is unsupervised, language and domain independent, and does
not require extensive crawling. We have evaluated our proposal on 45 real-world web sites, and the results
conrm that it can achieve a mean precision of 98% and a mean recall of 91%, and that its performance is
comparable to that of a supervised classication technique, while it does not require to label large sets of
sample pages. Furthermore, we propose a novel application that helps to extract the underlying model from
non-semantic-web sites.
Keywords:
Web Page Classication, URL Classication, URL Patterns
1
1. Introduction
2
Typical web sites provide pages about a number of topics; in other words, their pages can be classied
3
into a number of semantic classes that describe their contents. For instance, a typical on-line book store may
4
have pages about books, authors, publishers, reviews, oers, and so on.
5
URL-based web page classication refers to the problem of classifying a web page with respect to its topic
6
by analysing features extracted from its URL, regardless of its content and/or structure. Therefore, this
7
classication does not require the page to be downloaded prior to being classied.
8
URL-based classication of web pages has many applications, including: endowing Virtual Integration
9
crawlers with the intelligence to determine whether the page may contain or not information relevant to the
10
query (Blanco et al., 2011; Li and Zhong, 2004; Vidal et al., 2008), applying the most appropriate extraction
11
model to each page Crescenzi et al. (2001), performing URL ltering to avoid certain types of contents (e.g.,
∗ Corresponding author
Email addresses:
Ruiz),
corchu@us.es
inmahernandez@us.es
(Inma Hernández),
carlosrivero@us.es
(Carlos R. Rivero),
druiz@us.es
(David
(Rafael Corchuelo)
Preprint submitted to Elsevier
June 28, 2012
12
parental control systems (Zhang et al., 2006) or advertising removal (Shih and Karger, 2004)), or detection
13
and canonisation of duplicated URLs (Bar-Yossef et al., 2009; Koppula et al., 2010), amongst others.
14
There are other proposals in the area of URL-based web page classication. These proposals present some
15
drawbacks, mainly referred to scalability and applicability. With regards to scalability, some of them are
16
supervised (Baykan et al., 2009; Kan and Thi, 2005; Vidal et al., 2008), others require an extensive crawling
17
of the whole site under analysis to build the classication model (Vidal et al., 2008), while others need to
18
download the page to extract other features from inside the page, to achieve a good performance (Blanco
19
et al., 2011). As for applicability, the main drawback is that many of these proposals are either domain,
20
language or site-dependant (Baykan et al., 2011, 2009; Kan and Thi, 2005; Shih and Karger, 2004), which
21
means that their techniques cannot be applied to every site in the Web.
22
In this paper, we propose an statistical technique called CALA Pattern Miner to obtain URL patterns for
23
each web site without supervision. Each of these patterns matches a collection of URLs of the site containing
24
information of a certain class, i.e., they can be used to classify the URLs of a site according to the class of
25
information their pages contain. A two-page poster on our proposal has been published elsewhere (Hernández
26
et al., 2012).
27
Our proposal does not suer from the previous drawbacks: First, it is not supervised. Also, it does not
28
require an extensive crawling of the site to build the classication model, but only a small subset of hub
29
pages (e.g., in our experiments 100 hub pages were enough to achieve a good performance). Hub pages are
30
rich in links, and they are usually obtained in response to a query issued by means of a web form. To ll in
31
the forms, we use keywords that are extracted from the same site automatically, hence no dictionary or user
32
input is needed. Moreover, it is based exclusively on URL features, so pages do not have to be downloaded
33
to be classied. Finally, it is domain, language, and site-independent, since the user does not have to know
34
particular details about the site, the domain or the language in which the site pages are written in order to
35
make it work properly. Therefore, our technique is both scalable and generally applicable.
36
We have performed an experimental evaluation of our technique on a representative dataset of web
37
sites, comparing its performance with two baseline techniques, both supervised (Blanco et al., 2008) and
38
not supervised (Ben-Hur et al., 2001). We obtained good precision and recall values, comparable to the
39
supervised technique, while our F1 measure values are better than both supervised and not supervised baseline
40
techniques. Also, we observed that our execution times are comparable to supervised techniques. Note that
41
in the latter case we did not consider in the analysis the time and eort spent by the user in annotating the
42
pages of the training set, which should be added to the nal time), while they are considerable lower than
43
those of not supervised techniques.
44
The rest of this article is organised as follows: Section 2 presents the related work in URL-based Web
45
page classication, Section 3 denes our proposal to build URL patterns; Section 4 shows the evaluation
2
46
of our technique; nally, Section 5 lists some of the conclusions drawn from the research and concludes the
47
article.
48
2. Related Work
49
In the analysis of existing proposals in the web page classication area, two main issues must be addressed:
50
rstly, features for classication can be obtained from inside and/or outside the pages to be classied; secondly,
51
these proposals can be either supervised or not supervised. For a complete survey on topic classication of
52
web pages, see (Qi and Davison, 2009).
53
Features for Web page classication can be either inner, i.e., obtained from inside the pages, like the page
54
content (Selamat and Omatu, 2004), its structure (Vidal et al., 2008), or the disposition of links amongst
55
pages (Wang et al., 2001); or outer ,i.e., obtained from outside the page, like the URL of the page (Baykan
56
et al., 2009; Blanco et al., 2011; Kan and Thi, 2005). Note that the latter type features allow to classify
57
pages without the need to download them, which is our focus and is specially desirable in Virtual Integration
58
contexts, in which the user is waiting for an answer, and response time is an issue (Trillo et al., 2011).
59
Therefore, we focus on URL pattern mining, i.e., nding patterns that allow to classify a web page building
60
solely on its URL.
61
Supervised classication techniques require the user to build a training set consisting on labelled examples.
62
In the Web context, this usually means annotating a large number of examples by hand, which requires a
63
signicant amount of time. (Baykan et al., 2011, 2009) and (Kan and Thi, 2005) classify URLs by matching
64
them against URL patterns composed of sets of key words. They need to provide their algorithm with a list
65
of words that is representative for every class, which makes their proposal language and domain dependent.
66
Furthermore, these classiers need to be provided with a suciently large collection of both positive and
67
negative labelled training URLs. (Shih and Karger, 2004) use a tree structure to represent URLs, and to
68
obtain URL patterns that are prexes for all URLs of a certain class. The features on which their technique
69
relies depends heavily on the site being analysed and also requires positive labelled training URLs, which
70
makes it dicult to scale.
71
A solution to this problem might build on not supervised classiers, such as traditional clustering tech-
72
niques. Such techniques rely on a distance function and return clusters that verify that the inter-distance
73
is maximum, whereas the intra-distance is minimum. Since URLs can be naturally represented as strings,
74
the idea would be to use a well-known string-distance, e.g., the edit distance (Ristad and Yianilos, 1998) or
75
the longest common subsequence (Hirschberg, 1977). Unfortunately, it has been noticed that edit distance
76
measures applied to URLs do not seem to work well to mine URL patterns (Blanco et al., 2011), i.e., two
77
close URLs may provide information about two dierent classes, whereas distant URLs may be related to
78
pages with similar classes. For example, there is a minimum distance between the following URLs:
3
79
http://academic.research.microsoft.com/Detail?entitytype=2&searchtype=2&id=35096884 and
80
http://academic.research.microsoft.com/Detail?entitytype=2&searchtype=5&id=35096884,
81
but they point to pages that are likely to be classied in dierent classes (Publications and Citations).
82
Contrarily, URLs like
83
http://academic.research.microsoft.com/Author/1939777/ramesh-jain and
84
http://academic.research.microsoft.com/Author/10540585/yu-li
85
are more distant but belong to the same class (Authors). This has motivated a number of authors to work
86
on ad-hoc clustering techniques to mine URL patterns.
87
(Bar-Yossef et al., 2009) and (Koppula et al., 2010) explore the possibility of creating URL patterns to
88
detect the existence of web pages with dierent URLs that have the same content, which has a negative impact
89
on crawling eciency. To solve this problem, some rules are generated to detect URLs that are duplicated,
90
and URL patterns are used to normalise those URLs. In both proposals, the goal is not to classify URLs
91
according to the class of content in their targets, but to locate pages that have exactly the same content.
92
Also, they need to have a large collection of URLs to obtain good results, which means a previous extensive
93
crawling of the site (or the equivalent manual process) to gather them.
94
(Vidal et al., 2008) proposed a technique to build a map of a web site, and generate URL patterns for
95
those pages that lead (directly or eventually) to pages that are similar to a given example. They have to
96
previously crawl the entire site, download each page and then process them, which usually takes a signicant
97
amount of time and resources.
98
(Blanco et al., 2011) proposed an algorithm to classify web pages without supervision that combines web
99
page contents and their URLs as features. They require a large training set, so they crawl the entire site
100
in their experiments. Furthermore, to improve the classication eciency, features from the page itself are
101
included in addition to the link-based features, which means that it must be downloaded previously.
102
(Brin, 1998) builds text patterns for information extraction, supported by URL patterns that match the
103
pages that contain a fragment of text matching the text pattern. To build the URL patterns, their algorithm
104
has rst to visit the page to check if it matches the text pattern. Therefore, they have to previously perform
105
an extensive crawling on the site to detect all pages that match the text pattern.
106
3. Proposal
107
108
Our proposal is a technique to build a set of URL patterns that represent the collection of URLs in a web
site, where each pattern represents a dierent cluster of URLs.
4
109
The input to our technique is a set of URLs from pages of the site under analysis. Since our proposal
110
is grounded on statistics, we do not have to deal with the whole site. Instead, we just require a subset of
111
hubs, which are pages that provide summaries and links to other pages (Kleinberg, 1999). Typically, web
112
sites return hubs as a response to user queries. Note that hubs usually contain a larger number of URLs than
113
other types of pages in the web site, given that their goal is to oer the users as many results related to their
114
queries as possible. Therefore, the probability that they contain a suciently representative set of URLs is
115
higher than for any other type of page. Hence, we chose hubs to be the source from which we extract sets of
116
URLs.
117
Pattern building consists of retaining the URL segments that are common to all URLs of the same class
118
while abstracting the segments that change frequently with a wildcard. To discern those segments that
119
change frequently, we use a statistical technique, that assigns each prex inside a URL an estimator of the
120
probability of that prex appearing in pages from the web site. Then, this estimator is used to discern
121
whether to keep that prex as a literal, or to replace it with a wildcard.
122
In the following subsections, we rst present some preliminary denitions, then a running example, and,
123
nally, describe the details of our proposal.
124
3.1. Preliminaries
125
We rst dene the concept of hub and hubset:
126
Denition 1 (Hub). Let q be a query that can be issued on a search form in a web site, we dene H (q ) as
127
the hub page that is returned as a response to q.
128
Denition 2 (Hubset). Let Q be a set of queries that can be issued on a search form in a web site, we
129
dene H(Q ) as the set of hubs that is obtained after issuing all queries in Q.
130
Once we have obtained a hubset H(Q ), we obtain a URL set U by attening H(Q ), i.e., by joining the
131
sets of URLs inside all hubs in H(Q ). Then, we must dene a tokenisation of those URLs according to a
132
particular format, e.g., the W3C recommendation for URIs syntax RFC 3986.
133
Denition 3 (Tokens). Let u be a URL, we dene T (u )
134
obtained after tokenising u.
135
Denition 4 (Prexes). Let u be a URL, T (u )
136
P (u , i ) = ht1 , . . . , ti i as a prex of u with a given size i. Note that when i = n, we have P (u , i ) = T (u ).
137
Denition 5 (URL Pattern). A pattern is a sequence of tokens, such that some of the tokens are literals,
138
whereas others are wildcards. We denote a pattern pas p = ht1 , . . . , tm i, m > 0. Note that patterns are
139
sequences of tokens, and therefore denitions HubSet (P (p , i )), prexes P (p , i ) and probability estimators
140
Fp ,i can be applied to patterns as well. For the sake of brevity, from now onwards we will refer to URL
141
patterns simply by
= ht1 , . . . , tn i as the sequence of tokens that is
= ht1 , . . . , tn i a tokenisation of u and i ≤ n, we dene
patterns.
5
<n0 ,
http>
<n1, /
academic.research.microsoft .com>
<n9,
/Author >
<n2,
/Publication>
<n3,
/417664 >
<n5 ,
/5638047 >
<n4,
/dbpedia>
<n8,
<n6,
/linked-data > /namedgraphs>
<n7,
/1242380>
<n18 ,
/Journal >
<n10 ,
/38181>
<n12,
/43723>
<n14,
/255707>
<n11,
/timbernerslee>
<n13,
<n15,
/tom-heath> /christianbizer>
<n16,
/3535385>
<n19,
/870>
<n21 ,
/889>
<n17,
/s-renauer>
<n22,
<n20,
/ijswis> /jws>
<n29,
=1242380>
<n23,
/Detail >
<n24 ,
?entitytype >
<n25 ,
=1>
<n26 ,
&searchtype>
<n27 ,
=5>
<n28,
&id>
<n30 ,
=5638047>
<n31,
=4117664>
Figure 1: URLs in the Microsoft Academic Search example set
142
Denition 6 (Hubset of a prex). Let u be a URL and P (u , i ) be a prex of URL of size i, we dene
143
HubSet (P (u , i )) as the set of hubs in which there is at least one URL u ∗ such that P (u , i ) = P (u ∗ , i ).
144
We use a tree notation to represent sets of URLs and their tokens based on the PATRICIA trees, which
145
allows representing large collections of strings eciently and compactly. Every node in the tree is dened by
146
a label ni and refers to a token tj . Each node, actually, represents a prex of a URL, which is the sequence
147
of tokens in the path from the tree root n0 to the node itself. Note that each path from the tree root to a
148
leaf represents a single URL. For the sake of readability, each token in each node is preceded by the character
149
that separates it from the previous token.
150
3.2. Running example
151
Microsoft Academic Search is an scholarly web site that oers information about papers, authors, citations
152
and publishing hosts, such as journals or conferences. This site search form includes a text box, in which
153
users can perform keyword-based searches. Therefore, an example of a query that can be issued in this site
154
is an author name, e.g., q = Christian Bizer (note that in our proposal, keywords for queries are automatically
155
extracted from the web site).
156
The response to that query is a hub page H (q ) with a list of publications for authors whose name is
157
similar or equal to
158
publication, the hub includes a link to the publication content, links to each of its authors, a link to its
159
citations, and a link to the publishing host. In this case, an example of a set of queries Q is a list of authors,
160
such as Q = {Christian Bizer, Tom Heath, Tim Berners-Lee}. Using every one of these authors as key words in
161
the search form, we obtain a hubset H(Q ).
162
Christian Bizer
(e.g., publications of
Hubs in this hubset contain URLs like, for example,
6
Chris Bizer
are also included in this hub). For each
i
1
3
5
5
5
4
4
4
4
3
5
5
4
6
8
9
Node
n0
n2
n4
n6
n8
n10
n12
n14
n16
n18
n20
n22
n24
n26
n28
n30
Token
http
/Publication
/dbpedia
/linked-data
/named-graphs
/38181
/43723
/255707
3535385
Journal
ijswis
jws
?entityType
&searchType
&id
=5638047
Fu ,i
i
1.00
2
0.99
4
0.01
4
0.01
4
0.01
3
0.01
5
0.02
5
0.01
5
0.01
5
0.89
4
0.01
4
0.01
3
1.00
5
1.00
7
1.00
9
0.01
9
Node
n1
n3
n5
n7
n9
n11
n13
n15
n17
n19
n21
n23
n25
n27
n29
n31
Token
/academic.research.microsoft.com
/417664
/5638047
/1242380
/Author
/tim-berners-lee
/tom-heath
/christian-bizer
s-ren-auer
870
889
Detail
=1
=5
=1242380
=4117664
Fu ,i
1.00
0.01
0.01
0.01
0.99
0.01
0.02
0.01
0.01
0.01
0.01
1.00
1.00
1.00
0.01
0.01
Figure 2: Probability estimators
= http://academic.research.microsoft.com/Publication/5638047/linked-data.
163
u1
164
Applying a tokeniser based on RFC 3986, we obtain that
165
T u1
166
For i = 3, we obtain prex of u1 is
167
P u1
168
Finally, HubSet (P (u1 , 3)) is composed of all the hubs in H(Q ), since regardless the query we issue, every hub
169
in Microsoft Academic Search includes links to publications, and they all share the same prex.
170
171
( ) = hhttp, academic.research.microsoft.com, Publication, 5638047, linked-datai.
( , 3) = hhttp, academic.research.microsoft.com, Publicationi.
Figures 1 and 2 illustrate this example.
3.3. CALA Pattern Miner
172
The pattern mining process consists of: rst, calculate probability estimators for each token in each URL
173
of the hubset. Then, traverse the whole tree, and for each visited node, use those probability estimators
174
to discern if it is signicative. Not-signicative tokens are then abstracted into a wildcard. Finally, each
175
remaining branch of the tree corresponds to a dierent pattern.
176
3.3.1. Probability Estimators Calculation
177
Our technique relies on assigning a probability estimator to each prex P (u , i ) in each URL
u
obtained
178
from a site. This value estimates the probability of hubs in a hubset H(Q ) containing at least one URL that
179
shares the same prex. Since we are not generally aware of the underlying distribution of prexes, we can
180
calculate the probability estimator as the relative frequency of every prex in a hubset (Hacking, 2001).
7
100000
10000
Amazon
DailyMail
Torrentz
Etsy
Google Scholar
Freelancer
Newegg
People
DailyMotion
Deviantart
Guardian
BBC
Arxiv
BTJunkie
Overblog
Indeed
Ehow
Filestube
Archive
Alibaba
LiveJournal
PlentyOfFish
Chip
Gamefaqs
Aswers
HuffingtonPost
Isohunt
Target
Xing
Slideshare
Battle.net
Digg
Sourceforge
Yelp
TDG Scholar
Odesk
Netlog
Fiverr
IndiaTimes
Squidoo
Metacafe
Ms Academic
ArticlesBase
Drupal
Fotolia
1000
100
10
1
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
Figure 3:
Fu ,i
values histogram, from 45 evaluation sites
181
Denition 7 (Probability estimator:). Let H(Q ) be a hubset, u a URL and P (u , i ) a prex of u of size
182
i, we dene the probability estimator of prex P (u , i ) appearing in H(Q ) as:
183
P(P (u , i ) | H(Q )) =
184
| HubSet (P (u , i )) ∩ H(Q )|
|H(Q )|
and we denote it as Fu ,i
185
Fu ,i values range from 0.0 to 1.0. The more frequently that prexes appear in hubs from H(Q ), the higher
186
the value of their Fu ,i is. For example, in Figure 2 we show the dierent probability estimators calculated for
187
the dierent prexes in one URL set obtained from Microsoft Academic Search by issuing |Q | = 100 queries.
188
For the sake of simplicity we show the last token for each prex.
189
Note that the prex P1 in node n9 ) is very frequent, since it is in almost every hub from H(Q ); contrarily,
190
prex P2 in node n21 ) only appears in a 1% of the hubs in the hubset, i.e., only issuing a particular query from
191
the set of 100 queries do we obtain a hub with a URL including that prex. Another example is prex P3 in
192
node n18 , with a probability estimator of 0.89, which is signicantly high, but not as high as the probability
193
estimator of prex P1 . Note that publications are posted as either journals or conferences, but in both cases
194
they have authors and co-authors, hence prex P1 is more frequent than prex P3 .
195
Our conjecture is that for prexes whose Fu ,i in a hubset H(Q ) is not near 1.00, it is most probably
196
around 1/|{H(Q )}|, i.e., probability values are grouped around the two extremes of the distribution (0.00
8
197
and 1.00), with a minimum number of prexes whose probability is in the middle of the distribution. Figure 3
198
shows the histogram of Fu ,i values obtained from hubsets of 100 hubs from 45 dierent web sites. We observe
199
that most values are grouped in the interval (0.00, 0.05), except a small but signicant group of values around
200
1.00, which supports our previous conjecture. We must highlight that this graph is expressed in terms of a
201
logarithmic scale.
202
We represent patterns by means of a limited subset of regular expressions that includes only literals and
203
wildcard expressions that we represent with symbol ⋆. A wildcard is a regular expression that accounts for
204
characters other than separators. For example, if we use the RFC 3986 tokenisation, then ⋆ represents the
205
following regular expression: [/?#&=]+ .
206
As an example, pattern http://academic.research.microsoft.com/Author/⋆/⋆ matches the URLs of pages with
207
information about authors in Microsoft Academic Search; this is the case of the URLs represented as paths
208
in the tree that contain node n9 in Figure 1.
209
Denition 8 (Sibling prexes:). Let U be a set of URLs obtained from a hubset
210
and P (u , i ) a prex with size i > 1, we dene the set Siblings (u , i ) as the set of prexes of URLs u ∗ ∈ U
211
with size i that share a common prex P (u , i − 1) of size i-1 with u.
212
3.3.2. Pattern Mining
H(Q ), u ∈ U a URL
213
Based on both the probability estimators and the concept of sibling prexes, we dene a process to mine
214
patterns. For each prex P (u , i ) in the URL set, we compare all the dierent probability estimators for both
215
P (u , i ) and its sibling prexes. In those prexes whose value is not signicatively high, their last token is
216
replaced with a wildcard. Contrarily, prexes with a signicantly high value are probably part of a pattern, so
217
they are not abstracted, but kept as literals. The criterion to discern signicantly high probability estimators
218
is formally dened in the following denition.
219
Denition 9 (Wildcarding criterion: ). Let u be a URL, i an integer and
220
dene the criterion to discern prexes with a signicantly high probability estimator as:
221
α and β parameters. We
false if F ∈ [1.0 − β, 1.0] ∨ isOutlier (F , Siblings (u , i ), α)
u ,i
u ,i
wildcarding (u , i , α) =
true otherwise
222
The former criterion is false in two cases: a) if the prex probability estimator is near 1.0 (where β is a
223
parameter that denes a threshold around 1.0) an b) if the prex probability estimator is an outlier in the
224
distribution of probability estimators of other prexes that are its siblings.
225
Function isOutlier (a , B , α) checks if a certain value a is an outlier in the distribution of values B , with a
226
reliability level of α. An outlying observation, or outlier, is one that appears to deviate markedly from other
9
<n0,
http>
<n1,
academic.research.microsoft.com>
<n2 ,
/Publication>
<n9,
/Author >
<n18,
/Journal >
<n23,
/Detail >
<w1,
/ >
<w3,
/ >
<w5 ,
/ >
<n24,
?entityType >
<n25 ,
=1>
<w2,
/ >
<w4,
/ >
<w6 ,
/ >
<n26,
&searchType >
ni
n3 , n5 , n7
n4 , n6 , n8
n10 , n12 , n14 , n16
n11 , n13 , n15 , n17
n19 , n21
n20 , n22
n29 , n30 , n31
wj
w1
w2
w3
w4
w5
w6
w7
<n27,
=5 >
<n28 ,
&id>
<w7,
= >
(b)
(a)
Figure 4: URL set tree after pattern building
227
members of the sample in which it occurs Grubbs (1969). Although any valid function may be used, ours is
228
based on the Cantelli inequality, which states that, for any given statistical distribution, it holds that:
229
P (X − µ ≥ k σ) ≤
1
1 + k2
230
Therefore, for a given reliability level α, it is possible to calculate a threshold value X + µ such that data
231
above that value represents the α% of the distribution. Typical values for α are 0.01 or 0.05. Hence, we
232
mark all values above the threshold values as outliers.
233
Once we have dened the wildcarding criterion, we use it to process each node in the tree, using a depth-
234
rst traversal. In each visited node, we apply the wildcarding criterion, and, in case it is true, we abstract
235
the prex represented by the node, by replacing the node token (i.e., the last token in the prex) with a
236
wildcard ⋆. Whenever two or more sibling prexes are wildcarded, their associated nodes are merged. After
237
the whole tree has been traversed and processed, each of the resulting tree branches represents a dierent
238
pattern.
239
We present an example of the pattern building process in Figures 4(b), which shows the conguration of
240
nodes in the tree after applying the building process, some of which are the result of merging one or more
10
241
nodes ni into a wildcard node wj . We show the correspondence between the initial nodes ni and the resulting
242
wildcard node
243
5638047
244
nor an outlier for usual α values (0.01, 0.05).
wj
(t5 ) and
in Figure 4(a). For example, wildcard node
1242380
(t7 ), given that their
Fu ,i
w1
is the result of merging nodes
(t3 ),
values are {0.01, 0.01, 0.01}, none of which is near to 1.0
After the building pattern process, in the example we obtain the following patterns:
245
246
p1
= hhttp, academic.research.microsoft.com, Publication, ⋆, ⋆i,
247
p2
= hhttp, academic.research.microsoft.com, Author, ⋆, ⋆i,
248
p3
= hhttp, academic.research.microsoft.com, Journal, ⋆, ⋆i, and
249
p4
= hhttp, academic.research.microsoft.com, Detail, entityType, 1, searchType, 5, id, ⋆i.
250
4. Evaluation
251
417664
In this section, we report on our experimental evaluation. In the following subsection we describe the
252
experimental design and then report on our results.
253
4.1. Evaluation Design
254
Instead of proving that our technique works well in all sites in the Web, which might be unapproachable,
255
we are focusing on the most visited web sites on the Internet. We selected the top 41 more visited web
256
sites according to Alexa Web Directory (Alexa, 2011) plus four academical sites. For each site, we built
257
URL patterns with
258
compared our technique with two baseline classication techniques:
259
Support Vector Clustering
260
TT
CALA Pattern Miner
(from now on,
(from now on,
SVC
CALA
), and we evaluated its performance. We also
TemplateTokens
(from now on,
TT
) and
).
is a supervised structural web page classier, which detects the common template in a set of training
261
pages (Blanco et al., 2008). In this context, the template of a class is the collection of HTML tags that are
262
common to all training pages of said class. To classify validation pages, they are compared to the template,
263
and a similarity measure is calculated. Pages with similarity above a certain threshold are considered to
264
belong to the class.
265
SVC
, on the other hand, is a not supervised clustering technique based on the Support Vector Ma-
266
chines (Ben-Hur et al., 2001). We chose this clustering technique since it does not require to know the
267
optimal number of clusters in advance (as do other algorithms such as K-Means), while it achieves reasonable
268
execution times (it is faster than other techniques such as Expectation-Maximization). Note that
269
specially suitable for text clustering, since it is able to handle high dimensional spaces of features.
SVC
is
270
The performance for the three techniques was measured using the most usual evaluation measures for
271
classication tasks, i.e., Precision (P ), Recall (R ) and F1-measure (F 1). Execution times (T ) for each
11
272
273
technique were also measured for comparison. Finally, we also measured the number of URLS (U ) in each
class of each site.
274
To calculate the former measures, a denition is needed for each class under evaluation. Hence, for
275
each web site, we observed the dierent classes of data they contain, each of which determines a separate
276
class of URLs. For example, in Microsoft Academic Search, we identied pages containing three classes of
277
information: Authors, Hosts and Papers. For each class, an XPath expression was handcrafted to point to
278
links inside hubs whose target were pages of that class. A similar procedure is reported in (Blanco et al.,
279
2011), the only dierence being that they create regular expressions, instead of XPath expressions.
280
For the three techniques, we rst retrieve a hubset of 100 hubs from each evaluation site. To retrieve the
281
hubsets, we used a lightweight crawler (CALA Lightweight crawling), that required the sites under analysis
282
to have at least one search form with a text eld and to be coded in HTML. The crawler located a search
283
form in each site, lled in the form using keyword-based queries, and retrieved the resulting hub pages for
284
each query. From each hub, all URLs were extracted and tokenised.
285
The keywords to ll in the forms were extracted by the lightweight crawler from each web site, by means
286
of an incremental process. This process starts by analysing the home page of the site under analysis, and
287
queueing the
288
word in the queue is used to ll in the form, and retrieve a hub page, which is likewise analysed to extract
289
and queue new keywords. Moreover, the hub itself is also stored to be used to build the classier. Note that
290
not all keywords yield results; in fact, some of the hubs thus retrieved are likely to be empty (i.e., a web page
291
with an informative message that stats that no results were found). Since all those hubs have approximately
292
the same content, they are not useful for building the classier, nor for extracting keywords, so these pages
293
are discarded, and keywords are not extracted from them. The criterion to discard pages is based on the
294
Cantelli inequality to detect the pages with the lower size, which most likely are empty hubs. This process
295
is repeated until a maximum number of hubs
296
retrieved hubsets are available for downloading at (Hernández, 2011b), along with the XPath expressions
297
that dene each URL class we evaluated, and the key word lists we used.
N least frequent words in the page text (e.g., in our experiments we selected N = 10).
M
is obtained (e.g., in our experiment,
M
Each
= 100). All the
298
To evaluate CALA Pattern Miner, half of the tokenised URLs were used to build a set of URL patterns using
299
our Java implementation of the algorithms, and these patterns were used to classify the other half of the
300
URLs. More details on our implementation are available at (Hernández et al., 2011). The main dierence
301
between the former paper and this one is the lack of an evaluation section.
302
To evaluate TT technique, we implemented the algorithms in Java. For each evaluation site and each
303
class, we randomly selected 4 pages of the class to train the classier, and we randomly selected 30 pages,
304
from dierent classes, to validate the classier.
305
Finally, to evaluate SVC, we used the implementation that can be found in RapidMiner Mierswa et al.
12
Variable
P
P
R
R
F1
F1
(a)
Rank
p-value
TT > CALA
CALA > SVC
CALA > TT
SVC > CALA
CALA > TT
CALA > SVC
0.018
0.000
0.000
0.333
0.000
0.000
(b)
Figure 5: Kruskall-Wallis H Test and Wilcoxon Signed Rank results
306
307
308
(2006) data mining tool, using the URL tokens as features.
TO DO: hablar de la tecnica de checker, donde está publicada?
All experiments were executed on a
309
Windows 7 Professional.
310
4.2. Evaluation Results
64-bit
Intel Core i7 with four
2.93
GHz cores and
10
GB RAM, with
311
The main results of the experiment are presented in Tables 1 and 2. For each site in the experiment, we
312
show the number of URLs we used to build the classier, the number of URLs classied, the precision, recall
313
F1-measure and time for each site considered. After evaluating all sites, we classied a total of
314
and we obtained a mean precision of
315
former tables, cells with a -" symbol represent experiments that could not be completed (e.g., sites Amazon
316
and Etsy experiments with Support Vector Clustering were not nished after a week of calculations).
317
98%,
a mean recall of
In general, the reason for the lower recall values is that
91%,
133 766 pages,
and a mean F1-measure of
92%.
In the
CALA Pattern Miner tends to create smaller, more
318
specic patterns better than building larger patterns that reduce precision.
Smaller patterns mean that
319
sometimes a single class of URLs is separated into more than one pattern, which makes the mean recall
320
smaller.
321
Finally, we have performed some statistical tests to accurately show that our technique is indeed dierent
322
to the other techniques. First, we performed Kruskall-Wallis H test, which yielded the result in Figure 5(a),
323
i.e., that there are statistically signicant dierences amongst the dierent variables P , R and F 1 for the
324
systems, since the p-values for the three variables are lower than common
325
(1999). Then we use Wilcoxon's Signed Rank test to compute a rank amongst these systems Hollander and
326
Wolfe (1999). This test yields the rank in Table 5(b).
327
From the analysis of the evaluation data, we can conclude the following:
13
α = 0.05
Hollander and Wolfe
Site
Amazon
Class
Products
Reviews
Authors
DailyMotion Videos
User Proles
Ehow
Articles
Answers
Topics
Questions
Digg
Authors
Articles
Comments
Indiatimes
Articles
DailyMail
Authors
Articles
Deviantart
Photos
Tags
Filestube
Files
Hungtonpost Articles
Sourceforge
Projects
Reviews
Squidoo
Articles
User Proles
Torrentz
Files
Guardian
Authors
Articles
Archive
Articles
Isohunt
Files
Comments
Yelp
Businesses
Metacafe
Videos
Topics
User Proles
Etsy
Products
Stores
BBC
News
Videos
Alibaba
Products
U
6188
7752
1551
2564
849
1101
1510
4898
924
976
1286
339
468
1976
3554
524
1020
754
2180
1291
919
606
4739
315
682
4923
1993
1996
994
1971
1106
749
3910
3900
220
232
3136
P
CALA
R F1
T
1.00 0.94 0.97 34.06
1.00 0.99 1.00
1.00 1.00 1.00
1.00 1.00 1.00 2.78
1.00 1.00 1.00
1.00 0.99 0.99 2.06
0.92 0.99 0.95 5.06
1.00 1.00 1.00
0.99 1.00 1.00 3.80
1.00 1.00 1.00
1.00 1.00 1.00
1.00 0.45 0.62 5.42
1.00 1.00 1.00 9.48
1.00 0.45 0.62
1.00 1.00 1.00 8.86
0.94 0.31 0.47
1.00 0.94 0.97 7.97
1.00 0.18 0.31 3.73
1.00 1.00 1.00 8.02
1.00 1.00 1.00
1.00 0.99 1.00 2.08
1.00 1.00 1.00
1.00 1.00 1.00 3.33
0.65 1.00 0.99 20.94
0.35 1.00 0.02
1.00 0.98 0.99 13.28
1.00 0.97 0.98 4.94
1.00 1.00 1.00
1.00 0.79 0.88 7.59
1.00 0.76 0.86 4.20
1.00 0.97 0.99
1.00 0.91 0.95
1.00 0.95 0.97 21.55
1.00 1.00 1.00
1.00 0.57 0.73 3.66
1.00 1.00 1.00
1.00 0.89 0.94 22.23
P
TT
R F1
T
1.00 0.46 0.63 10.20
1.00 0.24 0.39
1.00 0.85 0.92
1.00 0.23 0.37 1.33
1.00 0.88 0.94
1.00 0.00 0.00 0.67
1.00 0.50 0.67 4.39
1.00 0.00 0.00
1.00 0.00 0.00 1.69
1.00 0.00 0.00
1.00 0.33 0.50
1.00 0.00 0.00 2.25
1.00 1.00 1.00 6.19
1.00 1.00 1.00
1.00 1.00 1.00 3.52
1.00 0.50 0.67
1.00 1.00 1.00 0.53
1.00 0.50 0.67 7.86
1.00 1.00 1.00 0.81
1.00 0.50 0.67
1.00 0.92 0.96 3.50
1.00 0.50 0.67
1.00 0.92 0.96 0.19
1.00 1.00 1.00 3.80
1.00 0.81 0.89
1.00 0.00 0.00 0.38
1.00 0.98 0.99 2.42
1.00 1.00 1.00
1.00 1.00 1.00 12.00
1.00 1.00 1.00 3.23
1.00 1.00 1.00
1.00 0.44 0.61
1.00 0.00 0.00 3.27
1.00 0.40 0.58
1.00 0.88 0.94 2.22
1.00 0.00 0.00
1.00 1.00 1.00 2.09
P
SVC
R F1
T
0.73 1.00 0.85 587.00
0.24 1.00 0.38
1.00 0.78 0.88 1113.00
0.22 1.00 0.36 33553.00
0.76 1.00 0.86
0.26 1.00 0.42 453.00
0.26 1.00 0.42
0.33 1.00 0.49
0.99 0.70 0.82 1240.00
0.10 0.99 0.19 203114.00
0.45 0.99 0.62
0.86 1.00 0.92 11059.00
0.12 0.98 0.21
0.91 1.00 0.95 175.00
0.99 0.99 0.99 858.00
0.61 1.00 0.76 420.00
0.36 1.00 0.53
0.58 1.00 0.73 137.00
0.36 1.00 0.53
0.98 1.00 0.99 914.00
0.35 0.97 0.51 6028.00
0.56 0.72 0.63
0.98 1.00 0.99 263.00
0.49 1.00 0.66 524.00
0.49 1.00 0.66
1.00 1.00 1.00 899.00
0.50 1.00 0.67 2647.00
0.28 1.00 0.44
0.19 1.00 0.32
0.39 0.97 0.56 37.00
0.43 1.00 0.60
0.97 0.25 0.39 3746.00
Table 1: Evaluation results on validation sites.
328
1. CALA precision values are higher than SVC precision values in general, and they are comparable to TT
329
precision values, although somehow lower. Therefore, our not supervised technique is only slightly less
330
precise than a supervised technique.
331
332
333
334
2. CALA recall values are lower than SVC recall values in general, and they are higher than TT recall values.
This is due to SVC being likely to create a single cluster for each site (R = 1.00).
3. CALA F1-measure values are higher than both SVC and TT F1-measure values. Therefore, we achieve a
better trade-o of precision versus recall.
335
4. CALA times are comparable to TT times, although somehow higher (order of magnitude of ten seconds).
336
Meanwhile, CALA times are signicantly lower than SVC times (order of magnitude of hours or even
337
days). Therefore, our not supervised technique is signicantly faster than the other not supervised
338
technique, while it is only slightly slower than the supervised technique. Note that in this analysis we
339
did not consider the time spent by the user annotating examples to create a supervised training set.
14
Site
Class
Target
Products
TDG Scholar Authors
Hosts
Papers
Ms Academic Authors
Papers
Google Scholar Citations
Arxiv
Authors
Papers
Abstracts
Livejournal
News
Xing
User Proles
Odesk
User Proles
Skills
ArticlesBase Authors
Freelancer
Projects
BTJunkie
Files
PlentyOfFish User Proles
Slideshare
Files
Netlog
User Proles
Drupal
Projects
Authors
Newegg
Products
Overblog
Articles
Chip
News
Battle.net
Forum Posts
Fiverr
Ads
Fotolia
Photos
People
Styles
Babies
Covers
Articles
Indeed
Jobs
Gamefaqs
Boards
Average
U
3180
2061
336
416
1244
370
698
15529
353
602
1265
987
894
758
1003
2207
6542
1170
968
2447
1511
1832
988
162
100
2428
4161
336
296
219
311
923
3370
1910.90
P
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.92
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.96
1.00
1.00
1.00
1.00
1.00
1.00
1.00
CALA
R F1
1.00
1.00
1.00
0.25
0.85
0.98
1.00
1.00
1.00
0.88
0.64
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.68
1.00
1.00
0.78
1.00
1.00
1.00
1.00
0.70
0.69
1.00
0.84
1.00
1.00
1.00
1.00
1.00
0.40
0.92
0.99
1.00
1.00
1.00
0.94
0.78
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.96
1.00
0.81
1.00
1.00
0.88
1.00
1.00
0.98
1.00
0.74
0.73
1.00
0.84
1.00
1.00
T
47.50
6.03
7.89
5.28
20.81
3.30
2.08
4.70
3.41
4.47
6.88
6.16
3.00
3.72
3.14
47.38
0.66
6.75
1.95
4.16
12.00
4.25
4.27
8.61
0.98 0.91 0.92 9.19
P
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
TT
R
1.00
1.00
1.00
0.00
0.96
0.50
1.00
0.25
0.00
0.92
0.00
1.00
0.50
0.62
1.00
1.00
0.00
1.00
0.88
0.00
1.00
0.96
1.00
1.00
0.85
0.96
1.00
0.00
0.42
0.33
1.00
0.00
0.00
F1 T
1.00
1.00
1.00
0.00
0.98
0.67
1.00
0.40
0.00
0.96
0.00
1.00
0.67
0.76
1.00
1.00
0.00
1.00
0.94
0.00
1.00
0.98
1.00
1.00
0.92
0.98
1.00
0.00
0.59
0.50
1.00
0.00
0.00
9.92
1.22
2.09
1.09
9.53
0.95
0.58
1.77
1.66
5.77
0.41
0.72
0.38
0.70
1.91
0.53
1.53
1.30
1.11
0.53
3.41
0.30
2.25
1.00 0.61 0.76 1.75
P
0.98
0.60
0.77
1.00
0.73
0.21
1.00
0.73
0.17
0.10
1.00
1.00
0.55
0.41
0.91
1.00
0.97
1.00
0.92
1.00
0.53
0.34
1.00
0.91
1.00
1.00
0.96
1.00
0.52
0.48
0.61
0.91
1.00
1.00
R
1.00
1.00
0.92
0.88
1.00
1.00
1.00
0.18
0.18
0.19
1.00
1.00
1.00
1.00
1.00
0.92
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.92
0.96
0.87
0.94
1.00
1.00
SVC
F1
0.99
0.75
0.83
0.93
0.85
0.35
1.00
0.29
0.17
0.13
1.00
1.00
0.71
0.58
0.95
0.96
0.99
1.00
0.96
1.00
0.70
0.50
1.00
0.96
1.00
1.00
0.98
1.00
0.66
0.64
0.71
0.93
1.00
1.00
T
31917.00
38081.00
354.00
66.00
3124.00
6372.00
1300.00
22414.00
133.00
26560.00
2365.00
3766.00
171.00
49.00
854.00
3068.00
80.00
19.00
179.00
1798.00
14324.00
8439.00
2271.00
12662.00
0.67 0.93 0.78 10421.70
Table 2: Evaluation results on validation sites. (Cont.)
340
341
342
343
344
345
346
347
348
349
350
351
We sum up our conclusions in Figure 6. Figure 6(a) shows the precision versus recall for each experiment
in the three techniques. Note that most experiments in CALA are located in the upper right corner (therefore,
precision and recall are both around 1.00). Figure 6(b) presents the combined histogram of execution times
of the three techniques. We must remark that CALA, which is not supervised, achieves execution times that
are comparable of those from TT, which is supervised, whereas SVC obtain larger values. Note that the X
axis is expressed in logarithmic scale.
5. Conclusions and Future Work
In this article, we propose an unsupervised technique to build URL patterns based on probability estimators.
We present the results of an experiment that proves that this technique is able to classify URLs, obtaining
an average precision of 98%, and an average recall of 91%. We compared our technique with two baseline
techniques, obtaining that our technique achieves Precision and Recall values and execution times that are
15
WͲZŽŵƉĂƌŝƐŽŶ
,ŝƐƚŽŐƌĂŵ
ϭ͘ϬϬ
Ϭ͘ϵϬ
Ϭ͘ϴϬ
Ϭ͘ϳϬ
ůůĂ Ϭ͘ϲϬ
ĐĞ Ϭ͘ϱϬ
ZϬ͘ϰϬ
Ϭ͘ϯϬ
Ϭ͘ϮϬ
Ϭ͘ϭϬ
Ϭ͘ϬϬ
ϭϲ
ϭϰ
ϭϮ
ĐLJ ϭϬ
Ŷ
Ğ
Ƶ ϴ
Ƌ
Ğ
ƌ ϲ
&
ϰ
Ϯ
Ϭ
dd
^s
>
Ϭ͘ϬϬ
Ϭ͘ϮϬ
Ϭ͘ϰϬ
Ϭ͘ϲϬ
WƌĞĐŝƐŝŽŶ
Ϭ͘ϴϬ
dd
^s
>
ϭ͘ϬϬ
dŝŵĞ;ƐĞĐŽŶĚƐͿ
(a)
(b)
Figure 6: Graph showing the Precision-Recall of the three techniques, and the histogram with the execution times
352
comparable to those of a supervised classier, and F1 values that are even higher, while relieving the user
353
from the tedious task of annotating training sets. Meanwhile, when compared with another not supervised
354
technique, CALA Pattern Miner achieves both a better performance and signicantly lower execution times.
355
In contrast to other proposals, ours is not supervised, which saves the user a signicant amount of time
356
labelling large training sets. Also, it does not require to crawl the whole site to build the classication model.
357
We use a small subset of hub pages from the site, from which we obtain a set of URLs, and we apply a
358
statistical technique to discern which tokens of the URL are part of the pattern, and which can be replaced
359
with a wildcard. Since we rely on token probabilities, regardless of the token having some meaning in a
360
particular dictionary, it is a language and domain independent proposal.
361
To support our classier, we have developed CALA Visualiser, a tool for graphically displaying URL
362
patterns, which supports the user in the task of identifying the class behind each pattern. A demo of this
363
tool is available at (Hernández, 2011a).
364
Proposals in the area of parental control systems, advertising removal, duplicated pages detection, web
365
directories creation, Virtual Integration and Conceptual Modelling can benet from the use of our classier.
366
Some details about the application of our classier to extract models from non-semantic web sites has been
367
accepted for publication in (Hernández et al., 2012).
368
Acknowledgements
369
This work has been partially supported by the European Commission (FEDER), the Spanish and the
370
Andalusian R &D &I programmes (grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E,
371
TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, and TIN2010-09988-E).
372
373
References
Alexa, 2011. Top sites directory, 14th february 2011. http://www.alexa.com/topsites.
16
374
375
376
377
378
379
380
381
382
383
384
385
Bar-Yossef, Z., Keidar, I., Schonfeld, U., 2009. Do not crawl in the dust: Dierent urls with similar text.
TWEB 3 (1), 3.
Baykan, E., Henzinger, M., Marian, L., Weber, I., July 2011. A comprehensive study of features and algorithms for URL-based topic classication. TWEB 5, 15:115:29.
Baykan, E., Henzinger, M. R., Marian, L., Weber, I., 2009. Purely URL-based topic classication. In: WWW.
pp. 11091110.
Ben-Hur, A., Horn, D., Siegelmann, H. T., Vapnik, V., 2001. Support vector clustering. Journal of Machine
Learning Research 2, 125137.
Blanco, L., Crescenzi, V., Merialdo, P., 2008. Structure and semantics of Data-IntensiveWeb pages: An
experimental study on their relationships. J. UCS 14 (11), 18771892.
Blanco, L., Dalvi, N., Machanavajjhala, A., 2011. Highly ecient algorithms for structural clustering of large
websites. In: WWW. ACM, pp. 437446.
386
Brin, S., 1998. Extracting patterns and relations from the World Wide Web. In: WebDB. pp. 172183.
387
Crescenzi, V., Mecca, G., Merialdo, P., 2001. RoadRunner: Towards automatic data extraction from large
388
web sites. In: VLDB. pp. 109118.
389
URL
http://www.vldb.org/conf/2001/P109.pdf
390
Grubbs, F. E., 1969. Procedures for Detecting Outlying Observations in Samples. Technometrics 11 (1).
391
Hacking, I., Jul. 2001. An Introduction to Probability and Inductive Logic. Cambridge University Press.
392
Hernández, I., 2011a. CALA demo. http://www.tdg-seville.info/inmahernandez/CALA+Demo.
393
Hernández, I., 2011b. CALA experimental design. http://www.tdg-seville.info/inmahernandez/Experiment.
394
Hernández, I., Rivero, C., Ruiz, D., Corchuelo, R., 2011. A tool for link-based web page classication. In:
395
Advances in Articial Intelligence. Vol. 7023 of LNCS. Springer Berlin / Heidelberg, pp. 443452.
396
Hernández, I., Rivero, C. R., Ruiz, D., Corchuelo, R., 2012. A statistical approach to url-based web page
397
clustering. In: Proceedings of the 21st international conference companion on World Wide Web. WWW
398
'12 Companion. ACM, pp. 525526.
399
400
401
Hernández, I., Rivero, C. R., Ruiz, D., Corchuelo, R., 2012. Towards discovering conceptual models behind
web sites. In: ER. p. To be published.
Hirschberg, D. S., 1977. Algorithms for the longest common subsequence problem. J. ACM 24 (4), 664675.
17
402
Hollander, M., Wolfe, D. A., 1999. Non-parametric Statistical Methods. Wiley-Interscience.
403
Kan, M.-Y., Thi, H. O. N., 2005. Fast webpage classication using URL features. In: CIKM. pp. 325326.
404
Kleinberg, J. M., 1999. Authoritative sources in a hyperlinked environment. J. ACM 46 (5), 604632.
405
Koppula, H. S., Leela, K. P., Agarwal, A., Chitrapura, K. P., Garg, S., Sasturkar, A., 2010. Learning URL
406
407
408
409
410
411
412
413
414
415
patterns for webpage de-duplication. In: WSDM. ACM, pp. 381390.
Li, Y., Zhong, N., 2004. Web mining model and its applications for information gathering. Knowl.-Based
Syst. 17 (5-6), 207217.
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T., 2006. YALE: rapid prototyping for complex
data mining tasks. In: Knowledge Discovery and Data Mining. pp. 935940.
Qi, X., Davison, B. D., 2009. Web page classication: Features and algorithms. ACM Comput. Surv. 41 (2).
URL
http://doi.acm.org/10.1145/1459352.1459357
Ristad, E. S., Yianilos, P. N., 1998. Learning string-edit distance. IEEE Trans. Pattern Anal. Mach. Intell.
20 (5), 522532.
Selamat, A., Omatu, S., 2004. Web page feature selection and classication using neural networks. Inf. Sci.
416
158, 6988.
417
URL
418
419
420
421
422
423
424
425
426
427
http://dx.doi.org/10.1016/j.ins.2003.03.003
Shih, L. K., Karger, D. R., 2004. Using URLs and table layout for web classication tasks. In: WWW. pp.
193202.
Trillo, R., Po, L., Ilarri, S., Bergamaschi, S., Mena, E., Apr. 2011. Using semantic techniques to access web
data. Inf. Syst. 36 (2), 117133.
Vidal, M. L. A., da Silva, A. S., de Moura, E. S., Cavalcanti, J. M. B., 2008. Structure-based crawling in the
Hidden Web. J. UCS 14 (11), 18571876.
Wang, Y., , Wang, Y., Kitsuregawa, M., 2001. Link based clustering of web search results. In: Lecture Notes
in Computer Science. Springer, pp. 225236.
Zhang, J., Qin, J., Yan, Q., 2006. The role of URLs in objectionable web content categorization. In: Web
Intelligence. pp. 277283.
18