Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
CALA: A Non-Supervised URL-Based Web Page Classier Inma Hernández ∗, Carlos R. Rivero , David Ruiz , Rafael Corchuelo a, a a Universidad a a de Sevilla, ETSI Informática. Avda. de la Reina Mercedes, s/n, Sevilla E-41012, Spain. Abstract Unsupervised URL-Based Web Page Classication refers to the problem of clustering the URLs in a web site so that each cluster includes a set of pages that can be classied using a unique class. The existing proposals to perform URL-Based Classication suer from a number of drawbacks: they are supervised, which requires the user to provide labelled training data and are then dicult to scale, are language or domain dependent, since they require the user to provide dictionaries of words, or require extensive crawling, which is time and resource consuming. In this article, we propose a new statistical technique to mine URL patterns that are able to classify Web pages. Our proposal is unsupervised, language and domain independent, and does not require extensive crawling. We have evaluated our proposal on 45 real-world web sites, and the results conrm that it can achieve a mean precision of 98% and a mean recall of 91%, and that its performance is comparable to that of a supervised classication technique, while it does not require to label large sets of sample pages. Furthermore, we propose a novel application that helps to extract the underlying model from non-semantic-web sites. Keywords: Web Page Classication, URL Classication, URL Patterns 1 1. Introduction 2 Typical web sites provide pages about a number of topics; in other words, their pages can be classied 3 into a number of semantic classes that describe their contents. For instance, a typical on-line book store may 4 have pages about books, authors, publishers, reviews, oers, and so on. 5 URL-based web page classication refers to the problem of classifying a web page with respect to its topic 6 by analysing features extracted from its URL, regardless of its content and/or structure. Therefore, this 7 classication does not require the page to be downloaded prior to being classied. 8 URL-based classication of web pages has many applications, including: endowing Virtual Integration 9 crawlers with the intelligence to determine whether the page may contain or not information relevant to the 10 query (Blanco et al., 2011; Li and Zhong, 2004; Vidal et al., 2008), applying the most appropriate extraction 11 model to each page Crescenzi et al. (2001), performing URL ltering to avoid certain types of contents (e.g., ∗ Corresponding author Email addresses: Ruiz), corchu@us.es inmahernandez@us.es (Inma Hernández), carlosrivero@us.es (Carlos R. Rivero), druiz@us.es (David (Rafael Corchuelo) Preprint submitted to Elsevier June 28, 2012 12 parental control systems (Zhang et al., 2006) or advertising removal (Shih and Karger, 2004)), or detection 13 and canonisation of duplicated URLs (Bar-Yossef et al., 2009; Koppula et al., 2010), amongst others. 14 There are other proposals in the area of URL-based web page classication. These proposals present some 15 drawbacks, mainly referred to scalability and applicability. With regards to scalability, some of them are 16 supervised (Baykan et al., 2009; Kan and Thi, 2005; Vidal et al., 2008), others require an extensive crawling 17 of the whole site under analysis to build the classication model (Vidal et al., 2008), while others need to 18 download the page to extract other features from inside the page, to achieve a good performance (Blanco 19 et al., 2011). As for applicability, the main drawback is that many of these proposals are either domain, 20 language or site-dependant (Baykan et al., 2011, 2009; Kan and Thi, 2005; Shih and Karger, 2004), which 21 means that their techniques cannot be applied to every site in the Web. 22 In this paper, we propose an statistical technique called CALA Pattern Miner to obtain URL patterns for 23 each web site without supervision. Each of these patterns matches a collection of URLs of the site containing 24 information of a certain class, i.e., they can be used to classify the URLs of a site according to the class of 25 information their pages contain. A two-page poster on our proposal has been published elsewhere (Hernández 26 et al., 2012). 27 Our proposal does not suer from the previous drawbacks: First, it is not supervised. Also, it does not 28 require an extensive crawling of the site to build the classication model, but only a small subset of hub 29 pages (e.g., in our experiments 100 hub pages were enough to achieve a good performance). Hub pages are 30 rich in links, and they are usually obtained in response to a query issued by means of a web form. To ll in 31 the forms, we use keywords that are extracted from the same site automatically, hence no dictionary or user 32 input is needed. Moreover, it is based exclusively on URL features, so pages do not have to be downloaded 33 to be classied. Finally, it is domain, language, and site-independent, since the user does not have to know 34 particular details about the site, the domain or the language in which the site pages are written in order to 35 make it work properly. Therefore, our technique is both scalable and generally applicable. 36 We have performed an experimental evaluation of our technique on a representative dataset of web 37 sites, comparing its performance with two baseline techniques, both supervised (Blanco et al., 2008) and 38 not supervised (Ben-Hur et al., 2001). We obtained good precision and recall values, comparable to the 39 supervised technique, while our F1 measure values are better than both supervised and not supervised baseline 40 techniques. Also, we observed that our execution times are comparable to supervised techniques. Note that 41 in the latter case we did not consider in the analysis the time and eort spent by the user in annotating the 42 pages of the training set, which should be added to the nal time), while they are considerable lower than 43 those of not supervised techniques. 44 The rest of this article is organised as follows: Section 2 presents the related work in URL-based Web 45 page classication, Section 3 denes our proposal to build URL patterns; Section 4 shows the evaluation 2 46 of our technique; nally, Section 5 lists some of the conclusions drawn from the research and concludes the 47 article. 48 2. Related Work 49 In the analysis of existing proposals in the web page classication area, two main issues must be addressed: 50 rstly, features for classication can be obtained from inside and/or outside the pages to be classied; secondly, 51 these proposals can be either supervised or not supervised. For a complete survey on topic classication of 52 web pages, see (Qi and Davison, 2009). 53 Features for Web page classication can be either inner, i.e., obtained from inside the pages, like the page 54 content (Selamat and Omatu, 2004), its structure (Vidal et al., 2008), or the disposition of links amongst 55 pages (Wang et al., 2001); or outer ,i.e., obtained from outside the page, like the URL of the page (Baykan 56 et al., 2009; Blanco et al., 2011; Kan and Thi, 2005). Note that the latter type features allow to classify 57 pages without the need to download them, which is our focus and is specially desirable in Virtual Integration 58 contexts, in which the user is waiting for an answer, and response time is an issue (Trillo et al., 2011). 59 Therefore, we focus on URL pattern mining, i.e., nding patterns that allow to classify a web page building 60 solely on its URL. 61 Supervised classication techniques require the user to build a training set consisting on labelled examples. 62 In the Web context, this usually means annotating a large number of examples by hand, which requires a 63 signicant amount of time. (Baykan et al., 2011, 2009) and (Kan and Thi, 2005) classify URLs by matching 64 them against URL patterns composed of sets of key words. They need to provide their algorithm with a list 65 of words that is representative for every class, which makes their proposal language and domain dependent. 66 Furthermore, these classiers need to be provided with a suciently large collection of both positive and 67 negative labelled training URLs. (Shih and Karger, 2004) use a tree structure to represent URLs, and to 68 obtain URL patterns that are prexes for all URLs of a certain class. The features on which their technique 69 relies depends heavily on the site being analysed and also requires positive labelled training URLs, which 70 makes it dicult to scale. 71 A solution to this problem might build on not supervised classiers, such as traditional clustering tech- 72 niques. Such techniques rely on a distance function and return clusters that verify that the inter-distance 73 is maximum, whereas the intra-distance is minimum. Since URLs can be naturally represented as strings, 74 the idea would be to use a well-known string-distance, e.g., the edit distance (Ristad and Yianilos, 1998) or 75 the longest common subsequence (Hirschberg, 1977). Unfortunately, it has been noticed that edit distance 76 measures applied to URLs do not seem to work well to mine URL patterns (Blanco et al., 2011), i.e., two 77 close URLs may provide information about two dierent classes, whereas distant URLs may be related to 78 pages with similar classes. For example, there is a minimum distance between the following URLs: 3 79 http://academic.research.microsoft.com/Detail?entitytype=2&searchtype=2&id=35096884 and 80 http://academic.research.microsoft.com/Detail?entitytype=2&searchtype=5&id=35096884, 81 but they point to pages that are likely to be classied in dierent classes (Publications and Citations). 82 Contrarily, URLs like 83 http://academic.research.microsoft.com/Author/1939777/ramesh-jain and 84 http://academic.research.microsoft.com/Author/10540585/yu-li 85 are more distant but belong to the same class (Authors). This has motivated a number of authors to work 86 on ad-hoc clustering techniques to mine URL patterns. 87 (Bar-Yossef et al., 2009) and (Koppula et al., 2010) explore the possibility of creating URL patterns to 88 detect the existence of web pages with dierent URLs that have the same content, which has a negative impact 89 on crawling eciency. To solve this problem, some rules are generated to detect URLs that are duplicated, 90 and URL patterns are used to normalise those URLs. In both proposals, the goal is not to classify URLs 91 according to the class of content in their targets, but to locate pages that have exactly the same content. 92 Also, they need to have a large collection of URLs to obtain good results, which means a previous extensive 93 crawling of the site (or the equivalent manual process) to gather them. 94 (Vidal et al., 2008) proposed a technique to build a map of a web site, and generate URL patterns for 95 those pages that lead (directly or eventually) to pages that are similar to a given example. They have to 96 previously crawl the entire site, download each page and then process them, which usually takes a signicant 97 amount of time and resources. 98 (Blanco et al., 2011) proposed an algorithm to classify web pages without supervision that combines web 99 page contents and their URLs as features. They require a large training set, so they crawl the entire site 100 in their experiments. Furthermore, to improve the classication eciency, features from the page itself are 101 included in addition to the link-based features, which means that it must be downloaded previously. 102 (Brin, 1998) builds text patterns for information extraction, supported by URL patterns that match the 103 pages that contain a fragment of text matching the text pattern. To build the URL patterns, their algorithm 104 has rst to visit the page to check if it matches the text pattern. Therefore, they have to previously perform 105 an extensive crawling on the site to detect all pages that match the text pattern. 106 3. Proposal 107 108 Our proposal is a technique to build a set of URL patterns that represent the collection of URLs in a web site, where each pattern represents a dierent cluster of URLs. 4 109 The input to our technique is a set of URLs from pages of the site under analysis. Since our proposal 110 is grounded on statistics, we do not have to deal with the whole site. Instead, we just require a subset of 111 hubs, which are pages that provide summaries and links to other pages (Kleinberg, 1999). Typically, web 112 sites return hubs as a response to user queries. Note that hubs usually contain a larger number of URLs than 113 other types of pages in the web site, given that their goal is to oer the users as many results related to their 114 queries as possible. Therefore, the probability that they contain a suciently representative set of URLs is 115 higher than for any other type of page. Hence, we chose hubs to be the source from which we extract sets of 116 URLs. 117 Pattern building consists of retaining the URL segments that are common to all URLs of the same class 118 while abstracting the segments that change frequently with a wildcard. To discern those segments that 119 change frequently, we use a statistical technique, that assigns each prex inside a URL an estimator of the 120 probability of that prex appearing in pages from the web site. Then, this estimator is used to discern 121 whether to keep that prex as a literal, or to replace it with a wildcard. 122 In the following subsections, we rst present some preliminary denitions, then a running example, and, 123 nally, describe the details of our proposal. 124 3.1. Preliminaries 125 We rst dene the concept of hub and hubset: 126 Denition 1 (Hub). Let q be a query that can be issued on a search form in a web site, we dene H (q ) as 127 the hub page that is returned as a response to q. 128 Denition 2 (Hubset). Let Q be a set of queries that can be issued on a search form in a web site, we 129 dene H(Q ) as the set of hubs that is obtained after issuing all queries in Q. 130 Once we have obtained a hubset H(Q ), we obtain a URL set U by attening H(Q ), i.e., by joining the 131 sets of URLs inside all hubs in H(Q ). Then, we must dene a tokenisation of those URLs according to a 132 particular format, e.g., the W3C recommendation for URIs syntax RFC 3986. 133 Denition 3 (Tokens). Let u be a URL, we dene T (u ) 134 obtained after tokenising u. 135 Denition 4 (Prexes). Let u be a URL, T (u ) 136 P (u , i ) = ht1 , . . . , ti i as a prex of u with a given size i. Note that when i = n, we have P (u , i ) = T (u ). 137 Denition 5 (URL Pattern). A pattern is a sequence of tokens, such that some of the tokens are literals, 138 whereas others are wildcards. We denote a pattern pas p = ht1 , . . . , tm i, m > 0. Note that patterns are 139 sequences of tokens, and therefore denitions HubSet (P (p , i )), prexes P (p , i ) and probability estimators 140 Fp ,i can be applied to patterns as well. For the sake of brevity, from now onwards we will refer to URL 141 patterns simply by = ht1 , . . . , tn i as the sequence of tokens that is = ht1 , . . . , tn i a tokenisation of u and i ≤ n, we dene patterns. 5 <n0 , http> <n1, / academic.research.microsoft .com> <n9, /Author > <n2, /Publication> <n3, /417664 > <n5 , /5638047 > <n4, /dbpedia> <n8, <n6, /linked-data > /namedgraphs> <n7, /1242380> <n18 , /Journal > <n10 , /38181> <n12, /43723> <n14, /255707> <n11, /timbernerslee> <n13, <n15, /tom-heath> /christianbizer> <n16, /3535385> <n19, /870> <n21 , /889> <n17, /s-renauer> <n22, <n20, /ijswis> /jws> <n29, =1242380> <n23, /Detail > <n24 , ?entitytype > <n25 , =1> <n26 , &searchtype> <n27 , =5> <n28, &id> <n30 , =5638047> <n31, =4117664> Figure 1: URLs in the Microsoft Academic Search example set 142 Denition 6 (Hubset of a prex). Let u be a URL and P (u , i ) be a prex of URL of size i, we dene 143 HubSet (P (u , i )) as the set of hubs in which there is at least one URL u ∗ such that P (u , i ) = P (u ∗ , i ). 144 We use a tree notation to represent sets of URLs and their tokens based on the PATRICIA trees, which 145 allows representing large collections of strings eciently and compactly. Every node in the tree is dened by 146 a label ni and refers to a token tj . Each node, actually, represents a prex of a URL, which is the sequence 147 of tokens in the path from the tree root n0 to the node itself. Note that each path from the tree root to a 148 leaf represents a single URL. For the sake of readability, each token in each node is preceded by the character 149 that separates it from the previous token. 150 3.2. Running example 151 Microsoft Academic Search is an scholarly web site that oers information about papers, authors, citations 152 and publishing hosts, such as journals or conferences. This site search form includes a text box, in which 153 users can perform keyword-based searches. Therefore, an example of a query that can be issued in this site 154 is an author name, e.g., q = Christian Bizer (note that in our proposal, keywords for queries are automatically 155 extracted from the web site). 156 The response to that query is a hub page H (q ) with a list of publications for authors whose name is 157 similar or equal to 158 publication, the hub includes a link to the publication content, links to each of its authors, a link to its 159 citations, and a link to the publishing host. In this case, an example of a set of queries Q is a list of authors, 160 such as Q = {Christian Bizer, Tom Heath, Tim Berners-Lee}. Using every one of these authors as key words in 161 the search form, we obtain a hubset H(Q ). 162 Christian Bizer (e.g., publications of Hubs in this hubset contain URLs like, for example, 6 Chris Bizer are also included in this hub). For each i 1 3 5 5 5 4 4 4 4 3 5 5 4 6 8 9 Node n0 n2 n4 n6 n8 n10 n12 n14 n16 n18 n20 n22 n24 n26 n28 n30 Token http /Publication /dbpedia /linked-data /named-graphs /38181 /43723 /255707 3535385 Journal ijswis jws ?entityType &searchType &id =5638047 Fu ,i i 1.00 2 0.99 4 0.01 4 0.01 4 0.01 3 0.01 5 0.02 5 0.01 5 0.01 5 0.89 4 0.01 4 0.01 3 1.00 5 1.00 7 1.00 9 0.01 9 Node n1 n3 n5 n7 n9 n11 n13 n15 n17 n19 n21 n23 n25 n27 n29 n31 Token /academic.research.microsoft.com /417664 /5638047 /1242380 /Author /tim-berners-lee /tom-heath /christian-bizer s-ren-auer 870 889 Detail =1 =5 =1242380 =4117664 Fu ,i 1.00 0.01 0.01 0.01 0.99 0.01 0.02 0.01 0.01 0.01 0.01 1.00 1.00 1.00 0.01 0.01 Figure 2: Probability estimators = http://academic.research.microsoft.com/Publication/5638047/linked-data. 163 u1 164 Applying a tokeniser based on RFC 3986, we obtain that 165 T u1 166 For i = 3, we obtain prex of u1 is 167 P u1 168 Finally, HubSet (P (u1 , 3)) is composed of all the hubs in H(Q ), since regardless the query we issue, every hub 169 in Microsoft Academic Search includes links to publications, and they all share the same prex. 170 171 ( ) = hhttp, academic.research.microsoft.com, Publication, 5638047, linked-datai. ( , 3) = hhttp, academic.research.microsoft.com, Publicationi. Figures 1 and 2 illustrate this example. 3.3. CALA Pattern Miner 172 The pattern mining process consists of: rst, calculate probability estimators for each token in each URL 173 of the hubset. Then, traverse the whole tree, and for each visited node, use those probability estimators 174 to discern if it is signicative. Not-signicative tokens are then abstracted into a wildcard. Finally, each 175 remaining branch of the tree corresponds to a dierent pattern. 176 3.3.1. Probability Estimators Calculation 177 Our technique relies on assigning a probability estimator to each prex P (u , i ) in each URL u obtained 178 from a site. This value estimates the probability of hubs in a hubset H(Q ) containing at least one URL that 179 shares the same prex. Since we are not generally aware of the underlying distribution of prexes, we can 180 calculate the probability estimator as the relative frequency of every prex in a hubset (Hacking, 2001). 7 100000 10000 Amazon DailyMail Torrentz Etsy Google Scholar Freelancer Newegg People DailyMotion Deviantart Guardian BBC Arxiv BTJunkie Overblog Indeed Ehow Filestube Archive Alibaba LiveJournal PlentyOfFish Chip Gamefaqs Aswers HuffingtonPost Isohunt Target Xing Slideshare Battle.net Digg Sourceforge Yelp TDG Scholar Odesk Netlog Fiverr IndiaTimes Squidoo Metacafe Ms Academic ArticlesBase Drupal Fotolia 1000 100 10 1 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Figure 3: Fu ,i values histogram, from 45 evaluation sites 181 Denition 7 (Probability estimator:). Let H(Q ) be a hubset, u a URL and P (u , i ) a prex of u of size 182 i, we dene the probability estimator of prex P (u , i ) appearing in H(Q ) as: 183 P(P (u , i ) | H(Q )) = 184 | HubSet (P (u , i )) ∩ H(Q )| |H(Q )| and we denote it as Fu ,i 185 Fu ,i values range from 0.0 to 1.0. The more frequently that prexes appear in hubs from H(Q ), the higher 186 the value of their Fu ,i is. For example, in Figure 2 we show the dierent probability estimators calculated for 187 the dierent prexes in one URL set obtained from Microsoft Academic Search by issuing |Q | = 100 queries. 188 For the sake of simplicity we show the last token for each prex. 189 Note that the prex P1 in node n9 ) is very frequent, since it is in almost every hub from H(Q ); contrarily, 190 prex P2 in node n21 ) only appears in a 1% of the hubs in the hubset, i.e., only issuing a particular query from 191 the set of 100 queries do we obtain a hub with a URL including that prex. Another example is prex P3 in 192 node n18 , with a probability estimator of 0.89, which is signicantly high, but not as high as the probability 193 estimator of prex P1 . Note that publications are posted as either journals or conferences, but in both cases 194 they have authors and co-authors, hence prex P1 is more frequent than prex P3 . 195 Our conjecture is that for prexes whose Fu ,i in a hubset H(Q ) is not near 1.00, it is most probably 196 around 1/|{H(Q )}|, i.e., probability values are grouped around the two extremes of the distribution (0.00 8 197 and 1.00), with a minimum number of prexes whose probability is in the middle of the distribution. Figure 3 198 shows the histogram of Fu ,i values obtained from hubsets of 100 hubs from 45 dierent web sites. We observe 199 that most values are grouped in the interval (0.00, 0.05), except a small but signicant group of values around 200 1.00, which supports our previous conjecture. We must highlight that this graph is expressed in terms of a 201 logarithmic scale. 202 We represent patterns by means of a limited subset of regular expressions that includes only literals and 203 wildcard expressions that we represent with symbol ⋆. A wildcard is a regular expression that accounts for 204 characters other than separators. For example, if we use the RFC 3986 tokenisation, then ⋆ represents the 205 following regular expression: [/?#&=]+ . 206 As an example, pattern http://academic.research.microsoft.com/Author/⋆/⋆ matches the URLs of pages with 207 information about authors in Microsoft Academic Search; this is the case of the URLs represented as paths 208 in the tree that contain node n9 in Figure 1. 209 Denition 8 (Sibling prexes:). Let U be a set of URLs obtained from a hubset 210 and P (u , i ) a prex with size i > 1, we dene the set Siblings (u , i ) as the set of prexes of URLs u ∗ ∈ U 211 with size i that share a common prex P (u , i − 1) of size i-1 with u. 212 3.3.2. Pattern Mining H(Q ), u ∈ U a URL 213 Based on both the probability estimators and the concept of sibling prexes, we dene a process to mine 214 patterns. For each prex P (u , i ) in the URL set, we compare all the dierent probability estimators for both 215 P (u , i ) and its sibling prexes. In those prexes whose value is not signicatively high, their last token is 216 replaced with a wildcard. Contrarily, prexes with a signicantly high value are probably part of a pattern, so 217 they are not abstracted, but kept as literals. The criterion to discern signicantly high probability estimators 218 is formally dened in the following denition. 219 Denition 9 (Wildcarding criterion: ). Let u be a URL, i an integer and 220 dene the criterion to discern prexes with a signicantly high probability estimator as: 221 α and β parameters. We   false if F ∈ [1.0 − β, 1.0] ∨ isOutlier (F , Siblings (u , i ), α) u ,i u ,i wildcarding (u , i , α) =   true otherwise 222 The former criterion is false in two cases: a) if the prex probability estimator is near 1.0 (where β is a 223 parameter that denes a threshold around 1.0) an b) if the prex probability estimator is an outlier in the 224 distribution of probability estimators of other prexes that are its siblings. 225 Function isOutlier (a , B , α) checks if a certain value a is an outlier in the distribution of values B , with a 226 reliability level of α. An outlying observation, or outlier, is one that appears to deviate markedly from other 9 <n0, http> <n1, academic.research.microsoft.com> <n2 , /Publication> <n9, /Author > <n18, /Journal > <n23, /Detail > <w1, / >  <w3, / >  <w5 , / >  <n24, ?entityType > <n25 , =1> <w2, / > <w4, / > <w6 , / > <n26, &searchType >   ni n3 , n5 , n7 n4 , n6 , n8 n10 , n12 , n14 , n16 n11 , n13 , n15 , n17 n19 , n21 n20 , n22 n29 , n30 , n31 wj w1 w2 w3 w4 w5 w6 w7  <n27, =5 > <n28 , &id> <w7, = >  (b) (a) Figure 4: URL set tree after pattern building 227 members of the sample in which it occurs Grubbs (1969). Although any valid function may be used, ours is 228 based on the Cantelli inequality, which states that, for any given statistical distribution, it holds that: 229 P (X − µ ≥ k σ) ≤ 1 1 + k2 230 Therefore, for a given reliability level α, it is possible to calculate a threshold value X + µ such that data 231 above that value represents the α% of the distribution. Typical values for α are 0.01 or 0.05. Hence, we 232 mark all values above the threshold values as outliers. 233 Once we have dened the wildcarding criterion, we use it to process each node in the tree, using a depth- 234 rst traversal. In each visited node, we apply the wildcarding criterion, and, in case it is true, we abstract 235 the prex represented by the node, by replacing the node token (i.e., the last token in the prex) with a 236 wildcard ⋆. Whenever two or more sibling prexes are wildcarded, their associated nodes are merged. After 237 the whole tree has been traversed and processed, each of the resulting tree branches represents a dierent 238 pattern. 239 We present an example of the pattern building process in Figures 4(b), which shows the conguration of 240 nodes in the tree after applying the building process, some of which are the result of merging one or more 10 241 nodes ni into a wildcard node wj . We show the correspondence between the initial nodes ni and the resulting 242 wildcard node 243 5638047 244 nor an outlier for usual α values (0.01, 0.05). wj (t5 ) and in Figure 4(a). For example, wildcard node 1242380 (t7 ), given that their Fu ,i w1 is the result of merging nodes (t3 ), values are {0.01, 0.01, 0.01}, none of which is near to 1.0 After the building pattern process, in the example we obtain the following patterns: 245 246 p1 = hhttp, academic.research.microsoft.com, Publication, ⋆, ⋆i, 247 p2 = hhttp, academic.research.microsoft.com, Author, ⋆, ⋆i, 248 p3 = hhttp, academic.research.microsoft.com, Journal, ⋆, ⋆i, and 249 p4 = hhttp, academic.research.microsoft.com, Detail, entityType, 1, searchType, 5, id, ⋆i. 250 4. Evaluation 251 417664 In this section, we report on our experimental evaluation. In the following subsection we describe the 252 experimental design and then report on our results. 253 4.1. Evaluation Design 254 Instead of proving that our technique works well in all sites in the Web, which might be unapproachable, 255 we are focusing on the most visited web sites on the Internet. We selected the top 41 more visited web 256 sites according to Alexa Web Directory (Alexa, 2011) plus four academical sites. For each site, we built 257 URL patterns with 258 compared our technique with two baseline classication techniques: 259 Support Vector Clustering 260 TT CALA Pattern Miner (from now on, (from now on, SVC CALA ), and we evaluated its performance. We also TemplateTokens (from now on, TT ) and ). is a supervised structural web page classier, which detects the common template in a set of training 261 pages (Blanco et al., 2008). In this context, the template of a class is the collection of HTML tags that are 262 common to all training pages of said class. To classify validation pages, they are compared to the template, 263 and a similarity measure is calculated. Pages with similarity above a certain threshold are considered to 264 belong to the class. 265 SVC , on the other hand, is a not supervised clustering technique based on the Support Vector Ma- 266 chines (Ben-Hur et al., 2001). We chose this clustering technique since it does not require to know the 267 optimal number of clusters in advance (as do other algorithms such as K-Means), while it achieves reasonable 268 execution times (it is faster than other techniques such as Expectation-Maximization). Note that 269 specially suitable for text clustering, since it is able to handle high dimensional spaces of features. SVC is 270 The performance for the three techniques was measured using the most usual evaluation measures for 271 classication tasks, i.e., Precision (P ), Recall (R ) and F1-measure (F 1). Execution times (T ) for each 11 272 273 technique were also measured for comparison. Finally, we also measured the number of URLS (U ) in each class of each site. 274 To calculate the former measures, a denition is needed for each class under evaluation. Hence, for 275 each web site, we observed the dierent classes of data they contain, each of which determines a separate 276 class of URLs. For example, in Microsoft Academic Search, we identied pages containing three classes of 277 information: Authors, Hosts and Papers. For each class, an XPath expression was handcrafted to point to 278 links inside hubs whose target were pages of that class. A similar procedure is reported in (Blanco et al., 279 2011), the only dierence being that they create regular expressions, instead of XPath expressions. 280 For the three techniques, we rst retrieve a hubset of 100 hubs from each evaluation site. To retrieve the 281 hubsets, we used a lightweight crawler (CALA Lightweight crawling), that required the sites under analysis 282 to have at least one search form with a text eld and to be coded in HTML. The crawler located a search 283 form in each site, lled in the form using keyword-based queries, and retrieved the resulting hub pages for 284 each query. From each hub, all URLs were extracted and tokenised. 285 The keywords to ll in the forms were extracted by the lightweight crawler from each web site, by means 286 of an incremental process. This process starts by analysing the home page of the site under analysis, and 287 queueing the 288 word in the queue is used to ll in the form, and retrieve a hub page, which is likewise analysed to extract 289 and queue new keywords. Moreover, the hub itself is also stored to be used to build the classier. Note that 290 not all keywords yield results; in fact, some of the hubs thus retrieved are likely to be empty (i.e., a web page 291 with an informative message that stats that no results were found). Since all those hubs have approximately 292 the same content, they are not useful for building the classier, nor for extracting keywords, so these pages 293 are discarded, and keywords are not extracted from them. The criterion to discard pages is based on the 294 Cantelli inequality to detect the pages with the lower size, which most likely are empty hubs. This process 295 is repeated until a maximum number of hubs 296 retrieved hubsets are available for downloading at (Hernández, 2011b), along with the XPath expressions 297 that dene each URL class we evaluated, and the key word lists we used. N least frequent words in the page text (e.g., in our experiments we selected N = 10). M is obtained (e.g., in our experiment, M Each = 100). All the 298 To evaluate CALA Pattern Miner, half of the tokenised URLs were used to build a set of URL patterns using 299 our Java implementation of the algorithms, and these patterns were used to classify the other half of the 300 URLs. More details on our implementation are available at (Hernández et al., 2011). The main dierence 301 between the former paper and this one is the lack of an evaluation section. 302 To evaluate TT technique, we implemented the algorithms in Java. For each evaluation site and each 303 class, we randomly selected 4 pages of the class to train the classier, and we randomly selected 30 pages, 304 from dierent classes, to validate the classier. 305 Finally, to evaluate SVC, we used the implementation that can be found in RapidMiner Mierswa et al. 12 Variable P P R R F1 F1 (a) Rank p-value TT > CALA CALA > SVC CALA > TT SVC > CALA CALA > TT CALA > SVC 0.018 0.000 0.000 0.333 0.000 0.000 (b) Figure 5: Kruskall-Wallis H Test and Wilcoxon Signed Rank results 306 307 308 (2006) data mining tool, using the URL tokens as features. TO DO: hablar de la tecnica de checker, donde está publicada? All experiments were executed on a 309 Windows 7 Professional. 310 4.2. Evaluation Results 64-bit Intel Core i7 with four 2.93 GHz cores and 10 GB RAM, with 311 The main results of the experiment are presented in Tables 1 and 2. For each site in the experiment, we 312 show the number of URLs we used to build the classier, the number of URLs classied, the precision, recall 313 F1-measure and time for each site considered. After evaluating all sites, we classied a total of 314 and we obtained a mean precision of 315 former tables, cells with a -" symbol represent experiments that could not be completed (e.g., sites Amazon 316 and Etsy experiments with Support Vector Clustering were not nished after a week of calculations). 317 98%, a mean recall of In general, the reason for the lower recall values is that 91%, 133 766 pages, and a mean F1-measure of 92%. In the CALA Pattern Miner tends to create smaller, more 318 specic patterns better than building larger patterns that reduce precision. Smaller patterns mean that 319 sometimes a single class of URLs is separated into more than one pattern, which makes the mean recall 320 smaller. 321 Finally, we have performed some statistical tests to accurately show that our technique is indeed dierent 322 to the other techniques. First, we performed Kruskall-Wallis H test, which yielded the result in Figure 5(a), 323 i.e., that there are statistically signicant dierences amongst the dierent variables P , R and F 1 for the 324 systems, since the p-values for the three variables are lower than common 325 (1999). Then we use Wilcoxon's Signed Rank test to compute a rank amongst these systems Hollander and 326 Wolfe (1999). This test yields the rank in Table 5(b). 327 From the analysis of the evaluation data, we can conclude the following: 13 α = 0.05 Hollander and Wolfe Site Amazon Class Products Reviews Authors DailyMotion Videos User Proles Ehow Articles Answers Topics Questions Digg Authors Articles Comments Indiatimes Articles DailyMail Authors Articles Deviantart Photos Tags Filestube Files Hungtonpost Articles Sourceforge Projects Reviews Squidoo Articles User Proles Torrentz Files Guardian Authors Articles Archive Articles Isohunt Files Comments Yelp Businesses Metacafe Videos Topics User Proles Etsy Products Stores BBC News Videos Alibaba Products U 6188 7752 1551 2564 849 1101 1510 4898 924 976 1286 339 468 1976 3554 524 1020 754 2180 1291 919 606 4739 315 682 4923 1993 1996 994 1971 1106 749 3910 3900 220 232 3136 P CALA R F1 T 1.00 0.94 0.97 34.06 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.78 1.00 1.00 1.00 1.00 0.99 0.99 2.06 0.92 0.99 0.95 5.06 1.00 1.00 1.00 0.99 1.00 1.00 3.80 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.45 0.62 5.42 1.00 1.00 1.00 9.48 1.00 0.45 0.62 1.00 1.00 1.00 8.86 0.94 0.31 0.47 1.00 0.94 0.97 7.97 1.00 0.18 0.31 3.73 1.00 1.00 1.00 8.02 1.00 1.00 1.00 1.00 0.99 1.00 2.08 1.00 1.00 1.00 1.00 1.00 1.00 3.33 0.65 1.00 0.99 20.94 0.35 1.00 0.02 1.00 0.98 0.99 13.28 1.00 0.97 0.98 4.94 1.00 1.00 1.00 1.00 0.79 0.88 7.59 1.00 0.76 0.86 4.20 1.00 0.97 0.99 1.00 0.91 0.95 1.00 0.95 0.97 21.55 1.00 1.00 1.00 1.00 0.57 0.73 3.66 1.00 1.00 1.00 1.00 0.89 0.94 22.23 P TT R F1 T 1.00 0.46 0.63 10.20 1.00 0.24 0.39 1.00 0.85 0.92 1.00 0.23 0.37 1.33 1.00 0.88 0.94 1.00 0.00 0.00 0.67 1.00 0.50 0.67 4.39 1.00 0.00 0.00 1.00 0.00 0.00 1.69 1.00 0.00 0.00 1.00 0.33 0.50 1.00 0.00 0.00 2.25 1.00 1.00 1.00 6.19 1.00 1.00 1.00 1.00 1.00 1.00 3.52 1.00 0.50 0.67 1.00 1.00 1.00 0.53 1.00 0.50 0.67 7.86 1.00 1.00 1.00 0.81 1.00 0.50 0.67 1.00 0.92 0.96 3.50 1.00 0.50 0.67 1.00 0.92 0.96 0.19 1.00 1.00 1.00 3.80 1.00 0.81 0.89 1.00 0.00 0.00 0.38 1.00 0.98 0.99 2.42 1.00 1.00 1.00 1.00 1.00 1.00 12.00 1.00 1.00 1.00 3.23 1.00 1.00 1.00 1.00 0.44 0.61 1.00 0.00 0.00 3.27 1.00 0.40 0.58 1.00 0.88 0.94 2.22 1.00 0.00 0.00 1.00 1.00 1.00 2.09 P SVC R F1 T 0.73 1.00 0.85 587.00 0.24 1.00 0.38 1.00 0.78 0.88 1113.00 0.22 1.00 0.36 33553.00 0.76 1.00 0.86 0.26 1.00 0.42 453.00 0.26 1.00 0.42 0.33 1.00 0.49 0.99 0.70 0.82 1240.00 0.10 0.99 0.19 203114.00 0.45 0.99 0.62 0.86 1.00 0.92 11059.00 0.12 0.98 0.21 0.91 1.00 0.95 175.00 0.99 0.99 0.99 858.00 0.61 1.00 0.76 420.00 0.36 1.00 0.53 0.58 1.00 0.73 137.00 0.36 1.00 0.53 0.98 1.00 0.99 914.00 0.35 0.97 0.51 6028.00 0.56 0.72 0.63 0.98 1.00 0.99 263.00 0.49 1.00 0.66 524.00 0.49 1.00 0.66 1.00 1.00 1.00 899.00 0.50 1.00 0.67 2647.00 0.28 1.00 0.44 0.19 1.00 0.32 0.39 0.97 0.56 37.00 0.43 1.00 0.60 0.97 0.25 0.39 3746.00 Table 1: Evaluation results on validation sites. 328 1. CALA precision values are higher than SVC precision values in general, and they are comparable to TT 329 precision values, although somehow lower. Therefore, our not supervised technique is only slightly less 330 precise than a supervised technique. 331 332 333 334 2. CALA recall values are lower than SVC recall values in general, and they are higher than TT recall values. This is due to SVC being likely to create a single cluster for each site (R = 1.00). 3. CALA F1-measure values are higher than both SVC and TT F1-measure values. Therefore, we achieve a better trade-o of precision versus recall. 335 4. CALA times are comparable to TT times, although somehow higher (order of magnitude of ten seconds). 336 Meanwhile, CALA times are signicantly lower than SVC times (order of magnitude of hours or even 337 days). Therefore, our not supervised technique is signicantly faster than the other not supervised 338 technique, while it is only slightly slower than the supervised technique. Note that in this analysis we 339 did not consider the time spent by the user annotating examples to create a supervised training set. 14 Site Class Target Products TDG Scholar Authors Hosts Papers Ms Academic Authors Papers Google Scholar Citations Arxiv Authors Papers Abstracts Livejournal News Xing User Proles Odesk User Proles Skills ArticlesBase Authors Freelancer Projects BTJunkie Files PlentyOfFish User Proles Slideshare Files Netlog User Proles Drupal Projects Authors Newegg Products Overblog Articles Chip News Battle.net Forum Posts Fiverr Ads Fotolia Photos People Styles Babies Covers Articles Indeed Jobs Gamefaqs Boards Average U 3180 2061 336 416 1244 370 698 15529 353 602 1265 987 894 758 1003 2207 6542 1170 968 2447 1511 1832 988 162 100 2428 4161 336 296 219 311 923 3370 1910.90 P 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 1.00 1.00 1.00 1.00 1.00 1.00 1.00 CALA R F1 1.00 1.00 1.00 0.25 0.85 0.98 1.00 1.00 1.00 0.88 0.64 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.68 1.00 1.00 0.78 1.00 1.00 1.00 1.00 0.70 0.69 1.00 0.84 1.00 1.00 1.00 1.00 1.00 0.40 0.92 0.99 1.00 1.00 1.00 0.94 0.78 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 1.00 0.81 1.00 1.00 0.88 1.00 1.00 0.98 1.00 0.74 0.73 1.00 0.84 1.00 1.00 T 47.50 6.03 7.89 5.28 20.81 3.30 2.08 4.70 3.41 4.47 6.88 6.16 3.00 3.72 3.14 47.38 0.66 6.75 1.95 4.16 12.00 4.25 4.27 8.61 0.98 0.91 0.92 9.19 P 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 TT R 1.00 1.00 1.00 0.00 0.96 0.50 1.00 0.25 0.00 0.92 0.00 1.00 0.50 0.62 1.00 1.00 0.00 1.00 0.88 0.00 1.00 0.96 1.00 1.00 0.85 0.96 1.00 0.00 0.42 0.33 1.00 0.00 0.00 F1 T 1.00 1.00 1.00 0.00 0.98 0.67 1.00 0.40 0.00 0.96 0.00 1.00 0.67 0.76 1.00 1.00 0.00 1.00 0.94 0.00 1.00 0.98 1.00 1.00 0.92 0.98 1.00 0.00 0.59 0.50 1.00 0.00 0.00 9.92 1.22 2.09 1.09 9.53 0.95 0.58 1.77 1.66 5.77 0.41 0.72 0.38 0.70 1.91 0.53 1.53 1.30 1.11 0.53 3.41 0.30 2.25 1.00 0.61 0.76 1.75 P 0.98 0.60 0.77 1.00 0.73 0.21 1.00 0.73 0.17 0.10 1.00 1.00 0.55 0.41 0.91 1.00 0.97 1.00 0.92 1.00 0.53 0.34 1.00 0.91 1.00 1.00 0.96 1.00 0.52 0.48 0.61 0.91 1.00 1.00 R 1.00 1.00 0.92 0.88 1.00 1.00 1.00 0.18 0.18 0.19 1.00 1.00 1.00 1.00 1.00 0.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.92 0.96 0.87 0.94 1.00 1.00 SVC F1 0.99 0.75 0.83 0.93 0.85 0.35 1.00 0.29 0.17 0.13 1.00 1.00 0.71 0.58 0.95 0.96 0.99 1.00 0.96 1.00 0.70 0.50 1.00 0.96 1.00 1.00 0.98 1.00 0.66 0.64 0.71 0.93 1.00 1.00 T 31917.00 38081.00 354.00 66.00 3124.00 6372.00 1300.00 22414.00 133.00 26560.00 2365.00 3766.00 171.00 49.00 854.00 3068.00 80.00 19.00 179.00 1798.00 14324.00 8439.00 2271.00 12662.00 0.67 0.93 0.78 10421.70 Table 2: Evaluation results on validation sites. (Cont.) 340 341 342 343 344 345 346 347 348 349 350 351 We sum up our conclusions in Figure 6. Figure 6(a) shows the precision versus recall for each experiment in the three techniques. Note that most experiments in CALA are located in the upper right corner (therefore, precision and recall are both around 1.00). Figure 6(b) presents the combined histogram of execution times of the three techniques. We must remark that CALA, which is not supervised, achieves execution times that are comparable of those from TT, which is supervised, whereas SVC obtain larger values. Note that the X axis is expressed in logarithmic scale. 5. Conclusions and Future Work In this article, we propose an unsupervised technique to build URL patterns based on probability estimators. We present the results of an experiment that proves that this technique is able to classify URLs, obtaining an average precision of 98%, and an average recall of 91%. We compared our technique with two baseline techniques, obtaining that our technique achieves Precision and Recall values and execution times that are 15 WͲZŽŵƉĂƌŝƐŽŶ ,ŝƐƚŽŐƌĂŵ ϭ͘ϬϬ Ϭ͘ϵϬ Ϭ͘ϴϬ Ϭ͘ϳϬ ůůĂ Ϭ͘ϲϬ ĐĞ Ϭ͘ϱϬ ZϬ͘ϰϬ Ϭ͘ϯϬ Ϭ͘ϮϬ Ϭ͘ϭϬ Ϭ͘ϬϬ ϭϲ ϭϰ ϭϮ ĐLJ ϭϬ Ŷ Ğ Ƶ ϴ Ƌ Ğ ƌ ϲ & ϰ Ϯ Ϭ dd ^s > Ϭ͘ϬϬ Ϭ͘ϮϬ Ϭ͘ϰϬ Ϭ͘ϲϬ WƌĞĐŝƐŝŽŶ Ϭ͘ϴϬ dd ^s > ϭ͘ϬϬ dŝŵĞ;ƐĞĐŽŶĚƐͿ (a) (b) Figure 6: Graph showing the Precision-Recall of the three techniques, and the histogram with the execution times 352 comparable to those of a supervised classier, and F1 values that are even higher, while relieving the user 353 from the tedious task of annotating training sets. Meanwhile, when compared with another not supervised 354 technique, CALA Pattern Miner achieves both a better performance and signicantly lower execution times. 355 In contrast to other proposals, ours is not supervised, which saves the user a signicant amount of time 356 labelling large training sets. Also, it does not require to crawl the whole site to build the classication model. 357 We use a small subset of hub pages from the site, from which we obtain a set of URLs, and we apply a 358 statistical technique to discern which tokens of the URL are part of the pattern, and which can be replaced 359 with a wildcard. Since we rely on token probabilities, regardless of the token having some meaning in a 360 particular dictionary, it is a language and domain independent proposal. 361 To support our classier, we have developed CALA Visualiser, a tool for graphically displaying URL 362 patterns, which supports the user in the task of identifying the class behind each pattern. A demo of this 363 tool is available at (Hernández, 2011a). 364 Proposals in the area of parental control systems, advertising removal, duplicated pages detection, web 365 directories creation, Virtual Integration and Conceptual Modelling can benet from the use of our classier. 366 Some details about the application of our classier to extract models from non-semantic web sites has been 367 accepted for publication in (Hernández et al., 2012). 368 Acknowledgements 369 This work has been partially supported by the European Commission (FEDER), the Spanish and the 370 Andalusian R &D &I programmes (grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, 371 TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, and TIN2010-09988-E). 372 373 References Alexa, 2011. Top sites directory, 14th february 2011. http://www.alexa.com/topsites. 16 374 375 376 377 378 379 380 381 382 383 384 385 Bar-Yossef, Z., Keidar, I., Schonfeld, U., 2009. Do not crawl in the dust: Dierent urls with similar text. TWEB 3 (1), 3. Baykan, E., Henzinger, M., Marian, L., Weber, I., July 2011. A comprehensive study of features and algorithms for URL-based topic classication. TWEB 5, 15:115:29. Baykan, E., Henzinger, M. R., Marian, L., Weber, I., 2009. Purely URL-based topic classication. In: WWW. pp. 11091110. Ben-Hur, A., Horn, D., Siegelmann, H. T., Vapnik, V., 2001. Support vector clustering. Journal of Machine Learning Research 2, 125137. Blanco, L., Crescenzi, V., Merialdo, P., 2008. Structure and semantics of Data-IntensiveWeb pages: An experimental study on their relationships. J. UCS 14 (11), 18771892. Blanco, L., Dalvi, N., Machanavajjhala, A., 2011. Highly ecient algorithms for structural clustering of large websites. In: WWW. ACM, pp. 437446. 386 Brin, S., 1998. Extracting patterns and relations from the World Wide Web. In: WebDB. pp. 172183. 387 Crescenzi, V., Mecca, G., Merialdo, P., 2001. RoadRunner: Towards automatic data extraction from large 388 web sites. In: VLDB. pp. 109118. 389 URL http://www.vldb.org/conf/2001/P109.pdf 390 Grubbs, F. E., 1969. Procedures for Detecting Outlying Observations in Samples. Technometrics 11 (1). 391 Hacking, I., Jul. 2001. An Introduction to Probability and Inductive Logic. Cambridge University Press. 392 Hernández, I., 2011a. CALA demo. http://www.tdg-seville.info/inmahernandez/CALA+Demo. 393 Hernández, I., 2011b. CALA experimental design. http://www.tdg-seville.info/inmahernandez/Experiment. 394 Hernández, I., Rivero, C., Ruiz, D., Corchuelo, R., 2011. A tool for link-based web page classication. In: 395 Advances in Articial Intelligence. Vol. 7023 of LNCS. Springer Berlin / Heidelberg, pp. 443452. 396 Hernández, I., Rivero, C. R., Ruiz, D., Corchuelo, R., 2012. A statistical approach to url-based web page 397 clustering. In: Proceedings of the 21st international conference companion on World Wide Web. WWW 398 '12 Companion. ACM, pp. 525526. 399 400 401 Hernández, I., Rivero, C. R., Ruiz, D., Corchuelo, R., 2012. Towards discovering conceptual models behind web sites. In: ER. p. To be published. Hirschberg, D. S., 1977. Algorithms for the longest common subsequence problem. J. ACM 24 (4), 664675. 17 402 Hollander, M., Wolfe, D. A., 1999. Non-parametric Statistical Methods. Wiley-Interscience. 403 Kan, M.-Y., Thi, H. O. N., 2005. Fast webpage classication using URL features. In: CIKM. pp. 325326. 404 Kleinberg, J. M., 1999. Authoritative sources in a hyperlinked environment. J. ACM 46 (5), 604632. 405 Koppula, H. S., Leela, K. P., Agarwal, A., Chitrapura, K. P., Garg, S., Sasturkar, A., 2010. Learning URL 406 407 408 409 410 411 412 413 414 415 patterns for webpage de-duplication. In: WSDM. ACM, pp. 381390. Li, Y., Zhong, N., 2004. Web mining model and its applications for information gathering. Knowl.-Based Syst. 17 (5-6), 207217. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T., 2006. YALE: rapid prototyping for complex data mining tasks. In: Knowledge Discovery and Data Mining. pp. 935940. Qi, X., Davison, B. D., 2009. Web page classication: Features and algorithms. ACM Comput. Surv. 41 (2). URL http://doi.acm.org/10.1145/1459352.1459357 Ristad, E. S., Yianilos, P. N., 1998. Learning string-edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 20 (5), 522532. Selamat, A., Omatu, S., 2004. Web page feature selection and classication using neural networks. Inf. Sci. 416 158, 6988. 417 URL 418 419 420 421 422 423 424 425 426 427 http://dx.doi.org/10.1016/j.ins.2003.03.003 Shih, L. K., Karger, D. R., 2004. Using URLs and table layout for web classication tasks. In: WWW. pp. 193202. Trillo, R., Po, L., Ilarri, S., Bergamaschi, S., Mena, E., Apr. 2011. Using semantic techniques to access web data. Inf. Syst. 36 (2), 117133. Vidal, M. L. A., da Silva, A. S., de Moura, E. S., Cavalcanti, J. M. B., 2008. Structure-based crawling in the Hidden Web. J. UCS 14 (11), 18571876. Wang, Y., , Wang, Y., Kitsuregawa, M., 2001. Link based clustering of web search results. In: Lecture Notes in Computer Science. Springer, pp. 225236. Zhang, J., Qin, J., Yan, Q., 2006. The role of URLs in objectionable web content categorization. In: Web Intelligence. pp. 277283. 18