Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Web Content Extraction KDD

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Web Content Extraction – a Meta-Analysis of its Past and

Thoughts on its Future

Tim Weninger Rodrigo Palacios Valter Crescenzi


University of Notre Dame California State University Università Roma Tre
Notre Dame, Indiana, USA Fresno, California, USA Dipartimento di Ingegneria,
tweninge@nd.edu rodpl91@gmail.com Rome, Italy
crescenz@dia.uniroma3.it
Thomas Gottron Paolo Merialdo
Institute for Web Science and Università Roma Tre
Technologies Dipartimento di Ingegneria,
University of Koblenz-Landau, Rome, Italy
Germany
merialdo@dia.uniroma3.it
gottron@uni-koblenz.de

ABSTRACT that the learned rules are often brittle and are unable to cope with
even minor changes to a Web pages’ template [12]. When a Web
In this paper, we present a meta-analysis of several Web content
site modifies its template, as they often do, the learned wrappers
extraction algorithms, and make recommendations for the future of
need to be refreshed by re-computing the expensive induction step.
content extraction on the Web. First, we find that nearly all Web
Certain improvements in wrapper induction attempt to induce ex-
content extractors do not consider a very large, and growing, por-
traction rules that are more robust to minor changes [9; 8; 25], but
tion of modern Web pages. Second, it is well understood that wrap-
the more robust rules only delay the inevitable [5].
per induction extractors tend to break as the Web changes; ; heuris-
tic/feature engineering extractors were thought to be immune to a Heuristic approaches are often criticised for their lack of generality.
Web site’s evolution, but we find that this is not the case: heuris- That is, heuristics that may work on a certain type of Web site,
tic content extractor performance also tends to degrade over time say a news agency, are often ill suited for business Web sites or
due to the evolution of Web site forms and practices. We conclude message boards, etc. Most approaches also ignore the vast majority
with recommendations for future work that address these and other of the Web pages that dynamically download or incorporate content
findings. via external reference calls during the rendering process, e.g., CSS,
JavaScript, images.
The goal of this paper is not to survey the whole of content extrac-
1. INTRODUCTION tion, so we resist the temptation to verbosely compare and contrast
The field of content extraction, within the larger pervue of data the numerous published methods. Rather, in this paper we make
mining and information retrieval, is primarily concerned with the a frank assessment on the state of the field, provide an analysis of
identification of the main text of a document, such as a Web page or content extraction effectiveness over time, and make recommenda-
Web site. The principle argument is that tools that make use of Web tions for the future of content extraction.
page data, e.g., search engines, mobile devices, various analytical In this paper we make three main contributions:
tools, demonstrate poor performance due to noise introduced by
text not-related to the main content [11; 23]. 1. We define the vectors of change in the function and presen-
In response the field of content extraction has developed methods tation of content on the Web,
that extract the main content from a given Web page or set of Web
pages, i.e., a Web site [20; 29]. Frequently, these content extrac- 2. We examine the state of content extraction with respect to the
tion methods are based on pattern mining and the construction of ever changing Web, and
well-crafted rules. In other cases, content extractors learn the gen-
eral skeleton of a Web page by examining multiple Web pages in 3. We perform a temporal evaluation on various content extrac-
a Web site [18; 1; 6; 7]. These two classes of content extractors tors
are referred to as heuristic and wrapper induction respectively; and
each class of algorithms have their own merits and disadvantages. Finally, we call for a change in the direction of content extraction
Generally speaking, wrapper induction methods are more accurate research and development.
than heuristic approaches, but require some amount of training data The evolution of Web practices is the central to the theme of this
in order to initially induce an appropriate wrapper. Conversely, paper. A scientific discipline ought to strive to have some invari-
heuristic approaches are able to function without an induction step, ance in the results over time. Of course, as technology changes, our
but are generally less accurate. study of it must also change as well. With this in mind, one way
The main criticism of content extraction via wrapper induction is to determine the success of a model is to measure its stability or
durability as the input changes over time.
To that end, we present the results of a case study that compares
content extraction algorithms, both old and new, on an evolving
Web site in that structural changes to the displayed content on a Web site
news.bbc.co.uk are frequently performed by instructions embedded in cascading
cnn.com style sheets. Although CSS instructions are not as expressive as
news.yahoo.com JavaScript functions – they were built for different purposes – the
thenation.com omission of a style sheet often severely affects the rendering of a
latimes.com Web page.
entertainment.msn.com Furthermore, many of the content extractors described earlier rely
foxnews.com on formatting hints that live within HTML in order to perform ef-
forbes.com fective extraction. Unfortunately, the ubiquitous use of CSS re-
nymag.com moves many of the HTML hints that extractors depend upon. Us-
esquire.com ing CSS, it is certainly possible that a complex Web site is made
entirely of div-tags.
Table 1: Dataset used in case study. 25 Web pages crawled from
each Web site per lustrum (5-year period), over 4 lustra and 10 Web HTML5. The new markup standards introduced by HTML5 in-
sites totals 1,000 Web pages. clude many new tags, including main, article, header, etc.,
meant to specify the semantic meaning of content. Widespread
adoption of HTML5 is in progress, so it is unclear whether and
dataset. The goal is to identify which measures, if any, are invariant how the new markup languages will be used or what the negative
to the evolution of Web practices. side effects will be, if any.
To that end, we collected a dataset of 1000 Web pages from 10 The semantic tags in HTML5 are actually a severe departure form
different domains, listed in Table 1, where each domain has a set the original intent of HTML. That is, HTML4 was originally only
of pages from years 2000, 2005, 2010, and 2015. There are 25 meant to be a markup for the visual structure of the Web page, not
HTML documents per lustrum (i.e.., 5-year period), for a total of a description language. Indeed the general lack of semantic tags
100 documents per Web site. The documents were automatically is one of the main reasons why content extraction algorithms were
and manually gathered from two types of sources: archives1 and created in the first place.
the original websites themselves for the 2015 lustrum. Further addition of semantics into HTML markup is provided by
We review the evolution that has occurred in Web content deliv- the schema.org project. Schema.org is a collaboration among
ery and extraction, referring explicitly to recent changes that un- the major Web search providers to provide a unified description
dermine the effectiveness of exiting content extractors. To show language that can be embedded into HTML4/5 tag attributes. Web
this explicitly we perform a large case study wherein we compare site developers can use these tags to encode whatthat certain HTML
the performance over time of several content extraction algorithms. data represents, for example, a Person-itemtype, which may
Based on our findings we call for a change in content extraction have a name-itemprop, which can then be used by search en-
research and make recommendations for future work. gines and other Web-services to built intelligent analytics tools.
Other efforts to encode semantic meaning in HTML can be found
in the Microformats.org project, the Resource Description
2. EVOLVING WEB PRACTICES Framework in Attributes (RDFa) extension to HTML5, and others.
We begin with the observation that the content delivery on the Web
has changed dramatically since it was first conceived. The case for itmscp itmtp itmprp sctn artcl
content extraction is centered around the philosophy that HTML is Mean 162.2 157.8 899.0 261.0 403.4
a markup language that describes how a Web page ought to look, Median 65.5 54.5 374.5 25 166.5
rather than what a Web page contains. Here, the classic form versus
function debate is manifest. Yet, in recent years the Web has seen Table 3: Mean and Median number of occurrences of semantic tags
a simultaneous marriage and divorce of form and function with the from schema.org: itemscope, itemtype and itemprop
massive adoption of scripting languages like JavaScript and with tags, and from HTML5: article and section found in 2015-
the finalization of HTML5. subset of the dataset. Semantic tags are only found in dataset from
In this section we argue that because Web technologies have changed, 2015.
the way we perform and evaluate content extraction must also change.

2.1 Evolution of Form and Function Table 3 shows the mean and median number of Schema.org and
HTML5 semantic tags in our 2015 dataset. We find that 9 out of
JavaScript. Nearly all content extraction algorithms operate by 10 Web sites we crawled had adopted the Schema.org tagging sys-
downloading the HTML of the Web page(s) under consideration, tem, and that 9 out of 10 Web sites had adopted the section and
and only the HTML. In many cases, Web pages refer directly or article tags from HTML5 (8/10 adopted both Schema.org and
indirectly to dozens of client side scripts, i.e., JavaScript files, that HTML5).
may be executed at load-time. Most of the time content extrac- The advent and widespread adoption of HTML5 and Schema.org
tors do not even bother to download referenced scripts even though decreases the need for many extraction tools because the content or
JavaScript functions can (and frequently do) completely modify the data is explicitly marked and described in HTML.
DOM and content of the downloaded HTML. Indeed, most of the
AJAX. Often, modern Web pages are delivered to the client without
spam and advertisements that content extraction technologies ex-
the content at all. Instead, the content is delivered in a separate
plicitly claim to catch are loaded via JavaScript and are therefore
JSON or XML message via AJAX. These are not rare cases, as of
not part of most content extraction testbeds.
April 2015, Web Technologies research finds that AJAX is used
CSS. Style sheets pose a problem similar in nature to JavaScript within 67% of all Web sites2 . Thus, it is conceivable that the vast
1 2
WayBack Machine - http://archive.org/web/ http://w3techs.com/technologies/overview/
Figure 1: The Web page of http://www.kdd.org/kdd2015/ fully rendered in a modern Web browser (Left). Web page with
JavaScript disabled (Middle). Downloaded Web page HTML, statically rendered without any external content (Right). Most extractors
operate on the Web page on the right.

src link iframe script js jquery css size


2000 37.092 1.152 0.388 7.600 6.588 0.908 1.828 39,121.90
2005 61.812 2.528 0.408 21.200 14.280 0.944 2.612 52,633.82
2010 57.976 10.104 1.408 44.044 24.096 1.000 10.468 81,033.89
2015 49.396 18.256 10.032 40.652 37.052 0.620 11.692 174,801.64

Table 2: The mean-average occurrences of certain HTML tags and attributes that represent ancillary source files in our dataset of 1,000 news
Web pages over 4 equal sized lustra (5-year periods). The use of external content and client-side scripting has been growing quickly and
steadily.

majority of content extractors over estimate their effectiveness in document (at right). The information conveyed to the end user is
67% of the cases, because a large portion of the final, visually- presented in its complete form in the rendered version; thus, con-
rendered Web page is not actually present in the HTML file. . tent extractors should strive to operate within the fully rendered
In fact,An observation which supports this hypothesis is, that in our document (at left), instead of the HTML-only extraction as is the
experiments we find that the most frequent last-word found byin current practice (at right).
many content extractors on NY Times articles is “loading...”
Table 2 shows the mean-average number of occurrences of certain
HTML tags and attributes that represent ancillary source files in our 2.2 Keeping Pace with the Changing Web
dataset of 1,000 Web pages. In this table, src refers to the occur- Web presentation has evolved in remarkable ways in a very short
rence of the common tag attribute which can refer to a wide range time period. Content Extraction algorithms have attempted to keep
of file types. link refers to the occurrence of the <link> HTML pace with evolving Web practices, but many content extraction al-
tag which frequently (although not necessarily) references exter- gorithms quickly become obsolete.
nal CSS files. iframe refers to the occurrence of the HTML tag Counter-intuitively, it seems that as although the number of Web
which is used to embed another HTML document into the current sites has increased, the variety of presentation styles has actually
HTML document. script refers to the occurrence of the HTML decreased. For a variety of reasons, most Web pages within the
tag which is used to denote a client-side script such as (but not same Web site look strikingly similar. Marketing and brand-management
necessarily) JavaScript. js refers to the occurrences of externally often dictate that a Web site maintains style distinct from competi-
referenced JavaScript files; css similarly refers to the occurrences tors, but are similar to other pages in the same Web site.
of externally referenced CSS files. The jquery column shows the
percentage of Web pages that employ AJAX via the jQuery library; Wrapper Induction. The self-similarity of pages in a Web site
alternative AJAX libraries were found but their occurrence rates stem from the fact that the vast majority of Web sites use scripts to
were very small. generate Web page content retrieved from backend databases. Be-
In many ways the above observations show that the Web is trending cause of the structural similarity of Web pages within the same Web
towards a further decoupling of form from content: JavaScript de- site, it is possible to reverse engineer the page generation process
couples the rendered DOM from the downloaded HTML, CSS sim- to find and remove the Web site’s skeleton, leaving only the content
ilarly separates the final presentation from the downloaded HTML, remaining [18; 1; 6; 7].
and AJAX allows for the HTML and extractable content to be sep- A wrapper is induced on one Web site at a time and typically needs
arate files entirely. Yet, despite these trends, most content extrac- only a handful of labelled examples. Once trained the learned
tion methodologies rely on extractions from statically downloaded wrapper can extract information at near-perfect levels of accuracy.
HTML files. Unfortunately, the wrapper induction techniques assume that the
An example of why this should be considered a bad practice is Web site template does not change. Even the smallest of tweaks
highlighted in Figure 1 where the Web page http://kdd.org/ to a Web site’s template or the database schema breaks the induced
kdd2015 is shown rendered in a browser (at left), rendered with- wrapper and requires retraining. Attempts to learn robust wrappers,
out JavaScript (center), and rendered with only the static HTML which are immune to minor changes in the Web page template have
beenshown somewhat successful, but even the most robust wrapper
javascript_library/all. Accessed May 6, 2015. rules eventually break [8; 12].
Heuristics and Feature Engineering. on the standard deviation of the tag ratios (Th), and a 1 dimension
Rather than learning rigid rules for content extraction, other works clustering option (1D). See the respective papers for details.
have focused on identifying certain heuristics as a signal for con- An attempt was made to induce wrappers using the Roadrunner
tent extraction. The variety of the different heuristics is impressive, wrapper induction system [7], which was successful on each set of
and the statistical models learned through a combination of vari- 25 Web pages, but performed very poorly on the proceeding lus-
ous features may, in many cases, perform comparable to extractors trum. Wrapper-breakage is a well known problem for wrapper in-
based on wrapper induction. Rather than learning rigid rules for duction techniques [8; 12]. A five-year window is too long for any
content extraction, other works have focused on identifying certain wrapper to continue to be effective. Thus Roadrunner had to be
heuristics as a signal for content extraction. The variety of the dif- trained and evaluated slightly differently. In this case we manually
ferent heuristics is impressive, and the statistical models learned identified Web pages that have very similar HTML structure and
through a combination of various features may, in many cases, per- learned a wrapper on those few pages. In most cases 90-95% of the
form comparable to extractors based on wrapper induction. Web pages in a single domain could be used to generate a wrapper,
Each methodology and algorithm was invented at a different time in but in 2 Web sites only about half of the Web pages were found to
the evolution of the Web and looked at different aspects of the Web have the same style and were useful for training. We used the in-
content. From the myriad of options we selected 11 algorithms duced wrapper to extract the content from the Web pages on which
from different time periods. They are listed in Table 4. it was trained.
We emphasize that our methodology follows that of most content
Algorithm Year extraction methodologies. Namely, we download the raw HTML
Body Text Extractor (BTE) [11] 2001 of the Web page and perform content extraction on only that static
Largest Size Increase (LSI) [16] 2001 HTML. We further emphasize that this ignores a very large por-
Document Slope Curve (DSC) [27] 2002 tion of the overall rendered Web page – renderings that are increas-
Link Quota Filter [21] 2005 ingly reliant on external sources for content and form via AJAX,
K-Feature Extractor (KFE) [10] 2005 stylesheets, iframes, etc. The disadvantages of this methodology
Advanced DSC (ADSC) [13] 2007 are clear, but we are beholden to them because the existing extrac-
Content Code Blurring (CCB) [14] 2008 tors require only static HTML.
RoadRunner∗ (RR) [7] 2008
Content Extraction via Tag Ratios (CETR) [28] 2010 3.0.1 Evaluation
BoilerPipe [17] 2010 We employ standard content extraction metrics to compare the per-
Eatiht [24] 2015 formance of different methods. Precision, recall and F1 -scores
are calculated by comparing the results/output of each methods to
Table 4: Content extraction algorithms, with their citation and pub- a hand-labeled gold standard. The F1 F1-scores are computed as
lication date. ∗ RoadRunner is a wrapper induction algorithm; all usual and all results are calculated by averaging each of the metrics
others are heuristic methods. over all examples.
The main criticism of these metrics is that they are likely to be in-
Each algorithm, heuristic, model or methodology is predicated on flated. This is because every word in a document is considered to be
the form and function of the Web at the time of its development. distinct even if two words are lexically the same. This makes it im-
Each was evaluated similarly on the state of the Web that existed possible to align words with the original page and therefore forces
at the time, presumably, just before publication. Furthermore, each us to treat the hand labeled content and automatically extracted con-
algorithm does not consider JavaScript, CSS, or AJAX changes to tent as a bag of words, e.g.i.e., where two words are considered the
the Web page, therefore the majority of the Web page may not ac- same if they are lexically the same. The bag of words measurement
tually be present for extraction, as is the case in Figure 1. is more lenient and as a result scores may be inflated.
The CleanEval competition has a hand-labeled gold standard as
well from a shared list of 684 English Web pages and 653 Chinese
3. CASE STUDY Web pages downloaded in 2006 by “[collecting] URLs returned
We present the results of a case study that compares content ex- by making queries to Google, which consisted of four words fre-
traction algorithms, both old and new, on an evolving dataset. The quent in an individual language”[22]. CleanEval uses a different
goal is to test the performance variability of content extractors over approach when computing extraction performance. Their scoring
time as Web sites evolve. So, for each Web page of each lustrum method is based on a word-at-a-time version of the Levenshtein
of each Web site, a gold-standard dataset was created manually by distance between the extraction algorithm and the gold standard di-
the second author. Each Web content extractor attempted to extract vided by the alignment length.
the main content from the Web page.
For the first seven content extractors in Table 4, we used the imple-
3.1 Results
mentation from the CombineE System [13]. The Eatiht, BoilerPipe First, we begin with a straightforward analysis of the results of each
and CETR implementations are all available online. BoilerPipe algorithm on the dataset. Figure 2a–2d shows the F1 -measure for
provides a standard implementation as well as an article extractor each lustrum, i.e., each 5-year time period, organized by extrac-
(AE), Sentence extractor (Sen), an extractor trained on data from tor cohort. For example, the BTE-extractor was published in 2001,
KrdWrd-Canola corpus3 , and two “number of words” extractors: a and is therefore part of the ca. 2000 cohort of extractors; it’s per-
decision tree induced extractor (W) and a decision tree induced ex- formance is illustrated in Figure 2a. The eatiht-extractor was pub-
tractor manually tuned to at least 15 words per content area (15W). lished in 2015 and is therefore part of the ca. 2015 cohort of ex-
CETR has a default algorithm as well as a threshold option based tractors, and is illustrated in Figure 2d.
The shape of the performance curves in Figure 2a–2d over time
3 exactly demonstrate the primary thesis of this paper: extractors
https://krdwrd.org/trac/raw-attachment/
wiki/Corpora/Canola/CANOLA.pdf quickly become obsolete.
100 100

75 75
F1−measure

F1−measure
50 50 Extractor
Extractor ADSC
BTE CCB
25 25
DSC KFE
LSI LQF
RR
0 0
2000 2005 2010 2015 2000 2005 2010 2015
Web Page Year Web Page Year

(a) ca. 2000 (b) ca. 2005

100 100

75 75
F1−measure

F1−measure
50 50
Extractor
BP Extractor
25 BP−AE 25
eatiht
CETR
CETR−Th
0 0
2000 2005 2010 2015 2000 2005 2010 2015
Web Page Year Web Page Year

(c) ca. 2010 (d) ca. 2015

Figure 2: F1 -measure for various extractor cohorts by lustrum (5-year period).

Indeed, Figure 3 averages the F1 -measure of each cohort and plots


100
their aggregate performance together. We can clearly see that all of
the extractor cohorts begin at approximately the same performance
on Web page data from the year 2000, but the performance quickly 75
F1−measure

falls as the form and function of the Web pages change. As a naive
baseline, we also measure the results if all non-HTML text was
50
extracted and treated as content; in this case, the F1 -measure is Cohort
buoyed by the perfect recall score, but the precision and accuracies 2000−Extractors
are bad as expected. 25 2005−Extractors
2010−Extractors
2015-extractors are most invariant to changes in the Web because 2015−Extractors
the developers likely created the extractor knowing the state of the 0 All Text
Web in 2015 and with an understanding of the history of the Web. 2000 2005 2010 2015
2010-extractors perform well on data from 2010 and prior, but were Web Page Year
unable to adapt to unforeseen changes that appeared in 2015. Sim-
ilarly extractors from 2005 performed well on data from 2005 and
Figure 3: Mean average F1 measure per cohort over each lustrum.
prior, but did not predict Web changes and quickly became obso-
lete.
The F1 -measure is arguably the best single performance metric to as the articles content. Compared to the results from the complex
analyze this type of data, however, individual precision, recall and algorithms shown in Table 5 the simple HTML5 extraction rule
accuracy considerations may be important to various applications. shows reasonable results with very little effort.
The raw scores are listed in Table 5.
We find that extractors from 2000 and 2005 have a steep downward Precision Recall Accuracy
trend and extractors from 2010 also has a downward trend, although Mean 57.3 67.3 82.4
not as steep. Only the 2015 extractor performs steadily. These Median 60.7 72.3 83.4
results indicate that changes Web design and implementation has
adversely affected content extraction tools. Table 6: Extraction results using only HTML5 article tags.
The semantic tags found in new Web standards like HTML5 may be
one solution to the falling extractor performance. Table 6 demon- This further demonstrates that the nature of the Web is changing,
strates surprisingly good extraction performance by extracting only and as a result, our thinking about content extraction must change
(and all of) the text inside the article tag from the 2015 lustrum too.
Lustrum
2000 2005 2010 2015
Extractor Year Prec Rec Acc Prec Rec Acc Prec Rec Acc Prec Rec Acc
All Text – 45.65 100 45.65 38.33 100 38.33 25.78 100 25.78 20.14 100 20.14
BTE 2001 76.36 92.74 82.74 58.15 89.37 72.22 34.32 88.67 53.92 20.67 85.47 44.05
2000

LSI 2001 83.37 89.79 87.87 64.19 88.71 77.32 43.35 80.47 65.7 23.66 77.51 48.56
DSC 2002 85.42 83.25 86.2 66.88 82.98 77.55 46.37 75.74 71.09 23.89 72.97 50.15
KFE 2005 74.21 75.88 83.34 50 69.95 70.71 35.28 63.45 65.4 19.78 64.92 47.72
LQF 2005 71.11 93.81 81.49 56.57 92.39 72.23 38.68 85.44 61.94 20.91 84.47 45.24
2005

ADSC 2007 74.39 92.91 83.68 57.68 91.36 73.51 36.95 86.74 59.28 20.27 85.59 44.25
CCB 2008 85.28 86.95 88.26 65.79 84.91 77.78 44.45 77.06 68.95 22.92 74.43 48.72
RR 2008 81.97 92.11 88.96 70.73 89.75 86.54 61.32 76.57 70.82 47.75 86.25 79.70
CETR 2010 86.74 85.18 88.98 76.05 82.08 85.13 59.01 81.32 80.79 54.66 67.86 88.03
Cohort

CETR-1D 2010 85.3 85.62 88.55 76.23 82.64 85.41 59.31 80.35 80.89 56.57 67.13 87.78
CETR-Th 2010 89.92 81.95 89.16 82.1 77.52 86.74 65.63 78.31 84.21 57.42 72.76 89.6
BP 2010 93.51 85.92 92.26 91.84 82.64 92.45 79.12 75.72 88.86 83.17 68.84 93.74
2010

BP-AE 2010 94.76 87.11 92.86 92.97 84.25 92.54 94.99 69.39 91.35 85.79 63.21 92.96
BP-Sen 2010 97.37 84.43 92.78 97.47 81.71 93.53 97.19 66.84 91.26 89.06 61.33 93.09
BP-Canola 2010 93.43 87.36 92.58 88.09 84.56 90.97 77.33 77.47 88.71 68.74 71.47 92.02
BP-15W 2010 94.5 83.7 91.55 89.09 80.62 90.22 82.1 74.04 89.37 73.89 68.51 92.4
BP-W 2010 91.45 88.83 92.51 88.31 86.12 91.76 81.79 78.54 89.58 83.31 70.84 93.97
eatiht 2015 81.89 76.3 80.17 82.04 80.04 83.76 93.75 69.18 91.13 88.48 62.93 94.39
2015

Table 5: Precision, recall and accuracy breakdown by lustrum (i.e., the 5-year period in which data was collected) and cohort (i.e., the set of
extractors that were developed in the same time period)

3.2 Discussion etc. This methodology will allow for all of the content to be
The main critique of wrapper induction methods is that they fre- loaded so that it may be fully extracted. A browser-based
quently require re-training. In response many heuristic/feature en- content extractor might operate similar to the popular Ad-
gineering approaches have been developed that are said to not re- Block software, but only render content rather than simply
quire training and simply work out of the box. removing blacklisted advertisers. Aside from executing JavaScript
These results underscore a robustness problem in Web content ex- and gathering all of the external resources, a browser based
traction. Ideally, Web science research should be at least partially content extractor would also allow for a visual-DOM repre-
invariant to change. If published content extractors are to be adopted sentation that may improve extraction effectiveness.
and widely used they ought to be able to withstand changing Web
standards. Wrapper induction techniques admit this problem; how- 2. Future content extraction studies should examine Web pages
ever, we find that heuristic content extractors are prone to obsoles- and Web sites from different time periods to measure the
cence as well. overall robustness of the dataset. This is a difficult task and
is perhaps contrary to the first recommendation because the
external data from old Web pages may not be be easily ren-
4. CONCLUSIONS dered because the external may sources cease to exist. Nev-
We conclude by recapping our main findings. ertheless, it is possible to denote Web pages which have not
First, we put into concrete terms the changes to the form and func- changed via through Change Detection and Notification (CDN)
tion of the Web. We argue that most content extraction methodolo- systems [4] or through Last-Modified or ETag HTTP head-
gies, by their reliance on unrendered, downloaded HTML markup, ers.
do not count very large portion of the final rendered Web page.
This is due to the Web’s increasing reliance on external sources for 3. With the adoption semantic tags in HTML5, such as section,
content and data via JavaScript, iframes, and so on. header, main, etc., as well as the creation of semantic at-
Second, we find that although wrapper induction techniques are tributes within the schema.org framework, it is important
prone to breakage and require frequent retraining, the heuristic/feature to ask whether content extraction algorithms are still needed
engineering extractors studied in this paper, which argued to not re- at all. Many Web sites have mobile versions that streamline
quire training at all, are also quickly obsolete. content delivery and a large number of content provides have
content syndication systems or APIs that deliver pure con-
4.1 Recommendations for future work tent. It may be more important in the near future to focus at-
We argue that the two findings presented in this paper be immedi- tention on structured data extraction from lists and tables [26;
ately addressed by the content extraction community, and we make 15; 19; 3; 2] and integrating that data for meaningful analy-
the following recommendations. sis.

1. Future content extraction methodologies should be performed Content extraction research has been an important part of the his-
on completely rendered Web pages, and should therefore be tory and development of the Web, but this area of study would
created as Web browser extensions or with a similar rendered- greatly benefit by considering these recommendations as they would
in-browser setup using a headless browser like PhantonJS, lead to new approaches that are more robust and reliable.
5. REFERENCES [15] R. Gupta and S. Sarawagi. Answering table augmentation
queries from unstructured lists on the web. Proc. VLDB En-
[1] Z. Bar-Yossef and S. Rajagopalan. Template detection via dow., 2(1):289–300, Aug. 2009.
data mining and its applications. In WWW, page 580, New
York, New York, USA, May 2002. ACM Press. [16] W. Han, D. Buttler, and C. Pu. Wrapping web data into XML.
ACM SIGMOD Record, 30(3):33, Sept. 2001.
[2] M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data
on the web. Communications of the ACM, 54(2):72–79, 2011. [17] C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate de-
tection using shallow text features. In Proceedings of the third
[3] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. ACM international conference on Web search and data min-
Webtables: Exploring the power of tables on the web. Proc. ing - WSDM ’10, page 441, New York, New York, USA, Feb.
VLDB Endow., 1(1):538–549, Aug. 2008. 2010. ACM Press.
[4] S. Chakravarthy and S. C. H. Hara. Automating change detec- [18] N. Kushmerick. Learning to remove internet advertisements.
tion and notification of web pages. In 17th International Con- In Proceedings of the third annual conference on Autonomous
ference on Database and Expert Systems Applications, pages Agents, pages 175–181. ACM, 1999.
465–469. IEEE, 2006.
[19] G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and
[5] B. Chidlovskii, B. Roustant, and M. Brette. Documentum eci searching web tables using entities, types and relationships.
self-repairing wrappers: Performance analysis. In Proceed- Proc. VLDB Endow., 3(1-2):1338–1347, Sept. 2010.
ings of the 2006 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’06, pages 708–717, New [20] B. Liu, R. Grossman, and Y. Zhai. Mining data records in Web
York, NY, USA, 2006. ACM. pages. In SIGKDD, page 601, New York, New York, USA,
Aug. 2003. ACM Press.
[6] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: To-
wards Automatic Data Extraction from Large Web Sites. [21] C. Mantratzis, M. Orgun, and S. Cassidy. Separating XHTML
VLDB, pages 109–118, Sept. 2001. content from navigation clutter using DOM-structure block
[7] V. Crescenzi and P. Merialdo. Wrapper inference for ambigu- analysis. In HT, page 145, New York, New York, USA, Sept.
ous web pages. Applied Artificial Intelligence, 22(1&2):21– 2005. ACM Press.
52, 2008. [22] Marco Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff.
[8] N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction. In CleanEval: a competition for cleaning webpages. In Interna-
SIGMOD, page 335, New York, New York, USA, June 2009. tional Language Resources and Evaluation, 2008.
ACM Press. [23] L. Martin and T. Gottron. Readability and the Web. Future In-
[9] H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan. Com- ternet, 4:238–252, 2012. Special Issue Selected Papers from
putational aspects of resilient data extraction from semistruc- ITA 11.
tured sources (extended abstract). In Proceedings of the Nine-
[24] R. Palacios. Eatiht. http://rodricios.github.io/
teenth ACM SIGMOD-SIGACT-SIGART Symposium on Prin-
eatiht/, 2015.
ciples of Database Systems, PODS ’00, pages 136–144, New
York, NY, USA, 2000. ACM. [25] A. G. Parameswaran, N. N. Dalvi, H. Garcia-Molina, and
[10] S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of R. Rastogi. Optimal schemes for robust web extraction.
informative blocks from webpages. In SAC, page 1722, New PVLDB, 4(11):980–991, 2011.
York, New York, USA, Mar. 2005. ACM Press. [26] R. Pimplikar and S. Sarawagi. Answering table queries on the
[11] A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: web using column keywords. Proceedings of the VLDB En-
Content classification for digital libraries. In DELOS Work- dowment, 5(10):908–919, 2012.
shop: Personalisation and Recommender Systems in Digital [27] D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King,
Libraries, 2001. W. Li, and X. Wei. QuASM: a system for question answer-
[12] D. Gibson, K. Punera, and A. Tomkins. The volume and evo- ing using semi-structured data. In JCDL ’02: Proceedings of
lution of web page templates. In Special interest tracks and the 2nd ACM/IEEE-CS joint conference on Digital libraries,
posters of the 14th international conference on World Wide pages 46–55, New York, NY, USA, 2002. ACM Press.
Web - WWW ’05, page 830, New York, New York, USA, May
[28] T. Weninger, W. H. Hsu, and J. Han. CETR. In WWW, page
2005. ACM Press.
971, New York, New York, USA, Apr. 2010. ACM Press.
[13] T. Gottron. Combining content extraction heuristics. In ii-
WAS, page 591, New York, New York, USA, Nov. 2008. ACM [29] Y. Zhai and B. Liu. Web data extraction based on partial tree
Press. alignment. In WWW, page 76, New York, New York, USA,
May 2005. ACM Press.
[14] T. Gottron. Content Code Blurring: A New Approach to Con-
tent Extraction. In DEXA TIR Workshop, pages 29–33. IEEE,
Sept. 2008.

You might also like