Abstract
The research regarding Web information extraction focuses on learning rules to extract some selected information from Web documents. Many proposals are ad hoc and cannot benefit from the advances in machine learning; furthermore, they are likely to fade away as the Web evolves, and their intrinsic assumptions are not satisfied. Some authors have explored transforming Web documents into relational data and then using techniques that got inspiration from inductive logic programming. In theory, such proposals should be easier to adapt as the Web evolves because they build on catalogues of features that can be adapted without changing the proposals themselves. Unfortunately, they are difficult to scale as the number of documents or features increases. In the general field of machine learning, there are propositio-relational proposals that attempt to provide effective and efficient means to learn from relational data using propositional techniques, but they have seldom been explored regarding Web information extraction. In this article, we present a new proposal called Roller: it relies on a search procedure that uses a dynamic flattening technique to explore the context of the nodes that provide the information to be extracted; it is configured with an open catalogue of features, so that it can adapt to the evolution of the Web; it also requires a base learner and a rule scorer, which helps it benefit from the continuous advances in machine learning. Our experiments confirm that it outperforms other state-of-the-art proposals in terms of effectiveness and that it is very competitive in terms of efficiency; we have also confirmed that our conclusions are solid from a statistical point of view.
Similar content being viewed by others
Notes
Cantelli’s inequality states that given a random variable X, if \(\mu \) denotes its mean and \(\sigma \) denotes is standard deviation, then the probability that \(X \ge \mu + k \, \sigma \) or that \(X \le \mu - k \, \sigma \) is not greater than \(1 / (1 + k ^ 2)\). If we wish to discard, say, 5 % of the data as lower or upper outliers, we then have to set \(1 / (1 + k ^ 2 ) = 0.05\), that is \(k = 4.36\).
References
Álvarez M, Pan A, Raposo J, Bellas F, Cacheda F (2008) Extracting lists of data records from semi-structured web pages. Data Knowl Eng 64(2):491–509
Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: SIGMOD conference, pp 337–348
Atramentov A, Leiva H, Honavar V (2003) A multi-relational decision tree learning algorithm. In: ILP, pp 38–56
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29
Bernstein PA, Haas LM (2008) Information integration in the enterprise. Commun ACM 51(9):72–79
Blockeel H, Raedt LD (1998) Top-down induction of first-order logical decision trees. Artif Intell 101(1–2):285–297
Blockeel H, Raedt LD, Jacobs N, Demoen B (1999) Scaling up inductive logic programming by learning from interpretations. Data Min Knowl Discov 3(1):59–93
Bădică C, Bădică A, Popescu E, Abraham A (2007) L-wrappers: concepts, properties and construction. Soft Comput 11(8):753–772
Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210
Chang C-H, Kuo S-C (2004) OLERA: Semisupervised web-data extraction with visual support. IEEE Intell Syst 19(6):56–64
Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Chidlovskii B (2001) Wrapping web information providers by transducer induction. In: ECML, pp 61–72
Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779
Crescenzi V, Merialdo P (2008) Wrapper inference for ambiguous web pages. Appl Artif Intell 22(1&2):21–52
Cumby CM, Roth D (2003) On kernel methods for relational learning. In: ICML, pp 107–114
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Džeroski S, Lavrač N (1993) Inductive learning in deductive databases. IEEE Trans Knowl Data Eng 5(6):939–949
Emde W and Wettschereck D (1996) Relational instance-based learning. In ICML, pp 122–130
Esposito F, Ferilli S, Fanizzi N, Basile TMA, Mauro ND (2003) Incremental multistrategy learning for document processing. Appl Artif Intell 17(8–9):859–883
Fernández-Villamor JI, Iglesias CÁ, Garijo M (2012) First-order logic rule induction for information extraction in web resources. Int J Artif Intell Tools 21(6):1–20
Flach PA, Lachiche N (2004) Naive bayesian classification of structured data. Mach Learn 57(3):233–269
Frank E, Hall MA, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 1269–1277
Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: AAAI/IAAI, pp 517–523
Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn 39(2/3):169–202
García S, Herrera F (2008) An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pair-wise comparisons. J Mach Learn Res 9:2677–2694
Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Mach Learn 57(3):205–232
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32
Getoor L, Friedman N, Koller D, Taskar B (2001) Learning probabilistic models of relational structure. In: ICML, pp 170–177
Guo H, Viktor HL (2008) Multirelational classification: a multiple view approach. Knowl Inf Syst 17(3):287–312
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Hickson I, Berjon R, Faulkner S, Leithead T, Navara ED, O’Connor E, Pfeiffer S (2014) HTML 5: a vocabulary and associated APIs for HTML and XHTML. Technical report W3C
Hogue AW, Karger DR (2005) Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp 86–95
Horváth T, Wrobel S, Bohnebeck U (2001) Relational instance-based learning with lists and terms. Mach Learn 43(1/2):53–80
Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction from the web. Inf Syst 23(8):521–538
Irmak U, Suel T (2006) Interactive wrapper generation with minimal user effort. In: WWW, pp 553–563
Jaeger M (2008) Probabilistic-logic models: reasoning and learning with relational structures. In: SCAI, pp 197–200
Kavurucu Y, Senkul P, Toroslu IH (2011) A comparative study on ILP-based concept discovery systems. Expert Syst Appl 38(9):11598–11607
Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263
Knobbe AJ, de Haas M, Siebes A (2001) Propositionalisation and aggregates. In: PKDD, pp 277–288
Kramer S, Lavrač N, Flach P (2001a) Propositionalization approaches to relational data mining. In: Džeroski S, Lavrač N (eds) Relational data mining. Springer, Berlin, pp 262–291
Kramer S, Widmer G, Pfahringer B, de Groeve M (2001b) Prediction of ordinal classes using regression trees. Fundam Inform 47(1–2):1–13
Krogel MA (2005) On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto von Guericke Universität Magdeburg
Krogel M-A, Rawles S, Zelezný F, Flach PA, Lavrač N, Wrobel S (2003) Comparative evaluation of approaches to propositionalization. In: ILP, pp 197–214
Kushmerick N, Weld DS, Doorenbos RB (1997) Wrapper induction for information extraction. IJCAI 1:729–737
Lavrač N, Džeroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, Chichester
Montoto P, Pan A, Raposo J, Losada J, Bellas F, Carneiro V (2008) A workflow language for web automation. J UCS 14(11):1838–1856
Muggleton S (2000) Learning stochastic logic programs. Electron Trans Artif Intell 4(B):141–153
Muggleton S, Raedt LD, Poole D, Bratko I, Flach PA, Inoue K, Srinivasan A (2012) ILP turns 20: biography and future challenges. Mach Learn 86(1):3–23
Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst 4(1/2):93–114
Park J, Barbosa D (2007) Adaptive record extraction from web pages. In: WWW, pp 1335–1336
Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3&4):287–312
Sarawagi S (2008) Information extraction. Found Trends Databases 1(3):261–377
Shen YK, Karger DR (2007) U-REST: an unsupervised record extraction system. In: WWW, pp 1347–1348
Sheskin DJ (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman and Hall/CRC, Boca Raton/London
Sleiman HA, Corchuelo R (2013a) TEX: an efficient and effective unsupervised web information extractor. Knowl Based Syst 39:109–123
Sleiman HA, Corchuelo R (2013b) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981
Sleiman HA, Corchuelo R (2014a) A class of neural-network-based transducers for web information extraction. Neurocomputing 135:61–68
Sleiman HA, Corchuelo R (2014b) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272
Srinivasan A (2004) The Aleph manual. Technical report, University of Oxford
Su W, Wang J, Lochovsky FH (2009) ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2):12.1–12.35
Turmo J, Ageno A, Català N (2006) Adaptive information extraction. ACM Comput Surv 38(2):1–47
van Kesteren A, Gregor A, Russell A, Berjon R (2014) Document object model 4. Technical report W3C
Yin X, Han J, Yang J, Yu PS (2006) Efficient classification across multiple database relations: a crossmine approach. IEEE Trans Knowl Data Eng 18(6):770–783
Zhang H, Su J (2004) Conditional independence trees. In: ECML, pp 513–524
Acknowledgments
We are grateful to Dr. Hassan A. Sleiman, from the Commissariat à l’Énergie Atomique et aux Énergies Alternatives, LIST Institute, France, and Dr. Toñi Reina, from the University of Sevilla, for their help and support with previous proposals that have paved the way for Roller. We would also like to thank Dr. Francisco Herrera, from the University of Granada, Spain, for sharing his statistical analysis software with us, and also to Dr. José C. Riquelme, from the University of Sevilla, Spain, for the many fruitful discussions regarding evaluating our proposal and comparing our experimental results. Last, but not least, we would like to thank our reviewers and our editor for their insightful comments on the earlier versions of this article, which helped us to improve it very much. Our work was supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (Grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, TIN2010-09988-E, TIN2011-15497-E, and TIN2013-40848-R). The work by Patricia Jiménez was also partially supported by the University of Southern California during a research visit that she paid to the Information Sciences Institute.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: The experimentation environment
We performed our experiments on a four-threaded Intel Core i7 computer that ran at 2.93 GHz, had 4 GB of RAM, Windows 7 Pro 64-bit, Oracle’s Java Development Kit \(1.7.9\_02\), JTidy 9.38, and Weka 3.6.8. No changes were performed to the default configurations of the hardware or the software.
We used a collection 40 datasets on books, films, cars, events, doctors, jobs, realty, and players, plus the nine datasets from the ExAlg repository and the five datasets from the RISE repository that provide semi-structured documents. The categories regarding the first group of datasets were randomly sampled from The Open Directory sub-categories, and the websites inside each category were randomly selected from the 100 best-ranked websites between December 2010 and March 2011 according to Google’s search engine; we downloaded 30 documents from each website and handcrafted a set of annotations with the slots that we wished to extract from each document. Table 8 describes our datasets; for each category, we report on the sites from which they were downloaded, the slots that model the information that they provide, the number of documents that they have, their average size in KiB, the average number of HTML errors that they have (as reported by JTidy), the average number of positive examples, and the average number of negative examples. The datasets were split ten times; in each split, we randomly selected six documents for training purposes and the remaining ones for testing purposes. The results on which we report in this article were obviously computed on the testing sets.
We searched the Web and contacted many authors in order to have access to the implementation of as many proposals as possible. We managed to find an implementation for SoftMealy [34] and Wien [44], which are classical proposals, and RoadRunner [14], FiVaTech [38], and Trinity [58], which are recent proposals. We also experimented with a straightforward approach in which we translated our datasets into first-order knowledge bases and then used Aleph [60] to induce rules.
Appendix 2: Performance measures
We collected the usual effectiveness measures, namely: precision (P), recall (R), and the \(F_1\) score to combine them both (\(F_1\)). We also collected some efficiency measures, namely: learning time (\({ LT }\)) and extraction time (\({ ET }\)), both measured in CPU seconds.
Effectiveness measures are stable because the proposals that we have compared are deterministic, that is, they do not change when a proposal is run multiple times on the same datasets. Efficiency measures, on the contrary, are subject to external experimental conditions and may vary from execution to execution. We decided to measure CPU times because they are far more stable than user times; given that the proposals that we have compared are deterministic, that means that they follow the same execution paths every time that they are executed on the same dataset, which implies that they execute exactly the same machine-level instructions. IO activities were not a problem in our experimental study; the reason is that the proposals that we compared are CPU bound, not IO bound. In other words, they read the input documents, which typically takes less than a hundredth of a second, then run their algorithms in memory, and finally output the results to a file, which does not usually take more than a hundredth of a second.
To confirm the idea that CPU times are stable enough, we repeated each experiment 25 times and we averaged the timings after discarding a few outliers using the well-known Cantelli’s inequality.Footnote 1 We studied these outliers to make sure that they were actual abnormal values. They were due to the fact that our University’s cloud infrastructure was reset a couple of times while the experiments were running; that resulted in a few timings that were abnormally large due to the interferences caused by the temporary interruption of the computing service. The rest of the timings were quite stable; the small differences from run to run were mainly due to the performance of the memory cache system, which, in turn, depended on the other processes that were running on our University’s cloud infrastructure.
Rights and permissions
About this article
Cite this article
Jiménez, P., Corchuelo, R. Roller: a novel approach to Web information extraction. Knowl Inf Syst 49, 197–241 (2016). https://doi.org/10.1007/s10115-016-0921-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0921-4