Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Roller: a novel approach to Web information extraction

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The research regarding Web information extraction focuses on learning rules to extract some selected information from Web documents. Many proposals are ad hoc and cannot benefit from the advances in machine learning; furthermore, they are likely to fade away as the Web evolves, and their intrinsic assumptions are not satisfied. Some authors have explored transforming Web documents into relational data and then using techniques that got inspiration from inductive logic programming. In theory, such proposals should be easier to adapt as the Web evolves because they build on catalogues of features that can be adapted without changing the proposals themselves. Unfortunately, they are difficult to scale as the number of documents or features increases. In the general field of machine learning, there are propositio-relational proposals that attempt to provide effective and efficient means to learn from relational data using propositional techniques, but they have seldom been explored regarding Web information extraction. In this article, we present a new proposal called Roller: it relies on a search procedure that uses a dynamic flattening technique to explore the context of the nodes that provide the information to be extracted; it is configured with an open catalogue of features, so that it can adapt to the evolution of the Web; it also requires a base learner and a rule scorer, which helps it benefit from the continuous advances in machine learning. Our experiments confirm that it outperforms other state-of-the-art proposals in terms of effectiveness and that it is very competitive in terms of efficiency; we have also confirmed that our conclusions are solid from a statistical point of view.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Cantelli’s inequality states that given a random variable X, if \(\mu \) denotes its mean and \(\sigma \) denotes is standard deviation, then the probability that \(X \ge \mu + k \, \sigma \) or that \(X \le \mu - k \, \sigma \) is not greater than \(1 / (1 + k ^ 2)\). If we wish to discard, say, 5 % of the data as lower or upper outliers, we then have to set \(1 / (1 + k ^ 2 ) = 0.05\), that is \(k = 4.36\).

References

  1. Álvarez M, Pan A, Raposo J, Bellas F, Cacheda F (2008) Extracting lists of data records from semi-structured web pages. Data Knowl Eng 64(2):491–509

    Article  Google Scholar 

  2. Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: SIGMOD conference, pp 337–348

  3. Atramentov A, Leiva H, Honavar V (2003) A multi-relational decision tree learning algorithm. In: ILP, pp 38–56

  4. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29

    Article  Google Scholar 

  5. Bernstein PA, Haas LM (2008) Information integration in the enterprise. Commun ACM 51(9):72–79

    Article  Google Scholar 

  6. Blockeel H, Raedt LD (1998) Top-down induction of first-order logical decision trees. Artif Intell 101(1–2):285–297

    Article  MathSciNet  MATH  Google Scholar 

  7. Blockeel H, Raedt LD, Jacobs N, Demoen B (1999) Scaling up inductive logic programming by learning from interpretations. Data Min Knowl Discov 3(1):59–93

    Article  Google Scholar 

  8. Bădică C, Bădică A, Popescu E, Abraham A (2007) L-wrappers: concepts, properties and construction. Soft Comput 11(8):753–772

    Article  Google Scholar 

  9. Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210

    MathSciNet  MATH  Google Scholar 

  10. Chang C-H, Kuo S-C (2004) OLERA: Semisupervised web-data extraction with visual support. IEEE Intell Syst 19(6):56–64

    Article  Google Scholar 

  11. Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428

    Article  Google Scholar 

  12. Chidlovskii B (2001) Wrapping web information providers by transducer induction. In: ECML, pp 61–72

  13. Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779

    Article  MathSciNet  MATH  Google Scholar 

  14. Crescenzi V, Merialdo P (2008) Wrapper inference for ambiguous web pages. Appl Artif Intell 22(1&2):21–52

    Article  Google Scholar 

  15. Cumby CM, Roth D (2003) On kernel methods for relational learning. In: ICML, pp 107–114

  16. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  17. Džeroski S, Lavrač N (1993) Inductive learning in deductive databases. IEEE Trans Knowl Data Eng 5(6):939–949

    Article  Google Scholar 

  18. Emde W and Wettschereck D (1996) Relational instance-based learning. In ICML, pp 122–130

  19. Esposito F, Ferilli S, Fanizzi N, Basile TMA, Mauro ND (2003) Incremental multistrategy learning for document processing. Appl Artif Intell 17(8–9):859–883

    Article  Google Scholar 

  20. Fernández-Villamor JI, Iglesias CÁ, Garijo M (2012) First-order logic rule induction for information extraction in web resources. Int J Artif Intell Tools 21(6):1–20

    Article  Google Scholar 

  21. Flach PA, Lachiche N (2004) Naive bayesian classification of structured data. Mach Learn 57(3):233–269

    Article  MATH  Google Scholar 

  22. Frank E, Hall MA, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 1269–1277

  23. Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: AAAI/IAAI, pp 517–523

  24. Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn 39(2/3):169–202

    Article  MATH  Google Scholar 

  25. García S, Herrera F (2008) An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pair-wise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  26. Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Mach Learn 57(3):205–232

    Article  MATH  Google Scholar 

  27. Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32

  28. Getoor L, Friedman N, Koller D, Taskar B (2001) Learning probabilistic models of relational structure. In: ICML, pp 170–177

  29. Guo H, Viktor HL (2008) Multirelational classification: a multiple view approach. Knowl Inf Syst 17(3):287–312

    Article  Google Scholar 

  30. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  31. Hickson I, Berjon R, Faulkner S, Leithead T, Navara ED, O’Connor E, Pfeiffer S (2014) HTML 5: a vocabulary and associated APIs for HTML and XHTML. Technical report W3C

  32. Hogue AW, Karger DR (2005) Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW, pp 86–95

  33. Horváth T, Wrobel S, Bohnebeck U (2001) Relational instance-based learning with lists and terms. Mach Learn 43(1/2):53–80

    Article  MATH  Google Scholar 

  34. Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction from the web. Inf Syst 23(8):521–538

    Article  Google Scholar 

  35. Irmak U, Suel T (2006) Interactive wrapper generation with minimal user effort. In: WWW, pp 553–563

  36. Jaeger M (2008) Probabilistic-logic models: reasoning and learning with relational structures. In: SCAI, pp 197–200

  37. Kavurucu Y, Senkul P, Toroslu IH (2011) A comparative study on ILP-based concept discovery systems. Expert Syst Appl 38(9):11598–11607

    Article  Google Scholar 

  38. Kayed M, Chang C-H (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263

    Article  Google Scholar 

  39. Knobbe AJ, de Haas M, Siebes A (2001) Propositionalisation and aggregates. In: PKDD, pp 277–288

  40. Kramer S, Lavrač N, Flach P (2001a) Propositionalization approaches to relational data mining. In: Džeroski S, Lavrač N (eds) Relational data mining. Springer, Berlin, pp 262–291

    Chapter  Google Scholar 

  41. Kramer S, Widmer G, Pfahringer B, de Groeve M (2001b) Prediction of ordinal classes using regression trees. Fundam Inform 47(1–2):1–13

    MathSciNet  MATH  Google Scholar 

  42. Krogel MA (2005) On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto von Guericke Universität Magdeburg

  43. Krogel M-A, Rawles S, Zelezný F, Flach PA, Lavrač N, Wrobel S (2003) Comparative evaluation of approaches to propositionalization. In: ILP, pp 197–214

  44. Kushmerick N, Weld DS, Doorenbos RB (1997) Wrapper induction for information extraction. IJCAI 1:729–737

    Google Scholar 

  45. Lavrač N, Džeroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, Chichester

    MATH  Google Scholar 

  46. Montoto P, Pan A, Raposo J, Losada J, Bellas F, Carneiro V (2008) A workflow language for web automation. J UCS 14(11):1838–1856

    Google Scholar 

  47. Muggleton S (2000) Learning stochastic logic programs. Electron Trans Artif Intell 4(B):141–153

    MathSciNet  Google Scholar 

  48. Muggleton S, Raedt LD, Poole D, Bratko I, Flach PA, Inoue K, Srinivasan A (2012) ILP turns 20: biography and future challenges. Mach Learn 86(1):3–23

    Article  MathSciNet  MATH  Google Scholar 

  49. Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst 4(1/2):93–114

    Article  Google Scholar 

  50. Park J, Barbosa D (2007) Adaptive record extraction from web pages. In: WWW, pp 1335–1336

  51. Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3&4):287–312

    Article  Google Scholar 

  52. Sarawagi S (2008) Information extraction. Found Trends Databases 1(3):261–377

    Article  MATH  Google Scholar 

  53. Shen YK, Karger DR (2007) U-REST: an unsupervised record extraction system. In: WWW, pp 1347–1348

  54. Sheskin DJ (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman and Hall/CRC, Boca Raton/London

    MATH  Google Scholar 

  55. Sleiman HA, Corchuelo R (2013a) TEX: an efficient and effective unsupervised web information extractor. Knowl Based Syst 39:109–123

    Article  Google Scholar 

  56. Sleiman HA, Corchuelo R (2013b) A survey on region extractors from web documents. IEEE Trans Knowl Data Eng 25(9):1960–1981

    Article  Google Scholar 

  57. Sleiman HA, Corchuelo R (2014a) A class of neural-network-based transducers for web information extraction. Neurocomputing 135:61–68

    Article  Google Scholar 

  58. Sleiman HA, Corchuelo R (2014b) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556

    Article  Google Scholar 

  59. Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272

    Article  MATH  Google Scholar 

  60. Srinivasan A (2004) The Aleph manual. Technical report, University of Oxford

  61. Su W, Wang J, Lochovsky FH (2009) ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2):12.1–12.35

  62. Turmo J, Ageno A, Català N (2006) Adaptive information extraction. ACM Comput Surv 38(2):1–47

  63. van Kesteren A, Gregor A, Russell A, Berjon R (2014) Document object model 4. Technical report W3C

  64. Yin X, Han J, Yang J, Yu PS (2006) Efficient classification across multiple database relations: a crossmine approach. IEEE Trans Knowl Data Eng 18(6):770–783

    Article  MATH  Google Scholar 

  65. Zhang H, Su J (2004) Conditional independence trees. In: ECML, pp 513–524

Download references

Acknowledgments

We are grateful to Dr. Hassan A. Sleiman, from the Commissariat à l’Énergie Atomique et aux Énergies Alternatives, LIST Institute, France, and Dr. Toñi Reina, from the University of Sevilla, for their help and support with previous proposals that have paved the way for Roller. We would also like to thank Dr. Francisco Herrera, from the University of Granada, Spain, for sharing his statistical analysis software with us, and also to Dr. José C. Riquelme, from the University of Sevilla, Spain, for the many fruitful discussions regarding evaluating our proposal and comparing our experimental results. Last, but not least, we would like to thank our reviewers and our editor for their insightful comments on the earlier versions of this article, which helped us to improve it very much. Our work was supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (Grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, TIN2010-09988-E, TIN2011-15497-E, and TIN2013-40848-R). The work by Patricia Jiménez was also partially supported by the University of Southern California during a research visit that she paid to the Information Sciences Institute.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patricia Jiménez.

Appendices

Appendix 1: The experimentation environment

Table 8 Description of our datasets

We performed our experiments on a four-threaded Intel Core i7 computer that ran at 2.93 GHz, had 4 GB of RAM, Windows 7 Pro 64-bit, Oracle’s Java Development Kit \(1.7.9\_02\), JTidy 9.38, and Weka 3.6.8. No changes were performed to the default configurations of the hardware or the software.

We used a collection 40 datasets on books, films, cars, events, doctors, jobs, realty, and players, plus the nine datasets from the ExAlg repository and the five datasets from the RISE repository that provide semi-structured documents. The categories regarding the first group of datasets were randomly sampled from The Open Directory sub-categories, and the websites inside each category were randomly selected from the 100 best-ranked websites between December 2010 and March 2011 according to Google’s search engine; we downloaded 30 documents from each website and handcrafted a set of annotations with the slots that we wished to extract from each document. Table 8 describes our datasets; for each category, we report on the sites from which they were downloaded, the slots that model the information that they provide, the number of documents that they have, their average size in KiB, the average number of HTML errors that they have (as reported by JTidy), the average number of positive examples, and the average number of negative examples. The datasets were split ten times; in each split, we randomly selected six documents for training purposes and the remaining ones for testing purposes. The results on which we report in this article were obviously computed on the testing sets.

We searched the Web and contacted many authors in order to have access to the implementation of as many proposals as possible. We managed to find an implementation for SoftMealy [34] and Wien [44], which are classical proposals, and RoadRunner [14], FiVaTech [38], and Trinity [58], which are recent proposals. We also experimented with a straightforward approach in which we translated our datasets into first-order knowledge bases and then used Aleph [60] to induce rules.

Appendix 2: Performance measures

We collected the usual effectiveness measures, namely: precision (P), recall (R), and the \(F_1\) score to combine them both (\(F_1\)). We also collected some efficiency measures, namely: learning time (\({ LT }\)) and extraction time (\({ ET }\)), both measured in CPU seconds.

Effectiveness measures are stable because the proposals that we have compared are deterministic, that is, they do not change when a proposal is run multiple times on the same datasets. Efficiency measures, on the contrary, are subject to external experimental conditions and may vary from execution to execution. We decided to measure CPU times because they are far more stable than user times; given that the proposals that we have compared are deterministic, that means that they follow the same execution paths every time that they are executed on the same dataset, which implies that they execute exactly the same machine-level instructions. IO activities were not a problem in our experimental study; the reason is that the proposals that we compared are CPU bound, not IO bound. In other words, they read the input documents, which typically takes less than a hundredth of a second, then run their algorithms in memory, and finally output the results to a file, which does not usually take more than a hundredth of a second.

To confirm the idea that CPU times are stable enough, we repeated each experiment 25 times and we averaged the timings after discarding a few outliers using the well-known Cantelli’s inequality.Footnote 1 We studied these outliers to make sure that they were actual abnormal values. They were due to the fact that our University’s cloud infrastructure was reset a couple of times while the experiments were running; that resulted in a few timings that were abnormally large due to the interferences caused by the temporary interruption of the computing service. The rest of the timings were quite stable; the small differences from run to run were mainly due to the performance of the memory cache system, which, in turn, depended on the other processes that were running on our University’s cloud infrastructure.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiménez, P., Corchuelo, R. Roller: a novel approach to Web information extraction. Knowl Inf Syst 49, 197–241 (2016). https://doi.org/10.1007/s10115-016-0921-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0921-4

Keywords